Better Faster Lighter Java [Electronic resources] نسخه متنی

9.2 Examining the Requirements

The requirements for the Simple Spider leave
a wide
variety of design decisions open. Possible solutions might be based
on hosted EJB solutions with XML-configurable indexing schedules,
SOAP-encrusted web services with pass-through security, and any
number of other combinations of buzz words, golden hammers, and
time-wasting complexities. The first step in designing the Spider was
to eliminate complexity and focus on the problem at hand. In this
section, we will go through the decision-making steps together. The
mantra for this part of the process: ignore what you think you need
and examine what you know you need.

9.2.1 Breaking It Down

The first two services described by the requirements are the crawler
and the indexer. They are listed as separate services in the
requirements, but in examining the overall picture, we see no current
need to separate them. There are no other services that rely on the
crawler absent the indexer, and it doesn't make
sense to run the indexer unless the crawler has provided a fresh look
at the search domain. Therefore, in the name of simplicity,
let's simplify the requirements to specify a single
service that both crawls and indexes a web site.

The requirements next state that the crawler needs to ignore links to
image files, since it would be meaningless to index them for textual
search and doing so would take up valuable resources. This is a good
place to
apply the Inventor's
Paradox. Think for a second about the Web: there are more kinds of
links to ignore than just image files and, over time, the list is
likely to grow. Let's allow for a configuration file
that specifies what types of links to ignore.

After the link-type requirement comes a requirement for configuring
the maximum number of links to follow. Since we have just decided to
include a configuration option of some kind, this requirement fits
our needs and we can leave it as-is.

Next, we have a requirement for making the indexer schedulable.
Creating a scheduling service involves implementing a long-running
process that sits dormant most of the time, waking up at specified
intervals to fire up the indexing service. Writing such a process is
not overly complex, but it is redundant and well outside the primary
problem domain. In the spirit of choosing the right tools and doing
one thing well, we can eliminate this entire requirement by relying
on the deployment platform's own scheduling
services. On Linux and Unix we have cron and on
Windows we have at. In order to hook to these
system services, we need only provide an entry point to the Spider
that can be used to fire off the indexing service. System
administrators can then configure their schedulers to perform the
task at whatever intervals are required.

The final service requirement is the search service. Even though the
requirements don't specify it as an individual
service, it must be invoked independently of the index (we
wouldn't want to re-run the indexer every time we
wanted to search for something): it is obvious that it needs to be a
separate service within the application. Unfortunately, the search
service must be somewhat coupled to the indexing
service, as the search service must be coupled to the format of the
indexing service's data source. No global standard
API currently exists for text index file formats. If and when such a
standard comes into being, we'll upgrade the Spider
to take advantage of the new standard and make the searching and
indexing services completely decoupled from one another.

As for the user interfaces, a console interface is a fairly
straightforward choice. However, the mere mention of web services
often sends people into paroxysms of standards exuberance. Because of
the voluminous and increasingly complex web services standards stack,
actually implementing a web service is becoming more and more
difficult. Looking at our requirements, however, we see that we can
cut through most of the extraneous standards. Our service only needs
to launch a search and return an XML result set. The default
implementation of an axis web service can provide those capabilities
without us messing around with either socket-level programming or
high-level standards implementation.

9.2.2 Refining the Requirements

We can greatly improve on the initial requirements. Using the
Inventor's Paradox, common sense, and available
tools, we can eliminate a few others. Given this analysis, our new
requirements are:

Provide a service to crawl and index a web site.

Allow the user to pass a starting point for the search domain.

Let the user configure the service to ignore certain types of links.

Let the user configure the service to only follow a maximum number of
links.

Expose an invoke method to both an existing scheduler and humans.

Provide a search service over the results of the crawler/indexer.

The search should collect a search word or phrase.

Search results should include a full path to the file containing the
search term.

Search results should contain a relative rank for each result. The
actual algorithm for determining the rank is unimportant.

Provide a console-based interface for invoking the indexer/crawler
and search service.

Provide a web service interface for invoking the indexer/crawler and
the search service. The web service interface does not need to
explicitly provide authentication or authorization.

These requirements represent a cleaner design that allows future
extensibility and focuses development on tasks that are essential to
the problem domain. This is exactly what we need from requirements.
They should provide a clear roadmap to success. If you get lost, take
a deep breath. It's okay to ask for directions and
clarify requirements with a
customer.

Better Faster Lighter Java [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی