Better Faster Lighter Java [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Better Faster Lighter Java [Electronic resources] - نسخه متنی

Justin Gehtland; Bruce A. Tate

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








10.5 Making Use of the Configuration Service


If we jump straight in and

start using the search as
it's currently configured, we'll
notice a problem. Our searches are returning lots of
resultsmore than can be possible given the number of products
in the database. In fact, a search for
"dog" returns over 20 results, even
though there are only 6 dogs in the database.

This is happening because of the brute-force nature of the crawling
service. Without extra help, the crawler finds every link on every
page and follows it, adding the results to the index. The problem is
that in addition to the links that allow users to browse animals in
the catalog, there are also links that allow users to add the animals
to their shopping carts, links to let them remove those items from
their carts, links to a sign-in page (which, by default in
jPetStore, loads with real credentials stored in
the textboxes), and a live link for
"Login," which the crawler will
happily followthus generating an entirely new set of links,
with a session ID attached to them.

We need to make sure our crawler doesn't get
suckered into following all the extraneous links and generate more
results than are helpful for our users. In the first part of Chapter 9, we talked about the three major problems
that turn up in a naïve approach to crawling a site:

Infinite loops


Once a link has been followed, the crawler must ignore it.


Off-site jumps


Since we are looking at
http://localhost/jpetstore, we
don't want links to external resources to be
indexed: that would lead to indexing the entire Internet (or, at
least, blowing up the application due to memory problems after hours
of trying).


Pages that shouldn't be indexed


In this case, that's pages like the sign-in page,
any page with a session ID attached to it, and so on.



Our crawler/indexer service handles the first two issues for us
automatically. Let's go back and look at the code.
The IndexLinks class has three collections it
consults every time it considers a new link:

Set linksAlreadyFollowed = new HashSet( );
HashSet linkPrefixesToFollow = new HashSet( );
HashSet linkPrefixesToAvoid = new HashSet( );

Every time a link is followed, it gets added to
linksAlreadyFollowed. The crawler never revisits a
link stored here. The other two collections are a list of link
prefixes that are allowed and a list of the ones that are denied.
When we call IndexLinks.setInitialLink, we add the
root link to the linkPrefixesToFollow set:

linkPrefixesToFollow.add(new URL(initialLink));

IndexLinks also exposes a method,
initAvoidPrefixesFromSystemProperties, which tells
the IndexLinks bean to read the configured system
properties in order to initialize the list:

  public void initAvoidPrefixesFromSystemProperties( ) throws MalformedURLException {
String avoidPrefixes = System.getProperty("com.relevance.ss.AvoidLinks");
if (avoidPrefixes == null || avoidPrefixes.length( ) == 0) return;
String[] prefixes = avoidPrefixes.split(" ");
if (prefixes != null && prefixes.length != 0) {
setAvoidPrefixes(prefixes);
}
}

First, the logic for considering a link checks to make sure the new
link matches one of the prefixes in
linkPrefixesToFollow. For us, the only value
stored there is http://localhost/jpetstore. If
it is a subpage of that prefix, we make sure the link
doesn't match one of the prefixes in
linkPrefixesToAvoid.

A special side note: good code documentation is an important part of
maintainability and flexibility. Notice the rather severe lack of
comments in the code for the Simple Spider. On the other hand, it has
rather lengthy method and type names (like
initAvoidPrefixesFromSystemProperties), which make
comments redundant, since they clearly describe the entity at hand.
Good naming, not strict commenting discipline, is often the key to
code readability.

All we need to do is populate the
linkPrefixesToAvoid collection.
ConsoleSearch already calls
initAvoidPrefixesFromSystemProperties for us, so
all we have to do is add the necessary values to the
com.relevance.ss.properties file:

AvoidLinks=http://localhost:8080/jpetstore/shop/signonForm.do http://localhost:8080/
jpetstore/shop/viewCart.do http://localhost:8080/jpetstore/shop/
searchProducts.do http://localhost:8080/jpetstore/shop/viewCategory.do;jsessionid=
http://localhost:8080/jpetstore/shop/addItemToCart.do http://localhost:8080/jpetstore/shop/
removeItemFromCart.do

These prefixes represent, in order, the sign-on form of the
application, any links that show the current user's
cart, the results of another search, any pages that are the result of
a successful logon, pages that add items to a users cart, and pages
that remove items from a users cart.


10.5.1 Principles in Action


Keep it simple: use existing Properties tools, not XML

Choose the right tools: java.util.Properties

Do one thing, and do it well: the service worries about following
provided links; the configuration files worry about deciding what
links can be followed

Strive for transparency: the service doesn't know
ahead of time what kinds of links will be acceptable; configuration
files make that decision transparent to the service

Allow for extension: expandable list of allowable link types



/ 111