Better Faster Lighter Java [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Better Faster Lighter Java [Electronic resources] - نسخه متنی

Justin Gehtland; Bruce A. Tate

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








9.6 The Crawler/Indexer Service


The application needs a way to

dynamically
follow the links from a given URL and the links from those pages, ad
infinitum, in order to create the full domain of searchable pages.
Just thinking about writing all of the web-related code to do that
work gives me the screaming heebie-jeebies. We would have to write
methods to post web requests, listen for responses, parse those
responses looking for links, and so on.

In light of the "keep it simple"
chapter, it seems we are immediately faced with a buy-it-or-build-it
question. This functionality must exist already; the question is,
where? It turns out we already have a library at our disposal that
contains everything we need: HTTPUnit. Because



HTTPUnit's
purpose in life is to imitate a browser, it can be used to make HTTP
requests, examine the HTML results, and follow the links contained
therein.

Using HTTPUnit to do the work for us is a fairly nonstandard
approach. HTTPUnit is considered a testing framework, not an
application development framework. However, since it accomplishes
exactly what we need to do with regard to navigating web sites, it
would be a waste of effort and resources to attempt to recreate that
functionality on our own.

Our main entry point to the crawler/indexer service is
IndexLinks. This class establishes the entry point
for the indexable domain and all of the configuration settings for
controlling the overall result set. The constructor for the class
should accept as much of the configuration information as possible:

  public IndexLinks(String indexPath, int maxLinks, 
String skippedLinksOutputFileName)
{
this.maxLinks = maxLinks;
this.linksNotFollowedOutputFileName = skippedLinksOutputFileName;
writer = new IndexWriter(indexPath, new StandardAnalyzer( ), true);
}

The writer is an instance of
org.apache.lucene.index.IndexWriter, which is
initialized to point to the path where a new index should be created.

Our instance requires a series of collections to manage our links.
Those collections are:

  Set linksAlreadyFollowed = new HashSet( );
Set linksNotFollowed = new HashSet( );
Set linkPrefixesToFollow = new HashSet( );
HashSet linkPrefixesToAvoid = new HashSet( );

The first two are used to store the links as we discover and
categorize them. The next two are configuration settings used to
determine if we should follow the link based on its prefix. These
settings allow us to eliminate subsites or certain external sites
from the search set, thus giving us the ability to prevent the
crawler from running all over the Internet, indexing everything.

The other object we need is a
com.meterware.httpunit.WebConversation. HTTPUnit
uses this class to model a browser-server session. It provides
methods for making requests to web servers, retrieving responses, and
manipulating the HTTP messages that result. We'll
use it to retrieve our indexable pages.

  WebConversation conversation = new WebConversation( );

We must provide setter methods so the users of the indexer/crawler
can add prefixes to these two collections:

  public void setFollowPrefixes(String[] prefixesToFollow) 
throws MalformedURLException {
for (int i = 0; i < prefixesToFollow.length; i++) {
String s = prefixesToFollow[i];
linkPrefixesToFollow.add(new URL(s));
}
}
public void setAvoidPrefixes(String[] prefixesToAvoid) throws MalformedURLException {
for (int i = 0; i < prefixesToAvoid.length; i++) {
String s = prefixesToAvoid[i];
linkPrefixesToAvoid.add(new URL(s));
}
}

In order to allow users of the application maximum flexibility, we
also provide a way to store lists of common prefixes that they want
to allow or avoid:

  public void initFollowPrefixesFromSystemProperties( ) throws MalformedURLException {
String followPrefixes = System.getProperty("com.relevance.ss.FollowLinks");
if (followPrefixes == null || followPrefixes.length( ) == 0) return;
String[] prefixes = followPrefixes.split(" ");
if (prefixes != null && prefixes.length != 0) {
setFollowPrefixes(prefixes);
}
}
public void initAvoidPrefixesFromSystemProperties( ) throws MalformedURLException {
String avoidPrefixes = System.getProperty("com.relevance.ss.AvoidLinks");
if (avoidPrefixes == null || avoidPrefixes.length( ) == 0) return;
String[] prefixes = avoidPrefixes.split(" ");
if (prefixes != null && prefixes.length != 0) {
setAvoidPrefixes(prefixes);
}
}

As links are considered for inclusion in the index,
we'll be executing the same code against each to
determine its worth to the index. We need a few helper methods to
make those determinations:

  boolean shouldFollowLink(URL newLink) {
for (Iterator iterator = linkPrefixesToFollow.iterator( ); iterator.hasNext( );) {
URL u = (URL) iterator.next( );
if (matchesDownToPathPrefix(u, newLink)) {
return true;
}
}
return false;
}
boolean shouldNotFollowLink(URL newLink) {
for (Iterator iterator = linkPrefixesToAvoid.iterator( ); iterator.hasNext( );) {
URL u = (URL) iterator.next( );
if (matchesDownToPathPrefix(u, newLink)) {
return true;
}
}
return false;
}
private boolean matchesDownToPathPrefix(URL matchBase, URL newLink) {
return matchBase.getHost( ).equals(newLink.getHost( )) &&
matchBase.getPort( ) == newLink.getPort( ) &&
matchBase.getProtocol( ).equals(newLink.getProtocol( )) &&
newLink.getPath( ).startsWith(matchBase.getPath( ));
}

The first two methods, shouldFollowLink and
shouldNotFollowLink, compare the URL to the
collections for each. The third,
matchesDownToPathPrefix, compares the link to one
from the collection, making sure the host, port, and protocol are all
the same.

The service needs a way to consider a link for inclusion in the
index. It must accept the new link to consider and the page that
contained the link (for record-keeping):

  void considerNewLink(String linkFrom, WebLink newLink) throws MalformedURLException {
URL url = null;
url = newLink.getRequest( ).getURL( );
if (shouldFollowLink(url)) {
if (linksAlreadyFollowed.add(url.toExternalForm( ))) {
if (linksAlreadyFollowed.size( ) > maxLinks) {
linksAlreadyFollowed.remove(url.toExternalForm( ));
throw new Error("Max links exceeded " + maxLinks);
}
if (shouldNotFollowLink(url)) {
IndexLink.log.info("Not following " + url.toExternalForm( )
+ " from " + linkFrom);
} else {
IndexLink.log.info("Following " + url.toExternalForm( )
+ " from " + linkFrom);
addLink(new IndexLink(url.toString( ),conversation, this));
}
}
} else {
ignoreLink(url, linkFrom);
}
}

newLink is an instance of
com.meterware.httpunit.WebLink, which represents a
single page in a web conversation. This method starts by determining
whether the new URL is in our list of approved prefixes; if it
isn't, newLink calls the helper
method ignoreLink (which we'll
see in a minute). If it is approved, we test to see if we have
already followed this link; if we have, we just move on to the next
link. Note that we verify whether the link as already been followed
by attempting to add it to the
linksAlreadyFollowed set. If the value already
exists in the set, the set returns false. Otherwise, the set returns
true and the value is added to the set.

We also determine if the addition of the link has caused the
linksAlreadyFollwed set to grow past our
configured maximum number of links. If it has, we remove the last
link and throw an error.

Finally, the method checks to make sure the current URL is not in the
collection of proscribed prefixes. If it isn't, we
call the helper method addLink in order to add the
link to the index:

private void ignoreLink(URL url, String linkFrom) {
String status = "Ignoring " + url.toExternalForm( ) + " from " + linkFrom;
linksNotFollowed.add(status);
IndexLink.log.fine(status);
}
public void addLink(IndexLink link)
{
try
{
link.runTest( );
}
catch(Exception ex)
{
// handle error...
}
}

Finally, we need an entry point to kick off the whole process. This
method should take the root page of our site to index and begin
processing URLs based on our configuration criteria:

public void setInitialLink(String initialLink) throws MalformedURLException {
if ((initialLink == null) || (initialLink.length( ) == 0)) {
throw new Error("Must specify a non-null initialLink");
}
linkPrefixesToFollow.add(new URL(initialLink));
this.initialLink = initialLink;
addLink(new IndexLink(initialLink,conversation,this));
}

Next, we define a class to model the links themselves and allow us
access to their textual representations for inclusion in the index.
That class is the IndexLink class.
IndexLink needs three declarations:

private WebConversation conversation;
private IndexLinks suite;
private String name;

The WebConversation index again provides us the
HTTPUnit framework's implementation of a
browser-server session. The IndexLinks suite is
the parent instance of IndexLinks that is managing
this indexing session. The name variable stored the current
link's full URL as a String.

Creating an instance of the IndexLink class should
provide values for all three of these variables:

public IndexLink(String name, WebConversation conversation, IndexLinks suite) {
this.name = name;
if ((name == null) || (conversation == null) || (suite == null)) {
throw new IllegalArgumentException(
"LinkTest constructor requires non-null args");
}
this.conversation = conversation;
this.suite = suite;
}

Each IndexLink exposes a method that navigates to
the endpoint specified by the URL and checks to see if the result is
an HTML page or other indexable text. If the page is indexable, it is
added to the parent suite's index. Finally, we
examine the current results to see if they contain links to other
pages. For each such link, the process must start over:

public void checkLink( ) throws Exception {
WebResponse response = null;
try {
response = conversation.getResponse(this.name);
} catch (HttpNotFoundException hnfe) {
// handle error
}
if (!isIndexable(response)) {
return;
}
addToIndex(response);
WebLink[] links = response.getLinks( );
for (int i = 0; i < links.length; i++) {
WebLink link = links[i];
suite.considerNewLink(this.name, link);
}
}

The isIndexable method simply verifies the content
type of the returned result:

private boolean isIndexable(WebResponse response) {
return response.getContentType( ).equals("text/html") || response.getContentType( ).
equals("text/ascii");
}

whereas the addToIndex method actually retrieves
the full textual result from the URL and adds it to the
suite's index:

private void addToIndex(WebResponse response) throws SAXException, IOException, 
InterruptedException {
Document d = new Document( );
HTMLParser parser = new HTMLParser(response.getInputStream( ));
d.add(Field.UnIndexed("url", response.getURL( ).toExternalForm( )));
d.add(Field.UnIndexed("summary", parser.getSummary( )));
d.add(Field.Text("title", parser.getTitle( )));
d.add(Field.Text("contents", parser.getReader( )));
suite.addToIndex(d);
}

The parser is an instance of
org.apache.lucene.demol.HTMLParser, a freely
available component from the Lucene team that takes an HTML document
and supplies a collection-based interface to its constituent
components. Note the final call to
suite.addToIndex, a method on our
IndexLinks class that takes the Document and adds
it to the central index:

// note : method of IndexLinks
public void addToIndex(Document d)
{
try
{
writer.addDocument(d);
}
catch(Exception ex)
{
}
}

That's it. Together, these two classes provide a
single entry point for starting a crawling/indexing session. They
ignore the concept of scheduling an indexing event; that task is left
to the user interface layers. We only have two classes, making the
model extremely simple to maintain. And we chose to take advantage of
an unusual library (HTTPUnit) to keep us from writing code outside
our problem domain (namely, web request/response processing).


9.6.1 Principles in Action


Keep it simple: chooseHTTPUnit for web navigation code, minimum
performance enhancements (maximumLinks,
linksToAvoid collection)

Choose the right tools: JUnit, HTTPUnit, Cactus,[1] Lucene

[1] Unit
tests elided for conciseness. Download the full version to see the
tests.

Do one thing, and do it well: interface-free model, single
entry-point to service, reliance on platform's
scheduler; we also ignored this principle in deference to simplicity
by combining the crawler and indexer

Strive for transparency: none

Allow for extension: configuration

settings for links to ignore



/ 111