9.6 The Crawler/Indexer Service
The application needs a way to
dynamically
follow the links from a given URL and the links from those pages, ad
infinitum, in order to create the full domain of searchable pages.
Just thinking about writing all of the web-related code to do that
work gives me the screaming heebie-jeebies. We would have to write
methods to post web requests, listen for responses, parse those
responses looking for links, and so on.In light of the "keep it simple"
chapter, it seems we are immediately faced with a buy-it-or-build-it
question. This functionality must exist already; the question is,
where? It turns out we already have a library at our disposal that
contains everything we need: HTTPUnit. Because
HTTPUnit's
purpose in life is to imitate a browser, it can be used to make HTTP
requests, examine the HTML results, and follow the links contained
therein.Using HTTPUnit to do the work for us is a fairly nonstandard
approach. HTTPUnit is considered a testing framework, not an
application development framework. However, since it accomplishes
exactly what we need to do with regard to navigating web sites, it
would be a waste of effort and resources to attempt to recreate that
functionality on our own.Our main entry point to the crawler/indexer service is
IndexLinks. This class establishes the entry point
for the indexable domain and all of the configuration settings for
controlling the overall result set. The constructor for the class
should accept as much of the configuration information as possible:
public IndexLinks(String indexPath, int maxLinks,The writer is an instance of
String skippedLinksOutputFileName)
{
this.maxLinks = maxLinks;
this.linksNotFollowedOutputFileName = skippedLinksOutputFileName;
writer = new IndexWriter(indexPath, new StandardAnalyzer( ), true);
}
org.apache.lucene.index.IndexWriter, which is
initialized to point to the path where a new index should be created.Our instance requires a series of collections to manage our links.
Those collections are:
Set linksAlreadyFollowed = new HashSet( );The first two are used to store the links as we discover and
Set linksNotFollowed = new HashSet( );
Set linkPrefixesToFollow = new HashSet( );
HashSet linkPrefixesToAvoid = new HashSet( );
categorize them. The next two are configuration settings used to
determine if we should follow the link based on its prefix. These
settings allow us to eliminate subsites or certain external sites
from the search set, thus giving us the ability to prevent the
crawler from running all over the Internet, indexing everything.The other object we need is a
com.meterware.httpunit.WebConversation. HTTPUnit
uses this class to model a browser-server session. It provides
methods for making requests to web servers, retrieving responses, and
manipulating the HTTP messages that result. We'll
use it to retrieve our indexable pages.
WebConversation conversation = new WebConversation( );We must provide setter methods so the users of the indexer/crawler
can add prefixes to these two collections:
public void setFollowPrefixes(String[] prefixesToFollow)In order to allow users of the application maximum flexibility, we
throws MalformedURLException {
for (int i = 0; i < prefixesToFollow.length; i++) {
String s = prefixesToFollow[i];
linkPrefixesToFollow.add(new URL(s));
}
}
public void setAvoidPrefixes(String[] prefixesToAvoid) throws MalformedURLException {
for (int i = 0; i < prefixesToAvoid.length; i++) {
String s = prefixesToAvoid[i];
linkPrefixesToAvoid.add(new URL(s));
}
}
also provide a way to store lists of common prefixes that they want
to allow or avoid:
public void initFollowPrefixesFromSystemProperties( ) throws MalformedURLException {As links are considered for inclusion in the index,
String followPrefixes = System.getProperty("com.relevance.ss.FollowLinks");
if (followPrefixes == null || followPrefixes.length( ) == 0) return;
String[] prefixes = followPrefixes.split(" ");
if (prefixes != null && prefixes.length != 0) {
setFollowPrefixes(prefixes);
}
}
public void initAvoidPrefixesFromSystemProperties( ) throws MalformedURLException {
String avoidPrefixes = System.getProperty("com.relevance.ss.AvoidLinks");
if (avoidPrefixes == null || avoidPrefixes.length( ) == 0) return;
String[] prefixes = avoidPrefixes.split(" ");
if (prefixes != null && prefixes.length != 0) {
setAvoidPrefixes(prefixes);
}
}
we'll be executing the same code against each to
determine its worth to the index. We need a few helper methods to
make those determinations:
boolean shouldFollowLink(URL newLink) {The first two methods, shouldFollowLink and
for (Iterator iterator = linkPrefixesToFollow.iterator( ); iterator.hasNext( );) {
URL u = (URL) iterator.next( );
if (matchesDownToPathPrefix(u, newLink)) {
return true;
}
}
return false;
}
boolean shouldNotFollowLink(URL newLink) {
for (Iterator iterator = linkPrefixesToAvoid.iterator( ); iterator.hasNext( );) {
URL u = (URL) iterator.next( );
if (matchesDownToPathPrefix(u, newLink)) {
return true;
}
}
return false;
}
private boolean matchesDownToPathPrefix(URL matchBase, URL newLink) {
return matchBase.getHost( ).equals(newLink.getHost( )) &&
matchBase.getPort( ) == newLink.getPort( ) &&
matchBase.getProtocol( ).equals(newLink.getProtocol( )) &&
newLink.getPath( ).startsWith(matchBase.getPath( ));
}
shouldNotFollowLink, compare the URL to the
collections for each. The third,
matchesDownToPathPrefix, compares the link to one
from the collection, making sure the host, port, and protocol are all
the same.The service needs a way to consider a link for inclusion in the
index. It must accept the new link to consider and the page that
contained the link (for record-keeping):
void considerNewLink(String linkFrom, WebLink newLink) throws MalformedURLException {newLink is an instance of
URL url = null;
url = newLink.getRequest( ).getURL( );
if (shouldFollowLink(url)) {
if (linksAlreadyFollowed.add(url.toExternalForm( ))) {
if (linksAlreadyFollowed.size( ) > maxLinks) {
linksAlreadyFollowed.remove(url.toExternalForm( ));
throw new Error("Max links exceeded " + maxLinks);
}
if (shouldNotFollowLink(url)) {
IndexLink.log.info("Not following " + url.toExternalForm( )
+ " from " + linkFrom);
} else {
IndexLink.log.info("Following " + url.toExternalForm( )
+ " from " + linkFrom);
addLink(new IndexLink(url.toString( ),conversation, this));
}
}
} else {
ignoreLink(url, linkFrom);
}
}
com.meterware.httpunit.WebLink, which represents a
single page in a web conversation. This method starts by determining
whether the new URL is in our list of approved prefixes; if it
isn't, newLink calls the helper
method ignoreLink (which we'll
see in a minute). If it is approved, we test to see if we have
already followed this link; if we have, we just move on to the next
link. Note that we verify whether the link as already been followed
by attempting to add it to the
linksAlreadyFollowed set. If the value already
exists in the set, the set returns false. Otherwise, the set returns
true and the value is added to the set.We also determine if the addition of the link has caused the
linksAlreadyFollwed set to grow past our
configured maximum number of links. If it has, we remove the last
link and throw an error.Finally, the method checks to make sure the current URL is not in the
collection of proscribed prefixes. If it isn't, we
call the helper method addLink in order to add the
link to the index:
private void ignoreLink(URL url, String linkFrom) {Finally, we need an entry point to kick off the whole process. This
String status = "Ignoring " + url.toExternalForm( ) + " from " + linkFrom;
linksNotFollowed.add(status);
IndexLink.log.fine(status);
}
public void addLink(IndexLink link)
{
try
{
link.runTest( );
}
catch(Exception ex)
{
// handle error...
}
}
method should take the root page of our site to index and begin
processing URLs based on our configuration criteria:
public void setInitialLink(String initialLink) throws MalformedURLException {Next, we define a class to model the links themselves and allow us
if ((initialLink == null) || (initialLink.length( ) == 0)) {
throw new Error("Must specify a non-null initialLink");
}
linkPrefixesToFollow.add(new URL(initialLink));
this.initialLink = initialLink;
addLink(new IndexLink(initialLink,conversation,this));
}
access to their textual representations for inclusion in the index.
That class is the IndexLink class.
IndexLink needs three declarations:
private WebConversation conversation;The WebConversation index again provides us the
private IndexLinks suite;
private String name;
HTTPUnit framework's implementation of a
browser-server session. The IndexLinks suite is
the parent instance of IndexLinks that is managing
this indexing session. The name variable stored the current
link's full URL as a String.Creating an instance of the IndexLink class should
provide values for all three of these variables:
public IndexLink(String name, WebConversation conversation, IndexLinks suite) {Each IndexLink exposes a method that navigates to
this.name = name;
if ((name == null) || (conversation == null) || (suite == null)) {
throw new IllegalArgumentException(
"LinkTest constructor requires non-null args");
}
this.conversation = conversation;
this.suite = suite;
}
the endpoint specified by the URL and checks to see if the result is
an HTML page or other indexable text. If the page is indexable, it is
added to the parent suite's index. Finally, we
examine the current results to see if they contain links to other
pages. For each such link, the process must start over:
public void checkLink( ) throws Exception {The isIndexable method simply verifies the content
WebResponse response = null;
try {
response = conversation.getResponse(this.name);
} catch (HttpNotFoundException hnfe) {
// handle error
}
if (!isIndexable(response)) {
return;
}
addToIndex(response);
WebLink[] links = response.getLinks( );
for (int i = 0; i < links.length; i++) {
WebLink link = links[i];
suite.considerNewLink(this.name, link);
}
}
type of the returned result:
private boolean isIndexable(WebResponse response) {whereas the addToIndex method actually retrieves
return response.getContentType( ).equals("text/html") || response.getContentType( ).
equals("text/ascii");
}
the full textual result from the URL and adds it to the
suite's index:
private void addToIndex(WebResponse response) throws SAXException, IOException,The parser is an instance of
InterruptedException {
Document d = new Document( );
HTMLParser parser = new HTMLParser(response.getInputStream( ));
d.add(Field.UnIndexed("url", response.getURL( ).toExternalForm( )));
d.add(Field.UnIndexed("summary", parser.getSummary( )));
d.add(Field.Text("title", parser.getTitle( )));
d.add(Field.Text("contents", parser.getReader( )));
suite.addToIndex(d);
}
org.apache.lucene.demol.HTMLParser, a freely
available component from the Lucene team that takes an HTML document
and supplies a collection-based interface to its constituent
components. Note the final call to
suite.addToIndex, a method on our
IndexLinks class that takes the Document and adds
it to the central index:
// note : method of IndexLinksThat's it. Together, these two classes provide a
public void addToIndex(Document d)
{
try
{
writer.addDocument(d);
}
catch(Exception ex)
{
}
}
single entry point for starting a crawling/indexing session. They
ignore the concept of scheduling an indexing event; that task is left
to the user interface layers. We only have two classes, making the
model extremely simple to maintain. And we chose to take advantage of
an unusual library (HTTPUnit) to keep us from writing code outside
our problem domain (namely, web request/response processing).
9.6.1 Principles in Action
Keep it simple: chooseHTTPUnit for web navigation code, minimum
performance enhancements (maximumLinks,
linksToAvoid collection)Choose the right tools: JUnit, HTTPUnit, Cactus,[1] Lucene[1] Unit
tests elided for conciseness. Download the full version to see the
tests. Do one thing, and do it well: interface-free model, single
entry-point to service, reliance on platform's
scheduler; we also ignored this principle in deference to simplicity
by combining the crawler and indexerStrive for transparency: noneAllow for extension: configuration
settings for links to ignore