Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources]

Larne Pekowsky

نسخه متنی -صفحه : 207/ 112

14.3. Indexing Web Pages

Lucene is useful for searching through many kinds of documents in addition to flat files stored on local disks. One particularly important example is searching Web pages, and a tool for indexing such pages will be developed in this section. Although this tool poses no threat to companies like Google, it is nonetheless useful for indexing a small to medium-sized intranet or Web site.

The first step in developing an index application is the Analyzer. In particular, the first consideration should be whether the StandardAnalyzer meets the needs of the project. In this case the answer is "no" because the Web index should omit tags and attributes. This means a custom Analyzer and custom TokenStream will be needed.

The analyzer is straightforward and is shown in Listing 14.4.

Listing 14.4. A simple analyzer

package com.awl.toolbook.chapter14;
import java.io.*;
import org.apache.lucene.analysis. Analyzer;
import org.apache.lucene.analysis.TokenStream;
public class WebAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName,
Reader in)
{
char data[] = new char[2048];
int count   = 0;
StringBuffer buffy = new StringBuffer();
try {
while((count = in.read(data)) > 0) {
buffy.append(new String(data,0,count));
}
} catch (Exception e) {
System.err.println("Error reading: " + e);
}
return new WebTokenStream(buffy.toString());
}
}

Essentially all that Listing 14.4 does is assemble a string containing all the data and pass that string to a WebTokenizer. This tokenizer handles extracting text from the page, and it is shown in Listing 14.5.

Listing 14.5. A token stream that uses regexps

package com.awl.toolbook.chapter14;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.oro.text.regex.*;
public class WebTokenStream extends TokenStream {
private Pattern tagPattern;
private Pattern wordPattern;
private PatternMatcherInput input;
private PatternMatcherInput wordInput;
private Perl5Matcher tagMatcher;
private Perl5Matcher wordMatcher;
private int baseOffset = 0;
public WebTokenStream(String data) {
try {
Perl5Compiler compiler = new Perl5Compiler();
tagPattern = compiler.compile("<[^>]*([^<]*)",
Perl5Compiler.SINGLELINE_MASK);
wordPattern = compiler.compile("[\\W*]",
Perl5Compiler.SINGLELINE_MASK);
input      = new PatternMatcherInput(data);
tagMatcher = new Perl5Matcher();
} catch (MalformedPatternException e) {}
}
public Token next() {
// Phase two - look for a word
if(wordMatcher != null) {
if(wordMatcher.contains(wordInput,
wordPattern))
{
MatchResult r = wordMatcher.getMatch();
return new Token(
r.toString().toLowerCase(),
baseOffset + r.beginOffset(0),
baseOffset + r.endOffset(0));
} else {
wordMatcher = null;
}
}
// Phase one - look for blocks of words among tags
if(tagMatcher.contains(input,tagPattern)) {
MatchResult r = tagMatcher.getMatch();
wordInput     =
new PatternMatcherInput(r.group(1));
baseOffset    = r.beginOffset(1);
wordMatcher   = new Perl5Matcher();
return next();
}   else {
return null;
}
}
public void close() {}
}

Chapter 13. Here two patterns are used: tagPattern, which separates contiguous sequences of characters that do not contain a < from surrounding tags, and wordPattern, which Listing 14.1 to create an index of Web pages stored in local files. It is more interesting to index pages that are out on the Internet, however. This is typically done by starting from one page and then following all links from that page to find others. Such a program is called a

spider, because it walks across the Web. Chapter 5. While this is primarily intended to help unit test Web pages, it also Chapter 12 can be used to handle arguments to the program such as maximum depth and number of pages, as well as generate help messages. Listing 14.6 is thus an excellent example of many toolkits working together to make programming faster and easier.

Note!

Before the code is presented, a disclaimer is necessary. Using this program on any site other than your own will likely be considered very bad behavior. Production-quality spiders as used by major search engines respect a number of established conventions to ensure they do not unduly burden remote sites. None of these features is supported in this simple example.

Listing 14.6. A Web indexer

package com.awl.toolbook.chapter14;
import java.util.HashMap;
import java.util.List;
import java.io.IOException;
import java.io.File;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.commons.cli.*;
import com.meterware.httpunit.*;
public class WebIndex {
private int maxDepth = 3;
private int maxPages = 100;
private int pages    = 0;
private HashMap visitedPages = new HashMap();
private IndexWriter indexWriter = null;
private String[] args;
public String[] getArgs() {return args;}
public void setArgs(String[] args) {this.args = args;}
private static Options makeOptions() {
Options options = new Options();
options.addOption(OptionBuilder
.withDescription("maximum depth to descend")
.withLongOpt("max-depth")
.create('d'));
options.addOption(OptionBuilder
.withDescription(
"maximum number of pages to index")
.withLongOpt("max-pages")
.create('n'));
return options;
}
public static void usage(Options options) {
HelpFormatter formatter = new HelpFormatter();
formatter.printHelp(
"WebIndex [OPTION] url",
options,
true);
}
public static void main(String args[]) {
WebIndex w = new WebIndex(args);
w.run();
}
public WebIndex() {}
public WebIndex(String args[]) {
setArgs(args);
}
public void run() {
try {
doRun();
} catch (Exception e) {}
}
private void doRun() throws Exception {
// Create the indexWriter
Analyzer analyzer = new WebAnalyzer();
File f = new File("webIndex");
if(f.exists()) {
indexWriter = new IndexWriter("webIndex",
analyzer,
true);
} else {
indexWriter = new IndexWriter("webIndex" ,
analyzer,
false);
}
// process the arguments, which will also start
// indexing at the top page
processArgs();
// close and cleanup
close();
}
public void close() throws IOException {
indexWriter.optimize();
indexWriter.close();
}
public void processArgs() {
// Parse the arguments, if there's an error
// report usage
Options options = makeOptions();
CommandLineParser parser = new BasicParser();
CommandLine cmd = null;
try {
cmd = parser.parse(options,args);
} catch (ParseException e) {
System.err.println(e.getMessage());
usage(options);
System.exit(-1);
}
if(cmd.hasOption('d')) {
maxDepth = Integer.parseInt(
cmd.getOptionValue('d'));
}
if(cmd.hasOption('n')) {
maxPages = Integer.parseInt(
cmd.getOptionValue('n'));
}
List others = cmd.getArgList();
try {
indexPage(others.get(0).toString(),0);
} catch (Exception e) {
System.err.println("Unable to index: " + e);
}
}
public void indexPage(String url, int depth)
throws Exception
{
WebConversation wc = new WebConversation();
WebRequest     req = new GetMethodWebRequest(url);
WebResponse   resp = wc.getResponse(req);
visitedPages.put(url,Boolean.TRUE);
Document doc = new Document();
doc.add(Field.Text("title",resp.getTitle()));
doc.add(Field.Text("url",
resp.getURL().toString()));
doc.add(Field.Text(
"description",
arrayToString(
resp.getMetaTagContent("name",
"description"))));
doc.add(Field.UnStored(
"keywords",
arrayToString(
resp.getMetaTagContent("name",
"keywords"))));
doc.add(Field.UnStored("body",resp.getText()));
indexWriter.addDocument(doc);
if(depth < maxDepth) {
WebLink links[] = resp.getLinks();
int newDepth    = depth+1;
for(int i=0;i<links.length;i++) {
String newUrl = links[i].getURLString();
if(visitedPages.get(newUrl) == null  &&
pages < maxPages)
{
indexPage(newUrl,newDepth);
pages++;
}
}
}
}
private String arrayToString(String in[]) {
StringBuffer buffy = new StringBuffer();
for(int i=0;i<in.length;i++) {
buffy.append(in[i]);
buffy.append(' ');
}
return buffy.toString();
}
}

Listing 14.2. The name of the the index directory would need to be changed from index to WebIndex, and the name of the retrieved fields would also need to be modified. A better solution would be to allow the name of the index and retrieved fields to be specified on the command line.