Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] نسخه متنی

14.1. Creating Indices

Indices are concerned with two things: the objects being indexed and the object that will extract useful data from each of these objects. Objects to be indexed are generically called

documents and are encapsulated by the Document class. Classes that extract data from documents are called

analyzers.

Documents are collections of

fields implemented as instances of the Field class. A field consists of a name and a value. The value may be free text or a static string. Any field may be stored in the index, and any field may be indexed for searching. These are independent decisions; it is possible to have a field that is indexed and not stored, or stored and not indexed.

Indexed fields may be searchedfor example, the Document representing a Web page might have a searchable field for the page body and another for the title. It would then be possible to search for pages where the title contains "lucene" and the body contains "create index."

Stored fields are available as the result of a search. The Web page Document might have a stored field for the URL so that after a search, the location of the page is readily available. Note that the title might be stored, but the body probably would not be because doing so would make the index huge. Thus, even though it is would be possible to search for bodies containing given text, to display that body, it would be necessary to access the file via the stored URL.

The Field class contains a number of static methods that create fields of various types. UnIndexed() creates a field that is stored but not indexed. There are a few variations of the Text() method, including one that creates a stored and indexed static string and one that takes a Reader and creates a field that will be indexed based on the data read but will not be stored. A number of these variations are used in Listing 14.1, which shows a class that constructs Document instances suitable for storing information about files. The Document class is final; otherwise Listing 14.1 would likely extend it.

Listing 14.1. A simple container for File information


package com.awl.toolbook.chapter14;
import java.io.*;
import org.apache.lucene.document.*;
public class FileDocument {
public static Document makeDocument(File f)
throws IOException
{
Document d = new Document();
// Stored and indexed
d.add(Field.Text("path", f.getPath()));
// Stored and indexed
d.add(Field.Keyword("lastModified",
DateField.timeToString(
f.lastModified())));
// Stored, not indexed
d.add(Field.UnIndexed("length",
Long.toString(f.length())));
// Indexed, not stored
d.add(Field.Text("contents",
new FileReader(f)));
return d;
}
}

The path and lastModified fields are stored and indexed. The DateField.timeToString() method converts a date into a form that can be lexigraphically ordered, ensuring that searches on date fields are meaningful.

An analyzer's job is to convert the fields in a document into streams of

tokengs. A token may be thought of as a single word, although in general the issue is more Chapter 15. Future versions of Lucene will likely come with a tokenizer and analyzer that work with POI.

At the API level all an Analyzer must do is provide a tokenStream() method that takes as arguments the name of a field and a Reader from which the data will be obtained. This method should return an instance of a class that extends TokenStream. This class must provide a next() method that will return the next token, or null when there are no more tokens available. Lucene provides a StandardTokenizer that uses rules common to many European languages to split a stream into words.

Along with the StandardTokenizer there is a provided StandardAnalyzer class, which will be more than sufficient for a great many applications. In addition to using the words obtained from StandardTokenizer, StandardAnalyzer also discards

stop words, common English words like "a" and "the" that are not useful for searching.^[1]

^[1] Readers familiar with how compilers work may note interesting parallels between the lexing and parsing phases, and tokenziers and analyzers. In both there is a first a step that determines what the symbols are and a second that determines what they mean.

The TokenStream and Analyzer are two of the three classes that are needed in order to create indices. The third class is IndexWriter, which is provided as part of Lucene and does not need to be extended. IndexWriter is initialized with the name of the index to create, an analyzer, and a flag indicating whether a new index should be created or an existing one should be opened for appending.

Listing 14.2 shows a program that will create an index from a collection of files and directories specified on the command line.

Listing 14.2. A simple program to index files


package com.awl.toolbook.chapter14;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexWriter;
public class SimpleIndex {
public static void main(String[] args)
throws Exception
{
SimpleIndex indx = new SimpleIndex();
for(int i=0;i<args.length;i++) {
indx.addToIndex(new File(args[i]));
}
indx.close();
}
private int count   = 0;
private IndexWriter indexWriter = null;
public SimpleIndex()
throws Exception
{
Analyzer analyzer = new StandardAnalyzer();
File f = new File("index");
if(f.exists()) {
indexWriter = new IndexWriter("index",
analyzer,
true);
} else {
indexWriter = new IndexWriter("index",
analyzer,
false);
}
}
public void close() throws IOException {
System.out.println(");
System.out.println("Indexed "   + count +
" documents");
indexWriter.optimize();
indexWriter.close();
}
public void addToIndex(File f)
throws Exception
{
if(f.isDirectory()) {
String[] files = f.list();
for(int i=0;i<files.length;i++) {
addToIndex(new File(f, files[i]));
}
} else {
indexWriter.addDocument(
FileDocument.makeDocument(f));
count++;
if(count % 100 == 0) {
System.out.print(count + "...");
}
}
}
}

main() simply creates an instance of SimpleIndex and passes each of the arguments to addIndex. The name used to create the indexindex in this casenames a directory, and several files will be created in this directory. The constructor for SimpleIndex checks whether there is already an index contained in the index directory. If so, it opens the index for appending; if not, it creates a new index.

The single most important line in the program is the call to


indexWriter.addDocument(FileDocument.createDocument())

First, a new Document is created, and the various fields are initialized. When this is passed to indexWriter, each of the fields will be examined. Fields that are not indexed will be written into the index verbatim. Those fields that are indexed will be

Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی