14.2. Using Indices
Searching is handled by an instance of the IndexSearcher class, which is constructed with the name of the index to use. To prepare a program to use the index constructed in the preceding section, it would just be necessary to do the following:
Details about the search to perform are encapsulated in objects that extend the base Query class. There are numerous ways to obtain such a Query, which will be examined shortly.Assuming a Query has been obtained, performing the search and processing the results are straightforward. The Query is passed to the IndexSearcher, which returns a Hits object. This object provides methods to obtain the number of matching documents and the documents themselves. Methods are also available to obtain a score, a number representing how well each document matches the given query. The documents obtained from the IndexSearcher are closely related to the documents that were originally given to the indexWriter.addDocument() method, except that nonstored fields are not available. These ideas are illustrated in the following code snippet:
Searcher searcher = new IndexSearcher("index");
Queries may be constructed programmatically or through a user-friendly query language that can be parsed into a query. The programmatic method will be presented first.The most basic kind of query searches for the occurrence of a word in a field. This is done through use of an auxiliary class called a Term. For example, to search for all documents whose contents contain the word "jakarta"
Hits hits = searcher.search(query);
int len = hits.length();
System.out.println("Found " + len + " documents");
for(int i = 0; i < len; i++) {
Document doc = hits.doc(i);
System.out.println(i + ". " +
doc.get("path") +
" (" + hits.score(i) + ") " +
"Length:" + doc.get("length") +
"Modified:" + doc.get("lastModified"))
}
Closely related to TermQuery is PhraseQuery, which looks for a sequence of terms in order. The following will search for the phrase "lucene makes searching easy" in file contents:
Term term = new Term("contents","jakarta");
Query query = new TermQuery(term);
Note that each call to add() appends the Term to the end. Also note that this is not equivalent to
Query query = new PhraseQuery();
query.add(new Term("contents","lucene"));
query.add(new Term("contents","makes"));
query.add(new Term("contents","searching"));
query.add(new Term("contents","easy"));
This is because the BasicTokenStream splits contents into individual words, and it is these words that are indexed. While it is certainly possible to create a tokenizer that splits fields into sentences or larger units, this was not done in the example, and therefore there is no single token that contains the entire phrase. Consequently, a search on the whole phrase with a single TermQuery will return zero results.There is also a WildcardQuery that can search for terms with a limited regular expression syntax. A question mark in a term represents a single letter, and an asterisk represents any number of letters.
Query query =
new TermQuery(
new Term("contents","lucene makes searching easy"));
will find all documents containing any of "moose," "goose," and so on. Similarly,
Query query = new WildcardQuery(
new Term("contents","?oose"));
will find documents with "moose," "muse," and so on. Note that this only matches single tokens, so documents containing "my nose" will not match.There is also a FuzzyQuery that matches words that are "somewhat like" the given term. The exact algorithm by which this is done is beyond the scope of this book, but the intent is to catch common misspellings. The best way to get a feel for how this works is to try it by experimenting with the code provided in this chapter.Finally, there is a RangeQuery that matches all terms in a given range. This may be used on text fields, as well as dates.
Query query = new WildcardQuery(
new Term("contents","m*se"));
will match all documents that contain any word starting with the letter "o." The first argument in the constructor is the lower bound, the second is the upper bound, and the last is a flag indicating whether the search should be inclusive. Note that this is equivalent to WildCardQuery("o*").
Query query = new RangeQuery(
new Term("contents","oa"),
new Term("contents","oz"),
true);
will match documents that were last modified in March of 2002.Each of these is an atomic query meaning it places a single restriction on a single field. Atomic queries may be combined into BooleanQueries to express more complex searches such as documents containing two terms or documents created in a certain time range that do not contain a certain term.A BooleanQuery is initially created empty:
Query query = new RangeQuery(
new Term("lastUpdated","20020301"),
new Term("lastUpdate","20020331"),
true);
Any query may be made part of a BooleanQuery by calling the add() method. This method takes three arguments: the query, a flag indicating whether the query must match a document in order for the document to be included in the results of the search, and another flag indicating whether then query must not match in order for the document to be included in the results. Adding two required queries, as in
BooleanQuery bq = new BooleanQuery();
means that returned documents must contain both "java" and "linux." All queries added with the required flag set to true are therefore logically connected by boolean AND.Adding a query with the "forbidden" flag set to true will remove matching documents. Adding the following lines to the preceding ones
nested.add(new TermQuery("contents","java"),
true,
false);
nested.add(new TermQuery("contents","linux"),
true,
false);
will result in the retrieval of documents containing "java" and "linux" except those created in June 2002. Queries added in this way are thus logically connected by boolean AND NOT operators.Clearly a condition cannot be both required and forbidden, and so it is an error to set both flags to true. However, a condition can be neither required nor forbidden, in which case it is optional. If all the queries added to a BooleanQuery have neither flag set, then at least one query must match for a document to be included in the results. Such queries are therefore connected by boolean OR.
bq.add(new RangeQuery(
new Term("lastUpdated","20020601"),
new Term("lastUpdate","20020631"),
true),
false,
true);
will find documents containing either the phrase "java search tool," the word "lucene," or both.Any query can be added into a BooleanQuery, including other BooleanQueries. This allows complex logical expressions to be constructed, as in
nested.add(new PhraseQuery("contents","java search tool"),
false,
false);
nested.add(new TermQuery("contents","lucene"),
false,
false);
which represents a search for documents that were last modified between December 1, 2002, and December 31, 2002, and that contain the word "java" along with either "linux," "windows," or both.Programmatically building queries allows rules to be specified precisely, but it can be rather limiting. Typically search application do not hard-code the search parameters but allow the end user to enter them interactively. Writing a query as a sequence of Java statements makes it impossible to support this kind of interactivity. For that reason Lucene supports a query language with a simple user-friendly syntax that can be easily parsed into a query.Term queries are represented as the name of the field to search, followed by a colon, followed by the term. For example,
BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery("contents","java"),
true,
false);
bq.add(new RangeQuery(
new Term("lastUpdated","20021201"),
new Term("lastUpdate","20021231"),
true);
BooleanQuery nested = new BooleanQuery();
nested.add(new TermQuery("contents","linux"),
false,
false);
nested.add(new TermQuery("contents","windows"),
false,
false);
bg.add(nested,true,false);
When the search is performed on a field specified as the default, the name of the field can be omitted.Queries with wildcards are automatically handled as WildCardQueries, such as m?se or octo*.Phrase queries are indicated by quotation marks around the phrase, such as
contents:java
or
lastUpdated:20030301
Queries can be grouped with parentheses and can be connected by logical operators AND, OR, and NOT.
title:"how to use lucene"
would search for documents containing "java" and either "linux" or "windows" in the default field, omitting documents with "installing" in the title.Logical constraints can also be specified with the + and - modifiers. A + before a term, phrase, or other clause requires that term just as setting the required flag to TRue. Likewise, a - requires that a term not be present, just as setting the forbidden flag to true.The preceding example is therefore equivalent to
java AND (linux OR windows) AND NOT title:"installing"
Note that linux and windows have no modifiers; this is because the default operator is OR.A field specifier can precede a group, so
+java (linux windows) -title:installing
means the same as
title:(lucene OR search)
Fuzzy matches are indicated by a tilde (
(title:lucene OR title:search)
The resulting query can be used as normal. Listing 14.3 shows a complete search application that takes as an argument a search expression and returns the list of matching documents.
Analyzer analyzer = new BasicAnalyzer();
Query query = QueryParser.parse(line, "contents", analyzer);
Listing 14.3. A simple search application
Note that Listing 14.3 expects the entire search expression as the first argument. In most shells this will mean putting single or double quotes around compound expressions.
package com.awl.toolbook.chapter14;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
public class BasicSearch {
public static void main(String[] args)
throws Exception
{
Searcher searcher = new IndexSearcher("index");
Analyzer analyzer = new StandardAnalyzer();
Query query = QueryParser.parse(args[0],
"contents",
analyzer);
Hits hits = searcher.search(query);
int len = hits.length();
System.out.println("Found " + len + " documents");
for(int i = 0; i < len; i++) {
Document doc = hits.doc(i);
System.out.println(i + ". " +
doc.get("path") +
" (" + hits.score(i) + ") " +
"Length:" + doc.get("length") +
" Modified:" + doc.get("lastModified"));
}
}
}