Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] نسخه متنی

14.2. Using Indices

Searching is handled by an instance of the IndexSearcher class, which is constructed with the name of the index to use. To prepare a program to use the index constructed in the preceding section, it would just be necessary to do the following:


Searcher searcher = new IndexSearcher("index");

Details about the search to perform are encapsulated in objects that extend the base Query class. There are numerous ways to obtain such a Query, which will be examined shortly.

Assuming a Query has been obtained, performing the search and processing the results are straightforward. The Query is passed to the IndexSearcher, which returns a Hits object. This object provides methods to obtain the number of matching documents and the documents themselves. Methods are also available to obtain a

score, a number representing how well each document matches the given query. The documents obtained from the IndexSearcher are closely related to the documents that were originally given to the indexWriter.addDocument() method, except that nonstored fields are not available. These ideas are illustrated in the following code snippet:


Hits hits = searcher.search(query);
int len   = hits.length();
System.out.println("Found " + len + " documents");
for(int i = 0; i < len; i++) {
Document doc = hits.doc(i);
System.out.println(i + ". " +
doc.get("path") +
" (" + hits.score(i) + ") " +
"Length:" + doc.get("length") +
"Modified:" + doc.get("lastModified"))
}

Queries may be constructed programmatically or through a user-friendly query language that can be parsed into a query. The programmatic method will be presented first.

The most basic kind of query searches for the occurrence of a word in a field. This is done through use of an auxiliary class called a Term. For example, to search for all documents whose contents contain the word "jakarta"


Term term   = new Term("contents","jakarta");
Query query = new TermQuery(term);

Closely related to TermQuery is PhraseQuery, which looks for a sequence of terms in order. The following will search for the phrase "lucene makes searching easy" in file contents:


Query query = new PhraseQuery();
query.add(new Term("contents","lucene"));
query.add(new Term("contents","makes"));
query.add(new Term("contents","searching"));
query.add(new Term("contents","easy"));

Note that each call to add() appends the Term to the end. Also note that this is

not equivalent to


Query query =
new TermQuery(
new Term("contents","lucene makes searching easy"));

This is because the BasicTokenStream splits contents into individual words, and it is these words that are indexed. While it is certainly possible to create a tokenizer that splits fields into sentences or larger units, this was not done in the example, and therefore there is no single token that contains the entire phrase. Consequently, a search on the whole phrase with a single TermQuery will return zero results.

There is also a WildcardQuery that can search for terms with a limited regular expression syntax. A question mark in a term represents a single letter, and an asterisk represents any number of letters.


Query query = new WildcardQuery(
new Term("contents","?oose"));

will find all documents containing any of "moose," "goose," and so on. Similarly,


Query query = new WildcardQuery(
new Term("contents","m*se"));

will find documents with "moose," "muse," and so on. Note that this only matches single tokens, so documents containing "my nose" will not match.

There is also a FuzzyQuery that matches words that are "somewhat like" the given term. The exact algorithm by which this is done is beyond the scope of this book, but the intent is to catch common misspellings. The best way to get a feel for how this works is to try it by experimenting with the code provided in this chapter.

Finally, there is a RangeQuery that matches all terms in a given range. This may be used on text fields, as well as dates.


Query query = new RangeQuery(
new Term("contents","oa"),
new Term("contents","oz"),
true);

will match all documents that contain any word starting with the letter "o." The first argument in the constructor is the lower bound, the second is the upper bound, and the last is a flag indicating whether the search should be inclusive. Note that this is equivalent to WildCardQuery("o*").


Query query = new RangeQuery(
new Term("lastUpdated","20020301"),
new Term("lastUpdate","20020331"),
true);

will match documents that were last modified in March of 2002.

Each of these is an

atomic query meaning it places a single restriction on a single field. Atomic queries may be combined into BooleanQueries to express more complex searches such as documents containing two terms or documents created in a certain time range that do not contain a certain term.

A BooleanQuery is initially created empty:


BooleanQuery bq = new BooleanQuery();

Any query may be made part of a BooleanQuery by calling the add() method. This method takes three arguments: the query, a flag indicating whether the query must match a document in order for the document to be included in the results of the search, and another flag indicating whether then query must

not match in order for the document to be included in the results. Adding two required queries, as in


nested.add(new TermQuery("contents","java"),
true,
false);
nested.add(new TermQuery("contents","linux"),
true,
false);

means that returned documents must contain both "java" and "linux." All queries added with the required flag set to true are therefore logically connected by boolean AND.

Adding a query with the "forbidden" flag set to true will remove matching documents. Adding the following lines to the preceding ones


bq.add(new RangeQuery(
new Term("lastUpdated","20020601"),
new Term("lastUpdate","20020631"),
true),
false,
true);

will result in the retrieval of documents containing "java" and "linux" except those created in June 2002. Queries added in this way are thus logically connected by boolean AND NOT operators.

Clearly a condition cannot be both required and forbidden, and so it is an error to set both flags to true. However, a condition can be neither required nor forbidden, in which case it is optional. If all the queries added to a BooleanQuery have neither flag set, then at least one query must match for a document to be included in the results. Such queries are therefore connected by boolean OR.


nested.add(new PhraseQuery("contents","java search tool"),
false,
false);
nested.add(new TermQuery("contents","lucene"),
false,
false);

will find documents containing either the phrase "java search tool," the word "lucene," or both.

Any query can be added into a BooleanQuery, including other BooleanQueries. This allows complex logical expressions to be constructed, as in


BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery("contents","java"),
true,
false);
bq.add(new RangeQuery(
new Term("lastUpdated","20021201"),
new Term("lastUpdate","20021231"),
true);
BooleanQuery nested = new BooleanQuery();
nested.add(new TermQuery("contents","linux"),
false,
false);
nested.add(new TermQuery("contents","windows"),
false,
false);
bg.add(nested,true,false);

which represents a search for documents that were last modified between December 1, 2002, and December 31, 2002, and that contain the word "java" along with either "linux," "windows," or both.

Programmatically building queries allows rules to be specified precisely, but it can be rather limiting. Typically search application do not hard-code the search parameters but allow the end user to enter them interactively. Writing a query as a sequence of Java statements makes it impossible to support this kind of interactivity. For that reason Lucene supports a

query language with a simple user-friendly syntax that can be easily parsed into a query.

Term queries are represented as the name of the field to search, followed by a colon, followed by the term. For example,

contents:java

or
lastUpdated:20030301

When the search is performed on a field specified as the default, the name of the field can be omitted.

Queries with wildcards are automatically handled as WildCardQueries, such as m?se or octo*.

Phrase queries are indicated by quotation marks around the phrase, such as


title:"how to use lucene"

Queries can be grouped with parentheses and can be connected by logical operators AND, OR, and NOT.


java AND (linux OR windows) AND NOT title:"installing"

would search for documents containing "java" and either "linux" or "windows" in the default field, omitting documents with "installing" in the title.

Logical constraints can also be specified with the + and - modifiers. A + before a term, phrase, or other clause requires that term just as setting the required flag to TRue. Likewise, a - requires that a term not be present, just as setting the forbidden flag to true.

The preceding example is therefore equivalent to


+java (linux windows) -title:installing

Note that linux and windows have no modifiers; this is because the default operator is OR.

A field specifier can precede a group, so


title:(lucene OR search)

means the same as


(title:lucene OR title:search)

Fuzzy matches are indicated by a tilde () following the word, as in lucene. Adding a number after the tilde effects the "degree" of fuzziness; lucene10 will match documents with words that are further from "lucene" than lucene4.

Range queries are indicated with the word TO and enclosed in brackets, as in modifiedDate:[20030101 TO 20030201] or title:[aardvark TO ant].

Programs can support the query language with two lines of code. A static parse() method in the QueryParser class takes care of all the hard work. This method takes a string representing the query to be parsed, the name of the default field, and an Analyzer, which will usually be the same one with which the index was built.


Analyzer analyzer = new BasicAnalyzer();
Query query = QueryParser.parse(line, "contents", analyzer);

The resulting query can be used as normal. Listing 14.3 shows a complete search application that takes as an argument a search expression and returns the list of matching documents.

Listing 14.3. A simple search application


package com.awl.toolbook.chapter14;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
public class BasicSearch  {
public static void main(String[] args)
throws Exception
{
Searcher searcher = new IndexSearcher("index");
Analyzer analyzer = new StandardAnalyzer();
Query query       = QueryParser.parse(args[0],
"contents",
analyzer);
Hits hits = searcher.search(query);
int len  = hits.length();
System.out.println("Found " + len + " documents");
for(int i = 0; i < len; i++) {
Document doc = hits.doc(i);
System.out.println(i + ". " +
doc.get("path") +
" (" + hits.score(i) + ") " +
"Length:" + doc.get("length") +
" Modified:" + doc.get("lastModified"));
}
}
}

Note that Listing 14.3 expects the entire search expression as the first argument. In most shells this will mean putting single or double quotes around compound expressions.

Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Apache Jakarta and Beyond: A Java Programmeramp;#039;s Introduction [Electronic resources] - نسخه متنی

Larne Pekowsky

آدرس پست الکترونیک گیرنده :

آدرس پست الکترونیک فرستنده :

نام و نام خانوارگی فرستنده :

پیغام برای گیرنده ( حداکثر 250 حرف ) :

کد امنیتی را وارد نمایید

فونت

اندازه قلم

حالت نمایش

14.2. Using Indices

Listing 14.3. A simple search application