Chapter 13 discussed one common approach to working with text, defining a structure through a language called regular expressions and checking an input against this structure. One application of this approach is finding portions of text from a collection that meet a certain precisely defined criteria.
A very common related problem is finding documents from within a large collection that meet a less precisely defined requirementfor example, finding all Web pages that discuss JavaServer Pages or finding all e-mails from John Smith. This is called a
text search or
free text search to enforce the idea that the desired text is "free" to appear anywhere in the documents.
Text searches could be tackled by the tools discussed in Chapter 13. The Egrep tool could be used to examine a set of Web pages for the pattern "JavaServer\W*pages" or e-mails for the pattern "John\W* Smith." This approach is so inefficient as to be worthless for anything beyond a small set of documents. First, the full power of regular expressions is wasted on simple patterns consisting of two static strings. Second, every character of every document has be be examined to determine which ones contain the patterns. This is a tremendously slow operation by computer standards.
A better approach in this case would be to construct an
index containing every significant word or phrase that appears in every document, along with a reference indicating which documents contain each word. There are a great deal of issues with this approach: determining what constitutes "significant," arranging the index itself for efficiency, and so on. These problems have been extensively studied by both