Perl Cd Bookshelf [Electronic resources] نسخه متنی

Full Text Search

Oracle Text, known as Oracle interMedia Text in Oracle8i, provides the full text search functionality that allows users to quickly find the documents that contain certain words or phrases.

To utilize the full text search functionality in Oracle Text, you can create a CONTEXT index provided by Oracle Text and specify searches in SQL using the CONTAINS() function. For example, you can create a CONTEXT index on the description column of the product table as follows:

CREATE INDEX desc_idx ON product(description)
INDEXTYPE IS CTXSYS.CONTEXT;

The CONTEXT index type is specified by the INDEXTYPE clause. After the CONTEXT index is created, you can submit a SQL query as follows:

SELECT name FROM product WHERE CONTAINS(description, 'XSLT')>0;

The first argument of the CONTAINS() function is the column being searched, and the second argument includes the search words or phrases. The CONTAINS() function returns a number for each row in the table, indicating how closely the document matches the query. If the returned number is greater than zero, the document matches the query. As a result, the example query returns the name of every product containing XSLT in its description.

By default, the CONTEXT index does not provide transactional support for the DML operations, including data insertions and updates. You need to synchronize the CONTEXT index to ensure that the index is up-to-date.

In Oracle Database 10g, a new index property called TRANSACTIONAL is introduced for the CONTEXT index to provide transactional support, as shown in the following example:

CREATE INDEX desc_idx ON product(description)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('TRANSACTIONAL');

if the CONTEXT index is created with the TRANSACTIONAL property turned on. In addition to processing the synchronized ROWIDs in the CONTEXT index, the CONTAINS() function does in-memory indexing and processing for the updated or inserted ROWIDs that have not yet been synchronized. You still need to synchronize the index to bring the pending ROWIDs into the CONTEXT index for better performance.

In summary, the following is the basic procedure for using the full text search functionality provided by Oracle Text:

Create a CONTEXT index on VARCHAR2, CLOB, BLOB, BFILE, UriType, or XMLType columns.

Query table-indexed columns using the CONTAINS() function in SQL.

Synchronize the CONTEXT index after DML operations and optimize the index regularly to ensure better performance.

We will discuss the CONTEXT index optimization in the following sections.

Note

Oracle Text provides a transaction-based index called the CTXCAT index. However, normally it is not used for indexing XML documents because it supports only CHAR and VARCHAR2 columns and has limited support for structured document search.

Searching XML Using Oracle Text

In searching XML, you often want to use the structure of the XML document to restrict the search. In Oracle Text, document sections are used to define and add structure information from the XML document to the CONTEXT index. Each document section is defined based on the start tag/end tag of the XML elements or on certain XPath detections. During indexing, the CONTEXT index stores the document sections for every indexed token. This allows users to specify the scope of the text search either by using XPath expressions with INPATH/HASPATH operators or by using XML element names with the WITHIN operator.

Oracle Text provides three types of document section for XML documents: AUTO_SECTION_GROUP, PATH_SECTION_GROUP, and XML_SECTION_GROUP. Each section group is an object that consists of a collection of tags that should be indexed and a section type specification that indicates the format of the document and how to parse the format.

Both AUTO_SECTION_GROUP and XML_SECTION_GROUP index the document based on the element/attribute names. The difference between the two is that AUTO_SECTION_GROUP indexes every XML tag whereas XML_SECTION_GROUP indexes only the tags in its tag collection. PATH_SECTION_GROUP indexes all the element tags and attributes while maintaining the XPath information.

Note

Using AUTO_SECTION_GROUP in Oracle Database 10g is not recommended, because its searching functionality can be fully covered by the PATH_SECTION_GROUP index.

Using PATH_SECTION_GROUP

PATH_SECTION_GROUP is the default section group for XMLTypes. Since PATH_SECTION_GROUP indexes all the sections in the XML document, there is no need to declare the document sections to be indexed. Therefore, users normally use the default instance of PATH_SECTION_GROUP provided by Oracle Text to create the index. This default instance of PATH_SECTION_GROUP contains the predefined full text search preferences. For example, you can create a CONTEXT index called product_idx with PATH_SECTION_GROUP as follows:

CREATE INDEX product_idx ON product(description)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('SECTION GROUP CTXSYS.PATH_SECTION_GROUP');

With PATH_SECTION_GROUP, you can use the WITHIN operator, but you can additionally choose from two more-powerful operators, the HASPATH or INPATH operators, to incorporate a subset of XPath expressions within the CONTAINS() function. HASPATH returns TRUE if the specified XPath exists in the XML document, and INPATH returns TRUE if the text specified exists in the XML content selected out by the XPath expressions. Here are examples of both operators in action:

SQL> SELECT product_id, name
FROM product
WHERE CONTAINS(description,'SQL INPATH(/Description/
OracleSupport)')>0;
PRODUCT_Id NAME
----------------------------------------------------------------------
1 XSLT Processor
SQL> SELECT product_id, name
FROM product
WHERE CONTAINS(description,
'HASPATH(/Description//Standard[@type="W3C"])')>0;
PRODUCT_ID NAME
----------------------------------------------------------------------
2 XSQL
1 XSLT Processor

The first SQL query uses INPATH to limit the <OracleSupport> element to be a direct child of the root element, <Description>, which contains the word SQL. The HASPATH operator in the second query is used to evaluate the XPath expression and return TRUE if the XPath /Description//Standard[@type=“W3C”] exists in the XML document.

Note

The text used in XPath is case sensitive; for example, INPATH(//Product) is different from INPATH(//product).

Using XML_SECTION_GROUP

The CONTEXT index in PATH_SECTION_GROUP is easy to use, but it can be expensive because it indexes all the document sections in the XML document, including those whose content will not be searched. To avoid the overhead, you can use XML_SECTION_GROUP to specify a limited number of document sections based on the XML elements. For example, you can create an instance of XML_SECTION_GROUP called productGroup and limit the index on one document section called whatis_sec containing content from the <Whatis> element:

EXEC CTX_DDL.CREATE_SECTION_GROUP('productGroup','XML_SECTION_GROUP');
EXEC CTX_DDL.ADD_ZONE_SECTION('productGroup','whatis_sec','Whatis');

Then, you can create a CONTEXT index using the productGroup:

CREATE INDEX desc_xml_idx ON product(description)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('SECTION GROUP productGroup');

The SQL query can use the WITHIN operator to specify the document section:

SELECT name
FROM product
WHERE CONTAINS(description, 'java WITHIN whatis_sec')>0;

You can also add a new zone section named orasupport_sec using the tag name of the <OracleSupport> element by rebuilding the index:

ALTER INDEX desc_xml_idx REBUILD
PARAMETERS ('ADD ZONE SECTION orasupport_sec TAG OracleSupport');

However, the following SQL query will not return any rows:

SELECT name
FROM product
WHERE CONTAINS(description, 'java WITHIN orasupport_sec')>0;

This is because the rebuild process only modifies the index metadata, and does not rebuild the index in any way. This means that these section updates do not affect any indexed documents. To include the existing documents in the index with the new sections, you need to manually mark the documents for re-indexing or simply drop and re-create the index.

In contrast to PATH_SECTION_GROUP, XML_SECTION_GROUP is efficient. However, when the element names are not unique in the XML document, you cannot create zone sections to differentiate these elements in the full text searches.

Synchronizing the Index

Unlike the B*Tree index, where the database maintains the index on the basis of DML transactions, the CONTEXT index is not updated after the DML transactions, including the data insertions and updates. Instead, only the ROWIDs of the updated or inserted table records are kept for later index synchronization. Therefore, if there are any inserts or updates of documents in the base table, you need to synchronize the CONTEXT index for the records to be part of the searching content.

Note

The CONTEXT index transactionally supports data deletion operations.

You can decide whether synchronization is needed by reviewing the CTX_USER_PENDING view, where all ROWIDs related to DML operations are stored. For example, if the product table is updated, you can see new entries in the CTX_USER_PENDING view:

SQL> UPDATE product SET
description=updateXML(description,'/Description/OracleSupport',
'<OracleSupport>Oracle XSQL provides command-line utility, run-time
Servlet engine and extensible Java development framework
</OracleSupport>')
WHERE product_id=2;
1 row updated.
SQL> SELECT pnd_index_name, pnd_rowid, TO_CHAR(pnd_timestamp,
'dd-mon-yyyy') timestamp
FROM CTX_USER_PENDING;
PND_INDEX_NAME PND_ROWID TIMESTAMP
------------------------------ ------------------ -----------
PRODUCT_IDX AAAK6jAAEAAAAbsAAB 09-aug-2003

You can synchronize your index manually by calling the CTX_DDL.SYNC_INDEX() procedure. In order to have the execution privilege on the CTX_DDL PL/SQL package, you need to have the CTXAPP role granted by the CTXSYS user:

GRANT ctxapp TO &user_name;

The following example synchronizes the product_idx index using 2MB of memory:

SQL> BEGIN
2 CTX_DDL.SYNC_INDEX('product_idx', '2M');
3 END;
4 /
PL/SQL procedure successfully completed.
SQL> SELECT pnd_index_name, pnd_rowid, TO_CHAR(pnd_timestamp,
'dd-mon-yyyy') timestamp FROM CTX_USER_PENDING;
no rows selected

After index synchronization, the CTX_USER_PENDING view will not have any entries.

Alternatively, you can run CTX_DDL.SYNC_INDEX() automatically at regular intervals using the DBMS_JOB.SUBMIT() procedure. Oracle Text includes a SQL script you can use to do this:

$ORACLE_HOME/ctx/sample/script/drjobdml.sql
$ORACLE_HOME/ctx/sample/script/drbgdml.sql

To use this script, you must be the owner of the index and you must have execution privileges on the CTX_DDL package. You must also set the JOB_QUEUE_PROCESSES parameter in the Oracle initialization file. For example, to set the index synchronization to run every 360 minutes on product_idx, you can issue the following command in SQL*Plus:

SQL> @drjobdml product_idx 360;

You need to properly choose an interval for CONTEXT index synchronization. This is because synchronizing the CONTEXT indexes right after each data insertion or update may result in index fragmentation, which may dramatically reduce the performance of the full text search queries.

If the application requires frequent synchronization of the CONTEXT index or even requires the transactional update of the CONTEXT index, you then need to schedule additional regular CONTEXT index optimization to ensure high performance.

How Oracle Text Search Works

Knowing the details of how searching with Oracle Text works helps you build better searches. This is particularly useful when you want to know why the data is selected. This section discusses those features that help you understand the search process.

Whenever the CONTEXT index is created, all the XML documents are analyzed and a set of tables is created to store the indexed tokens. For example, along with the product_idx index, the following index tables are created:

SQL> SELECT table_name FROM user_tables
WHERE table_name LIKE '%PRODUCT_IDX%';
TABLE_NAME
------------------------------
DR$PRODUCT_IDX$I
DR$PRODUCT_IDX$K
DR$PRODUCT_IDX$N
DR$PRODUCT_IDX$R

The Oracle Text engine creates four tables automatically. These tables start with the DR$ prefix and end with the suffix to be $I, $K, $N, or $R. Between the prefix and suffix is the index name.

The DR$<INDEX NAME>$K and DR$<INDEX NAME>$R tables store the mapping of the row information from the internal document identifiers (DOCIDs) to external ROWID values. The database optimizer (Cost Based Optimizer) determines the type of lookup. For functional lookups, the $K table is used. For indexed lookups, the $R table is used. Hence, you can easily find out whether a functional or indexed lookup is used by examining the SQL trace and looking for the $K or $R tables.

The $N table is used to maintain information for the deleted rows and the $I table is the one that stores all the tokens indexed by Oracle Text. The search queries will look up all the tokens stored in the $I table before returning the result. The following is the schema of the $I table:

SQL> DESC DR$product_idx$I;
Name Null? Type
--------------- ------------------- -------- ----------------
TOKEN_TEXT NOT NULL VARCHAR2(64)
TOKEN_TYPE NOT NULL NUMBER(3)
TOKEN_FIRST NOT NULL NUMBER(10)
TOKEN_LAST NOT NULL NUMBER(10)
TOKEN_COUNT NOT NULL NUMBER(10)
TOKEN_INFO BLOB

The TOKEN_TEXT column stores the tokens and the TOKEN_INFO column stores the information about the row and word positions where the token occurs.

To ensure the preferred content is indexed by Oracle Text, you can use the following command to check the $I table:

SELECT TOKEN_TEXT FROM DR$product_idx$I
WHERE TOKEN_TEXT LIKE 'XSLT';

This SQL query checks whether the word XSLT is indexed by the CONTEXT index called product_idx.

Sometimes an indexing operation might fail or not complete successfully. When the system encounters an error indexing a row, it logs the error in an Oracle Text view. You can view these errors for specific users using the CTX_USER_INDEX_ERRORS view or query the CTXSYS errors for all users using the CTX_INDEX_ERRORS view. For example, to view the most recent index errors of the current logged in user, you can issue the following SQL command:

SELECT err_timestamp, err_text
FROM ctx_user_index_errors
ORDER BY err_timestamp DESC;

To clear the view of errors, you can issue the following command:

DELETE FROM CTX_USER_INDEX_ERRORS;

Optimizing Oracle Text Searches

We have discussed that you need to optimize the CONTEXT index after multiple index synchronizations. This is because index synchronization may result in index fragmentation that adversely affects query performance.

Additionally, in order to improve query response time, when many rows are deleted from the base table, you need to optimize the CONTEXT index in FULL mode. This is because the old data in the index table is marked only as deleted but is not cleaned up immediately. Because the old data takes up space and can cause extra overhead at query time, you must remove this data from the index by optimizing the index in FULL mode. This is called garbage collection. Optimizing in FULL mode for garbage collection is also necessary when you have had frequent updates or deletions to the base table. If you rarely delete or update data from a table, you can use the OPTIMIZE FAST option. Otherwise, you should use OPTIMIZE FULL.

You can also optimize the index in TOKEN mode, where you specify a specific token to be optimized. You can use this mode to optimize index tokens that are frequently searched, without spending time optimizing tokens that are rarely referenced. For example, you can specify that only the token “Functionality” in the product description be optimized in the index if you know that this token is updated and queried frequently. To optimize an index in TOKEN mode, you can use CTX_DDL.OPTIMIZE_INDEX() as follows:

BEGIN
CTX_DDL.OPTIMIZE_INDEX('product_idx','token',
TOKEN=>'Functionality');
END;

An optimized token can improve its query response time. Again, the tokens can be found in the $I table.

To help you decide whether to optimize your index, you can create a statistical report on your index using the CTX_REPORT.INDEX_STATS() procedure. The report includes information on the optimal row fragmentation, a list of the most fragmented tokens, and the amount of garbage data in your index. Although this report might take a long time to run for large indexes, it really helps.

Advanced Text Searches

Oracle Text provides many full text search features. This section discusses techniques on how to return scores that reflect the relevance of the query and returned records, how to leverage the full text search on XML documents in the XML DB Repository, and how to set up categories for XML document searching.

Using Scores

When you issue a full text search query, Oracle Text returns a relevance score for each document returned. The score is between 1 and 10. The higher the score, the more relevant the document is to the query. The score can be selected using the SCORE() operator with a label of 1:

SQL> SELECT score(1) AS score, product_id, name
FROM product
WHERE CONTAINS(description,'W3C INPATH(//Standard/@type)',1)>0;
SCORE PRODUCT_ID NAME
----------------------------------------------------------------
6 2 XSQL
3 1 XSLT Processor

You can use these scores to order the returned documents by showing the most relevant documents first.

Searching XML in the Oracle XML DB Repository

To add full text search for XML files stored in the Oracle XML DB repository, you can create a CONTEXT index using a subtype of the UriType called XDBUriType.

Note

Additionally, you can create CONTEXT indexes on files in HTTP server, FTP server, and OS file systems using the UriType or its subtypes such as the HTTPUriType.

For example, you can create a table with an XDBUriType column containing the Oracle XML DB repository URIs pointing to the files stored the XML DB repository. For example, the XML DB repository URI of the xdbconfig.xml file is /xdbconfig.xml:

CREATE TABLE myconfig(url SYS.XDBURITYPE);
INSERT INTO myconfig VALUES(SYS.XDBURITYPE.CREATEURI('/xdbconfig.xml'));

Then, you can create a CONTEXT index on this column:

CREATE INDEX config_idx ON myconfig(url) INDEXTYPE IS CTXSYS.CONTEXT;

After the index is created, you can use the following SQL CONTAINS query to search the content:

SELECT * FROM myconfig WHERE CONTAINS(url, 'http')>0;

Category-Based Searching

When searching a large number of XML documents, you can classify the documents and establish the document routing by creating a CTXRULE index. In Oracle Database 10g, you can build a CTXRULE index based on your XML document queries.

The first step is to create a table of queries that defines the classifications. For example, you can create a QUERY_CAT table to hold the category name and query text:

CREATE TABLE QUERY_CAT (
query_id NUMBER PRIMARY KEY,
category VARCHAR2(30),
query VARCHAR2(2000));

You then can populate the table with the classifications and the queries that define each category. For example, consider a classification for the subjects XML and SQL:

INSERT INTO QUERY_CAT VALUES(1, 'XML Product', 'xml');
INSERT INTO QUERY_CAT VALUES(2, 'SQL', 'ABOUT(sql)');

You can create the CTXRULE index as follows:

CREATE INDEX ctxrule_idx ON query_cat(query)
INDEXTYPE IS CTXSYS.CTXRULE;

After a CTXRULE index is created on the query set, you can use the MATCHES() function to classify the XML documents. In the following example, a product_route table is created to store results of the document classification:

create table product_route
(product_id number,
category varchar2(30));

The following PL/SQL code populates the product_route table using MATCHES():

DECLARE 
CURSOR p_desc IS SELECT product_id, description FROM product;
BEGIN
FOR item IN p_desc LOOP
FOR c1 IN (SELECT category
FROM query_cat
WHERE MATCHES(query,item.description.getStringVal())>0)
LOOP
INSERT INTO product_route(product_id, category)
VALUES (item.product_id, c1.category);
END LOOP;
END LOOP;
END;

Using this functionality, you can classify XML documents into a number of predefined categories.

Perl Cd Bookshelf [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی