All of the data structures discussed so far in this chapter are server entities. Users request data from an Oracle server through database queries. Oracle's query optimizer must then determine the best way to access the data requested by each query.
One of the great virtues of a relational database is its ability to access data without predefining the access paths to the data. When a SQL query is submitted to an Oracle database, Oracle must decide how to access the data. The process of making this decision is called query optimization, because Oracle looks for the optimal way to retrieve the data. This retrieval is known as the execution path. The trick behind query optimization is to choose the most efficient way to get the data, because there may be many different options available.
For instance, even with a query that involves only a single table, Oracle can take either of these approaches:
Use an index to find the ROWIDs of the requested rows and then retrieve those rows from the table.
Scan the table to find and retrieve the rows; this is referred to as a full table scan.
Although it's usually much faster to retrieve data using an index, the process of getting the values from the index involves an additional I/O step in processing the query. Query optimization may be as simple as determining whether the query involves selection conditions that can be imposed on values in the index. Using the index values to select the desired rows involves less I/O and is therefore more efficient than retrieving all the data from the table and then imposing the selection conditions.
Another factor in determining the optimal query execution plan is whether there is an ORDER BY condition in the query that can be automatically implemented by the presorted index. Alternatively, if the table is small enough, the optimizer may decide to simply read all the blocks of the table and bypass the index because it estimates the cost of the index I/O plus the table I/O to be higher than just the table I/O.
The query optimizer has to make some key decisions even with a query on a single table. When a more involved query is submitted, such as one involving many tables that must be joined together efficiently or one that has a complex selection criteria and multiple levels of sorting, the query optimizer has a much more complex task.
Prior to Oracle Database 10g, you could choose between two different Oracle query optimizers, a rule-based optimizer and a cost-based optimizer, which are described in the following sections. With Oracle Database 10g, the rule-based optimizer is desupported. The references to syntax and operations for the rule-based optimizer in the following sections are provided as a convenience and are applicable only if you are running an older release of Oracle.
Oracle has always had a query optimizer, but until Oracle7 the optimizer was only rule-based. The rule-based optimizer, as the name implies, uses a set of predefined rules as the main determinant of query optimization decisions. As noted earlier, the rule-based optimizer is desupported in Oracle Database 10g.
Rule-based optimization was sometimes a better choice than other random methods of optimization or the Oracle cost-based optimizer in early Oracle releases, despite some significant weaknesses. One weakness is the simplistic set of rules. The Oracle rule-based optimizer has about twenty rules and assigns a weight to each one of them. In a complex database, a query can easily involve several tables, each with several indexes and complex selection conditions and ordering. This complexity means that there will be a lot of options, and the simple set of rules used by the rule-based optimizer might not differentiate the choices well enough to make the best choice.
The rule-based optimizer assigns an optimization score to each potential execution path and then takes the path with the best optimization score. Another main weakness of the rule-based optimizer stems from the way it resolved optimization choices made in the event of a "tie" score. When two paths present the same optimization score, the rule-based optimizer looks to the syntax of the SQL statement to resolve the tie. The winning execution path is based on the order in which the tables occur in the SQL statement.
You can understand the potential impact of this type of tie-breaker by looking at a simple situation in which a small table with 10 rows, SMALLTAB, is joined to a large table with 10,000 rows, LARGETAB, as shown in Figure 4-4. If the optimizer chooses to read SMALLTAB first, the Oracle database will read the 10 rows and then read LARGETAB to find the matching rows for all 10 of these rows. If the optimizer chooses to read LARGETAB first, the database will read 10,000 rows from LARGETAB and then read SMALLTAB 10,000 times to find the matching rows. Of course, the rows in SMALLTAB will probably be cached, thus reducing the impact of each probe, but the two optimizer choices still offer a dramatic difference in potential performance.
Differences like this occur as a result of the ordering of the table names in the query. In the previous situation the rule-based optimizer returns the same results for the query, but it uses widely varying amounts of resources to retrieve those results.
To improve the optimization of SQL statements, Oracle introduced the cost-based optimizer in Oracle7. As the name implies, the cost-based optimizer does more than simply look at a set of optimization rules; instead, it selects the execution path that requires the least number of logical I/O operations. This approach avoids the error discussed in the previous section. After all, the cost-based optimizer would know which table was bigger and would select the right table to begin the query, regardless of the syntax of the SQL statement.
Oracle8 and later versions, by default, attempt to use the cost-based optimizer to identify the optimal execution plan. As we've mentioned, in Oracle Database 10g, the cost-based optimizer is the only supported choice. To properly evaluate the cost of any particular execution plan, the cost-based optimizer uses statistics about the composition of the relevant data structures. These statistics are automatically gathered in Oracle Database 10g.
Like the rule-based optimizer, the cost-based optimizer finds the optimal execution plan by assigning an optimization score for each of the potential execution plans. However, the cost-based optimizer uses its own internal rules and logic along with statistics that reflect the state of the data structures in the database. These statistics relate to the tables, columns, and indexes involved in the execution plan. The statistics for each type of data structure are listed in Table 4-1.
Oracle Database 10g also collects overall system statistics, including I/O and CPU performance and utilization.
These statistics are stored in three tables in the data dictionary, which are described in the final section of this chapter, Section 4.10.
Data structure |
Type of statistics |
---|---|
Table |
Number of rows Number of blocks Number of unused blocks Average available free space per block Number of chained rows Average row length |
Column |
Number of distinct values per column Second-lowest column value Second-highest column value Column density factor |
Index |
Depth of index B*-tree structure Number of leaf blocks Number of distinct values Average number of leaf blocks per key Average number of data blocks per key Clustering factor |
You can see that these statistics can be used individually and in combination to determine the overall cost of the I/O required by an execution plan. The statistics reflect both the size of a table and the amount of unused space within the blocks; this space can, in turn, affect how many I/O operations are needed to retrieve rows. The index statistics reflect not only the depth and breadth of the index tree, but also the uniqueness of the values in the tree, which can affect the ease with which values can be selected using the index.
Oracle8 and later versions use the SQL statement ANALYZE to collect these statistics. (This is no longer needed in Oracle Database 10g because statistics can be gathered automatically.) You can analyze a table, an index, or a cluster in a single SQL statement. Collecting statistics can be a resource-intensive job; in some ways, it's like building an index. Because of its potential impact, the ANALYZE command has two options:
Calculates the statistics on the entire data structure.
Specifies a number of rows or overall percentage of the data structure for statistical analysis. You can also let the DBMS_STATS package determine the appropriate sample size for best results.
The latter choice makes gathering the relative statistics for the data structure consume far fewer resources than computing the exact figures for the entire structure. If you have a very large table, for example, analyzing 5% or less of the table will probably produce an accurate estimate of the relative percentages of unused space and other relative data.
The use of statistics makes it possible for the cost-based optimizer to make a much more well-informed choice of the optimal execution plan. For instance, the optimizer could be trying to decide between two indexes to use in an execution plan that involves a selection based on a value in either index. The rule-based optimizer might very well rate both indexes equally and resort to the order in which they appear in the WHERE clause to choose an execution plan. The cost-based optimizer, however, knows that one index contains 1,000 entries while the other contains 10,000 entries. It even knows that the index that contains 1,000 values contains only 20 unique values, while the index that contains 10,000 values has 5,000 unique values. The selectivity offered by the larger index is much greater, so that index will be assigned a better optimization score and used for the query.
Testing the Effect of New StatisticsThere may be times when you don't want to update your statistics, such as when the distribution of data in your database has reached a steady state or when your queries are already performing in an optimal way. Oracle gives you a way that you can try out a new set of statistics to see if they might make things better while still maintaining the option of returning to the old set: you can save your statistics in a separate table and then collect new ones. If, after testing your application with these new statistics, you decide you preferred the way the old statistics worked, you can simply reload the saved statistics. |
In Oracle9i, you have the option of allowing the cost-based optimizer to use CPU speed as one of the factors in determining the optimal execution plan. An initialization parameter turns this feature on and off. In Oracle Database 10g, the default cost basis is calculated on the CPU cost plus the I/O cost for a plan.
Even with all the information available to it, the cost-based optimizer did have some noticeable initial flaws. Aside from the fact that it (like all software) occasionally has bugs, the cost-based optimizer used statistics that didn't provide a complete picture of the data structures. In the previous example, the only thing the statistics tell the optimizer about the indexes is the number of distinct values in each index. They don't reveal anything about the distribution of those values. For instance, the larger index can contain 5,000 unique values, but these values can each represent two rows in the associated table, or one index value can represent 5,001 rows while the rest of the index values represent a single row. The selectivity of the index can vary wildly, depending on the value used in the selection criteria of the SQL statement. Fortunately, Oracle 7.3 introduced support for collecting histogram statistics for indexes to address this exact problem. You create histograms using syntax within the ANALYZE INDEX statement when you gather statistics yourself in Oracle versions prior to Oracle Database 10g. This syntax is described in your Oracle SQL reference documentation.
Oracle8 and more recent releases come with a built-in PL/SQL package, DBMS_STATS, which contains a number of procedures that can help you to automate the process of collecting statistics. Many of the procedures in this package also collect statistics with parallel operations, which can speed up the collection process.
There are two ways you can influence the way the cost-based optimizer selects an execution plan. The first way is by setting the optimizer mode to favor either batch- type requests or interactive requests with the ALL_ROWS or FIRST_ROWS choice.
You can set the optimizer mode to weigh the options for the execution plan to either favor ALL_ROWS, meaning the overall time that it takes to complete the execution of the SQL statement, or FIRST_ROWS, meaning the response time for returning the first set of rows from a SQL statement. The optimizer mode tilts the evaluation of optimization scores slightly and, in some cases, may result in a different execution plan. ALL_ROWS and FIRST_ROWS are two of four choices for the optimizer mode, which is described in more detail in the next section.
Oracle also gives you a way to completely override the decisions of the optimizer with a technique called hints. A hint is nothing more than a comment with a specific format inside a SQL statement. You can use hints to force a variety of decisions onto the optimizer, such as:
Use of the rule-based optimizer prior to Oracle Database 10g
Use of a full table scan
Use of a particular index
Use of a specific number of parallel processes for the statement
Hints come with their own set of problems. A hint looks just like a comment, as shown in this extremely simple SQL statement. Here, the hint forces the optimizer to use the EMP_IDX index for the EMP table:
SELECT /*+ INDEX(EMP_IDX) */ LASTNAME, FIRSTNAME, PHONE FROM EMP
If a hint isn't in the right place in the SQL statement, if the hint keyword is misspelled, or if you change the name of a data structure so that the hint no longer refers to an existing structure, the hint will simply be ignored, just as a comment would be. Because hints are embedded into SQL statements, repairing them can be quite frustrating and time-consuming if they aren't working properly. In addition, if you add a hint to a SQL statement to address a problem caused by a bug in the cost-based optimizer and the cost-based optimizer is subsequently fixed, the SQL statement is still outside the scope of the optimization calculated by the optimizer.
However, hints do have a placefor example, when a developer has a user-defined datatype that suggests a particular type of access. The optimizer cannot anticipate the effect of user-defined datatypes, but a hint can properly enable the appropriate retrieval path.
|
For more details about when hints might be considered, see the sidebar "Accepting the Verdict of the Optimizer" later in this chapter.
In the previous section we mentioned two optimizer modes: ALL_ROWS and FIRST_ROWS. The two other valid optimizer modes for all Oracle versions prior to Oracle Database 10g are:
RULE
Forces the use of the rule-based optimizer
CHOOSE
Allows Oracle to choose whether to use the cost-based optimizer or the rule-based optimizer
With an optimizer mode of CHOOSE, which is the default setting, Oracle will use the cost-based optimizer if any of the tables in the SQL statement have statistics associated with them. The cost-based optimizer will make a statistical estimate for the tables that lack statistics. It's important to understand that partially collected statistics can cause tremendous problems. If one of the tables in a SQL statement has statistics, Oracle will use the cost-based optimizer. If the optimizer is acting on incomplete information, the quality of optimization will suffer accordingly. If you're going to use the cost-based optimizer, make sure you gather complete statistics for your databases.
You can set the optimizer level at the instance level, at the session level, or within an individual SQL statement. But the big question remains, especially for those of you who have been using Oracle since before the introduction of the cost-based optimizer: which optimizer should I choose?
We favor using the cost-based optimizer, even prior to Oracle Database 10g, for several reasons.
First, the cost-based optimizer makes decisions with a wider range of knowledge about the data structures in the database. Although the cost-based optimizer isn't flawless in its decision-making process, it does tend to make more accurate decisions based on its wider base of information, especially because it has been around since Oracle7 and has been improved with each new release.
Second, the cost-based optimizer has been enhanced to take into account improvements in the Oracle database itself, while the rule-based optimizer has not. For instance, the cost-based optimizer understands the impact that partitioned tables have on the selection of an execution plan, while the rule-based optimizer does not. As another example, the cost-based optimizer can optimize execution plans for star schema queries, which are heavily used in data warehousing, while the rule-based optimizer has not been enhanced to deal effectively with these types of queries.
The reason for this bias is simple: Oracle Corporation has been quite frank about their intention to make the cost-based optimizer the optimizer for the Oracle database. Oracle hasn't been adding new features to the rule-based optimizer and hasn't guaranteed support for it in future releases. As we have mentioned, with Oracle Database 10g, the rule-based optimizer is no longer supported
So your future (or your present, with Oracle Database 10g) lies in the cost-based optimizer. But there are still situations in which you might want to use the rule-based optimizer if you are running a release of Oracle prior to Oracle Database 10g.
As the old saying goes, if it ain't broke, don't fix it. And you may be in an environment in which you've designed and tuned your SQL to operate optimally with the rule-based optimizer. Although you should still look ahead to a future in which only the cost-based optimizer is supported, there may be no reason to switch over to the cost-based optimizer now if you have an application that is already performing at its best and you are not planning to move to Oracle Database 10g in the near term.
The chances are pretty good that the cost-based optimizer will choose the same execution plan as a properly tuned application using the rule-based optimizer. But there is always a chance that the cost-based optimizer will make a different choice, which can create more work for you, because you might have to spend time tracking down the different optimizations.
Remember the bottom line for all optimizers: no optimizer can provide a performance increase for a SQL statement that's already running optimally. The cost-based optimizer is not a magic potion that remedies the problems brought on by a poor database and application design or an inefficient implementation platform.
The good news is that Oracle gives you a lot of flexibility in accepting, storing, or overriding the verdict of the optimizer.
Accepting the Verdict of the OptimizerSome of you may be doubting the effectiveness of Oracle query optimization, especially those of you who encountered bumpy times with the early releases of the cost-based optimizer. You may have seen cases in which the query optimizer chose an incorrect execution path that resulted in poor performance. You may feel that you have a better understanding of the structure and use of the database than does the query optimizer. For these reasons, you probably include a lot of hints to force the acceptance of the execution path you feel is correct. We recommend using the query optimizer for all of your queries. Although the Oracle developers who wrote the query optimizer had no knowledge of your particular database, they did depend on a lot of customer feedback, experience, and knowledge of how Oracle processes queries during the creation of the query optimizer. They designed the cost-based optimizer to efficiently execute all types of queries that may be submitted to the Oracle database. In addition, there are three advantages that the query optimizer has over your discretion in all cases: The optimizer sees the structure of the entire database. Many Oracle databases support a variety of applications and users and it's quite possible that your system shares data with other systems, making the overall structure and composition of the data somewhat out of your control. In addition, you have probably designed and tested your systems in a limited environment, so your idea of the optimal execution path may not match the reality of the production environment. The optimizer has a dynamically changing view of the database and its data. The statistics used by the cost-based optimizer can change with each new collection operation. Although the statistics typically don't change the query optimization in more or less steady-state production databases, exceptions to this rule do occur. In addition to the changing statistical conditions, the internal workings of the optimizer are occasionally changed to fix bugs or to accommodate changes in the way the Oracle database operates. In Oracle9i, the cost-based optimizer can even take into account the speed of the CPU, and Oracle Database 10g also collect statistics on I/O. If you force the selection of a particular query plan with a hint, you won't benefit from these changes in Oracle. A bad choice by the optimizer may be a sign that something is amiss in your database. For the most part, the query optimizer selects the optimal execution path. What may be seen as a mistake by the query optimizer can, in reality, be traced to a misconception about the database and its design or to an improper implementation. A mistake is always an opportunity to learn, and you should always take advantage of any opportunity to increase your overall understanding of how Oracle and its optimizer work.
We recommend that you consider using hints only when you have determined them to be absolutely necessary by thoroughly investigating the causes for an optimization problem. The hint syntax was included in Oracle syntax as a way to handle exceptional situations, rather than to allow you to circumvent the query optimizer. If you've found a performance anomaly and further investigation has led to the discovery that the query optimizer is choosing an incorrect execution path, then and only then should you assign a hint to a query. Even in this situation, we recommend that you keep an eye on the hinted query in a production environment to make sure that the forced execution path is still working optimally. |
There may be times when you want to prevent the optimizer from calculating a new plan whenever a SQL statement is submitted. For example, you might do this if you've finally reached a point at which you feel the SQL is running optimally, and you don't want the plan to change regardless of future changes to the optimizer or the database.
Starting with Oracle8i, you can create a stored outline that will store the attributes used by the optimizer to create an execution plan. Once you have a stored outline, the optimizer simply uses the stored attributes to create an execution plan. With Oracle9i, you can also edit the hints that make up a stored outline.
Remember that storing an outline fixes the optimization of the outline at the time the outline was stored. Any subsequent improvements to the optimizer will not affect the stored outlines, so you should document your use of stored outlines and consider restoring them with new releases.