4.1 Universal Techniques for Controlling Plans
This section describes a number of
database-independent techniques you can use to control execution
plans. The techniques are good for the following purposes:Enabling use of the index you wantPreventing use of the wrong indexesEnabling the join order you wantPreventing join orders you do not wantChoosing the order to execute outer queries and subqueriesProviding the cost-based optimizer with good dataFooling the cost-based optimizer with bad data
These vendor-independent techniques often offer an alternative method
to achieve ends you could also achieve with vendor-specific methods.
When you have a choice, the vendor-specific methods are usually
cleaner. However, some problems are solvable only by these universal
techniques, which offer solutions that can sometimes work on SQL that
is intended to run on multiple vendor databases.
4.1.1 Enabling Use of the Index You Want
To enable efficient use of an index,
you need a reasonably selective condition on the leading column (or
only column) of that index. The condition must also be expressed in a
way that enables the database to establish a reasonably narrow index
range for the index values. The ideal form this takes is:
SomeAlias.Leading_Indexed_Column=<Expression>
In less ideal cases, the comparison is with some range of values,
using BETWEEN, LIKE,
<, >,
<=, or >=. These range
comparisons also potentially enable use of the index, but the index
range is likely to be larger and the resulting query therefore
slower. If the index range is too large, the optimizer might conclude
that the index is not worth using and choose another path to the
data. When you combine equalities and range conditions for
multicolumn indexes, you should prefer indexes that lead with the
columns that have equality conditions and finish with columns that
have range conditions. Note that the left side of the comparison
simply names the column, with no function around the column, and no
expression (such as addition) using the column. Use of a
function, a type conversion, or an
arithmetic expression on the side with the indexed column will
generally disable use of that index.
Type conversions are a particularly subtle
way that SQL sometimes disables use of an index. DB2 returns an error
if you compare two expressions with incompatible types. SQL Server
prefers to perform the implicit conversion on the side of the
comparison that does not disable the index. Oracle implicitly
converts character-type expressions to the type of the other side,
even when this disables index use. For example, consider this
expression:
P.Phone_Number=5551212
If Phone_Number were a character-type column, this
would likely evaluate internally on Oracle and SQL Server as:
On Oracle: TO_NUMBER(P.Phone_Number)=5551212
On SQL Server: P.Phone_Number=CAST(5551212 AS VARCHAR)
SQL Server preserves indexed access to the column. On Oracle, the
implicit use of TO_NUMBER( ) disables use of the
index just as surely as if you made the expression explicit. (The
only real difference is that the problem is harder to find in the
implicit form.) The same problem can plague index use for joins, as
well as for single-table conditions. For example, consider the join:
P.Phone_Number=C.Contact_Number
If Contact_Number were a number type and
Phone_Number were a character type, the implicit
conversion on Oracle would prevent an index-driven nested-loops join
from C to P. A join in the
other direction would be unhindered.The expression opposite the indexed column reference can be
arbitrarily complex. However, it must not reference columns in the
same alias with the indexed column. For example, consider the
condition:
P.Phone_Number=P.Area_Code||'5551212'
The database cannot drive into the index on
P.Phone_Number with this condition, because the
database must reach alias P before it can evaluate
the expression on the right side. This chicken-and-egg problem
prevents identifying (with the index) the subset of the table that
meets this condition until after the database examines the whole
table.The final way that SQL often disables
index use is with conditions combined with OR. For
example, consider the query:
SELECT ...
FROM Order_Details D, ...
WHERE ...
AND (D.Order_ID=:1 or :1 IS NULL)
AND ...
In this example, the database can reach
Order_Details through an index on
Order_ID if the bind variable
:1 happens to be nonnull. But if
:1 is bound to a null value, there is no
restriction at all on Order_ID and thus no use for
that index. Since the database cannot tell what :1
will be bound to when it parses the SQL and prepares the plan, it
will find no good opportunity to use the index. In this case, the
solution is to create a two-part plan, with each part optimized for
one of the cases:
SELECT ...
FROM Order_Details D, ...
WHERE ...
AND D.Order_ID=:1
AND :1 IS NOT NULL
AND ...
UNION ALL
SELECT ...
FROM Order_Details D, ...
WHERE ...
AND :1 IS NULL
AND ...
When you view the execution plan for this query, it shows both
indexed access through the index on
Order_Details(Order_ID) and full-table-scan access
to Order_Details. This might appear to be the
worst of both worlds, but you are saved by the conditions:
AND :1 IS NOT NULL
...
AND :1 IS NULL
These conditions make no reference at all to any data in the
database, so the database can and does evaluate them before it even
begins reading data for that half of the combined statement.
Therefore, it never actually executes the full table scan when
:1 is not null, and it never actually executes an
indexed read (or any other part of the execution plan for the first
half of the query) when :1 is null. This amounts
to a method to branch your execution plan depending on conditions on
the bind variables, the variables that determine which data is
available to drive the query. The only catch is that you must ensure
that the conditions on the bind variables are mutually exclusive, so
that exactly one of the branches actually
returns data. For example, if you have another bind variable to
provide Customer_Name, you might put together a
query like this:
SELECT ...
FROM Order_Details D, Customers C, ...
WHERE ...
AND D.Order_ID=:1
AND :1 IS NOT NULL
AND (C.Customer_Name=:2 OR :2 IS NULL)
AND ...
UNION ALL
SELECT ...
FROM Order_Details D, Customers C, ...
WHERE ...
AND :1 IS NULL
AND :2 IS NOT NULL
AND C.Customer_Name=:2
AND ...
UNION ALL
SELECT ...
FROM Order_Details D, Customers C, ...
WHERE ...
AND :1 IS NULL
AND :2 IS NULL
AND ...
This could support a three-part plan, in which the database would:Drive into Orders on the index on
Order_ID (your first choice), when possible.Otherwise, drive into Customers on the index on
Customer_Name (your second choice) when it has no
Order_ID specified but has a customer name.Otherwise, just get all the rows, probably beginning with a full
table scan, when it has no selective conditions at all.
In any case, the conditions on the bind variables in the three parts
are contrived to be mutually exclusive:
AND :1 IS NOT NULL
...
AND :1 IS NULL
AND :2 IS NOT NULL
...
AND :1 IS NULL
AND :2 IS NULL
4.1.2 Preventing Use of the Wrong Indexes
Join expressions are usually simple,
usually between consistent types, and usually between numerical IDs.
Conditions on the driving table are usually simple and compatible
with index use. A more frequent problem than enabling use of the
right index is preventing use of the wrong indexes. In many queries,
there are multiple single-table conditions that are capable of
reaching multiple indexes, but you want to use only a specific one of
those indexes. Join conditions are usually expressed to allow
index-driven joins in either direction, although only one of the
possible join directions turns out to be optimal. Occasionally,
you'll prefer to disable use of an index on a join
altogether, to force a hash or sort-merge join.To disable use of an index, create the
simplest possible expression around the indexed column reference. For
example, you should prevent use of an index on
Status_Code for the unselective condition on
closed orders, as the number of closed orders will eclipse open
orders as you do more and more business:
O.Status_Code='CL'
Since Status_Code is a character-type column, a
simple expression to disable index use without changing the results
would simply concatenate an empty string to the end of
Status_Code:
On Oracle and DB2: O.Status_Code||='CL'
On SQL Server: O.Status_Code+='CL'
For number-type columns, you can add 0:
O.Region_ID+0=137
All databases have some sort of function that evaluates to the first
argument when the argument is null and otherwise returns the second
argument. On Oracle, the
function is NVL( ). On
SQL Server and DB2, it is
COALESCE( ). If both
arguments are the same column, the function always returns the same
result as the bare column, regardless of the column type. Therefore,
this makes a handy recipe to deactivate index use regardless of
column type:
On Oracle: NVL(O.Order_Date,O.Order_Date)=<Value>
On DB2 and SQL Server: COALESCE(O.Order_Date,O.Order_Date)=<Value>
In a join condition, a join that disables an indexed path to
O.Region_ID (but not to
R.Region_ID) could look like this:
O.Region_ID+0=R.Region_ID
Using the type-independent approach, this same join would look like
this:
NVL(O.Region_ID,O.Region_ID)=R.Region_ID
4.1.3 Enabling the Join Order You Want
Apart from unintentionally disabled
indexes, there are two issues that sometimes disable desired join
orders:Outer joinsMissing redundant join conditions
4.1.3.1 Outer joins
Consider an outer join
query, in Oracle-style notation:
SELECT ...
FROM Employees E, Locations L
WHERE E.Location_ID=L.Location_ID(+)
or in the newer, universal notation:
SELECT ...
FROM Employees E LEFT OUTER JOIN Locations L
ON E.Location_ID=L.Location_ID
This query requests employee records with their matching locations,
when employees have locations; otherwise, null location data is used
for employees that have no matching locations. Based on the request,
it is clear that the query cannot effectively drive from
Locations to Employees, since
even employees without locations are needed. Consider a case in which
this query is just a template to which an application adds conditions
that depend on search criteria provided by an end user. If the end
user wants to see employees for a particular location, the
application might create this query:
SELECT ...
FROM Employees E LEFT OUTER JOIN Locations L
ON E.Location_ID=L.Location_ID
WHERE L.Description='Headquarters'
In the outer case of the join from Employees to
Locations, L.Description will
be assigned a generated value of null, and the
condition on L.Description will be
false. Only the inner case of the join will return
rows that might meet the restriction on
L.Description, so now it makes perfect sense to
drive the query in the other join order, from
Locations to Employees.
However, the existence of the outer join often prevents automated
optimizers from allowing this reversed order on the outer joins, so
you need to make the join explicitly an inner join to get the
reversed join direction:
SELECT ...
FROM Employees E INNER JOIN Locations L
ON E.Location_ID=L.Location_ID WHERE L.Description='Headquarters'
4.1.3.2 Missing redundant join conditions
Normally, between any number of tables,
the join count is the number of tables minus one. For example,
between three tables, you expect to find two joins. Occasionally, a
query permits an extra, redundant join. For example, if you have an
Addresses table that contains all addresses
significant to the company, it might have a one-to-zero or one-to-one
relationship with the earlier Locations table,
which contains only locations owned by the company and which
references Addresses through a matching primary
key. In this case, you might find a query like the following:
SELECT ...
FROM Employees E, Locations L, Addresses A
WHERE E.Location_ID=L.Location_ID
AND E.Location_ID=A.Address_ID
AND A.ZIP_Code=95628
By transitivity (if
a=b and
b=c, then
a=c), you can deduce that
the condition L.Location_ID=A.Address_ID must be
true for all rows this query would return.
However, that condition is not explicit in the query, and not all
databases will deduce it and fill it in if it is left out. The best
plan, in this case, will likely begin with all addresses within that
ZIP Code and immediately join to Locations to
discard all addresses except the one or two that correspond to
company locations, before joining to Employees.
Since that join order requires the missing join condition to support
an indexed path from Addresses to
Locations, you should make the missing join
condition explicit:
SELECT ...
FROM Employees E, Locations L, Addresses A
WHERE E.Location_ID=L.Location_ID
AND E.Location_ID=A.Address_ID
AND L.Location_ID=A.Address_ID
AND A.ZIP_Code=95628
Since you do not want to follow the join from
Addresses to Employees
directly, you could also remove, if necessary, the redundant join
condition E.Location_ID=A.Address_ID, to
discourage that unwanted join operation.
4.1.4 Preventing Join Orders You Do Not Want
Forcing joins in the direction you
want, using the earlier techniques for preventing use of the wrong
indexes, will prevent many undesired join orders. What do you do when
you want the database to follow a particular join direction
eventually, but not too early in the execution plan? You cannot
afford to disable an index, because you must use that index
eventually, just not too early. Consider the following two joins, in
which you want to start the query with reads of T1
and then join to T2 before joining to
T3:
... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID=T3.Key3_ID ...
Here, you want to follow nested loops into both T2
and T3, following indexes in the keys mentioned
and reaching T2 before reaching
T3. To postpone the join you want to happen later,
make it depend (or at least to appear to depend) on data from the
join that must happen earlier. Here is a solution:
... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID ...
You and I know that the second version is logically equivalent to the
first. However, the database just finds an expression on the left
side of the second join that depends on both T1
and T2 (not recognizing that no value from
T2 can change the result), so it
won't try to perform the join to
T3 until after T2.If necessary, you can string together joins like this to completely
constrain a join order. For each join after the first, add a
logically irrelevant component referencing one of the columns added
in the preceding join to the join expression. For example, if you
want to reach tables T1 through
T5 in numerical order, you can use the following.
Notice that the join condition for the T3 table
uses the expression 0*T2.Key2_ID to force the join
to T2 to occur first. Likewise, the join condition
for the T4 table uses
0*T3.Key3_ID to force T3 to be
joined first.
... AND T1.Key2_ID=T2.Key2_ID
AND T1.Key3_ID+0*T2.Key2_ID=T3.Key3_ID
AND T1.Key4_ID+0*T3.Key3_ID=T4.Key4_ID
AND T1.Key4_ID+0*T4.Key4_ID=T5.Key5_ID ...
I'll apply this method to a concrete example.
Consider the following SQL, adapted from Chapter 3:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description,
M.First_Name, M.Last_Name, LM.Description
FROM Locations LE, Locations LM, Employees M, Employees E
WHERE E.Last_Name = 'Johnson'
AND E.Manager_ID=M.Employee_ID
AND E.Location_ID=LE.Location_ID
AND M.Location_ID=LM.Location_ID
AND LE.Description='Dallas'
Assume that you have an execution plan that drives from the index on
the employee's last name, but you find that the join
to the employee's location (alias
LE) to discard employees at locations other than
Dallas is unfortunately happening last, after the
other joins (to M and LM). You
should join to LE immediately from
E, to minimize the number of rows you need to join
to the other two tables. Starting from E, the join
to LM is not immediately possible, so if you
prevent the join to M before
LE, you should get the join order you want.
Here's how:
SELECT E.First_Name, E.Last_Name, E.Salary, LE.Description,
M.First_Name, M.Last_Name, LM.Description
FROM Locations LE, Locations LM, Employees M, Employees E
WHERE E.Last_Name = 'Johnson'
AND E.Manager_ID+0*LE.Location_ID=M.Employee_ID
AND E.Location_ID=LE.Location_ID
AND M.Location_ID=LM.Location_ID
AND LE.Description='Dallas'
The key here is that I've made the join to
M dependent on the value from
LE. The expression
0*LE.Location_ID forces the optimizer to join to
LE before M. Because of the
multiply-by-zero, the added expression has no effect on the results
returned by the query.
4.1.5 Forcing Execution Order for Outer Queries and Subqueries
Most queries with subqueries can
logically drive from either the outer query or the subquery.
Depending on the selectivity of the subquery condition, either choice
can be best. The choice generally arises for queries with
EXISTS or
IN conditions. You can always convert between an
EXISTS condition on a correlated subquery and the
equivalent IN condition on a noncorrelated
subquery. For example, you can convert this:
SELECT ...
FROM Departments D
WHERE EXISTS (SELECT NULL FROM Employees E
WHERE E.Department_ID=D.Department_ID)
to this:
SELECT ...
FROM Departments D
WHERE D.Department_ID IN (SELECT E.Department_ID FROM Employees E)
The first form implies that the database drives from the outer query
to the subquery. For each row returned by the outer query, the
database executes the join in the subquery. The second form implies
that you begin with the list of distinct departments that have
employees, as found in the noncorrelated subquery, and drive from
that list into the matching list of such departments in the outer
query. Sometimes, the database itself follows this implied join
order, although some databases can make the conversion internally if
their optimizer finds that the alternate order is better. To make
your own SQL more readable and to make it work well regardless of
whether your database can convert the forms internally, use the form
that implies the order you want. To force that order even when the
database could make the conversion, use the same
join-direction-forcing technique used in Section 4.1.4. Thus, an
EXISTS condition that forces the outer query to
execute first would look like this:
SELECT ...
FROM Departments D
WHERE EXISTS (SELECT NULL FROM Employees E
WHERE E.Department_ID=D.Department_ID+0)
For the contrary order, an IN condition that
forces the implied driving order from the subquery to the outer query
would look like this:
SELECT ...
FROM Departments D
WHERE D.Department_ID IN (SELECT E.Department_ID+0 FROM Employees E)
|
drive from the outer query to the subquery (such as NOT
EXISTS subqueries) or should drive in that order. Such a
case implies a choice of the order of execution of the subqueries.
You can also have choices about whether to execute subqueries after
completing the outer query, or at the first opportunity, as soon as
the correlation join is possible, or at some point between these
extremes.The first tactic for controlling the order of subquery execution is
simply to list the subqueries in order in the
WHERE clause (i.e., the top subquery to be
executed should be listed first). This is one of the few times when
WHERE-clause order seems to matter.Rarely, the database will execute a subquery sooner than you would
like. The same tactic for postponing joins (described in
Section 4.1.4) works for
correlation joins,
the joins in subqueries that correlate the subqueries to the outer
queries. For example, consider this query:
SELECT ...
FROM Orders O, Customers C, Regions R
WHERE O.Status_Code='OP'
AND O.Customer_ID=C.Customer_ID
AND C.Customer_Type_Code='GOV'
AND C.Region_ID=R.Region_ID
AND EXISTS (SELECT NULL
FROM Order_Details OD
WHERE O.Order_ID=OD.Order_ID
AND OD.Shipped_Flag='Y')
For this query you might find that the subquery runs as soon as you
reach the driving Orders table, but you might wish
to perform the join to Customers first, to discard
nongovernmental orders, before you take the expense of the subquery
execution. In this case, this would be the transformation to postpone
the correlation join:
SELECT ...
FROM Orders O, Customers C, Regions R
WHERE O.Status_Code='OP'
AND O.Customer_ID=C.Customer_ID
AND C.Customer_Type_Code='GOV'
AND C.Region_ID=R.Region_ID
AND EXISTS (SELECT NULL
FROM Order_Details OD
WHERE O.Order_ID+0*C.Customer_ID=OD.Order_ID
AND OD.Shipped_Flag='Y')
Notice the addition of +0*C.Customer_ID to the
subquery's WHERE clause. This
ensures the join to Customers occurs first, before
the subquery executes.
4.1.6 Providing the Cost-Based Optimizer with Good Data
On any cost-based optimizer (that
is, for any query except one running on the Oracle
rule-based optimizer, since only
Oracle has a rule-based optimizer), the second most common source of
poor execution plans (after missing indexes) is missing statistics on
the
tables, columns, and
indexes involved in the query. In all, cost-based optimizers do a
fairly good job of finding the best plan without help when they have
good information to begin with. However, when they are missing
informationfor example, because a table or index has been
rebuilt without regenerating statistics for that objectthey
tend to make terrible assumptions.If you are running on any database except Oracle, or if you are on
Oracle's cost-based optimizer (as is most common and
as Oracle recommends) and not forcing the rule-based optimizer, the
first thing you should try if you are not getting the execution plan
you want is to regenerate statistics on every table and index
relevant to the query. Standard statistics will usually suffice to
get reasonable execution plans.Cost-based optimizers usually assume that data is uniformly
distributed. For example, if the optimizer statistics show a table of
1,000,000 rows with 50,000 distinct values for some indexed foreign
key, the database will optimize on the assumption that every value of
that key will match exactly 20 rows. For most indexed columns, like
foreign keys, this assumption of a uniform data distribution works
well. However, some columns have highly skewed distributions, such as
status, code, or type columns, or foreign keys to status or type
tables. For example, consider this query:
SELECT ... FROM Orders WHERE Status_Code = 'OP'
There might only be three or four values of
Status_Code across a 1,000,000-row
Orders table, but if 'OP' means
this is an open order, not yet fulfilled or cancelled, this condition
is far more selective than the optimizer would expect based solely on
the number of distinct values. If the column had an index, the
optimizer might never use that index if it knew only the small number
of distinct indexed values. However, on some databases, you can
generate added statistics that let the database know not only the
number of distinct values but also the distribution of those values,
a necessary step when you have such highly skewed distributions.
4.1.7 Fooling the Cost-Based Optimizer with Incorrect Data
This last technique is dangerous,
and I recommend it only as a last resort. Sometimes, you want to
simulate a large database on a small, development database. If you
can extrapolate (or, better, measure from an actual database)
statistics that apply to a large database, you can manually modify
the data-dictionary tables that store those statistics for the
optimizer, to fool the optimizer into thinking it is working with a
large database. The small database will have statistics that show
large tables with many distinct values on most indexes. This is a
handy way to see execution plans that will apply to production
volumes when you have only a test database with toy data volumes. For
such toy-sized databases, there is no great risk to this approach. On
production databases, the optimizer will occasionally make better
choices if it has the wrong data, usually if it has data that
exaggerates the selectivity of desired indexes or that exaggerates
the size of a table when a full table scan is undesirable.Imagine reversing the logic the optimizer follows: ask
"What would I need to believe about the tables and
indexes of this query to find an alternative plan (the alternative
that you, the human optimizer, want) much more
attractive?" It is not hard to fool the optimizer
into doing what you want rather than what it would choose on its own,
if you lie to it about the statistics. However, on production
systems, this is dangerous in several ways:As soon as anyone regenerates statistics for the tables or indexes,
the optimizer will revert to the original error, unless the manual
statistics-tweak is reapplied. You will have to rigorously control
statistics generation to prevent this.As soon as the database optimizer improveswith the next
release, perhapsit is denied the chance to exploit those
improvements with correct data.Most importantly, every other query against the tables and indexes
with false statistics is at risk and will potentially be harmed, just
to help the one query you wanted to tune when you fudged the
statistics.
I have never needed to play this card to get an adequately optimized
plan on Oracle, SQL Server, or DB2, and I recommend you avoid it if
possible.
• Table of Contents• Index• Reviews• Examples• Reader Reviews• Errata• AcademicSQL TuningBy
