2.6 Calculating Selectivity
When
tuning queries, you can best picture nonjoin (single-table)
conditions as filters. These filters pass through the table rows that
the application needs (the rows satisfying the conditions), while
discarding the rows that the application does not need (the rows
failing to satisfy the query conditions). The application has
functional reasons to exclude unneeded rows, but from a performance
point of view the job of the filter is to save the database work and
time. The trick of tuning is to avoid work, as much as possible, on
rows destined for the trash heap.Selectivity
is the power of a filter as a row-excluding tool, expressed as the
fraction of table rows that the filter passes through.
The database can apply filters at
three points, with varying success at minimizing query cost:Determining the index range condition
Conditions that determine the limits of an index range scan require
no work at all on the excluded rows.
Determining the table rows reached from the index
Sometimes, conditions do not determine the index range limits but
nevertheless can be evaluated in the index before reaching the table.
These conditions require touching ultimately excluded row entries in
the index, but not in the table.
Determining the rows returned after table access
If a condition requires columns held in the table but not in an
index, a database cannot evaluate that condition until it reads the
table rows. Filters evaluated in the table are of no value in
reducing the costs of reaching the rows of that table, though they
reduce network costs because the excluded rows do not need to be
returned. If the filtered table is not the last or the only table
touched, any filter also reduces the costs of joins to other tables
later in the execution plan.
2.6.1 Filter Selectivity
In
this section, I explain how to calculate the selectivity of
conditions on a table. Let's begin with some
definitions:Individual-condition filter selectivity
The fraction of table rows that satisfy a single condition on that
table.
Multiple-condition filter selectivity
The fraction of table rows that satisfy the combination of conditions
referring exclusively to that table.
Filter independence
The assumption, usually true, that you can calculate
multiple-condition selectivity as the simple product of
individual-condition selectivity fractions. For example, a condition
on a person's first name and a condition on a
person's Zip Code are logically independent. You can
assume that the fraction of rows that have both the correct first
name and the correct Zip Code will be roughly the product of the
fraction of rows that have the correct name and the fraction that
have the correct Zip Code. For example, if
1/100th of the rows contain the desired
first name and 1/500th of the rows contain
the desired Zip Code, then the multiple-condition filter selectivity
is 1/100 x 1/500 = 1/50,000.
Filter redundancy
The opposite of filter independence. The truth of one condition
guarantees the truth of another. For example, a condition on the Zip
Code likely guarantees a single value for telephone area code, so the
selectivity of conditions on both would be no better than the
selectivity of the ZIP Code alone. You can always test for complete
or partial filter redundancy by calculating the multicondition filter
selectivity based on filter independence and seeing whether it equals
the actual selectivity of the combination of the conditions.
When you tune a query and evaluate filter selectivity, start by
asking if the query stands alone or if it represents a whole group of
queries. If it represents a group of queries, ask about the
distribution of values within that group. For example, consider the
query:[5][5] Note that I usually replace the
SELECT list with ... in my
examples. It turns out that the list of columns and expressions that
you select has little impact on query performance, which is dominated
by the list of tables in the FROM clause and the
conditions in the WHERE clause.
SELECT ... FROM Orders WHERE Unpaid_Flag='Y';
This query has (we would hope) high selectivity, being true for a
small fraction of the complete history of orders. If you can count on
finding a value of 'Y' for
Unpaid_Flag in the query, you might want an index
on that column. But if the query is part of a group that just as
often searches for the unselective condition
Unpaid_Flag='N', you should likely avoid the
index. In this example, the meaning of the flag is likely special to
the query, driving the very purpose of the query (to find bills to
send out), so you can count on finding primarily queries against
'Y', which is the rare value.
|
Unpaid_Flag='Y', begin by executing the following
two queries:
SELECT COUNT(*) FROM Orders WHERE Unpaid_Flag='Y';
SELECT COUNT(*) FROM Orders;
The selectivity of the condition is the first count divided by the
second.On the other hand, consider the query:
SELECT ... FROM Order_Details WHERE Order_ID=:id;
End users would query for details of orders roughly at random. You
could assume this even if the application replaced the bind variable
:id with an actual value, since an application
would have no reason to always query the same order. The query
against any particular ID does not stand alone, but represents a
whole family of queries that you should tune as a unit. End users are
as likely to query one order as another, so you could estimate the
filter selectivity with:
SELECT 1/COUNT(DISTINCT Order_ID) FROM Order_Details;
A more subtle case arises when the end user might query on any value,
but the end user is more likely to query on common values than
uncommon values. For example, if operators bring up customers by
finding all customers that have the last name of the calling
customer, they will bring up common last names more often than
uncommon ones with a query such as:
SELECT ... FROM Customers WHERE Last_Name = 'SMITH';
Here, if you just counted distinct names, as you counted order IDs
earlier, you would see an over-optimistic selectivity that assumed
you were just as likely to search for
Last_Name='KMETEC' as for
Last_Name='SMITH'. Each last name has a
selectivity of
n(i)/C,
where n(i) is the count of
rows with the ith nonnull last name and
C is the count of all rows in the table. If
choosing any last name were equally probable, you could just average
n(i)/C
over all the last names. That average would equal one over the number
of distinct names. However, the probability of searching on a last
name in this scenario is
n(i)/C',
where C' is the count of rows having nonnull
last names. Therefore, you really need the sum of the selectivities
times the probability of seeing each selectivityi.e., the sum
of
(n(i)/C')
x
(n(i)/C)over
all last names. Since C' is also the sum of the
individual n(i) values, you
can compute the filter selectivity in SQL as follows:
SELECT SUM(COUNT(LastName)*COUNT(Last_Name))/
(SUM(COUNT(Last_Name))*SUM(COUNT(*)))
FROM Customers GROUP BY Last_Name;
2.6.2 Index Range-Condition Selectivity
Index range-condition selectivity is the
fraction of table rows the database examines in an index while
scanning the index. Every index range scan must have two endpoints
that define the beginning and end of the range. These endpoints can
be equivalent to positive or negative infinity, meaning that the
range can be unbounded on either end, though not generally on both
ends. The range can exclude either or both of the range endpoints,
depending on the nature of the bounding condition. For example, the
range condition (Salary > 4000 AND Salary <
8000) excludes both endpoints with >
and < bounding conditions.There are common problems that prevent a database from finding a
concrete beginning and end for a range scan, even given a very
selective condition on the indexed column, generally preventing
effective use of the index. It is hard or even impossible (depending
on the function) to translate a condition on
SomeFunction(Some_Column) into a single range
condition on a simple index leading with
Some_Column. Generally, a database does not even
try to do so.
usually enable indexed access to ranges defined on that column, often
making indexes useless for queries using those expressions. This is a
double-edged sword: it forces you to rewrite some queries to enable
the index that you want to use, but it is also a useful tool,
allowing to you rewrite other queries to prevent the database from
using indexes that you do not want to use. A subtle example
of a function that disables use of an index is when you compare
expressions of different types and the database applies a
type-conversion function implicitly. For example, the
type-inconsistent condition CharacterColumn=94303
implicitly becomes, in Oracle,
TO_NUMBER(CharacterColumn)=94303. To resolve this
problem and enable use of an index on the character column, perform
the conversion explicitly on the other side. For example:Replace
|
difficulties when a database vendor supports different number types,
such as integer and decimal types. Type conversions also interfere
with efficient, indexed join paths for which a foreign key of one
type points to a primary key of another type.
|
will be:Conditions on the leading column of the index under consideration are
suited to provide a range start or endpoint. Conditions on later
columns of the same index are not suited to provide a range start or
endpoint, unless you also have exact equality conditions on all
preceding columns of that index. For example, an index on
(Date_Column, ID_Number)
applied to the condition Date_Column
>= TO_DATE('2003/01/01',
'YYYY/MM/DD') delivers a range scan fully
determined by the date condition. Further conditions on the second
column, such as ID_Number=137, do not further
narrow the index range scan. To further narrow the range based on the
second condition, you have to look at a long list of ranges, one for
each possible value of Date_Column that satisfies
the first condition, but databases do not do this. However, if you
flip the index column order to (ID_Number,
Date_Column), then the same two conditions
together define the range endpoints and you get
a much shorter, faster index range scan.
|
Depending on the database, the
database usually assumes that Indexed_Col IS
NULL denotes too large a range to be useful, and
the database ignores such conditions for establishing range
endpoints.
|
Indexed_Col IS NOT NULL covers too
large a range to be useful, so the database will not drive to an
index from this condition. In rare cases, having any nonnull value is
so rare that an index range scan over all possible nonnull values is
beneficial. In such cases, if you can figure out a safe lower or
upper limit to the range of all possible values, you can enable a
range scan with a condition such as Positive_ID_Column >
-1 or Date_Column >
TO_DATE('0001/01/01','YYYY/MM/DD').Indexed_Char_Col LIKE 'ABC%' establishes the start
and end of a legal range for an index, with
Indexed_Char_Col as the leading column of the
index. (LIKE is SQL's
pattern-matching comparator, and % is the
pattern-matching wildcard.) Indexed_Char_Col LIKE 'ABC%DEF' also establishes
the start and end of a legal range, but in this case only the
'ABC%' part of the comparison string contributes
to narrowing the index range scan.Indexed_Number_Column LIKE '123%'
does not establish the start and end of a legal range,
because the LIKE comparison is meaningful only
when applied to character strings and the database must implicitly
convert Indexed_Number_Column to a character
string to check this condition, disabling any index leading with
Indexed_Number_Column. In terms native to numbers,
this condition would point to a whole series of ranges:
(Indexed_Number_Column >= 123 AND Indexed_Number_Column < 124) ORIndexed_Char_Col LIKE '%ABC%' does not establish
(Indexed_Number_Column >= 1230 AND Indexed_Number_Column < 1240) OR
(Indexed_Number_Column >= 12300 AND Indexed_Number_Column < 12400) OR...
the start and end of a legal range, because the leading wildcard
implies this pattern might be matched anywhere in the entire range of
the index. Equality
(=), BETWEEN, and most
inequality (<, <=,
>, >=) conditions on
first columns of indexes establish legitimate index range endpoints.The not-equal-to inequality condition,
usually expressed as != or
<>, does not establish an index range,
because the database assumes this condition to be too unselective to
be worth indexed access. If the excluded value covers almost all the
rows, with other values being rare, you can usefully enable indexed
access by replacing the negative condition
Column!='DominantValue' with Column IN
(<List of all
other values, all of which are
rare>), although it could prove
inconvenient to keep this list up-to-date as the application evolves.Series of conditions combined by OR or by an
IN list can lead to a series of range scans, but
only if every such condition points to a legal range. For example,
IDColumn IN (123,124,125) yields three legal
equality conditions that deliver three range scans, and
(Name='Smith' OR Name='Jones') yields a pair of
range scans, but (Name='Smith' OR Name IS NULL)
does not enable use of an index (except on DB2), since IS
NULL does not mark a legal range.
If you have conditions that do not specify a restricted range on at
least the first column of an index, the only way to use an index at
all is to perform a
full index scan, a
scan of every index entry, across all leaf blocks. Databases will not
usually choose full index scans, because to read an entire index is
expensive and full index scans usually end up pointing to much of the
table as well, with a net cost greater than the competing full table
scan. You can force full index scans (often without meaning to) by
adding a query hint (more on these later) that instructs the database
to use an index even though it would not choose to use that index on
its own. More often than not, adding such a hint is a mistake, a
desperate attempt to get something better than a full table scan.
Usually, this happens when a developer values an index too much,
actually leading to a worse plan than the plan the database would
choose without help.Even in cases for which a full index scan is better than the
competing full table scan, it is almost surely worse than a third
alternative: usually driving from a different, sometimes new index or
changing a condition on a function to enable a narrow range scan.
2.6.3 Selectivity on Table Rows Reached from the Index
Since indexes are
compact and better cached objects, compared to tables, selectivity of
an index range scan matters less than efficient table access. Even
when an index and table are perfectly cached, the selectivity on the
table matters more than on the index, because you read about 300
index rows in a single logical I/O to a leaf block but you must do a
separate logical I/O for every table row. This brings us to the
second part of evaluating the usefulness of an index: how narrowly
will the index restrict access to the table? Fortunately, database
implementations are smart enough to check all conditions as early in
the execution plan as possible, before touching a table, when the
index allows. This reduces table reads whenever an index includes
columns required to evaluate a condition, even if the database cannot
use that condition to determine the range scan endpoints. For
example, consider a Persons table with one index
that includes the Area_Code,
Phone_Number, Last_Name, and
First_Name columns. Consider
a query against this table with the conditions:
WHERE Area_Code=916 AND UPPER(First_Name)='IVA'
Only the first condition contributes to the endpoints of the index
range scan. The second condition, on the fourth indexed column, fails
to narrow that index range scan for three reasons:The second index column, Phone_Number, has no
equality condition specified (no condition at all, for that matter).The third column, Last_Name, also has no equality
condition specified.The condition on First_Name is disabled as
range-limiting by the UPPER function.
Fortunately, none of these reasons prevents the database from
checking the second condition before accessing the table. Because the
index contains the First_Name column, the database
can test conditions against that column using data from the index. In
this case, the database uses the Area_Code
condition to define an index range scan. Then, as it looks at each
index entry in the range, the database discards entries with first
names other than Iva. The likely result is just
one or two table-row reads on this selective combination of
conditions.Let's examine this situation more closely to decide
whether it would be worthwhile to create a better index for this
combination of conditions. Both Area_Code and
First_Name will see skewed distributions. You will
query the common area codes and common first names more often than
the uncommon ones, so follow the approach for skewed distributions
and find the individual selectivity factors:
SELECT SUM(COUNT(Area_Code)*COUNT(Area_Code))/
(SUM(COUNT(Area_Code))*SUM(COUNT(*)))
FROM Customers GROUP BY Area_Code;
SELECT SUM(COUNT(First_Name)*COUNT(First_Name))/
(SUM(COUNT(First_Name))*SUM(COUNT(*)))
FROM Customers GROUP BY First_Name;
Let's say that the first calculation yields a
selectivity of 0.0086, and the second yields a selectivity of 0.018.
Let's further assume that you have a 1,000,000-row
table and the index references 200 rows per leaf block (less than
usual, because this is an unusually wide index key). The condition on
Area_Code alone determines the number of index
blocks scanned, so first find the number of rows that condition
references. The calculation is 0.0086 x 1,000,000=8,600, indicating
that the database scans 43 (8600/20) leaf blocks. (You can always
neglect root and branch block reads in single-table queries.) These
43 leaf blocks are likely well cached and require about a millisecond
to read.
|
fortunately, you pick up the additional selectivity of the condition
on First_Name before reaching the table, reducing
the row count to about 155 (8,600 x 0.018) rows. You will see about
155 logical I/Os to the table, with a few physical I/Os, since you
will see a less favorable hit ratio on the table, compared to the
more compact index.The multicolumn filter selectivity, 0.0086 x 0.018 = 0.000155, is
well into the range that favors indexed access, even though the
individual selectivity of the first column alone is in the gray area.
Notice that even if you modify the query or the data model to pick up
the full selectivity in the index range conditions, the cost will go
down only marginally, eliminating most of the 43 well-cached index
leaf-block reads and speeding the query by about a millisecond. This
will have no effect on the 155 more expensive table-block reads. (You
could modify the application to do a case-sensitive match on the
first name, and create a new index to cover just these two columns.
Alternatively, you could modify the table to store the uppercase
first names and index the new column together with area code.)
Therefore, even though this query and index are not perfectly suited
to one another, the index is good enough to use. It is not worthwhile
to add a new index and change the query for the small potential
gain.
2.6.4 Combining Indexes
Occasionally,
the database finds equality-condition filters that point to different
single-column indexes. By combining index range scans on these
indexes, the database can pick up the multicolumn filter selectivity
before reaching the table. Let's reuse the earlier
example of a query that specifies area code and customer first name,
but replace the multicolumn index with single-column indexes on the
two columns. Further, replace the condition on
UPPER(First_Name) with a simple match on the
column, assuming the column already stores uppercase values. Typical
conditions are then:
WHERE Area_Code=415 AND First_Name='BRUCE';
You can now assume that you have closer to the typical 300 rows or
more per index leaf block, since these are single-column indexes on
short columns. Using the same single-column filter selectivity
factors and assuming as before that the table holds 1,000,000 rows,
you can again predict that the Area_Code condition
will yield 8,600 (0.0086 x 1,000,000) rows, requiring 29 (8,600/300)
index leaf blocks. The second index range scan, for
First_Name, will yield 18,000 (0.018 x 1,000,000)
rows, requiring 60 (18,000/300) index leaf blocks. Since these are
both equality conditions, the database finds the two rowid lists for
these two conditions presorted, and it can easily merge these two
sorted lists to find the same 155 table rowids found with the earlier
multicolumn index.The operation just described, merging lists of rowids from two
indexes, is called the index AND-EQUAL
MERGE operation, and the database can
do this between any number of equality conditions on single-column
indexes. The cost of the table access is the same as finding the rows
with one multicolumn index, but you must read more index leaf
blocksin this case, 89 (29+60) instead of the earlier example
of 43.As you can see, the AND-EQUAL MERGE is more expensive than a
multicolumn index, but it might be better than using just one
single-column index. But this case, in which the AND-EQUAL MERGE is
reasonably attractive, is rare. Almost always, the most selective
single-column condition is fine by itself and the cost of the less
selective index range scan exceeds the savings in table access, or
the added cost of the AND-EQUAL MERGE justifies creating a
multicolumn index. When you see an AND-EQUAL MERGE in an execution
plan, you should almost always either suppress use of the less
selective index or create a multicolumn index to use instead. The new
multicolumn index should start with the more selective
columnin this case, Area_Codeand
should usually take the place of any index on that column alone. Such
an index sometimes enables you to drop the index on the other column
as well.
• Table of Contents• Index• Reviews• Examples• Reader Reviews• Errata• AcademicSQL TuningBy
