SQL Tuning [Electronic resources] نسخه متنی

5.2 Full Query Diagrams

Example 5-1 is a simple
query that illustrates all the significant elements of a query
diagram.

Example 5-1. A simple query with one join

SELECT D.Department_Name, E.Last_Name, E,First_Name
FROM Employees E, Departments D
WHERE E.Department_Id=D.Department_Id
AND E.Exempt_Flag=Y
AND D.US_Based_Flag=Y;

This query translates to the query diagram in Figure 5-1.

Figure 5-1. A full query diagram for a simple query

First, Ill describe the meaning of each of these
elements, and then Ill explain how to produce the
diagram, using the SQL as a starting point.

5.2.1 Information Included in Query Diagrams

In mathematical terms, what you see in Figure 5-1
is a
directed graph: a
collection of nodes and links, where the links often have arrows
indicating their directionality. The nodes in this diagram are
represented by the letters E and
D. Alongside both the nodes and the two ends of
each link, numbers indicate further properties of the nodes and
links. In terms of the query, you can interpret these diagram
elements as follows.

5.2.1.1 Nodes

Nodes represent tables or table aliases in
the FROM clausein this case, the aliases
E and D. You can choose to
abbreviate table or alias names as convenient, as long as doing so
does not create confusion.

5.2.1.2 Links

Links represent joins between tables, where
an
arrow on one end
of the link means that the join is guaranteed to be unique on that
end of the join. In this case, Department_ID is
the
primary (unique) key
to the table Departments, so the link includes an
arrow on the end pointing to the D node. Since
Department_ID is not unique in the
Employees table, the other end of the link does
not have an arrow. Following my convention, always draw links with
one arrow with the arrow pointing downward. Consistency is useful
here. When you always draw arrows pointing downward, it becomes much
easier to refer consistently to master tables as always being located
below detail tables. In Chapter 6 and
Chapter 7, I describe rules of thumb for finding
optimum execution plans from a query diagram. Since the rules of
thumb treat joins to master tables differently than joins to detail
tables, the rules specifically refer to downward
joins and upward joins as a shorthand means of
distinguishing between the two types.

Although you might guess that Department_ID is the
primary key of Departments, the SQL does not
explicitly declare which side of each join is a primary key and which
side is a foreign key. You need to check indexes or declared keys to
be certain that Department_ID is guaranteed to be
unique on Departments, so the arrow combines
information about the nature of the database keys and information
about which keys the SQL uses.

5.2.1.3 Underlined numbers

Underlined
numbers
next to the nodes represent the fraction of
rows in each table that will satisfy the filter conditions on that
table, where filter conditions are nonjoin conditions in the
diagrammed SQL that refer solely to that table. Figure 5-1 indicates that 10% of the rows in
Employees have Exempt_Flag=Y
and 50% of the rows in Departments have
US_Based_Flag=Y. I call these the
filter ratios.

Frequently, there are no filter conditions at all on one or more of
the tables. In that case, use 1.0 for the filter ratio
(R), since 100% of the rows meet the
(nonexistent) filter conditions on that table. Also, in such cases, I
normally leave the filter ratio off the diagram entirely; the
numbers absence implies R=1.0
for that table. A filter ratio cannot be greater than 1.0. You can
often guess approximate filter ratios from knowledge of what the
tables and columns represent (if you have such knowledge). It is
best, when you have realistic production data distributions
available, to find exact filter ratios by directly querying the data.
Treat each filtered table, with just the filter clauses that apply to
that table, as a single-table query, and find the selectivity of the
filter conditions just as I describe in Chapter 2
under Section 2.6.

During the development phase of an application, you do not always
know what filter ratio to expect when the application will run
against production volumes. In that case, make your best estimate
based on your knowledge of the application, not based on tiny,
artificial data volumes on development databases.

5.2.1.4 Nonunderlined numbers

Nonunderlined
numbers next to the two ends of the link
represent the average number of rows found in the table on that end
of the join per matching row from the other end of the join. I call
these the join ratios. The join
ratio on the nonunique end of the join is the
detail join ratio,
while the join on the unique end (with the arrow) is the
master join ratio.

Master join ratios are always less than or equal to 1.0, because the
unique key ensures against finding multiple masters per detail. In
the common case of a detail table with a mandatory foreign key and
perfect referential integrity (guaranteeing a matching master row),
the master join ratio is exactly 1.0.

Detail join ratios can be any nonnegative number. They can be less
than 1.0, since some master-detail relationships allow zero, one, or
many detail rows, with the one-to-zero case dominating the
statistics. In the example, note that the average
Employees row matches (joins to) a row in
Departments 98% of the time, while the average
Departments row matches (joins to) 20
Employees. Although you might guess approximate
join ratios from knowledge of what the tables represent, query these
values when possible from the real, full-volume data distributions.
As with filter ratios, you might need to estimate join ratios during
the development phase of an application.

5.2.2 What Query Diagrams Leave Out

As important as what query diagrams include is what they do not
include. The following sections describe several things that query
diagrams omit, and explain why that is the case.

5.2.2.1 Select lists

Query diagrams entirely exclude any
reference to the list of columns and expressions that a query selects
(everything between SELECT and
FROM, that is). Query performance is almost
entirely determined by which rows in the database you touch and how
you reach them.
What you do with those rows, which
columns you return, and which expressions you calculate are almost
irrelevant to performance. The main, but rare, exception to this rule
is when you occasionally select so few columns from a table that it
turns out the database can satisfy the query from data in an index,
avoiding access to the base table altogether.
Occasionally, this
index-only access can lead to major savings, but it has little
bearing on decisions you make about the rest of the execution plan.
You should decide whether to try index-only access as a last step in
the tuning process and only if the best execution plan without this
strategy turns out to be too slow.

5.2.2.2 Ordering and aggregation

The diagram excludes any
indication of ordering
(ORDER BY),
grouping (GROUP BY), or

post-grouping filtering
(HAVING). These operations are almost never
important. The sort step these generally imply is not free, but you
can usually do little to affect its cost, and its cost is usually not
the problem behind a badly performing query.

5.2.2.3 Table names

Query diagrams usually abstract table names
to table aliases. Once you have the necessary table statistics, most
other details are immaterial to the tuning problem. For example, it
makes no difference which tables a query reads or which physical
entities the tables represent. In the end, you must be able to
translate the result back to actions on the original SQL and on the
database (actions such as creating a new index, for example).
However, when you are solving the abstract
tuning problem, the more abstract the node names, the better.
Analogously, when your objective is to correctly add a series of
numbers, its no help to worry about whether you are
dealing with items in an inventory or humans in a hospital. The more
abstractly you view a problem, the clearer it is and the more clearly
you notice analogies with similar, past problems that might have
dealt with different entities.

5.2.2.4 Detailed join conditions

The details of join conditions are lost
when you abstract joins as simple arrows with a couple of numbers
coming from outside of the SQL. As long as you know the statistics of
the join, the details of the join columns and how they are compared
do not matter.

5.2.2.5 Absolute table sizes (as opposed to relative sizes)

The diagram does not represent table
sizes. However, you can generally infer relative table sizes from the
detail join ratio, by the upper end of the
link. Given a query diagram, you need to know overall table sizes to
estimate overall rows returned and runtime of the query, but it turns
out that you do not need this information to estimate
relative runtimes of the alternatives and
therefore to find the best alternative. This is a helpful result,
because you often must tune a query to run well not just on a single
database instance, but also on a whole wide range of instances across
many customers. Different customers frequently have different
absolute table sizes, but the relative sizes of those tables tend not
to vary so much, and join and filter ratios tend to vary far less; in
fact, they tend to vary little enough that the differences can be
ignored.

In my experience, it is a myth that one of
the main advantages of cost-based optimizers is their ability to
respond to varying table and index statistics with different plans
for the same applications SQL. This theoretical
possibility is cited as an argument against manual tuning of an
applications SQL, since the result might
(theoretically) hurt as many customers as it would help, by
figuratively tying the optimizers hands. On the
contrary, I have yet to find a single query that
didnt have an execution plan that worked fine on
every data distribution I encountered or was likely to encounter.
(These plans were not optimal for every distribution, but they were
close enough to optimal so as not to matter.) As a result, careful
manual tuning of a products SQL need not harm one
customer to help another.

5.2.2.6 Filter condition details

The details of filter conditions are
lost when you abstract them to simple numbers. You can choose an
optimum path to the data without knowing anything about the details
of how, or using which columns, the database excludes rows from a
query result. You need only know how effective, numerically, each
filter is for the purpose of excluding rows. Once you know that
optimum abstract path, you need to return to the detailed conditions,
in many cases, to deduce what to change. You can change indexes to
enable that optimum path, or you can change the SQL to enable the
database to exploit the indexes it already has, but this final step
is simple once you know the optimum abstract path.

5.2.3 When Query Diagrams Help the Most

Just as some word problems are too simple
to really require abstraction (e.g., "If Johnny has
five apples and eats two, how many apples does he have
left?"), some queries, like the one in Example 5-1, are so simple that an abstracted
representation of the problem might not seem necessary. If all
queries were two-way joins, you could manage fine without query
diagrams. However, in real-world applications, it is common to join 6
or more tables, and joining 20 or more is not as rare as you might
think, especially among the queries you will tune, since these more
complex queries are more likely to need manual tuning. (My personal
record is a 115-way join, and I routinely tune joins with more than
40 tables.) The more fully normalized your database and the more
sophisticated your user interface, the more likely it is that you
will have many-way joins. It is a myth that current databases cannot
handle 20-way joins, although it is no myth that databases (some more
than others) are much more likely to require manual tuning to get
decent performance for such complex queries, as compared to simple
queries.

Apart from the obvious advantages of stripping away irrelevant detail
and focusing attention on just the abstract essentials, query
diagrams offer another, more subtle benefit: practice problems are
much easier to generate! In this chapter, I begin each problem with
realistic SQL, but I cannot use real examples from proprietary
applications I have tuned. So, for good examples, I invent an
application database design and imagine real business scenarios that
might generate each SQL statement. Once I have thoroughly
demonstrated the process of reducing a SQL statement to a query
diagram, the rest of the book will have problems that
start with the query diagrams. Similarly, most
of a math text focuses on abstract math, not on word problems.
Abstraction yields compact problems and efficient teaching, and
abstraction further saves me the bother of inventing hundreds of
realistic-sounding word problems that correspond to the mathematical
concepts I need to teach. You might not worry much about how much
bother this saves me, but it saves
you time as well. To better learn SQL tuning,
one of the best exercises you can do is to invent SQL tuning problems
and solve them. If you start with realistic SQL, you will find this a
slow and painful process, but if you just invent problems expressed
through query diagrams, you will find you can crank out exercises by
the hundreds.

5.2.4 Conceptual Demonstration of Query Diagrams in Use

I have stated without proof that query
diagrams abstract all the data needed to answer the most important
questions about tuning. To avoid trying your patience and credulity
too severely, I will demonstrate a way to use the query diagram in
Figure 5-1 to tune its query in detail, especially
to find the best join order. It turns out that the best access path,
or execution plan, for most multitable queries involves indexed
access to the first, driving table, followed by nested loops to the
rest of the tables, where those nested loops follow indexes to their
join keys, in some optimum join order.

For small master tables, you sometimes can do slightly better by
replacing the nested-loops joins to those small tables with
hash joins, but the improvement is usually
minor and likely not worth the bother or the robustness risk that
comes with a hash join.

However, for many-way joins, the number of possible join orders can
easily be in the high millions or billions. Therefore, the single
most important result will be a fast method to find the best join
order.

Before I demonstrate the complex process of optimizing join order for
a many-way join using a query diagram, lets just
see what the diagram in Figure 5-1 tells us about
comparative costs of the two-way join in the underlying SQL query, if
you were to try both possible join orders. To a surprisingly good
approximation, the cost of all-indexed access, with nested loops, is
proportional to the number of table rows the database accesses, so I
will use rows touched as the proxy for query cost. If you begin with
the detail node E, you cannot work out an absolute
cost without assuming some

rowcount for that table. However, the
rowcount is missing from the diagram, so let it be
C and follow the consequences algebraically:

Using an index on the filter for E, the database
must begin by touching 0.1 /
C rows in the driving table
E.

Following nested loops to master table D, the
master join ratio shows that the database will reach 0.1
/ C x 0.98 rows of
D.

Adding the results of Steps 1 and 2, the total count of table rows
touched is (0.1 x C)+(0.1 x
C x 0.98).

The total rowcount the database will reach, then, factoring out
C and 0.1 from both terms, is C
x (0.1 x
(1+0.98)), or 0.198 x
C.

Beginning with the other end of the join, lets
still express the cost in terms of C, so first
work out the size of D in terms of
C. Given the size of D in
terms of C, the cost of the query in terms of
C follows:

For every row in D, the database finds on average
20 rows in E, based on the detail join ratio.
However, just 98% of the rows in E even match rows
in D, so the total rowcount C
must be 20/0.98 times larger than the rowcount of
D. Conversely, then, the rowcount of
D is C x
0.98/20.

Following an index to reach only the filter rows on
D results in reaching half these rows for the
driving table for this alternative, or C /
0.5 x
0.98/20.

Following the join to E, then, the database
reaches 20 times as many rows in that table: 20
x C x
0.5 x
0.98/20, or C x
0.5 x
0.98.

Adding the two rowcounts and factoring out the common terms as
before, find the total cost: C x
(0.5 x
0.98 x
((1/20)+1)), or 0.5145 x
C.

The absolute cost is not really the question; the relative cost is,
so you can choose the best plan and, whatever C
is, 0.198 x C will surely
be much lower than 0.5145 x
C. The query diagram has all the information you
need to know that the plan driving from E and
joining to D will surely be much faster (about 2.6
times faster) than the plan driving from D and
joining to E. If you found these alternatives to
be much closer, you might worry whether the cost estimate based on
simple rowcounts was not quite right, and I will deal with that issue
later. The message here is that the query diagram answers the key
questions that are relevant to optimizing a query; you just need an
efficient way to use query diagrams for complex queries.

5.2.5 Creating Query Diagrams

Now that you see what a query diagram
means, here are the steps to create one from an SQL statement,
including some steps that will not apply to the particular query
shown earlier, but that you will use for later queries:

Begin with an arbitrary choice of table alias in the
FROM clause, and place this in the middle of a
blank page. For now, I will refer to this table as the
focus table, simply meaning that this table will
be the current point from which to add further details to the query
diagram. I will choose alias E as the starting
point for the example, just because it is first in the query.

Search for join conditions that match up to a single value of the
focus tables primary key. For each such join that
you find, draw an arrow pointing down toward the focus table alias
from above, labeling the other end of the arrow with the alias at the
other end of the join. (You will usually find at most one join into a
table from above, for reasons I will discuss later.) When the link
represents an outer join, add an arrowhead halfway along the link and
point the arrowhead toward the optional table.

In the example, there is no join to the Employees
tables primary key, presumed to be
Employee_ID. If you want to be certain that
Department_ID is not a poorly named primary key of
Employees, you can check the
tables declared keys and unique indexes. If you
suspect that it might really be unique but that the database designer
failed to declare or enforce that uniqueness with either a declared
key or index, you can check for uniqueness with SELECT COUNT(*), COUNT(DISTINCT Department_ID) FROM Employees;. However, you might be surprised how rarely
there is any doubt about where the unique end of the join lies, just
from column and table naming. You can generally get by just fine with
intelligent guesses, revisiting those guesses only if the answer you
come up with performs worse than you expect or require.

Search for join conditions that go from a foreign key in the focus
table to a primary key in another table, and draw arrows for those
joins pointing down from the focus table, with aliases for the
joined-to tables on the lower end of each arrow. When a link
represents an outer join, add an arrowhead pointing to the optional
table halfway along the link.

Shift focus to another, previously unvisited node in the diagram and
repeat Steps 2 and 3 until you have nodes that represent every alias
in the FROM clause and arrows that represent every
join. (For example, I represent a multipart join as a single arrow
from a multipart foreign key to a multipart primary key.) Usually,
you will find just a single arrow pointing down into a node, so you
will normally search for fresh downward-pointing joins from nodes
already on the lower, arrow end of a join. This usually leads to an
upside-down tree structure, descending from a single detail table at
the top.

After completing all the nodes and links, fill in the numbers for
filter ratios and join ratios based, if possible, on
production-application table statistics. If you
dont have production data, then estimate as best
you can. It is not necessary to add join ratios alongside links that
represent outer joins. You almost always find that the optional table
in an outer join (on the (+) side of the join, in
old Oracle notation, or following the LEFT OUTER
keywords in the newer style) has no filter condition, leaving the
filter ratio equal to 1.0, which you denote by omitting it from the
diagram.

Place an asterisk next to the filter ratio for any filter that is
guaranteed to return at most a single row. This is not a function
just of the ratio and the rowcount of the table, since a condition
might return an average of a single row without guaranteeing a
maximum of a single row. To guarantee at most a single row, you
should have a unique index or a well-understood application
constraint that creates a true guarantee.

Guaranteeing Uniqueness

Sometimes, a filter condition is
so close to guaranteeing a unique match that it is tempting to treat
it as unique. Usually, you can get away with this, but I find that it
is useful to first try to make the guarantee complete, by filling in
some missing condition or correcting the database design. For
example, you might have a filter Order_ID=:1 for
an Orders table, with the primary key of
Orders that consists of the
Order_ID and Company_ID
columns, but Company_ID is a single, dominant
value 99.9% of the time. The most likely reason the developer left
out any condition on Company_ID is simply that he
forgot that it is sometimes not equal to the dominant value and that
Order_ID does not always specify a unique order by
itself. I have found that when filter conditions are close to unique,
the intent is almost invariably to make them unique, and the missing
condition or conditions should be added to achieve the original
intended functionality.

Similar comments apply to almost-unique joins. When the join
conditions imply a many-to-almost-always-one join, chances are high
that the query or the database design should be corrected to
guarantee a perfect many-to-one join.

If you have available production data, it is the ideal source of
filter and join ratios. For Example 5-1, you would
perform the following queries (Q1 through
Q5) to determine these ratios rigorously:

Q1: SELECT COUNT(*) A1 FROM Employees 
    WHERE Exempt_Flag=Y;
A1: 1,000
Q2: SELECT COUNT(*) A2 FROM Employees;
A2: 10,000
Q3: SELECT COUNT(*) A3 FROM Departments 
    WHERE US_Based_Flag=Y;
A3: 245
Q4: SELECT COUNT(*) A4 FROM Departments;
A4: 490
Q5: SELECT COUNT(*) A5 FROM Employees E, Departments D 
    WHERE E.Department_ID=D.Department_ID;
A5: 9,800

With A1 through A5 representing
the results these queries return, respectively, do the following
math:

To find the filter ratio for the E node, find
A1/A2.

To find the filter ratio for the D node, find
A3/A4.

To find the detail join ratio, find
A5/A4, the rowcount of the
unfiltered join divided by the master-table rowcount.

To find the master join ratio, find
A5/A2, the rowcount of the
unfiltered join divided by the detail-table rowcount.

For example, the results from queries Q1 through
Q51,000, 10,000, 245, 490, and 9,800,
respectivelywould yield precisely the filter and join ratios
in Figure 5-1, and if you multiplied all of these
query results by the same constant, you would still get the same
ratios.

5.2.6 A More Complex Example

Now I illustrate the diagramming process for a query that is large
enough to get interesting. I will diagram the query in Example 5-2, which expresses outer joins in the older
Oracle style (try diagramming this query yourself first, if you feel
ambitious).

Example 5-2. A more complex query, with eight joined tables

SELECT C.Phone_Number, C.Honorific, C.First_Name, C.Last_Name, 
C.Suffix, C.Address_ID, A.Address_ID, A.Street_Address_Line1, 
A.Street_Address_Line2, A.City_Name, A.State_Abbreviation, 
A.ZIP_Code, OD.Deferred_Shipment_Date, OD.Item_Count, 
ODT.Text, OT.Text, P.Product_Description, S.Shipment_Date 
FROM Orders O, Order_Details OD, Products P, Customers C, Shipments S, 
Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name) LIKE :Last_Name||%
AND UPPER(C.First_Name) LIKE :First_Name||%
AND OD.Order_ID = O.Order_ID
AND O.Customer_ID = C.Customer_ID
AND OD.Product_ID = P.Product_ID(+)
AND OD.Shipment_ID = S.Shipment_ID(+)
AND S.Address_ID = A.Address_ID(+)
AND O.Status_Code = OT.Code
AND OT.Code_Type = ORDER_STATUS
AND OD.Status_Code = ODT.Code
AND ODT.Code_Type = ORDER_DETAIL_STATUS
AND O.Order_Date > :Now - 366
ORDER BY C.Customer_ID, O.Order_ID DESC, S.Shipment_ID, OD.Order_Detail_ID;

As before, ignore all parts of the query except the
FROM and WHERE clauses. All
tables have intuitively named primary keys except
Code_Translations, which has a two-part primary
key: (Code_Type, Code).

Note that when you find a single-table equality condition on part of
a primary key, such as in this query on
Code_Translations, consider that condition to be
part of the join, not a filter condition. When
you treat a single-table condition as part of a join, you are
normally looking at a physical table that has two or more logical
subtables. I call these apples-and-oranges
tables: tables that hold somewhat different, but related
entity types within the same physical table. For purposes of
optimizing a query on these subtables, you are really interested only
in the statistics for the specific subtable the query references.
Therefore, qualify all queries for table statistics with the added
condition that specifies the subtable of interest, including the
query for overall rowcount (SELECT COUNT(*) FROM
<Table>), which becomes, for this
example, a subtable rowcount (SELECT COUNT(*) FROM Code_Translations WHERE
Code_Type=ORDER_STATUS).

If you wish to complete the skeleton of the query diagram yourself,
as an exercise, follow the steps for creating a query diagram shown
earlier up to Step 4, and skip ahead to Figure 5-4
to see if you got it right.

5.2.6.1 Diagram joins to the first focus

For Step
1, place the alias O, arbitrarily, in the center
of the diagram. For Step 2, search for any join to
Os primary key,
O.Order_ID. You should find the join from
OD, which is represented in Figure 5-2 as a downward arrow from OD
to O. To help remember that you have diagrammed
this join, cross it out in the query.

Figure 5-2. Beginnings of the diagram for Example 5-2

5.2.6.2 Diagram joins from the first focus

Moving on to Step 3, search for other joins (not to
Order_ID, but to foreign keys) to
Orders, and add these as arrows pointing down from
O to C and to
OT, since these joins reach
Customers and Code_Translations
on their primary keys. At this point, the diagram matches Figure 5-3.

Figure 5-3. After Step 3 for the second diagram

5.2.6.3 Change focus and repeat

This completes all joins that originate from alias
O. To remember which part is complete, cross out
the O in the FROM clause, and
cross out the joins already diagrammed with links. Step 4 indicates
you should repeat Steps 2 and 3 with a new focus, but it does not go
into details about how to choose that focus. If you try every node as
the focus once, you will complete the diagram correctly regardless of
order, but you can save time with a couple of rules of thumb:

Try to complete upper parts of the diagram first. Since you have not
yet tried the top node OD as a focus, start there.

Use undiagrammed joins to find a new focus that has at least some
work remaining. Checking the list of joins that have not yet been
crossed out would lead to the following list of potential candidates:
OD, P, S,
A, and ODT. However, out of
this list, only OD is now on the diagram, so it is
the only focus now available to extend the diagram. If you happened
to notice that there are no further joins to C and
OT, you could cross them out in the
FROM clause as requiring no further work.

By following these rules of thumb you can complete the diagram
faster. If at any stage you find the diagram getting crowded, you can
redraw what you have to spread things out better. The spatial
arrangement of the nodes is only for convenience; it carries no
meaning beyond the meaning conveyed by the arrows. For example, I
have drawn C to the left of OT,
but you could switch these if it helps make the rest of the diagram
fit the page.

The illustrators at OReilly have worked hard to
make my example diagrams look nice, but you dont
have to. In most cases, no one will see your diagrams except you, and
you will probably find your own diagrams sufficiently clear with
little bother about their appearance.

You find no joins to the primary key of
Order_Details (to
OD.Order_Detail_ID, that is), so all links to
OD point downward to P,
S, and ODT, which all join to
OD with their primary keys. Since
S and P are optional tables in
outer joins, add downward-pointing arrowheads midway down these
links. Cross out the joins from OD that these
three new links represent. This leaves just one join left, an outer
join from S to the primary key of
A. Therefore, shift focus one last time to
S and add one last arrow pointing downward to
A, with a

midpoint arrowhead indicating this outer
join. Cross out the last join and you need not shift focus again,
because all joins and all aliases are on the diagram.

You should quickly confirm that no alias is orphaned (i.e., has no
join condition at all linking it to the rest of the diagram). This
occasionally happens, most often unintentionally, when the query
includes a Cartesian join.

At this stage, if youve been following along, your
join diagram should match the connectivity shown in Figure 5-4, although the spatial arrangement is
immaterial. I call this the query skeleton or
join skeleton; once you complete it, you need
only fill in the numbers to represent relevant data distributions
that come from the database, as in Step 5 of the diagramming
process.

Figure 5-4. The query skeleton

5.2.6.4 Compute filter and join ratios

There are no filter conditions on the
optional tables in the outer joins, so you do not need statistics
regarding those tables and joins. Also, recall that the apparent
filter conditions OT.Code_Type
= ORDER_STATUS and
ODT.Code_Type =
ORDER_DETAIL_STATUS are part of the joins to
their respective tables, not true filter conditions, since they are
part of the join keys to reach those tables. That leaves only the
filter conditions on customer name and order date as true filter
conditions. If you wish to practice the method for finding the join
and filter ratios for the query diagram, try writing the queries to
work out those numbers and the formulae for calculating the ratios
before looking further ahead.

The selectivity of the conditions on first and last names depends on
the lengths of the matching patterns. Work out an estimated
selectivity, assuming that :Last_Name and
:First_Name typically are bound to the first three
letters of each name and that the end users search on common
three-letter strings (as found in actual customer names)
proportionally more often than uncommon strings. Since this is an
Oracle example, use Oracles expression
SUBSTR(Char_Col, 1, 3) for an expression that
returns the first three characters of each of these character
columns.

Recall that for the apples-and-oranges table
Code_Translations, you should gather statistics
for specific types only, just as if each type was in its own,
separate physical table. In other words, if your query uses order
status codes in some join conditions, then query
Code_Translations not for the total number of rows
in that table, but rather for the number of order status rows. Only
table rows for the types mentioned in the query turn out to
significantly influence query costs. You may end up querying the same
table twice, but for different subsets of the table. Example 5-3 queries Code_Translations
once to count order status codes and again to count order detail
status codes.

The queries in Example 5-3 yield all the data you
need for the join and filter ratios.

Example 5-3. Queries for query-tuning statistics

Q1:  SELECT SUM(COUNT(*)*COUNT(*))/(SUM(COUNT(*))*SUM(COUNT(*))) A1
     FROM Customers 
     GROUP BY UPPER(SUBSTR(First_Name, 1, 3)), UPPER(SUBSTR(Last_Name, 1, 3));
A1:  0.0002
Q2:  SELECT COUNT(*) A2 FROM Customers;
A2:  5,000,000
Q3:  SELECT COUNT(*) A3 FROM Orders 
     WHERE Order_Date > SYSDATE - 366;
A3:  1,200,000
Q4:  SELECT COUNT(*) A4 FROM Orders;
A4:  4,000,000
Q5:  SELECT COUNT(*) A5 FROM Orders O, Customers C 
     WHERE O.Customer_ID = C.Customer_ID;
A5:  4,000,000
Q6:  SELECT COUNT(*) A6 FROM Order_Details;
A6:  12,000,000
Q7:  SELECT COUNT(*) A7 FROM Orders O, Order_Details OD 
     WHERE OD.Order_ID = O.Order_ID;
A7:  12,000,000
Q8:  SELECT COUNT(*) A8 FROM Code_Translations 
     WHERE Code_Type = ORDER_STATUS;
A8:  4
Q9:  SELECT COUNT(*) A9 FROM Orders O, Code_Translations OT
     WHERE O.Status_Code = OT.Code 
       AND Code_Type = ORDER_STATUS;
A9:  4,000,000
Q10: SELECT COUNT(*) A10 FROM Code_Translations 
     WHERE Code_Type = ORDER_DETAIL_STATUS;
A10: 3
Q11: SELECT COUNT(*) A11 FROM Order_Details OD, Code_Translations ODT
     WHERE OD.Status_Code = ODT.Code 
       AND Code_Type = ORDER_DETAIL_STATUS;
A11: 12,000,000

Beginning with filter ratios, get the weighted-average filter ratio
for the conditions on Customers
first and last name directly from A1. The result
of that query is 0.0002. Find the filter ratio on
Orders from
A3/A4, which comes out to 0.3.
Since no other alias has any filters, the filter ratios on the others
are 1.0, which you imply just by leaving filter ratios off the query
diagram for the other nodes.

Find the detail join ratios, to place alongside the upper end of each
inner join arrow, by dividing the count on the lower table (the
master table of that master-detail relationship) by the count on the
join of the two tables. The ratios for the upper ends of the joins
from OD to O and
ODT are 3 (A7/A4) and 4,000,000
(A11/A10), respectively.

To fit join ratios on the diagram, I will abbreviate millions as
m and thousands as k, so I
will diagram the last result as 4m.

The ratios for the upper ends of the joins from O
to C and OT are 0.8 (from
A5/A2) and 1,000,000 (or
1m, from
A9/A8), respectively.

Find the master join ratios for the lower ends of each inner-join
arrow, by dividing the count for the join by the upper table count.
When you have mandatory foreign keys and referential integrity
(foreign keys that do not fail to point to existing primary-key
values), you find master join ratios equal to exactly 1.0, as in
every case in this example, from
A7/A6,
A11/A6,
A5/A4, and
A9/A4.

Check for any unique filter conditions that you would annotate with
an asterisk (Step 6). In the case of this example, there are no such
conditions.

Then, place all of these numbers onto the query diagram, as shown

in
Figure 5-5.

Figure 5-5. The completed query diagram for the second example

5.2.7 Shortcuts

Although the full process of completing a detailed, complete query
diagram for a many-way join looks and is time-consuming, you can
employ many shortcuts that usually reduce the process to a few
minutes or even less:

You can usually guess from the names which columns are the primary
keys.

You can usually ignore ratios for
the outer joins and the optional tables they reach. Often, you can
leave outer joins off the diagram altogether, at least for a
preliminary solution that shows how much room for improvement the
query has.

The master join ratio (on the primary-key
end of a join) is almost always exactly 1.0, unless the foreign key
is optional. Unless you have reason to suspect that the foreign key
is frequently null or frequently points to no-longer-existing
primary-key values, just leave it off, implying a value of 1.0. The
master join ratio is never greater than 1.0.

Some databases implement
constraints that rigorously guarantee referential integrity. Even
without such constraints, well-built applications maintain
referential integrity, or at least near-integrity, in their tables.
If you guarantee, or at least assume, referential integrity, you can
assume that nonnull foreign keys always point to valid primary keys.
In this case, you do not need to run the (comparatively slow) query
for joined-table counts to find the join factors. Instead, you just
need the percentage of rows that have a null foreign key and the two
table counts. To compute those, begin by executing SQL similar to
this:

SELECT COUNT(*) D, COUNT(<Foreign_Key_Column>) F FROM <Detail_Table>;
SELECT COUNT(*) M FROM
<Master_Table>;

Letting D be the first count from the first query,
F be the second count from the same query, and
M be the count from the second query, the master
join ratio is just F/D, while
the detail join ratio is F/M.

Join ratios rarely matter; usually, you can just guess. The detail
join ratio matters most in the unusual case when it is much less than
1.0, and it matters somewhat whether the detail join factor is close
to 1.0 or much larger. You can usually guess closely enough to get in
the right ballpark.

Informed guesses for join factors are sometimes best! Even if your
current data shows surprising statistics for a specific master-detail
relationship, you might not want to depend on those statistics for
manual query tuning, since they might change with time or might not
apply to other database instances that will execute the same SQL.
(This last issue is especially true if you are tuning for a shared
application product.)

If you tune many queries from the same
application, reuse join-ratio data as you encounter the same joins
again and again.

Filter ratios matter the most but usually
vary by orders of magnitude, and, when filter ratios are not close,
order-of-magnitude guesses suffice. You almost never need more
precision than the first nonzero digit (one significant
figure, if you remember the term from high school science)
to tune a query well for any of the ratios, so round the numbers you
show. (For example, 0.043 rounds to 0.04 with one significant figure,
and 472 rounds to 500.)

Just as for join ratios, informed guesses for filter ratios are
sometimes best. An informed guess will likely yield a result that
works well across a range of customers and over a long time.
Overdependence on measured values might lead to optimizations that
are not robust between application instances or over time.

The key value of diagramming is its effect on your understanding of
the tuning problem. With practice, you can often identify the best
plan in your head, without physically drawing the query diagram at
all!