8.1 The Case for Nested Loops
Throughout this book, I have asserted that
nested-loops joins on the join keys offer more robust execution plans
than hash or sort-merge joins. Let's examine why.
Consider a two-table query diagram, as shown in Figure 8-1.
Figure 8-1. A prototype two-table query
Sight-unseen, it is possible to say a surprising amount about this
query, to a high probability, if you already know that it is a
functionally useful business query that has come to you for tuning:The top table, at least, is large or
expected to grow large. Because Jd, the detail
join ratio, is greater than 1.0 (as is usual), the bottom table is
smaller than the top table by some factor, but that factor could be
moderate or large. Queries that read only small tables are common
enough, but they rarely become a tuning concern; the native optimizer
produces fast execution plans without help when every table is small.Large tables are likely to grow larger with time. They usually became
large in the first place through steady growth, and that growth
rarely stops or slows down.The query should return a moderate number of rows, a small fraction
of the number in the detail table. Certainly, queries can return
large rowsets, but such queries are usually not useful to a real
application, because end users cannot effectively use that much data
at once. Online queries for a real application should generally
return fewer than 100 rows, while even batch queries should usually
return no more than a few thousand.While tables grow with time, the inability of end users to digest too
much data does not change. This often means that the query conditions
must point to an ever-decreasing fraction of the rows in the larger
table. Usually, this is achieved by some condition that tends to
point to recent data, since this tends to be more interesting from a
business perspective. Although a table might contain an
ever-lengthening history of a business's data, the
set of recent data will grow much more slowly or not at all.The number of rows the query returns is
Ca x
Fa x
Fb, where
Ca is the rowcount of table
A.
These assertions lead to the conclusion that the product of the two
filter ratios (Fa x
Fb) must be quite small and
will likely grow smaller with time. Therefore, at least one of
Fa and Fb must also be
quite small. In practice, this is almost always achieved with one of
the filter ratios being much smaller than the other; the smaller
filter ratio is almost always the one that grows steadily smaller
with time. Generally, the smallest of these filter ratios justifies
indexed access in preference to a full table scan.If the best (smallest) filter ratio is Fb and it
is low enough to justify indexed access to table
B, then nested loops from B
will generally point to the same fraction (Fb)
of rows in table A. (A given fraction of master
records will point to the same fraction of details.) This fraction
will also be low enough to justify indexed access (through the
foreign key, in this case) to table A, with lower
cost than a full table scan. Since
Fb<Fa under this
assumption, nested loops will minimize the number of rows touched and
therefore minimize the number of physical and logical I/Os to table
A, compared to an execution plan that drives
directly through the index for the filter on A.
Either a hash or a sort-merge join to table A
would require a more expensive full table scan or index range scan on
the less selective filter for table A. Since index
blocks for the join are better cached than the large table
A, they are inexpensive to read, by comparison to
table blocks. Therefore, when the best filter ratio is
Fb, nested loops minimize the cost of the read
to table A.When the best filter is on the detail table (table
A, in this case), the same argument holds at the
limit where Jd=1. When
Jd>1, larger values of
Jd tend to favor
hash
joins. However, unless Jd is quite large, this
factor does not usually overcome the usually strong advantage of
driving to every table from the most selective filter. When
Jd is large, this implies that the master table
B is much smaller than the detail table
A and will be consequently much better cached and
less expensive to reach, regardless of the join method. I already
discussed this case at length in Section 6.6.5 in Chapter 6, and I won't repeat it all
here. The key point is that, even in the cases in which hash joins
improve a given join cost the most, they usually reduce only a
comparatively minor component of the query costthe cost of
reaching a much smaller, better-cached master table. To achieve even
this small benefit, the database must place the joined-to rowset in
memory or, worse, store the hashed set temporarily on disk.
• Table of Contents• Index• Reviews• Examples• Reader Reviews• Errata• AcademicSQL TuningBy