10.1 When Very Fast Is Not Fast Enough
Online
steps in a business process that
run in less than a second are unlikely to significantly slow the end
user who is performing the process. Even steps that take well over a
second can often be made convenient to the end user if those parts of
the process are taken offline and made into batch processes. The only
business-driven need for queries to run very fastunder a half
a second, for examplecomes about when a single step in a
business process requires queries to run repeatedly. Some application
designs repeat a query hundreds of times for a single online event,
or up to millions of times for a single batch process. In these
cases, clearly, a query would have to run in a few milliseconds or
less to meet an end user's needs, and most queries
are not that fast, even after perfect tuning. When you run into this
problem, the real issue is the need for so many queries, not the
runtime of each query. There are three basic solutions to the need
for repeated queries:Cache everything the repeated queries might need, once, in
application memory, and subsequently get data as needed from the
application cache.Consolidate the repeated queries into a new, single query.Merge the repeated queries into joins that are added to a single
query the application already runs.
Before I discuss these three solutions, consider a typical repeated
query, which retrieves the pricing history for a given product:
SELECT Product_Price FROM Price_List
WHERE Product_ID = :1
AND Effective_Date <= :today
ORDER BY Effective_Date DESC;
To find the best solution to such a repeated query, you need to
consider several questions:How many times must the application run the query sequentially to
perform a task?What is the source of the values of :1 and
:today that drive the query?How many rows would you need to preread and cache for the cache to
supply all the data the application might need? How much future use
could you get from that cache?What is the rowcount range and average for the query? How many of
those rows does the application actually use?
The next few sections describe three possible solutions and show how
the answers to these questions can lead to one of those solutions.
10.1.1 Caching to Avoid Repeated Queries
If the set of possible
Product_IDs is small compared to the number of
times the example query repeats, you have a good case to cache the
whole set. For example, we might have 10
Product_IDs while repeating the query 1,000 times,
so it makes more sense to precache results across the whole set and
read from the cache in place of repeated queries to the database. For
example, you could issue this query:
SELECT Product_Price FROM Price_List
WHERE Effective_Date <= :today
ORDER BY Product_ID ASC, Effective_Date DESC;
The cache could then hold these resultsfor example, in a
structure in application memory using hash buckets or a binary tree
that gives rapid (microseconds) access to the results for a
particular Product_ID. Since individual reads from
the first approach outnumber the Product_IDs, this
approach offers two advantages in terms of the cost of database
access:The database reads fewer rows, with fewer logical I/Os. In the
original approach, the database read each row more than once, on
average, since it hit the average Product_ID more
than once. These reads were likely less efficient in terms of logical
I/Os per row as well, because they likely drove through an index read
that required more than one logical I/O per row read. The precaching
approach, in contrast, likely drives from a full table scan that
reads several rows per logical I/O.The database gets the data in a single query, likely with a single
round trip over the network. Even with reuse of a preparsed query,
the repeated-query approach requires at least one round trip to the
database per repeat of the query, with considerable overhead.
In the current example, the repeated query likely returns multiple
rows for each Product_ID, but it is quite likely
that only the row with the highest Effective_Date
returned (the first one in the sort order) is relevant to the
application. An ideal caching algorithm would take this into account
and simply discard the other rows, saving memory and other costs to
fill the cache.Therefore, even if such a cache is used one set of times, to perform
a single business task for a single end user, it can cost less than
the repeated queries it makes unnecessary. If the cache continues to
be useful for future tasks by the same end user, the benefit
increases, even justifying a cache that is initially more expensive
to populate than the repeated queries for a single task. If the cache
resides in shared memory that is accessible to multiple end users,
assuming its contents are useful to the whole community of end users,
the benefits multiply still further.
10.1.2 Consolidated Queries
A blind query to read every row that a repeated query might
potentially touch can be too expensive, and the result set might be
too large to cache. However, that does not prevent the application
from consolidating the multiple queries into a single query with one
of two forms. In the current example, each pass through the loop
would find a Product_ID to drive the initial
repeated query. You could run through the loops without querying and
simply compile a list of IDs to use in a single consolidated query.
For the example, these forms would look like this:
SELECT Product_Price FROM Price_List
WHERE Product_ID in (<long list of IDs>)
AND Effective_Date <= :today
ORDER BY Product_ID ASC, Effective_Date DESC;
or this:
SELECT Product_Price FROM Price_List
WHERE Product_ID in (<subquery that returns the list of IDs>)
AND Effective_Date <= :today
ORDER BY Product_ID ASC, Effective_Date DESC;
This still has the advantages of eliminating the overhead of handling
multiple queries, and it handles a case in which caching the set of
all possible query results would not work, but it produces a result
that yields far less row reuse. In all, this query-consolidation
approach resolves a broad class of repeated-query performance
problems.
10.1.3 Merging Repeated Queries into a Preexisting Query
Usually, the source of the variables in repeated queries is an
earlier query. In the current example, :today
surely comes from the clock/calendar, but :1
almost certainly comes from an earlier query that returned a long
list of Product_IDs. The most likely reason behind
the repeated query is to fill in a current (with the most recent
Effective_Date) product price for each of the rows
the earlier query returned, based on the first row that the query
returns each time it runs. Since the goal of the query is to find a
matching price row for each of the earlier query's
rows, this is functionally like a join, though not nearly so
efficient.The technique in these cases is to convert these to actual joins.
When these repeated queries are just simple single-row queries on a
primary key, the conversion to a join is obvious. In cases like the
current example, the solution is subtler. The query returned a price
history sorted by Effective_Date to place the
currently effective price at the top. This currently effective price
record is the only record you actually want. To consolidate this
repeated query with the query that provides the list of
Product_IDs, you have to find a way to join with
the first row a sorted query returns without reading the other rows.There are several approaches that resolve this requirement, and I
will describe two. Each approach begins with a new column,
Current_Price_Flag, which equals
'Y' if and only if the price is the price
currently in effect:If rows are never created with future
Effective_Dates, create a trigger that sets the
newly obsolete price's
Current_Price_Flag to 'N' and
that creates new prices with Current_Price_Flag
set to 'Y'.If not-yet-effective price records are sometimes created, run a
nightly batch process that looks for future effective price records
that just became current at midnight, and have that process update
those records to have Current_Price_Flag equal to
'Y', while updating
Current_Price_Flag of newly obsolete prices to
'N'. This nightly batch process will probably
supplement trigger-driven updates from new price records that are
current at creation time.
Given such a properly maintained
Current_Price_Flag column, the join is simple. If
the initial query looked like this:
SELECT ... OD.Product_ID, ...
FROM ... Order_Details OD, ...
WHERE ...
the new query that incorporates the join would be:
SELECT ... OD.Product_ID, ..., PL.Product_Price
FROM ... Order_Details OD, ..., Price_List PL
WHERE ...
AND OD.Product_ID=PL.Product_ID
AND PL.Current_Price_Flag='Y'
Not all procedural code is replaced by a join as easily as in this
last example. However, SQL incorporates powerful logic expressions,
such as CASE, that enable virtually any procedural
data-lookup function to be converted to some series of joins combined
with complex (often nested) SQL expressions.
• Table of Contents• Index• Reviews• Examples• Reader Reviews• Errata• AcademicSQL TuningBy