SQL Tuning [Electronic resources] نسخه متنی

10.2 Queries that Return Data from Too Many Rows

There are four main contexts
in which queries cover too many rows, resulting in long-running
queries even when per-row costs are as low as they can be:

Online queries that return hundreds or more rows

These queries are invariably excessive from the point of view of end
users; such large result sets are inconvenient or impossible to use
online. The most likely response of an end user is simply to repeat
such a query with some further refinement to the conditions to try to
reduce the results to a manageable size.

Batch reports that return tens of thousands or more rows

These queries are invariably excessive from the point of view of end
users; such large result sets are inconvenient to read, even in a
report. Certainly, a large, well-ordered report does not have to be
read from cover to cover, but if the intent is just to provide a way
to look up selected facts, why not use a well-designed online
transaction for the purpose? A well-designed application on a
relational database is a better way to look up selected facts than
the output of a huge flat report.

Aggregations

Queries sometimes aggregate (i.e., summarize)
large rowsets into results that are small enough for end users to
digest. This can happen online or in batch.

Middleware

Batch processes sometimes function as
middlewaresoftware that moves data around
within a system or systems without sending that data to the end
users. Since end users aren't part of the picture
and computers have plenty of patience for large data volumes, these
batch processes sometimes legitimately need to handle data volumes
that are too large for human consumption.

Let's examine ways to fix each of these types of
excessive reads, in turn.

10.2.1 Large Online Queries

In the long run, large online queries
tend to take care of themselves. Every time end users deliberately or
accidentally trigger a query with insufficiently selective filter
criteria, they receive negative feedback in the form of a long wait
and a returned result that is inconveniently large for online
viewing. The returned result has too much chaff (unneeded data) mixed
in with the wheat (the data the end users really want) for convenient
use. Over time, end users learn which query criteria will likely
deliver punishingly large result sets, and they learn to avoid those
query criteria. End-user education can help this learning process
along, but the problem is somewhat self-repairing even without formal
effort.

Unfortunately, there are two key times in the short run when this
sort of automated behavior modification doesn't
prevent serious trouble:

During test usage

End users and developers often try out online forms in scenarios in
which any value will suffice. In real use, an end user knows the
name, or at least most of the name, before looking it up, and he
knows to type at least a partial name to avoid having to scroll
through a pick list of thousands of entries to find the right name.
However, in a test scenario, the tester might be perfectly happy
querying every name (such a query is called a

blind
query) and picking, for convenience,
"Aaron, Abigail" from the top of
the huge alphabetic list.

Blind queries such as this are much more common when end users are
just surfing a test system, to test a preproduction application, than
when they have production work to do. Blind queries like this can
even find their way into hurriedly compiled, automated benchmark load
tests, horribly distorting benchmark results. Unfortunately, it is
exactly during such testing that you will likely be searching for
performance problems, so these blind queries distort the problem
picture and deliver a poor first impression of the application
performance.

Early in end users' learning process

Novice end users do not yet know to avoid blind and insufficiently
selective (i.e., semiblind) queries even in
production use. When they execute such long-running queries, they
form a poor, often costly first impression of the application
performance. In early production use, and even later if end-user
turnover is high, the system-wide load from these mistakes can be
high, hurting everyone, even end users who avoid the mistakes.

Blind and semiblind queries in the most frequently used transactions
are worth preventing in the application layer. When possible, the
application should simply refuse to perform a potentially large query
that does not have at least one known-to-be-selective search criteria
specified at query time. This requires a little extra effort to
discover in advance which selective search criteria will be available
to end users in production use. These selective search criteria will
map to a list of table columns that must drive efficient queries.
However, you need to ask just these questions to make sure that you
have indexed paths from those columns to the rest of the data;
otherwise, you end up tuning queries, with otherwise harmful indexes,
that will never come up in real application use. When end users query
without reasonable search criteria, it is better to return immediate
error messages that suggest more selective search criteria than to
tie up their windows with long-running, useless queries.

Sometimes, you cannot guess in advance how selective a search will be
until it runs; the selectivity depends on the application data. For
example, the application might return far more matches to a last name
of "Smith" than
"Kmetec," but you would not want to
hardcode a list of name frequencies into the application. In these
cases, you need a way to see that a list is too long, without the
expense of reading the whole list. The solution has several steps:

Determine the maximum-length list you want to return without error.
For purposes of discussion, let's say the maximum
length is 500.

In the call to the database, request one more row than that maximum
length (501 rows for this discussion).

Arrange in advance that the query execution plan is robust, following
nested loops to return rows without having to prehash, sort, or
otherwise store whole large rowsets before returning the first 501
rows.

Request that the query result from the database be unsorted, so the
database does not need to read all the rows before returning the
first rows.

Sort the result as needed in the application layer if the result does
not trigger an error by hitting the error threshold (501 rows for
this discussion).

Cancel the query and return an error that suggests a more selective
search if the rowcount hits the maximum length (501 rows for this
discussion).

In my former performance-tuning position at TenFold Corporation, this
technique proved so useful that we made it the automatic behavior of
the EnterpriseTenFold application platform, with an adjustable
maximum rowcount.

In summary, prevent large online queries with a three-pronged
approach:

Train end users to specify narrow enough searches that they
don't get more data than is useful or efficient to
read.

Return error messages when end users attempt obviously unselective
queries, especially blind queries based on large root detail tables.

Run potentially large queries in such a way that you get the first
rows quickly, and return an error as soon as the number of returned
rows is excessive.

10.2.2 Large Batch Reports

Slow online events are punishing enough
that they don't go uncorrected. The affected end
users either modify their behavior or complain loudly enough that the
problem is corrected. Batch load can be more subtly dangerous in the
end, since batch performance problems sometimes go unnoticed,
creating enormous load and preventing adequate system throughput
without being an obvious problem. When overall load is too high, all
parts of an application slow down, but the villains, the batch
processes that consume too much of the system resources, might be
performing well enough to go unnoticed, especially when they are
low-priority processes that no one is awaiting. Automatically
prescheduled, periodic batch processes are especially dangerous: they
might run much more frequently than needed, without anyone noticing.
They might cease to be needed as often as they once were, or cease to
be needed at all, but go right on tying up your system unnoticed.

Conceptually, there are several questions regarding a large batch
report that are relevant to choosing a performance solution:

What is the reason for the report?

How is the report triggered?

Why is performance of the report a concern?

What sort of information does the reader extract from the report?

The answers to these questions all affect the best answer to the
final question: how do you fix the report performance?

10.2.2.1 Reasons for large reports

Beginning with the assumption that no one person will ever read a
huge application report from cover to cover, why should applications
ever need huge report queries? Common reasons for huge report
queries, and performance strategies for each reason, include:

A report has many readers, each of whom is interested in a different
subset of the data. No one reads the report from cover to cover, but any given part of it
might be interesting to at least one reader. The needs filled by such
all-inclusive reports are often better served by multiple smaller
reports. These can run in parallel, when needed, making it easier for
the system to reach all the necessary data quickly enough. They also
can each run just as often as their readers need, rather than read
everyone's data as often as the most demanding
readers require.

All details of a report are potentially interesting at the time the
report is requested, but end users will read only a small part of the
report, based on which questions happen to arise that the end users
must answer. The need to answer such ad hoc questions as they arise is far better
met by online application queries to the database. A flat report
structure in a huge report can never offer a path to data as
convenient as a well-built application. When you use reports for ad
hoc data access in place of online applications, you are bypassing
all the advantages of a relational database. Since the ad hoc online
queries that replace the entire huge report will touch only a small
subset of the data the report must read, the online solution requires
far less logical and physical I/O.

Only a subset of the query data is ever used.
Here, the solution is obvious and overwhelmingly beneficial:
eliminate from the query and the report those rows that are never
used. Where the report lists fewer rows than the query returns, add
filters to match just the rows the report needs. If you trim the
report itself, add filters to the queries that serve the trimmed
report and tune the queries so they never touch the unused data. A
special case in which only a subset of the data is required occurs
when the end user needs only the summary information (the
aggregations) in a report that includes both details and
aggregations. In this case, eliminate the details from both the
report and the database queries, and see Section 10.2.3 to consider further
solutions.

A report is required only for legal reasons, not because anyone will
ever read it.
Such a justification for a report invites several questions. Is it
still legally mandated, or did that requirement vanish and
you're just producing the report out of habit? Is it
really required as often as you are producing it? Rare huge reports
are not likely to be a performance or throughput problem, so the
essential point is to run them rarely if you cannot get rid of them.
Does the law really require the data in the form of that report, or
is it enough just to have access to the data in some form? Often,
requirements for data retention do not specify the form of the
retained data. The database itself, or its backups, might satisfy the
requirements without an application report. If the report is required
only for legal reasons, it can likely run during off hours, when load
is less of an issue.

I once heard a thirdhand tale of a report produced for legal reasons
that had been broken from the beginning; it filled many pages with a
system dump every time it ran. Since no one actually used the report,
no one noticed that it didn't work. However, each
paper printout of the system dump was dutifully filed away, and the
filer checked off a box in a form stating that the legal requirement
to run and file the report was met. When the system was reengineered,
the report was finally recognized as broken and was eliminated. The
reengineering team concluded the report had never really been
necessary after all. However, even when told that the report
didn't work and never had, the person in charge of
checking off the ran-and-filed box in the old form wanted the report
put back in its original, broken form.

10.2.2.2 Ways reports are triggered

There are two basic ways reports get triggered:

Ad hoc requests

When a report is specifically, manually requested, chances are high
that at least the requestor knows a genuine need for at least part of
the report. It is also likely that the requestor cares how long she
will have to wait for the report output, and a long report runtime
might cost the business. When the requestor needs the report soon, it
should get high priority, potentially even running parallel processes
that consume much of the system's resources to meet
the deadline. Otherwise, it is helpful to ensure that the queries
avoid running in parallel, thus avoiding harm to higher-priority
process performance. Furthermore, low-priority reports should
automatically be relegated to run during low-load periods, which is
often enough to eliminate any performance impact from these reports.

Automatic requests

Much of the batch load on most business systems is automatic, in the
form of reports that were long ago scheduled to run automatically
every day or week, or on some other periodic basis. These periodic
batch processes are a particularly insidious load source, because
they are so easy to forget and to ignore. Most addressees of reports
receive far more material than they could ever read, even if you
count only material produced by humans, setting aside monster
automated reports with vast collections of uninteresting numbers.
Somehow, most business people also feel vaguely embarrassed by their
inability to read and digest the vast amount of information that
comes their way. Therefore, rather than complain that they never read
some monster of a report that they receive daily or weekly and argue
that it is a waste of resources, they will keep meekly quiet about
their embarrassing inability to accomplish the impossible. (I know
I've done this;
haven't you?) If you want to reduce the frequency of
huge reports or eliminate them altogether, don't
wait for the addressees to complain that the reports
aren't needed!

One strategy to eliminate reports is to ask a leading question:
"We suspect this report is no longer useful; what do
you think?" Another strategy is to state,
"We're going to eliminate this
report, which we suspect is no longer useful, unless one of you lets
us know you still need it. If you still need it, do you still need it
on the same frequency, or can we produce it less
often?" My favorite strategy is to eliminate
everyone except yourself from the addressee list and just see if
anyone complains. If they complain, send them your copy and keep the
report, adding them back to the addressee list. If no one complains,
drop the scheduled report after a safe interval. (Of course, I
wouldn't suggest doing this without authority.) For
pull reportsreports that the reader must
navigate to, rather than receiving them by email or on paper (these
are called push reports)you can get the
same result by making the file inaccessible and waiting for
complaints.

10.2.2.3 Reasons batch performance is a concern

If a long-running batch job does not create an overloaded system or
otherwise cost the business money, don't worry about
it. Otherwise, there are several reasons to address performance of
the process, with different solutions for each:

End users await results

When a process requires an end user to request a report and the next
important task the end user can do requires the output of that
report, then report runtime is the main concern. Parallelizing the
report process is one solution, allowing several processors to attack
the problem at once. You can accomplish this in the database layer
with parallel execution plans (which I have never found necessary in
my own experience) or in the application layer, spawning multiple
application processes that attack the problem at once and consolidate
results in the end. However, much more often the solution lies in
eliminating all or part of the report with the strategies described
earlier under "Reasons for large
reports." This is especially true because a
bottleneck process like this rarely requires the reader to digest
enormous amounts of data; chances are that only a few lines of the
report are really required by the end user.

Report fails to fit a processing time-window

Many recurring processes take place in time-windows that are
preassigned for the work, often in the middle of the night or over
weekends. The rest of this chapter has strategies to reduce runtimes,
making it easier to meet any given time-window. However, often the
simplest strategy is to relax the time-window constraints. When you
step back and examine the fundamental needs of the business,
time-window constraints usually turn out to be more flexible than
they appear. Usually, the window of a process exists within a larger
window for a whole collection of processes. That larger window might
be smaller than it needs to be. Even if it is not, you can almost
always rearrange the processes within the collection to allow enough
time for any given process, if the other processes are reasonably
tuned.

Report overloads the system

The runtime of the report might not directly be a problem at all; no
one needs the results right away. This is usually the easiest problem
to solve: just make sure the report ties up only moderate amounts of
scarce resources during its run. Avoid running parallel processes to
hurry a low-priority report at the expense of higher-priority work.
Almost always, you can find a way to push such low-priority processes
to off hours, when system resources are not scarce and when the
effect on the performance of other processes is negligible.

10.2.2.4 Report information types

Behind every report is (or should be) one or more business needs that
the report addresses. If a report never influences the behavior of
its addressees, it is a waste of resources. In a perfect world, a
report would simply say, "Do the following:
...," and would be so well designed that you could
follow that advice with perfect confidence. Since the recommended
action would never take long to describe, long reports would be
nonexistent. In the real world, the report helps some human to reach
the same conclusion by following human reasoning about the
information the report provides. However, it is still the case that,
when the report is much longer than the description of the decision
it supports, it is probably not distilling the data as well as it
should. By discovering how to distill the data to a reasonable volume
in the report queries, you not only make the queries inherently
faster, you also produce a more usable report, helping to find the
forest for the trees, so to speak.

Consider different ways a manager might use a long report, on the
assumption that thousands of lines of numbers cannot each,
individually, be relevant to business decision-making:

Only the totals,
averages, and other aggregations are interesting, and the detail
lines are useless.
When this is the case, eliminate the detail lines and report only the
aggregations. Refer to the strategies under the following section,
"Aggregations of Many Details," to
further tune the resulting queries.

The aggregations
are what really matter, at least as approximations, but
they're not even in the report.
Instead, the poor manager must scan the details to do her own
"eyeball" averages or sums, a
miserable waste of human effort and an unreliable way to do
arithmetic. If the addressees find themselves doing such eyeball
arithmetic, it is a sure sign that the report is poorly designed and
that it should report the aggregations directly.

Exceptions are what matter.
The vast majority of report rows are useless, but the manager scans
the report for special cases that call for action. In the
manager's head are criteria for the conditions that
call for action, or at least for closer consideration. The answer in
this case is clear: define the exception criteria and report only the
exceptions. When the exception criteria are fuzzy, at least figure
out what defines a clear nonexception and filter those out. The
result is almost certain to run much faster and to be a much more
usable report.

The top (or bottom) n (for some nice, round
n) are what really matter.
This is really a special case of exceptions, but it is somewhat
harder to define the exception criteria without first examining all
the details from a sort. The key to handling this case is to realize
that there is nothing magic about a nice, round
n. It is probably just as good to produce a
sorted list of records that meet a preset exception criteria. For
example, you might choose to reward your top 10 salespersons.
However, would you really want to reward the 10th-best if the
10th-best sold less than last year's average
salesperson? On the flip side, would you want to bypass rewarding the
11th-best, who missed being 10th-best by $15? The point is that,
compared to a top-n list, it is probably at
least as useful to report a sorted list of exceptionsfor
example, salespersons who exceeded half a million dollars in sales in
the quarter. By defining good exception criteria, which can change as
the business evolves, you save the database the work of finding every
row to perform a complete sort when almost all of the data is
unexceptional and has no real chance to reach the top of the sort.
You also provide added information, such as how close the 11th-best
was to the 10th-best and how many salespersons exceeded the
threshold, compared to the last time you read the report.

A subset of the data is what matters.
Discard the rest of the set.

10.2.2.5 Solutions

For any particular large report, at least one of the earlier sets of
questions should lead to a solution to your performance problem.
Sometimes, a combination of questions will lead to a multipart
solution or will just reinforce the set of reasons for a single
solution. In summary, these are the techniques that resolve
performance problems:

Eliminate unused subsets of the reported data. Make sure these unused
subsets are not only not reported, but are not even queried from the
database. Special cases of this include:

Eliminate details, in favor of aggregations only, and see Section 10.2.3, for how to
report aggregations of many details quickly.

Report exceptions only, eliminating nonexceptional rows.

Replace top-n reporting with sorted lists of
exceptions.

Replace large reports with several smaller reports that each cover
just the needs of a subset of the addressees.

Eliminate large reports in favor of online functionality that helps
the end users find the information they need as they decide they need
it, instead of having them search a report for that information.

Eliminate reporting driven by data-retention requirements in favor of
just maintaining access to the same data in the database or its
backups.

Parallelize processing of high-priority reports that are needed
quickly.

Serialize processing of low-priority reports, and push processing to
off hours and lower frequency. Often, the correct frequency for
low-priority reports is never.

Rearrange load during processing time-windows to relax
time-window-driven constraints.

Admittedly, use of these techniques is more art than science, but
take heart: I have never found a case in which no reasonable solution

existed.

10.2.3 Aggregations of Many Details

It never makes sense
to show an end user a million rows of data, either online or in a
report. However, it makes perfect sense that an end user would want
to know something about an aggregation of a million or more rows,
such as "What was last quarter's
revenue?," where that revenue summarizes a million
or more order details. Unfortunately, it is no easier for the
database to read a million rows for purposes of aggregation than for
purposes of detail reporting. As a result, these large aggregations
are often the ultimate in thorny performance problems: perfectly
reasonable queries, from a functional perspective, that are
inherently expensive to run, even with perfect tuning. These problems
often call for a two-pronged attack:

Examine the query as if it were a query of many details, using the
methods of the earlier sections, and apply techniques for large
detail queries when possible.
In particular, consider eliminating details that do not contribute to
the aggregations. For example, when summarizing order details for
revenue, you might find many order details, such as advertising
inserts included for free, that have no cost to the customer. The
aggregations are unaffected by these, so the developer might not have
bothered to exclude them from the query. But if you exclude them
explicitly, the query handles fewer rows without changing the result.

When necessary, preaggregate data in the database and report the
summaries without touching the details.
This is the most commonly justified form of redundant data stored in
a database. It might be possible, for example, to deduce an account
balance from the sum of all transactions to that account in its
history. However, account balances are needed so commonly that it
makes much more sense to store the running balance directly and make
very sure that it is rigorously synchronized with the sum of all past
transactions. Much of the complexity, error, and redundancy in
applications originates from just such synchronization demands.
Today, much of this need can be met, with careful design, through
database triggers that automatically increment or decrement running
sums and counts every time a detail record is inserted, updated, or
deleted. With a trigger-based approach, the application frontend need
not handle the synchronization needs at all, and many different
sources of detail change can all propagate to a summary through a
single trigger in the background. This guarantees, far better than
the application code can, clean, rigorous synchronization of the
redundant data.

10.2.4 Middleware Processes Handling Too Many Rows

When the
destination for collected data is a system (either the same system,
when creating redundant data, or another system, when propagating
data) rather than a human, larger data volumes often make sense.
Nevertheless, the techniques to reduce middleware data volumes look
much like the techniques for reducing data volume in reports,
substituting a machine addressee for a human one. These are the
techniques most reminiscent of report-load-reduction techniques:

Eliminate unused subsets of the transferred data. Make sure these
unused subsets are not only not transferred, but are not even queried
from the database. Special cases of this include:

Eliminate details, in favor of aggregations only, and see the
previous section, Section 10.2.3, for
how to collect aggregations of many details quickly.

Transfer exceptions only, eliminating nonexceptional rows.

Parallelize processing of high-priority system interfaces that must
move data quickly.

Serialize processing of low-priority system interfaces, and push
processing to off hours and lower frequency. Often, the correct
frequency for low-priority system interfaces is
never.

Rearrange load during processing time-windows to relax
time-window-driven constraints.

In addition to these familiar-looking techniques, there are several
techniques specific to middleware:

Transfer only changed data, not data that hasn't
changed since the last run.
This is, by far, the most powerful technique to reduce middleware
data volumes, since changes to data involve much lower volumes than
fresh collections of all data. However, middleware commonly takes the
slower path to all data, because it is harder, functionally, to keep
data in sync through propagating changes to data rather than through
complete data refreshes. Just as for preaggregation, which is after
all just a special case of data transfer, the safest strategies for
propagating only data changes depend on well-designed database
triggers. With triggers, any data change, from any cause, will
automatically fire the trigger and record the necessary changes
whenever the source data is changed.

Eliminate the interface.
When applications share a database instance, you can often eliminate
interfaces, allowing applications to read each
other's data instead of copying data from one
application to another.

Move the dividing line between applications.
For example, you might have one application that is responsible for
Order Entry, and another for Order Fulfillment and Accounts
Receivable, combined. The interface between Order Entry and Order
Fulfillment would likely involve almost complete duplication of data.
If you rearranged the systems to combine Order Entry and Order
Fulfillment, you would find a much thinner interface, moving less
data, to Accounts Receivable.

Make the interface faster.
If you must move high data volumes between applications, at least
arrange that those high volumes can move fast. The fastest interface
simply moves data between tables within a database instance. The
next-fastest interface moves data between instances on the same
hardware. The next-fastest moves data across a local area network
between instances on different machines. The slowest interface
alternative transfers data across low-bandwidth, wide-area network
links, potentially over intercontinental distances.

SQL Tuning [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی