5.1 Query Processing Basics
How
MySQL goes from receiving a query to sending the results back to a
client is relatively straightforward. The work happens in several
distinct stages. Let's walk through them.
5.1.1 Query Cache
You
can enable the query cache (available as of MySQL 4.0.1) by setting
query_cache_type to an appropriate
value in my.cnf:
query_cache_type = 1
MySQL attempts to locate the results of any SELECT
query in the query cache before bothering to analyze or execute it.
It does this by hashing the query and using the hashed value to check
for the results in the cache. MySQL uses the exact query text it
receives, so the cache is sensitive to the most trivial variations.As far as the cache is concerned, the query:
SELECT * FROM table1
is different from:
select * FROM table1
The same goes for variations in
whitespace. MySQL
doesn't trim extra space from the beginning or end
of queries. This is rarely a problem because most repetitive queries
are generated by applications rather than humans sitting at a
keyboard.To save some effort, MySQL cheats a bit. It
only bothers to hash SELECT queries, since
they're the only ones it makes any sense to cache.
Unfortunately, older 4.0 versions of
MySQL
don't consider every SELECT
query. The logic it uses simply checks the first three characters of
your query, looking for SEL in a case-insensitive
way.As a result of this three-character "tunnel
vision," any time you introduce whitespace or
anything else at the beginning of the query, MySQL
won't bother with the query cache. This can be a
real problem in some applications. We know of a feed-processing
system in which the developers uses comments to embed extra
information at the beginning of each query:
/* <b>GetLatestStuff</b> */ SELECT * FROM sometable WHERE ...
The comment made is easier to identify the queries in an
administrative tool that grabs the output of SHOW
PROCESSLIST for display on a web page. Unfortunately,
there's no way to tell MySQL to
"try harder" when deciding whether
a query is a SELECT, so these queries are never
cached. Luckily, this problem is cured with a simple upgrade to MySQL
5.0.It is possible to tell MySQL that it
should not cache a given query, however. The way
to dodge the query cache is to add the
SQL_NO_CACHE hint to your query.
SELECT SQL_NO_CACHE * FROM mytable
This is helpful in controlling cache pollution. If your application
has a set of queries that will never benefit from the query cache
(perhaps because they run only once a day), there's
no sense in caching them. Telling MySQL not to cache such queries
leaves more room for storing the results of repetitive queries.When the query cache is running in ondemand mode (set
query_cache_type to 2 in
my.cnf), MySQL does the work of trying to find a
query in the cache only when it sees a
SQL_CACHE hint in the query:
SELECT SQL_CACHE * FROM mytable
If the query's hashed value is found in the cache,
MySQL sends the results from the cache to the client, bypassing any
additional effort, just as expected.The format of the results in the query cache is identical to the
format used when sending them to a client. So there is very little
overhead in retrieving results from the cache and sending them to a
client. MySQL simply sends the data over the network.
We'll look at query cache performance in Chapter 6.
5.1.2 Parsing, Analysis, and Optimization
Before
MySQL can do anything interesting (or useful) with a noncached query,
it must parse the query into its component parts. As part of that
process, it verifies that the query is syntactically valid and
gathers some basic information about the query:What
type of query is this? Is it a SELECT,
INSERT, UPDATE, or
DELETE, or some other administrative command such
as SET or GRANT?Which tables are involved? Are there any aliases used?What is the WHERE clause?Are there other hints or modifiers involved?
Once a query is broken down into more basic pieces, MySQL begins the
more challenging work of figuring out what to do with it. This is
where the query
optimizer kicks in. The query
optimizer's goal, simply put, is to find the most
efficient way to execute a query given all the available information.
Most of the time, this means the optimizer works to limit the number
of records it must examine. It does this because the time associated
with disk I/O is often (but not always) the governing factor that
determines how long a query will take. Intuitively, this makes
complete sense. It is an extension of the very same logic that
explains why indexes are so helpful.How
the optimizer goes about making decisions is often regarded by people
unfamiliar with MySQL internals as something like voodoo. Of course,
it's not voodoo at all. MySQL has a set of rules and
heuristics that have been evolving since its early days. These rules
guide its decision-making process. But like any computer program that
must deal with the infinite ways humans can assemble data and ask
questions about it, the optimizer's not perfect. The
rules and heuristics it uses work very well much of the time, but, on
occasion, they do not.The MySQL developers are constantly improving the
optimizerattempting to make it smarter and faster with each
new release. Based on feedback from real-world users, they are always
looking for ways to refine MySQL's ability to make
the right decision. If you find a query that causes MySQL to make bad
decisions, be sure to report it. Unreported problems are rarely
fixed.To make good decisions, MySQL tries to answer several important
questions.Are there any indexes that are candidates for finding the rows
quickly?Which index is best? If multiple tables are involved, which index is
best for each table?Which tables depend on which other tables in the join?What's the optimal join order for the tables?
Of course, MySQL needs to make a decision very quickly and without
actually testing all the options. Otherwise it might spend more time
deciding how to execute the query than actually executing it!The bulk of MySQL's effort centers around indexes
and table join order. These aren't the only factors,
but they're certainly the important ones. To get a
better understanding of what MySQL is thinking about a
SELECT query, it's best to look
at the EXPLAIN output for the
query.
5.1.3 Using EXPLAIN
So, what sort of knowledge can MySQL
gather without expending a lot of effort and time?
Let's look at a some queries against a news headline
tablethe sort of thing you might use to build a customizable
news web site. The structure of the table is listed next. Rather than
guessing what MySQL will probably do, we'll use its
under-appreciated
EXPLAIN command to help
figure that out. In doing so, we'll see how adding
an index or simply rephrasing a query can often better use an
existing index and greatly improve performance.
mysql> describe Headline;
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| Id | int(10) unsigned | | PRI | NULL | auto_increment |
| Headline | varchar(255) | | | | |
| Url | varchar(255) | | UNI | | |
| Time | int(10) unsigned | | MUL | 0 | |
| ExpireTime | int(10) unsigned | | | 0 | |
| Date | varchar(6) | | | | |
| Summary | text | YES | | NULL | |
| ModTime | timestamp | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
8 rows in set (0.00 sec)
As you can tell, the Headline table contains
information about news stories: title, summary, date, and so on.
Headlines can be associated with multiple topics, which are defined
in the Topic table. The T2H
table maps topics to headlines and vice versa. The relationship is
many-to-many because a single headline may be associated with
multiple topics.When you write a query against a primary key or unique index, MySQL
should know that there can be only a single match for each value.
Indeed, this query is very fast:
mysql> SELECT Headline, Url FROM Headline WHERE Id = 13950120 \G
*************************** 1. row ***************************
Headline: Midwest Cash Grain PM - Soy off, USDA data awaited
Url: http://biz.yahoo.com/rm/030328/markets_grain_cash_2l
1 row in set (0.00 sec)
Just as it's obvious to you or me, MySQL knows that
only one record can possibly match. Its strategy for finding the row
is straightforward: simply check the primary index for a match. If it
exists, fetch the row. To verify that, let's
EXPLAIN it:
mysql> EXPLAIN SELECT Headline, Url FROM Headline WHERE id = 13950120 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra:
1 row in set (0.00 sec)
Just as expected, MySQL knows there's only one
matching row. The rows value tells you so. What
MySQL says here isn't always to be completely
trusted, however, as you'll see in a little bit.Of course, EXPLAIN is relating much more than how many
rows to expect. Let's quickly review the information
it provides:id
The ID of this table in the query. EXPLAIN
produces one output record for each table in the query.
select_type
What is this table's role in the larger query?
Possible values are SIMPLE,
PRIMARY, UNION,
DEPENDENT UNION, SUBSELECT, and
DERIVED. As we look at the more complicated
queries, the meaning will become clearer.
table
The name of the table MySQL will read records from.
type
What type of join will MySQL use? In this example, you see
const because there was a constant value in the
query. Other possible values are system,
eq_ref, ref,
range, index, or
ALL. We'll revisit this in more
detail in the "Joins" section,
later in this chapter.
possible_keys
A list of the indexes (or NULL if none) MySQL can use to find rows in
the table.
key
The name of the index MySQL decides to use, after checking all
possible indexes (listed in possible_keys) and
choosing the best.
key_len
The size of the key value (in bytes).
ref
The columns or values that are used to match against the key.
rows
The number of rows MySQL thinks it needs to examine to satisfy the
query. If you frequently add and remove records from the table,
running ANALYZE TABLE lets MySQL update the index
statistics so it can make better estimates.
Extra
Any extra information MySQL wishes to convey about the execution of
this query. We'll see some examples of that shortly.
The simple case is just thatsimple. Let's ask
for a range of values instead.
mysql> SELECT Url FROM Headline WHERE id BETWEEN 13950120 AND 13950125;
+----------------------------------------------------------+
| Url |
+----------------------------------------------------------+
| http://biz.yahoo.com/rm/030328/markets_grain_cash_2l |
| http://biz.yahoo.com/prnews/030328/cgf038_1l |
| http://biz.yahoo.com/bw/030328/285487_1l |
| http://biz.yahoo.com/rc/030328/turkey_hijack_5l |
| http://biz.yahoo.com/rm/030328/food_aid_iraq_1l |
+----------------------------------------------------------+
5 rows in set (0.00 sec)
mysql> EXPLAIN SELECT Url FROM Headline WHERE id BETWEEN 13950120 AND 13950125 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 3
Extra: Using where
1 row in set (0.00 sec)
In this case, the type has switched from
const to range to indicate that
a search for more than one value. Similarly, ref
is now NULL.Every thing seems reasonable unless you
look closely. When executed, the query returns five rows, but the
rows says three. That's because
the rows value is merely an estimate. It probably
should have been called estimated_rows.The estimate is based on the index MySQL is using. Based on the
distribution of records across the possible key values, it simply
approximates that there are three valid records between 13950120 and
13950125.Also notice that the Extra column says
Using where. That's
MySQL's reassuring way of telling you that
it's using the limitations specified in the
WHERE clause to select records. It
wasn't present in the first example because MySQL
treats a single-row lookup using the primary key as special case.What if we try fetching records based on a nonindexed column:
mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600;
+----------+
| COUNT(*) |
+----------+
| 3971 |
+----------+
1 row in set (1.04 sec)
mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 302116
Extra: Using where
1 row in set (0.00 sec)
The NULL value in the key
column of the EXPLAIN output tell us that MySQL
won't be using an index for this query. In fact, the
NULL value in the possible_keys
column tells us that there were no indexes to pick from at all. If
this type of query is likely to be common, we can simply add an index
and rerun the query (or the EXPLAIN) to verify
that MySQL uses it.
mysql> ALTER TABLE Headline ADD INDEX (ExpireTime);
Query OK, 302116 rows affected (40.02 sec)
Records: 302116 Duplicates: 0 Warnings: 0
mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600;
+----------+
| COUNT(*) |
+----------+
| 3971 |
+----------+
1 row in set (0.01 sec)
mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: ExpireTime
key: ExpireTime
key_len: 4
ref: NULL
rows: 12009
Extra: Using where; Using index
1 row in set (0.00 sec)
The query now runs in 0.01 seconds instead of 1.04. The
EXPLAIN output looks much better, with the new
ExpireTime index being used for a
range search. Note again the discrepancy between
rows (12009) and the actual row count (3971). In a
case like this, it might be possible to improve the estimate that
MySQL makes by running either ANALYZE
TABLE or OPTIMIZE
TABLE on the Headline table.Also, notice that MySQL said Using index. That
means this is an index-only
query. MySQL is able to get all the data it
needs from the ExpireTime index, so it
doesn't bother fetching any of the rows from disk.But what if you need to fetch multiple headlines, and you know their
IDs? Should you use OR or
IN(...)? Let's find out what
MySQL can tell us, using the lowest and highest headline IDs as well
as one in between:
mysql> SELECT Url FROM Headline WHERE Id IN(1531513, 10231599, 13962322);
+----------------------------------------------+
| Url |
+----------------------------------------------+
| http://biz.yahoo.com/bond/010117/bfl |
| http://biz.yahoo.com/e/021101/yhoo10-ql |
| http://biz.yahoo.com/bw/030331/315850_1l |
+----------------------------------------------+
3 rows in set (0.00 sec)
mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN(1531513, 10231599, 13962322) \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 3
Extra: Using where
1 row in set (0.00 sec)
mysql> SELECT Url FROM Headline WHERE Id = 1531513 OR Id = 10231599 OR Id = 13962322;
+----------------------------------------------+
| Url |
+----------------------------------------------+
| http://biz.yahoo.com/bond/010117/bfl |
| http://biz.yahoo.com/e/021101/yhoo10-ql |
| http://biz.yahoo.com/bw/030331/315850_1l |
+----------------------------------------------+
3 rows in set (0.03 sec)
mysql> EXPLAIN SELECT Url FROM Headline WHERE Id = 1531513 OR Id = 10231599 OR Id =
13962322 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 3
Extra: Using where
1 row in set (0.00 sec)
Both queries execute very quickly, and their
EXPLAIN output is the same.
They're functionally the same. It's
clear that either query may return anywhere from zero to three rows.
We're querying based on a unique index (the primary
key), so there isn't much for MySQL to think about.
As it turns out, we happen to know that in this case, MySQL
internally changed the multi-OR query to one that
uses a single IN(...) list. However,
it's clear that as the number of IDs increases, the
query string will be smaller if you use the
IN(..). A smaller query means less parsing
overhead and better performance.What if we use a subquery to fetch the URL for the highest numbered
headline?
mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN (SELECT MAX(Id) FROM Headline);
After waiting five minutes, we killed the query. Either we did
something wrong, or MySQL wasn't using the obvious
approach to resolve this query. Hmm.To find out, let's explain it.
mysql> EXPLAIN SELECT Url FROM Headline WHERE Id IN (SELECT MAX(id) FROM Headline) \G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: Headline
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 302116
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBSELECT
table: Headline
type: index
possible_keys: NULL
key: PRIMARY
key_len: 4
ref: NULL
rows: 302116
Extra: Using index
2 rows in set (0.00 sec)
Yikes!MySQL isn't using any indexes! Notice that both
possible_keys fields are NULL.
Is this a bug? Perhaps, especially when you consider that the
key field in the dependent subselect says that it
selected the primary key. But it wasn't in the list
of possible keys. And, worse yet, MySQL believes it must examine
302,116 rows to resolve a single-record lookup supposedly based on a
primary key.Of course, this testing was performed with MySQL 4.1.0 alpha,
prerelease code in which the query optimizer hadn't
been properly tuned to handle subselects well.[1] The point
isn't that MySQL didn't do the
right thing. No matter how well tuned it is, MySQL will make a bad
decision once in a while. When it does, you need to be able to
diagnose the problem and, in some cases, come up with a workaround.[1] Subsequent tests with the 4.1.2 alpha version proved that the
query optimizer no longer had this bug.
So let's rewrite the query a bit to simplify things.
We're using IN(...) in a query
that can only return one row. So let's change that
to an equality (=) test.
mysql> SELECT Url FROM Headline WHERE Id = (SELECT MAX(id) FROM Headline);
+----------------------------------------------+
| Url |
+----------------------------------------------+
| http://biz.yahoo.com/bw/030331/315850_1l |
+----------------------------------------------+
1 row in set (0.00 sec)
mysql> EXPLAIN SELECT Url FROM Headline WHERE Id = (SELECT MAX(id) FROM Headline) \G
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: Headline
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra:
*************************** 2. row ***************************
id: 2
select_type: SUBSELECT
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away
2 rows in set (0.00 sec)
A-ha! That did it. The query ran in a split second.The EXPLAIN output is interesting too. MySQL
noticed that we were attempting something very trivial and optimized
away the second table. All those NULL values are
MySQL's way of saying, "These
simply don't matter."But what if that hadn't worked? Or what if
we're using an older MySQL that
doesn't have subselects yet? Simple. We can rewrite
the query as two SELECT statements and store the
intermediate value in a variable on the server side so that no
client-side state is required:
mysql> SELECT @max := MAX(Id) FROM Headline;
+-----------------+
| @max := MAX(Id) |
+-----------------+
| 13962322 |
+-----------------+
1 row in set (0.00 sec)
mysql> SELECT Url FROM Headline WHERE Id = @max;
+----------------------------------------------+
| Url |
+----------------------------------------------+
| http://biz.yahoo.com/bw/030331/315850_1l |
+----------------------------------------------+
1 row in set (0.00 sec)
We don't even need to explain those queries. Based
on what we already know, they'll obviously be fast
(and they are). Both are queries on primary keys and fetch single
values.And, for completeness, the most MySQL-like way to write that query is
to use an ORDER BY and LIMIT:
SELECT Url FROM Headline ORDER BY Id DESC LIMIT 1;
Let's look at one last example. What if you query
based on two different indexed fields? MySQL tries to select the
index that will result in the fewest rows being examined. So the
results will vary depending on your data and the values you choose.
mysql> SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1112201600 AND Id <=
5000000;
+----------+
| COUNT(*) |
+----------+
| 1175 |
+----------+
1 row in set (0.04 sec)
mysql> EXPLAIN SELECT COUNT(*) FROM Headline
-> WHERE ExpireTime >= 1112201600 AND Id <= 5000000 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: PRIMARY,ExpireTime
key: ExpireTime
key_len: 4
ref: NULL
rows: 12009
Extra: Using where
1 row in set (0.00 sec)
For this query, given the choice between the primary key field
(Id) and the ExpireTime, MySQL
decided to use ExpireTime. However, if the
ExpireTime value is changed so that it matches
many more rows, MySQL should favor the primary key:
mysql> EXPLAIN SELECT COUNT(*) FROM Headline WHERE ExpireTime >= 1012201600 AND Id <=
5000000 \G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: Headline
type: range
possible_keys: PRIMARY,ExpireTime
key: PRIMARY
key_len: 4
ref: NULL
rows: 13174
Extra: Using where
1 row in set (0.00 sec)
As expected, it does.Again, this decision-making process is all based on
MySQL's notion of what the data looks likehow
evenly distributed the values are. Different storage engines (InnoDB,
MyISAM, BDB) use different methods to gather those statistics. As a
result, you may find that some queries are executed differently if
you convert your data to a different table type. Of course, running
ANALYZE TABLE will also affect
MySQL's statistics.
5.1.3.1 Joins
Things become slightly
more complex when you're querying multiple tables.
MySQL has to decide which order makes the most sense. Again, the goal
it to read as few rows as possible, so it will consider each table
and estimate how many rows it must read from each. In doing so, it
also needs to understand the relationship among the tables. For
example, with a query like this, it's clear that
MySQL can't read the table order
first:
SELECT customer.name, order.date_placed, region.name
FROM customer, order, region
WHERE order.customer_id = customer.id
AND customer.region_id = region.id
AND customer.name = 'John Doe'
The rows MySQL will need to retrieve from the
order table depend on the
customer table. So it must read
customer before order. In fact,
the same is true of region. So in this case, MySQL
has to read customer records first. From there it
will decide to read the remaining tables in whatever order it
chooses.Unfortunately, finding the optimal join order is one of
MySQL's weakest skills. Rather than being clever
about this problem, the optimizer simply tries to brute-force its way
through. It tries every possible combination before choosing one.
That can spell disaster in a some cases. We've seen
at least one case in which MySQL took 29 seconds to decide how to
execute a multitable join and then 1 second to actually execute it.
In this particular case, there were over 10 tables involved. Since
MySQL is considering all possible combinations, performance begins to
degrade quite drastically as you go beyond a handful of tables. The
exact number, of course, depends on how powerful CPUs are this year.
5.1.4 Execution
There's
not a lot to say about query execution. MySQL simply follows its
plan, fetching rows from each table in order and joining based on the
relevant columns (hopefully using indexes). Along the way, it may
need to create a temporary table (in memory or on disk) to store the
results. Once all the rows are available, it sends them to the
client.Along the way, MySQL gathers some information and statistics about
each query it executes, including:Who issued the queryHow long the process tookHow many rows were returned
That information will appear in the slow query log (discussed later
in this chapter) if the query time exceeds the
server's threshold, and the log is enabled. If the
query is issued interactively, it will also appear after the query
results.