4.1 Indexing Basics
To understand how MySQL uses indexes, it's best
first to understand the basic workings and features of indexes. Once
you have a basic understanding of their characteristics, you can
start to make more intelligent choices about the right way to use
them.
4.1.1 Index Concepts
To understand what indexes allow MySQL to do, it's
best to think about how MySQL works to answer a query. Imagine that
phone_book is a table containing an aggregate
phone book for the state of California, with roughly 35 million
entries. And keep in mind that records within tables
aren't inherently sorted. Consider a query like this
one:
SELECT * FROM phone_book WHERE last_name = 'Zawodny'
Without any sort of index to consult, MySQL must read all the records
in the phone_book table and compare the
last_name field with the string
"Zawodny" to see whether they
match. Clearly that's not efficient. As the number
of records increases, so does the effort necessary to find a given
record. In computer science, we call that an O(n) problem.But given a real phone book, we all know how to quickly locate anyone
named Zawodny: flip to the Zs at the back of book and start there.
Since the second letter is "a," we
know that any matches will be at or near the front of the list of all
names starting with Z. The method used is based on knowledge of the
data and how it is sorted.That's cheating, isn't it? Not at
all. The reason you can find the Zawodnys so quickly is that
they're sorted alphabetically by last name. So
it's easy to find them, provided you know your ABCs,
of course.Most technical books (like this one) provide an index at the back. It
allows you to find the location of important terms and concepts
quickly because they're listed in sorted order along
with the corresponding page numbers. Need to know where
mysqlhotcopy is discussed? Just look up the page
number in the index.Database indexes are similar. Just as
the book author or publisher may choose to create an index of the
important concepts and terms in the book, you can choose to create an
index on a particular column of a database table. Using the previous
example, you might create an index on the last name to make looking
up phone numbers faster:
ALTER TABLE phone_book ADD INDEX (last_name)
In doing so, you're asking MySQL to create an
ordered list of all the last names in the
phone_book table. Along with each name, it notes
the positions of the matching recordsjust as the index at the
back of this book lists page numbers for each entry.[1][1] That's a bit of a lie. MySQL
doesn't always store the position of the matching
records. We'll see why soon enough.
From the database server's point of view, indexes
exist so that the database can quickly eliminate possible rows from
the result set when executing a query. Without any indexes, MySQL
(like any database server) must examine every row in a table. Not
only is that time consuming, it uses a lot of disk I/O and can
effectively pollute the disk cache.In the real world, it's rare to find dynamic data
that just happens to be sorted (and stays sorted). Books are a
special case; they tend to remain static.Because MySQL needs to
maintain a separate list of indexes' values and keep
them updated as your data changes, you really don't
want to index every column in a table. Indexes are a trade-off
between space and time. You're sacrificing some
extra disk space and a bit of CPU overhead on each
INSERT, UPDATE, and
DELETE query to make most (if not all) your
queries much faster.Much of the MySQL documentation uses the terms
index and key
interchangeably. Saying that last_name is a key in
the phone_book table is the same as saying that
the last_name field of the
phone_book table is indexed.
4.1.1.1 Partial indexes
Indexes trade space for performance.
But sometimes you'd rather not trade too much space
for the performance you're after. Luckily, MySQL
gives you a lot of control over how much space is used by the
indexes. Maybe you have a phone_book table with 2
billion rows in it. Adding an index on last_name
will require a lot of space. If the average
last_name is 8 bytes long, you're
looking at roughly 16 GB of space for the data portion of the index;
the row pointers are there no matter what you do, and they add
another 4-8 bytes per record.[2][2] That's
a bit of an oversimplification, too. MySQL has some strategies for
reducing the size of the index, but they also come at a price.
Instead of indexing the entire last name, you might index only the
first 4 bytes:
ALTER TABLE phone_book ADD INDEX (last_name(4))
In doing so, you've reduced the space requirements
for the data portion of the index by roughly half. The trade-off is
that MySQL can't eliminate quite as many rows using
this index. A query such as:
SELECT * FROM phone_book WHERE last_name = 'Smith'
retrieves all fields beginning with Smit,
including all people with name Smith,
Smitty, and so on. The query must then discard
Smitty and all other irrelevant rows.
4.1.1.2 Multicolumn indexes
Like many relational database engines,
MySQL allows you to create indexes that are composed of multiple
columns:
ALTER TABLE phone_book ADD INDEX (last_name, first_name)
Such indexes can improve the query speed if you often query all
columns together in the WHERE clause or if a
single column doesn't have sufficient variety. Of
course, you can use partial indexes to reduce the space required:
ALTER TABLE phone_book ADD INDEX (last_name(4), first_name(4))
In either case, a query to find Josh Woodward executes quickly:
SELECT * FROM phone_book
WHERE last_name = 'Woodward'
AND first_name = 'Josh'
Having the last name and first name indexed together means that MySQL
can eliminate rows based on both fields, thereby greatly reducing the
number of rows it must consider. After all, there are a lot more
people in the phone book whose last name starts with
"Wood" than there are folks whose
last name starts with "Wood" and
whose first name also starts with
"Josh."When discussing multicolumn indexes, you may see the individual
indexed columns referred to as key
parts or "parts of the
key." Multicolumn indexes are also referred to as
composite indexes or compound indexes.So why not just create two indexes, one on
last_name and one on
first_name? You could do that, but MySQL
won't use them both at the same time. In fact, MySQL
will only ever use one index per table per queryexcept for
UNIONs.[3] This fact is important
enough to say again: MySQL will only ever use one index per
table per query.[3] In a
UNION, each logical query is run separately, and
the results are merged.
With separate indexes on first_name and
last_name, MySQL will choose one or the other. It
does so by making an educated guess about which index allows it to
match fewer rows. We call it an educated guess because MySQL keeps
track of some index statistics that allow it to infer what the data
looks like. The statistics, of course, are generalizations. While
they often let MySQL make smart decisions, if you have very clumpy
data, MySQL may make suboptimal choices about index use. We call data
clumpy if the key being indexed is sparse in some
areas (such as names beginning with X) and highly concentrated in
others (such as the name Smith in English-speaking
countries). This is an important topic that we'll
revisit later in this book.
4.1.1.3 Index order
How does MySQL order values in the
index? If you've used another RDBMS, you might
expect MySQL to have syntax for specifying that an index be sorted in
ascending, descending, or some other order. MySQL gives you no
control over its internal sorting of index values. It has little
reason to. As of Version 4.0, it does a good job of optimizing cases
that cause slower performance for other database systems.For example, some database products may execute this query quickly:
SELECT * FROM phone_book WHERE last_name = 'Zawodny'
ORDER BY first_name DESC
And this query slowly:
SELECT * FROM phone_book WHERE last_name = 'Zawodny'
ORDER BY first_name ASC
Why? Because some databases store the indexes in descending order and
are optimized for reading them in that order. In the first case, the
database uses the multicolumn index to locate all the matching
records. Since the records are already stored in descending order,
there's no need to sort them. But in the second
case, the server finds all matching records and then performs a
second pass over those rows to sort them.MySQL is smart enough to "traverse the index
backwards" when necessary. It will execute both
queries very quickly. In neither case does it need to sort the
records.
4.1.1.4 Indexes as constraints
Indexes aren't always used
to locate matching rows for a query. A unique
index specifies that a particular value may
only appear once in a given column.[4] In the
phone book example, you might create a unique index on
phone_number to ensure that each phone number
appears only once: [5][4] Except for NULL,
of course. NULL is always a special case.[5] In the real world, however, this
would be a very bad practice, as anyone who has shared a phone with
several housemates can tell you.
ALTER TABLE phone_book ADD UNIQUE (phone_number)
The unique index serves a dual purpose. It functions just like any
other index when you perform a query based on a phone number:
SELECT * FROM phone_book WHERE phone_number = '555-7271'
However, it also checks every value when attempting to insert or
update a record to ensure that the value doesn't
already exist. In this way, the unique index acts as a constraint.Unique indexes use as much space as nonunique indexes do. The value
of every column as well as the record's location is
stored. This can be a waste if you use the unique index as a
constraint and never as an index. Put another way, you may rely on
the unique index to enforce uniqueness but never write a query that
uses the unique value. In this case, there's no need
for MySQL to store the locations of every record in the index:
you'll never use them.Unfortunately, there's no way to signal your
intentions to MySQL. In the future, we'll likely
find a feature introduced for this specific case. The MyISAM storage
engine already has support for unique columns without an index (it
uses a hash-based system), but the mechanism isn't
exposed at the SQL level yet.
4.1.1.5 Clustered and secondary indexes
With MyISAM tables, the indexes are kept in a
completely separate file that contains a list of primary (and
possibly secondary) keys and a value that represents the byte offset
for the record. These ensure MySQL can find and then quickly skip to
that point within the database to locate the record. MySQL has to
store the indexes this way because the records are stored in
essentially random order.With clustered
indexes, the primary key and the record itself are
"clustered" together, and the
records are all stored in primary-key order.
InnoDB uses clustered indexes. In the
Oracle world, clustered indexes are known as
"index-organized
tables," which may help you remember the
relationship between the primary key and row ordering.When your data is almost always searched on via its primary key,
clustered indexes can make lookups incredibly fast. With a standard
MyISAM index, there are two lookups, one to the index, and a second
to the table itself via the location specified in the index. With
clustered indexes, there's a single lookup that
points directly to the record in question.Some operations render clustered indexes less effective. For
instance, consider when a secondary index is in use. Going back to
our phone book example, suppose you have last_name
set as the primary index and phone_number set as a
secondary index, and you perform the following query:
SELECT * FROM phone_book WHERE phone_number = '555-7271'
MySQL scans the phone_number index to find the
entry for 555-7271, which contains the primary key
entry Zawodny because
phone_book's primary index is the
last name. MySQL then skips to the relevant entry in the database
itself.In other words, lookups based on your primary key happen exceedingly
fast, and lookups based on secondary indexes happen at essentially
the same speed as MyISAM index lookups would.But under the right (or rather, the wrong) circumstances, the
clustered index can actually
degrade performance. When you use one together with a secondary
index, you have to consider the combined impact on storage. Secondary
indexes point to the primary key rather than the row. Therefore, if
you index on a very large value and have several secondary indexes,
you will end up with many duplicate copies of that primary index,
first as the clustered index stored alongside the records themselves,
but then again for as many times as you have secondary indexes
pointing to those clustered indexes. With a small value as the
primary key, this may not be so bad, but if you are using something
potentially long, such as a URL, this repeated storage of the primary
key on disk may cause storage issues.Another less common but equally problematic condition happens when
the data is altered such that the primary key is changed on a record.
This is the most costly function of clustered indexes. A number of
things can happen to make this operation a more severe performance
hit:Alter the record in question according to the query that was issued.Determine the new primary key for that record, based on the altered
data record.Relocate the stored records so that the record in question is moved
to the proper location in the tablespace.Update any secondary indexes that point to that primary key.
As you might imagine, if you're altering the primary
key for a number of records, that UPDATE command
might take quite some time to do its job, especially on larger
tables. Choose your primary keys wisely. Use values that are unlikely
to change, such as a Social Security account number instead of a last
name, serial number instead of a product name, and so on.
4.1.1.6 Unique indexes versus primary keys
If you're coming from
other relational databases, you might wonder what the difference
between a primary key and a unique index is in MySQL. As usual, it
depends. In MyISAM tables, there's almost no
difference. The only thing special about a primary key is that it
can't contain NULL values. The primary key is simply
a NOT NULL
UNIQUE INDEX named
PRIMARY. MyISAM tables don't
require that you declare a primary key.InnoDB and BDB tables require primary keys
for every table. There's no requirement that you
specify one, however. If you don't, the storage
engine automatically adds a hidden primary key for you. In both
cases, the primary keys are simply incrementing numeric values,
similar to an AUTO-INCREMENT column. If you decide
to add your own primary key at a later time, simply use
ALTER TABLE to add one. Both
storage engines will discard their internally generated keys in favor
of yours. Heap tables don't
require a primary key but will create one for you. In fact, you can
create Heap tables with no indexes at all.
4.1.1.7 Indexing NULLs
It
is often difficult to remember that SQL uses tristate logic when
performing logical operations. Unless a column is declared
NOT NULL, there are three
possible outcomes in a logical comparison. The comparison may be true
because the values are equivalent; it may be false because the values
aren't equivalent; or it may not match because one
of the values is NULL. Whenever one of the values is NULL, the
outcome is also NULL.Programmers often think of NULL as undefined or unknown.
It's a way of telling the database server
"an unknown value goes here." So
how do NULL values affect indexes?NULL values may be used in normal (nonunique) indexes. This is true
of all database servers. However, unlike many database servers, MySQL
allows you to use NULL values in unique indexes.[6] You can store as many
NULL values as you'd like in such an index. This may
seem a bit counterintuitive, but that's the nature
of NULL. Because NULL represents an undefined value, MySQL needs to
assert that all NULL values are the same if it allowed only a single
value in a unique index.[6] MySQL Version 3.23 and older don't allow this,
Versions 4.0 and newer do.
To make things just a bit more interesting, a
NULL value may appear only once as a
primary key. Why? The SQL standard dictates this behavior. It is one
of the few ways in which primary keys are different from unique
indexes in MySQL. And, in case you're wondering,
allowing NULL values in the index really doesn't
impact performance.