SQL Performance Tuning [Electronic resources]

Peter Gulutzan, Trudy Pelzer

نسخه متنی -صفحه : 124/ 56
نمايش فراداده

The Normal Forms

Given what we've seen so far, wouldn't it be wonderful if there was a standard way to: (a) reduce row lengths by splitting rows vertically, (b) isolate different data on different files so it could be partitioned, and (c) mitigate fragmentation by cramming truly similar data into the same pages? Well, there is a standard way. It's called normalization.

Some people think that normalizing is like Latinsupposedly, learning Latin helps you think clearly and look good, but it's an optional subject. Well, if that's so, it's odd that so many plumbers and engineerspeople who slept through all the artsy stuff in schoolinsist on using normalization. Perhaps it's time to take a second look at the subject.

Consider Table 8-5, which shows an unnormalized "table" called Diplomatic_1. Diplomatic_1 contains the following information: the name of a diplomat, the language the diplomat speaks, the number of years the diplomat has spoken the language, the diplomat's title, the diplomat's length of service with the diplomatic corps, the diplomat's salary, the group with which the diplomat is affiliated, and the name of the person who heads the group. Each diplomat has a unique name, so the name column can be used as the table's primary key, as indicated by the fact that the name heading is underlined in Table 8-5. (In practice, a key can usually be constructed in more than one wayfor example, a social security number is another value that can uniquely identify an individual.)

Requests of the form "Tell me something about a diplomat" are easily answered by the structure of the data shown in Table 8-5. But answers to requests like "Which diplomats speak Greek?" and "Change the diplomat in charge of the FTA" are far more difficult. Normalization changes the organization of a table so that you can use its data more flexibly.

The first step in normalizing is based on the relational rule that each column of a table may contain only one, atomic, value per row. Looking at Table 8-5, it's easy to see that the language, years_used, work_group, and head_honcho columns violate this rule, because a diplomat may speak multiple languages and may be affiliated with multiple groups. If we add new rows by duplicating the nonrepeating values for each combination of repeating values though, Table 8-5 will be represented in first normal form (1NF). (A 1NF table contains only scalar values.) The Diplomatic_2 table shown in Table 8-6 is the Diplomatic_1 table normalized to first normal form.

Table 8-5. Diplomatic_1 Table
name language years_used title service_length salary work_group head_honcho
====
Axworthy

French

German

3

2

Consul 4 30,000.00

WHO

IMF

Greene

Craig

Broadbent

Russian

Greek

1

3

Diplomat 2 25,000.00

IMF

FTA

Craig

Crandall

Campbell

French

Spanish

Italian

2

1

3

Consul 3 28,000.00 EA Temple
Craig

French

Greek

Russian

Spanish

1

5

2

9

Ambassador 8 65,000.00 IMF Craig
Crandall French 9 Ambassador 3 55,000.00 FTA Crandall
Greene

French

Spanish

Italian

Japanese

3

7

1

4

Ambassador 9 70,000.00 WHO Greene
Temple

French

Russian

4

2

Ambassador 2 60,000.00 EA Temple

Take another look at dependence has two rules. First, if the value of column1 uniquely determines the value of column2, then column2 is functionally dependent on column1. Second, if the value of column1 limits the possible values in column2 to a specific set, then column2 is set dependent on column1. For example, because each diplomat has only one title, name will uniquely determine titletherefore, title is functionally dependent on name. Further, work_group is set dependent on name, because each diplomat is assigned to one or more of a specific set of groups.

The concept of dependence tells us that the title, service_length, and salary columns are not dependent on the entire primary key(name, language, work_group)of Diplomatic_2; they are dependent on name alone. Because this violates the relational rule, let's create a new tablecontaining only name, title, service_length, and salary. The key for this table will, once again, be name. Let's call the new table Diplomats.

Of the remaining Diplomatic_2 columns, years_used is determined by both name and language and therefore doesn't properly belong to Diplomats, so let's create another new tablecalled Languagesusing just these three columns. The Languages table's key is (name, language). Because Languages also contains the name column, it is still possible to associate a diplomat's language experience with his or her other data.

Splitting a table like this prevents it from having columns that are dependent on only part of the table's key. A first normal form table that also has no partial key dependence is said to be in second normal form (2NF). (A 2NF table is a 1NF table that contains only columns that are dependent upon the entire primary key.) Both Diplomats (Table 8-7) and Languages (Table 8-8) are in second normal form.

Table 8-6. Diplomatic_2 Table
name language years_used title service_length salary work_group head_honcho
==== ======= =========
Axworthy French 3 Consul 4 30,000.00 WHO Greene
Axworthy German 2 Consul 4 300,00.00 IMF Craig
Broadbent Russian 1 Diplomat 2 25,000.00 IMF Craig
Broadbent Greek 3 Diplomat 2 25,000.00 FTA Crandall
Campbell French 2 Consul 3 28,000.00 EA Temple
Campbell Spanish 1 Consul 3 28,000.00 EA Temple
Campbell Italian 3 Consul 3 28,000.00 EA Temple
Craig French 1 Ambassador 8 65,000.00 IMF Craig
Craig Greek 5 Ambassador 8 65,000.00 IMF Craig
Craig Russian 2 Ambassador 8 65,000.00 IMF Craig
Craig Spanish 9 Ambassador 8 65,000.00 IMF Craig
Crandall French 9 Ambassador 3 55,000.00 FTA Crandall
Greene French 3 Ambassador 9 70,000.00 WHO Greene
Greene Spanish 7 Ambassador 9 70,000.00 WHO Greene
Greene Italian 1 Ambassador 9 70,000.00 WHO Greene
Greene Japanese 4 Ambassador 9 70,000.00 WHO Greene
Temple French 4 Ambassador 2 60,000.00 EA Temple
Temple Russian 2 Ambassador 2 60,000.00 EA Temple

The situation with the Diplomatic_2.work_group column is slightly different. We have already noted that a diplomat's name determines the set of groups with which that diplomat is affiliated. This information is independent of the languages spoken by the diplomat, so a table with the work_group column shouldn't have language in its primary key. But work_group does uniquely determine a group's head_honcho.

Remember the Law of Transitivity from Chapter 2, "Simple Searches"? Well, it applies to dependence as well. That is, if column2 is dependent on column1 and column3 is dependent on column2, then it is also true that column3 is dependent on column1. This is known as a transitive dependencecolumn3 is transitively dependent on column1, via column2.

In our example, head_honcho is transitively dependent on name, because name determines a set of values for work_group, and head_honcho is functionally dependent on work_group. A second normal form table that has no transitive dependence is said to be in third normal form (3NF) and thus fulfills the relational requirement thatin every row of a tableall columns must depend directly on the primary key, without any transitive dependencies through other columns. (A 3NF table is a 2NF table whose non-key columns are also mutually independent; that is, each column can be updated independently of all the rest.)

Because each work_group has only one head_honcho, let's finish our normalization design by creating two more new tables. The first, called Groups, will contain the columns work_group and head_honcho. The Groups table's primary key is work_group. (head_honcho could also be a key if each diplomat managed only one work_group.) And finally, because each diplomat is affiliated with one or more groups, we'll create an Affiliations table, using the columns name and work_group. Affiliations forms the association between the Diplomats and Groups tables and is "all key"that is, Affiliations has no additional dependent columns, because the only thing dependent on both name and work_group is the fact that they are associated. Table 8-9 shows the third normal form Groups table, and Table 8-10 shows the third normal form Affiliations table.

Table 8-7. Diplomats Table
name title service_length salary
====
Axworthy Consul 4 30,000.00
Broadbent Diplomat 2 25,000.00
Campbell Consul 3 28,000.00
Craig Ambassador 8 65,000.00
Crandall Ambassador 3 55,000.00
Greene Ambassador 9 70,000.00
Temple Ambassador 2 60,000.00

Table 8-8. Languages Table
name language years_used
==== =======
Axworthy French 3
Axworthy German 2
Broadbent Russian 1
Broadbent Greek 3
Campbell French 2
Campbell Spanish 1
Campbell Italian 3
Craig French 1
Craig Greek 5
Craig Russian 2
Craig Spanish 9
Crandall French 9
Greene French 3
Greene Spanish 7
Greene Italian 1
Greene Japanese 4
Temple French 4
Temple Russian 2

Table 8-9. Groups Table
work_group head_honcho
=========
EA Temple
FTA Crandall
IMF Craig
WHO Greene

Table 8-10. Affiliations Table
name work_group
==== =========
Axworthy WHO
Axworthy IMF
Broadbent IMF
Broadbent FTA
Campbell EA
Craig IMF
Crandall FTA
Greene WHO
Temple EA

At this point, our normalization process is complete. All tables are in third normal form, and requests such as those listed earlier can easily be dealt with.

Here are some tips for good database design:

Don't use an existing database as the basis for a new database structureyou don't want to inadvertently duplicate awkward or inconsistent table definitions.

Avoid unnecessary duplication of datamake sure each table represents just one subject.

Define a primary key for every tablenot only will it uniquely identify a row, you'll use it to join tables. A primary key must have unique, NOT NULL values that, preferably, change only rarely.

Define unique keys and foreign keys for your tables.

Breaking Normalization Rules

Still on the subject of normalization, strictly speaking, an error isn't a performance problem unless you correct it. So the following observations won't convince the true cowboys in the crowd, because they only affect performance when results must be correct. Um, that still sounds nontrivial, so let's observe two scenarios.

Scenario #1: BREAK 1NF

InterBase, Informix, and Oracle have supported arrays for many years, and now ARRAY is an SQL:1999 data type. Another SQL:1999 collection data type, ROW, is supported by Informix. Suppose we have a table that contains addresses. We want to split the address column into four parts becausewhen the data is printedfour lines are used for the address. (In effect, we want to break first normal form.) So, instead of using Definition #1, we make our Addresses table with either Definition #2 or Definition #3:

Definition #1:
CREATE TABLE Addresses (
identifier INTEGER PRIMARY KEY,
address CHARACTER(100),
... )
Definition #2:
CREATE TABLE Addresses (          /* Use ARRAY data type */
identifier INTEGER PRIMARY KEY,
address CHAR(25) ARRAY[4],
... )
Definition #3:
CREATE TABLE Addresses (          /* Use ROW data type */
identifier INTEGER PRIMARY KEY,
address ROW (r1 CHAR(25),
r2 CHAR(25),
r3 CHAR(25),
r4 CHAR(25)),
... )

At this point, someone needs an answer to the questionHow many addresses are in New York? We can no longer answer the question with this query:

SELECT * FROM Addresses
WHERE address LIKE '%New York%'

Instead, we'll have to use one of these queries to get the answer … which takes considerably longer to execute. (Yes we tried it, on Informix, whose ROW data type complies with the SQL Standard.)

SELECT * FROM Addresses     /* for the ARRAY data type */
WHERE address[1] LIKE '%New York%'
OR address[2] LIKE '%New York%'
OR address[3] LIKE '%New York%'
OR address[4] LIKE '%New York%'
SELECT * FROM Addresses     /* for the ROW data type */
WHERE address.r1 LIKE '%New York%'
OR address.r2 LIKE '%New York%'
OR address.r3 LIKE '%New York%'
OR address.r4 LIKE '%New York%'

And herein lies the problem with breaking 1NF just to make the printing job easier. Not only are the collection type queries slower, they're not even strictly correctNew York could have been split over two or more lines.

Scenario #2: BREAK 2NF

Now, suppose we create two more tables, as follows:

CREATE TABLE States (
state_code CHARACTER(2) PRIMARY KEY,
state_name CHARACTER(25))
CREATE TABLE Cities (
identifier INTEGER PRIMARY KEY,
city_name CHARACTER(25),
state_code CHARACTER(2))

One of the rows in Cities contains a state_code value of AL. We can look up AL in the States table to determine that AL means Alaska. But we find that when we print addresses we must do joins to display the name of the state:

SELECT identifier, city_name, state_name
FROM Cities, States
WHERE Cities.state_code = States.state_code

Because we don't want to waste time doing the join, we duplicate the state_name in the Cities table and end up with the tables shown in Table 8-11 and Table 8-12.

With the new Cities table, we don't have to join any more. We can simply execute this SQL statement:

SELECT identifier, city_name, state_name
FROM Cities

We've broken second normal form (state_name is dependent on state_code, not just identifier)but performance will obviously improve.

Time passes… .

At this point, we find that some user, who didn't understand the system, thought state_name in the Cities table is an updatable column. So she changed it to Saudi Arabiabut forgot to update state_code as well. To prevent that from happening again, we add a CHECK constraint to Cities:

ALTER TABLE Cities
ADD CONSTRAINT State_Check CHECK (
(state_code, state_name) =
(SELECT state_code, state_name FROM States))
/* this could also be done with a FOREIGN KEY constraint */

Table 8-11. Cities Table
identifier city_name state_code state_name
1 Riyadh AL Alaska
2 Medina AL Alaska
3 Mecca AL Alaska

Table 8-12. States Table
state_code state_name
KY Kentucky
AL Alaska
NY New York

This is inevitable. Whenever redundant columns are added, errors creep in, and we correct them by adding constraints or triggers. (By the way, not all DBMSs will accept the syntax of this CHECK constraint.) So let's take stock:

All the rows in the Cities table are slightly larger. As we know from the last chapter, this means when we SELECT from Cities alone, the selection is slower.

The Cities table is subject to a new constraint. That means INSERTs will be slower. In fact, because the constraint references a far table, INSERTs could slow down by just as much as we saved by not joining on the SELECTs.

One other thing. AL doesn't stand for Alaskait stands for Alabama. When we fix that, we'll have to remember to fix it in two places.

The point we want to make here is that denormalizing only appears to make things faster. That is, when we deliberately denormalized, we failed to make an improvement overall. In fact, it appears that denormalizing improved one thing, but caused an offsetting problem to pop out somewhere else.

Relational proponents like to notice that sort of thing. They say it proves that you might as well try to optimize for a broad range of cases, because when you optimize for a single case, you muck things up elsewhere. And here we can address the claim that "it is practical" to denormalize. That's truebut "it" is a singular pronoun. Normalizing is an attempt to optimize generally. Denormalizing is an attempt to normalize for a particular application, to introduce a bias.

By following normalization rules, you will be splitting a tablefor example, where the original table was Table1 (column1, column2, column3, column4), you end up with Table1 (column1, column2, T2_ID) and Table2 (T2_ID, column3, column4). The advantages of this arrangement are three-fold:

Tables are narrower. Therefore, scans are quicker.

There are more tables. Therefore partitioning is easier.

Irrelevant information is separated out. Therefore a data change affecting column4 won't block a transaction affecting column1locks will be less frequent.

The Bottom Line: Normalization

"There is basically no need for OLTP denormalization today."

Richard Yevich and Susan Lawson, DB2 High Performance Design and Tuning, Prentice Hall PTR

Normalizing means smaller rows. Smaller rows mean faster retrievals because (a) the chance of overflowing the operating system cache is smaller, (b) a single page read retrieves more records into cache, and (c) byte-by-byte comparisons involve less bytes. On the other hand, (a) fragmentation is more likely as rows are updated or deleted, (b) string comparisons always involve an initial read of the size (the parser cannot decide in advance how many repetitions to plug into the loop that it produces), and (c) row position within a file can't be calculated in advance because row start is not a simple matter of saying (row size * row number).

Normalizing also has these effects: (a) table scans for a particular column are twice as fast (on average) because the number of rows per page is greater, (b) parallel queries are possible, (c) fetch and lock times are reduced, and (d) INSERTs are slowed by primary-key/foreign-key integrity checks.

Normalizing helps design and programming, and it makes things more efficient too (at least as far as 3NF, which is easy to reach with case tools). The downside is that too many slow joins happen if you ask for all information related to an entitybut really, you should be thinking about making the join faster or cutting out unnecessary joins rather than opting to denormalize.