SQL Performance Tuning [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

SQL Performance Tuning [Electronic resources] - نسخه متنی

Peter Gulutzan, Trudy Pelzer

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید





The Normal Forms



Given what we've seen so far, wouldn't it be wonderful if there was a standard way to: (a) reduce row lengths by splitting rows vertically, (b) isolate different data on different files so it could be partitioned, and (c) mitigate fragmentation by cramming truly similar data into the same pages? Well, there is a standard way. It's called normalization.


Some people think that normalizing is like Latinsupposedly, learning Latin helps you think clearly and look good, but it's an optional subject. Well, if that's so, it's odd that so many plumbers and engineerspeople who slept through all the artsy stuff in schoolinsist on using normalization. Perhaps it's time to take a second look at the subject.


Consider Table 8-5, which shows an unnormalized "table" called Diplomatic_1. Diplomatic_1 contains the following information: the name of a diplomat, the language the diplomat speaks, the number of years the diplomat has spoken the language, the diplomat's title, the diplomat's length of service with the diplomatic corps, the diplomat's salary, the group with which the diplomat is affiliated, and the name of the person who heads the group. Each diplomat has a unique name, so the name column can be used as the table's primary key, as indicated by the fact that the name heading is underlined in Table 8-5. (In practice, a key can usually be constructed in more than one wayfor example, a social security number is another value that can uniquely identify an individual.)


Requests of the form "Tell me something about a diplomat" are easily answered by the structure of the data shown in Table 8-5. But answers to requests like "Which diplomats speak Greek?" and "Change the diplomat in charge of the FTA" are far more difficult. Normalization changes the organization of a table so that you can use its data more flexibly.


The first step in normalizing is based on the relational rule that each column of a table may contain only one, atomic, value per row. Looking at Table 8-5, it's easy to see that the language, years_used, work_group, and head_honcho columns violate this rule, because a diplomat may speak multiple languages and may be affiliated with multiple groups. If we add new rows by duplicating the nonrepeating values for each combination of repeating values though, Table 8-5 will be represented in first normal form (1NF). (A 1NF table contains only scalar values.) The Diplomatic_2 table shown in Table 8-6 is the Diplomatic_1 table normalized to first normal form.




Table 8-5. Diplomatic_1 Table





























































































name
language
years_used
title
service_length
salary
work_group
head_honcho
====
Axworthy


French


German




3


2


Consul
4
30,000.00


WHO


IMF




Greene


Craig


Broadbent


Russian


Greek




1


3


Diplomat
2
25,000.00


IMF


FTA




Craig


Crandall


Campbell


French


Spanish


Italian




2


1


3


Consul
3
28,000.00
EA
Temple
Craig


French


Greek


Russian


Spanish




1


5


2


9


Ambassador
8
65,000.00
IMF
Craig
Crandall
French
9
Ambassador
3
55,000.00
FTA
Crandall
Greene


French


Spanish


Italian


Japanese




3


7


1


4


Ambassador
9
70,000.00
WHO
Greene
Temple


French


Russian




4


2


Ambassador
2
60,000.00
EA
Temple


Take another look at dependence has two rules. First, if the value of column1 uniquely determines the value of column2, then column2 is functionally dependent on column1. Second, if the value of column1 limits the possible values in column2 to a specific set, then column2 is set dependent on column1. For example, because each diplomat has only one title, name will uniquely determine titletherefore, title is functionally dependent on name. Further, work_group is set dependent on name, because each diplomat is assigned to one or more of a specific set of groups.


The concept of dependence tells us that the title, service_length, and salary columns are not dependent on the entire primary key(name, language, work_group)of Diplomatic_2; they are dependent on name alone. Because this violates the relational rule, let's create a new tablecontaining only name, title, service_length, and salary. The key for this table will, once again, be name. Let's call the new table Diplomats.


Of the remaining Diplomatic_2 columns, years_used is determined by both name and language and therefore doesn't properly belong to Diplomats, so let's create another new tablecalled Languagesusing just these three columns. The Languages table's key is (name, language). Because Languages also contains the name column, it is still possible to associate a diplomat's language experience with his or her other data.


Splitting a table like this prevents it from having columns that are dependent on only part of the table's key. A first normal form table that also has no partial key dependence is said to be in second normal form (2NF). (A 2NF table is a 1NF table that contains only columns that are dependent upon the entire primary key.) Both Diplomats (Table 8-7) and Languages (Table 8-8) are in second normal form.
















































































































































































































Table 8-6. Diplomatic_2 Table
name
language
years_used
title
service_length
salary
work_group
head_honcho
====
=======
=========
Axworthy
French
3
Consul
4
30,000.00
WHO
Greene
Axworthy
German
2
Consul
4
300,00.00
IMF
Craig
Broadbent
Russian
1
Diplomat
2
25,000.00
IMF
Craig
Broadbent
Greek
3
Diplomat
2
25,000.00
FTA
Crandall
Campbell
French
2
Consul
3
28,000.00
EA
Temple
Campbell
Spanish
1
Consul
3
28,000.00
EA
Temple
Campbell
Italian
3
Consul
3
28,000.00
EA
Temple
Craig
French
1
Ambassador
8
65,000.00
IMF
Craig
Craig
Greek
5
Ambassador
8
65,000.00
IMF
Craig
Craig
Russian
2
Ambassador
8
65,000.00
IMF
Craig
Craig
Spanish
9
Ambassador
8
65,000.00
IMF
Craig
Crandall
French
9
Ambassador
3
55,000.00
FTA
Crandall
Greene
French
3
Ambassador
9
70,000.00
WHO
Greene
Greene
Spanish
7
Ambassador
9
70,000.00
WHO
Greene
Greene
Italian
1
Ambassador
9
70,000.00
WHO
Greene
Greene
Japanese
4
Ambassador
9
70,000.00
WHO
Greene
Temple
French
4
Ambassador
2
60,000.00
EA
Temple
Temple
Russian
2
Ambassador
2
60,000.00
EA
Temple


The situation with the Diplomatic_2.work_group column is slightly different. We have already noted that a diplomat's name determines the set of groups with which that diplomat is affiliated. This information is independent of the languages spoken by the diplomat, so a table with the work_group column shouldn't have language in its primary key. But work_group does uniquely determine a group's head_honcho.


Remember the Law of Transitivity from Chapter 2, "Simple Searches"? Well, it applies to dependence as well. That is, if column2 is dependent on column1 and column3 is dependent on column2, then it is also true that column3 is dependent on column1. This is known as a transitive dependencecolumn3 is transitively dependent on column1, via column2.


In our example, head_honcho is transitively dependent on name, because name determines a set of values for work_group, and head_honcho is functionally dependent on work_group. A second normal form table that has no transitive dependence is said to be in third normal form (3NF) and thus fulfills the relational requirement thatin every row of a tableall columns must depend directly on the primary key, without any transitive dependencies through other columns. (A 3NF table is a 2NF table whose non-key columns are also mutually independent; that is, each column can be updated independently of all the rest.)


Because each work_group has only one head_honcho, let's finish our normalization design by creating two more new tables. The first, called Groups, will contain the columns work_group and head_honcho. The Groups table's primary key is work_group. (head_honcho could also be a key if each diplomat managed only one work_group.) And finally, because each diplomat is affiliated with one or more groups, we'll create an Affiliations table, using the columns name and work_group. Affiliations forms the association between the Diplomats and Groups tables and is "all key"that is, Affiliations has no additional dependent columns, because the only thing dependent on both name and work_group is the fact that they are associated. Table 8-9 shows the third normal form Groups table, and Table 8-10 shows the third normal form Affiliations table.






























































Table 8-7. Diplomats Table
name
title
service_length
salary
====
Axworthy
Consul
4
30,000.00
Broadbent
Diplomat
2
25,000.00
Campbell
Consul
3
28,000.00
Craig
Ambassador
8
65,000.00
Crandall
Ambassador
3
55,000.00
Greene
Ambassador
9
70,000.00
Temple
Ambassador
2
60,000.00












































































































Table 8-8. Languages Table
name
language
years_used
====
=======
Axworthy
French
3
Axworthy
German
2
Broadbent
Russian
1
Broadbent
Greek
3
Campbell
French
2
Campbell
Spanish
1
Campbell
Italian
3
Craig
French
1
Craig
Greek
5
Craig
Russian
2
Craig
Spanish
9
Crandall
French
9
Greene
French
3
Greene
Spanish
7
Greene
Italian
1
Greene
Japanese
4
Temple
French
4
Temple
Russian
2
































Table 8-9. Groups Table
work_group
head_honcho
=========
EA
Temple
FTA
Crandall
IMF
Craig
WHO
Greene




















































Table 8-10. Affiliations Table
name
work_group
====
=========
Axworthy
WHO
Axworthy
IMF
Broadbent
IMF
Broadbent
FTA
Campbell
EA
Craig
IMF
Crandall
FTA
Greene
WHO
Temple
EA


At this point, our normalization process is complete. All tables are in third normal form, and requests such as those listed earlier can easily be dealt with.


Here are some tips for good database design:




Don't use an existing database as the basis for a new database structureyou don't want to inadvertently duplicate awkward or inconsistent table definitions.




Avoid unnecessary duplication of datamake sure each table represents just one subject.




Define a primary key for every tablenot only will it uniquely identify a row, you'll use it to join tables. A primary key must have unique, NOT NULL values that, preferably, change only rarely.




Define unique keys and foreign keys for your tables.




Breaking Normalization Rules



Still on the subject of normalization, strictly speaking, an error isn't a performance problem unless you correct it. So the following observations won't convince the true cowboys in the crowd, because they only affect performance when results must be correct. Um, that still sounds nontrivial, so let's observe two scenarios.


Scenario #1: BREAK 1NF


InterBase, Informix, and Oracle have supported arrays for many years, and now ARRAY is an SQL:1999 data type. Another SQL:1999 collection data type, ROW, is supported by Informix. Suppose we have a table that contains addresses. We want to split the address column into four parts becausewhen the data is printedfour lines are used for the address. (In effect, we want to break first normal form.) So, instead of using Definition #1, we make our Addresses table with either Definition #2 or Definition #3:



Definition #1:
CREATE TABLE Addresses (
identifier INTEGER PRIMARY KEY,
address CHARACTER(100),
... )
Definition #2:
CREATE TABLE Addresses ( /* Use ARRAY data type */
identifier INTEGER PRIMARY KEY,
address CHAR(25) ARRAY[4],
... )
Definition #3:
CREATE TABLE Addresses ( /* Use ROW data type */
identifier INTEGER PRIMARY KEY,
address ROW (r1 CHAR(25),
r2 CHAR(25),
r3 CHAR(25),
r4 CHAR(25)),
... )


At this point, someone needs an answer to the questionHow many addresses are in New York? We can no longer answer the question with this query:



SELECT * FROM Addresses
WHERE address LIKE '%New York%'


Instead, we'll have to use one of these queries to get the answer … which takes considerably longer to execute. (Yes we tried it, on Informix, whose ROW data type complies with the SQL Standard.)



SELECT * FROM Addresses /* for the ARRAY data type */
WHERE address[1] LIKE '%New York%'
OR address[2] LIKE '%New York%'
OR address[3] LIKE '%New York%'
OR address[4] LIKE '%New York%'
SELECT * FROM Addresses /* for the ROW data type */
WHERE address.r1 LIKE '%New York%'
OR address.r2 LIKE '%New York%'
OR address.r3 LIKE '%New York%'
OR address.r4 LIKE '%New York%'


And herein lies the problem with breaking 1NF just to make the printing job easier. Not only are the collection type queries slower, they're not even strictly correctNew York could have been split over two or more lines.


Scenario #2: BREAK 2NF


Now, suppose we create two more tables, as follows:



CREATE TABLE States (
state_code CHARACTER(2) PRIMARY KEY,
state_name CHARACTER(25))
CREATE TABLE Cities (
identifier INTEGER PRIMARY KEY,
city_name CHARACTER(25),
state_code CHARACTER(2))


One of the rows in Cities contains a state_code value of AL. We can look up AL in the States table to determine that AL means Alaska. But we find that when we print addresses we must do joins to display the name of the state:



SELECT identifier, city_name, state_name
FROM Cities, States
WHERE Cities.state_code = States.state_code


Because we don't want to waste time doing the join, we duplicate the state_name in the Cities table and end up with the tables shown in Table 8-11 and Table 8-12.


With the new Cities table, we don't have to join any more. We can simply execute this SQL statement:



SELECT identifier, city_name, state_name
FROM Cities


We've broken second normal form (state_name is dependent on state_code, not just identifier)but performance will obviously improve.


Time passes… .


At this point, we find that some user, who didn't understand the system, thought state_name in the Cities table is an updatable column. So she changed it to Saudi Arabiabut forgot to update state_code as well. To prevent that from happening again, we add a CHECK constraint to Cities:



ALTER TABLE Cities
ADD CONSTRAINT State_Check CHECK (
(state_code, state_name) =
(SELECT state_code, state_name FROM States))
/* this could also be done with a FOREIGN KEY constraint */































Table 8-11. Cities Table
identifier
city_name
state_code
state_name
1
Riyadh
AL
Alaska
2
Medina
AL
Alaska
3
Mecca
AL
Alaska
























Table 8-12. States Table
state_code
state_name
KY
Kentucky
AL
Alaska
NY
New York


This is inevitable. Whenever redundant columns are added, errors creep in, and we correct them by adding constraints or triggers. (By the way, not all DBMSs will accept the syntax of this CHECK constraint.) So let's take stock:




All the rows in the Cities table are slightly larger. As we know from the last chapter, this means when we SELECT from Cities alone, the selection is slower.




The Cities table is subject to a new constraint. That means INSERTs will be slower. In fact, because the constraint references a far table, INSERTs could slow down by just as much as we saved by not joining on the SELECTs.




One other thing. AL doesn't stand for Alaskait stands for Alabama. When we fix that, we'll have to remember to fix it in two places.





The point we want to make here is that denormalizing only appears to make things faster. That is, when we deliberately denormalized, we failed to make an improvement overall. In fact, it appears that denormalizing improved one thing, but caused an offsetting problem to pop out somewhere else.


Relational proponents like to notice that sort of thing. They say it proves that you might as well try to optimize for a broad range of cases, because when you optimize for a single case, you muck things up elsewhere. And here we can address the claim that "it is practical" to denormalize. That's truebut "it" is a singular pronoun. Normalizing is an attempt to optimize generally. Denormalizing is an attempt to normalize for a particular application, to introduce a bias.


By following normalization rules, you will be splitting a tablefor example, where the original table was Table1 (column1, column2, column3, column4), you end up with Table1 (column1, column2, T2_ID) and Table2 (T2_ID, column3, column4). The advantages of this arrangement are three-fold:




Tables are narrower. Therefore, scans are quicker.




There are more tables. Therefore partitioning is easier.




Irrelevant information is separated out. Therefore a data change affecting column4 won't block a transaction affecting column1locks will be less frequent.




The Bottom Line: Normalization




"There is basically no need for OLTP denormalization today."


Richard Yevich and Susan Lawson, DB2 High Performance Design and Tuning, Prentice Hall PTR




Normalizing means smaller rows. Smaller rows mean faster retrievals because (a) the chance of overflowing the operating system cache is smaller, (b) a single page read retrieves more records into cache, and (c) byte-by-byte comparisons involve less bytes. On the other hand, (a) fragmentation is more likely as rows are updated or deleted, (b) string comparisons always involve an initial read of the size (the parser cannot decide in advance how many repetitions to plug into the loop that it produces), and (c) row position within a file can't be calculated in advance because row start is not a simple matter of saying (row size * row number).


Normalizing also has these effects: (a) table scans for a particular column are twice as fast (on average) because the number of rows per page is greater, (b) parallel queries are possible, (c) fetch and lock times are reduced, and (d) INSERTs are slowed by primary-key/foreign-key integrity checks.


Normalizing helps design and programming, and it makes things more efficient too (at least as far as 3NF, which is easy to reach with case tools). The downside is that too many slow joins happen if you ask for all information related to an entitybut really, you should be thinking about making the join faster or cutting out unnecessary joins rather than opting to denormalize.


/ 124