SQL in a Nutshell, 2nd Edition [Electronic resources] نسخه متنی

1.1 The Relational Model and ANSI SQL

Relational Database Management Systems (RDBMSs), such as those
covered in this book, are the primary engines of information systems
worldwide, particularly web applications and distributed
client/server computing systems. They enable a multitude of users to
quickly and simultaneously access, create, edit, and manipulate data
without impeding or impacting other users. They also allow developers
to write useful applications to access their resources as well as
provide administrators with the capabilities they need to maintain,
secure, and optimize organizational data resources.

An RDBMS is defined as a system whose users view data as a collection
of tables related to each other through common data values. Data is
stored in tables, which are composed of
rows and columns. Tables of
independent data can be linked (or related) to
one another if they each have unique, identifying columns of data
(called keys) that represent a data value held
in common. E. F.
Codd first described relational database theory in his landmark paper
"A Relational Model of Data for Large Shared Data
Banks," published in the Communications of
the ACM (Association for Computing Machinery) in June,
1970. Under Codd's new relational data model, data
was structured (into tables of rows and columns); manageable using
operations like selections, projections, and joins; and consistent as
the result of integrity rules like keys and referential integrity.
Codd also articulated rules that governed how a relational database
should be designed; the process for applying these rules is now known
as normalization.

1.1.1 Codd's Rules for Relational Database Systems

Codd applied rigorous mathematical theories, primarily set theory, to
the management of data, and he compiled a list of criteria a database
must meet to be considered relational. At its core, the relational
database concept centers around storing data in tables. This concept
is now so common as to seem trivial; however, not long ago designing
a system capable of sustaining the relational model was considered a
long shot with limited usefulness. Following are
Codd's Twelve Principles of Relational
Databases :

Information is represented logically in tables.

Data must be logically accessible by table, primary key, and column.

Null values must be uniformly treated as "missing
information," not as empty strings, blanks, or
zeros.

Metadata (data about the database) must be stored in the database
just as regular data is.

A single language must be able to define data, views, integrity
constraints, authorization, transactions, and data manipulation.

Views must show the updates of their base tables and vice versa.

A single operation must be available to do each of the following
operations: retrieve data, insert data, update data, or delete data.

Batch and end-user operations are logically separate from physical
storage and access methods.

Batch and end-user operations can change the database schema without
having to recreate it or the applications built upon it.

Integrity constraints must be available and stored in the metadata,
not in an application program.

The data manipulation language of the relational system should not
care where or how the physical data is distributed and should not
require alteration if the physical data is centralized or
distributed.

Any row processing done in the system must obey the same integrity
rules and constraints that set-processing operations do.

These principles continue to be the litmus test used to validate the
"relational" characteristics of a
database platform; a database that does not meet all of these rules
is not fully relational. While these rules do not apply to
applications development, they do determine whether the database
engine itself can be considered truly
"relational." Currently, most
commercial RDBMS products pass Codd's test. Among
the platforms discussed in SQL in a Nutshell,
Second Edition, only MySQL fails to support all of these
requirements.

Understanding Codd's principles assists programmers
and developers in the proper development and design of Relational
Databases (RDBs). The following sections detail how some of these
requirements are met within SQL using RDBs.

1.1.1.1 Data structures (rules 1, 2, and 8)

Codd's rules 1 and 2 state that
"information is represented logically in
tables" and that "data must be
logically accessible by table, primary key, and
column." So the process of defining a table for a
SQL database does not require that programs instruct the database how
to interact with the underlying physical data structures.
Furthermore, SQL logically isolates the process of accessing data and
physically maintaining that data as required by rule 8,
"batch and end-user operations are logically
separate from physical storage and access methods."

In the relational model, data is shown logically as a two-dimensional
table that describes a single entity (for
example, business expenses). Academics refer to tables as
entities and to columns as
attributes. Tables are composed of
rows, or records (academics
call them tuples), and
columns (called attributes,
since each column of a table describes a specific attribute of the
entity). The intersection of a record and a column provides a single
value. The column or columns whose values
uniquely identify each record can act as a primary
key. These days this representation seems elementary, but
it was actually quite innovative when it was first proposed.

SQL2003 defines a whole data
structure hierarchy beyond simple tables, though tables are the core
data structure. Relational design handles data on a table-by-table
basis, not on a record-by-record basis. This table-centric
orientation is the heart of set programming. Consequently, almost all
SQL commands operate much more efficiently against sets of data
within or across tables than against individual records. Said another
way, effective SQL programming requires that you think in terms of
sets of data, rather than of individual rows.

Figure 1-1 is a description of the SQL2003
terminology used to describe the hierarchical data structures used by
a relational database:
clusters
contain sets of
catalogs;
catalogs contain sets of
schemas;
schemas contain sets of
objects,
such as
tables
and
views;
and tables are composed of sets of
columns
and
records.

Figure 1-1. SQL2003 dataset hierarchy

For example, in a Business_Expense
table, a column called Expense_Date
might show when an expense was incurred. Each record in the table
describes a specific entity; in this case, everything that makes up a
business expense (when it happened, how much it cost, who incurred
the expense, what it was for, and so on). Each attribute of an
expense, in other words each column, is supposed to be
atomic; that is, each column is supposed to
contain one, and only one, value. If a table is constructed in which
the intersection of a row and column can contain more than one
distinct value, then one of SQL's primary design
guidelines has been violated. (Naturally, some of the database
platforms discussed in this book allow you to place more than one
value into a column, via VARRAY or
TABLE datatypes.)

There are rules of behavior specified for column values. Foremost is
that column values must share a common domain,
better known as a
datatype.
For example, the value ELMER should not be
placed into the Expense_Date field.
The value ELMER is a string, not a date, while
the Expense_Date field can contain
only dates. Therefore, this column would be defined as having a
DATE datatype. In addition, SQL2003 allows
further controls of such values through the application of
constraints and assertions.
(Constraints are discussed in detail later in Chapter 2.) A SQL constraint might limit Expense_Date to expenses less than a year old.

Additionally, data access for all individuals and computer processes
is controlled at the schema level by an
AuthorizationID
or
user.
Permissions to
specific sets of data may be granted or restricted to each user.

Moreover, SQL databases also employ character
sets and
collations.
Character sets are the "symbols" or
"alphabets" used by the
"language" of the data. For
example, the American English character set does not contain the
special character for ñ in the
Spanish character set. Collations are sets of sorting rules that
operate on a character set. A collation defines how a given data
manipulation operation sorts data. For example, an American English
character set might be sorted either by
character-order,
case-insensitive, or by
character-order,
case-sensitive.

The ANSI
standard does not say how sorts should be done, only that platforms
must provide common collations found in a particular language.

It is important to know what collation you are using when
writing SQL code against a database platform since it can have a
direct impact on how queries behave, particularly on the behavior of
the WHERE and ORDER BY
clauses of SELECT statements. For example, a
query that sorts data using a binary collation will return data in a
very different order than one that sorts data using, say, an American
English collation.

1.1.1.2 NULLS (rule 3)

Most databases allow any of their supported datatypes to store
NULL values. Inexperienced SQL
programmers and developers tend to think of NULL as zero or blank. In
fact, NULL is neither of these. In SQL2003, NULL literally means that
the value is unknown or indeterminate. (This question
alonewhether NULL should be considered unknown or
indeterminateis the subject of academic debate.) This
differentiation enables a database designer to distinguish between
those entries that represent a deliberately placed zero (for example)
and those where either the data is not recorded in the system or
where a NULL has been explicitly entered. For an example of this
semantic difference, consider a system that tracks payments. A
product with a NULL price does not mean that the product is free;
instead, a NULL price indicates that the amount is not known or
perhaps not yet determined.

There is a good deal of differentiation between the database
platforms in terms of how they handle NULL values. This leads to some
major porting issues between those platforms relating to NULLs. For
example, an empty string (i.e., a NULL string) is inserted as a NULL
value on Oracle. With the exception of Sybase, all the other
databases covered in this book permit the insertion of an empty
string into VARCHAR and
CHAR columns.

One side effect of the
indeterminate nature of a NULL value is that it cannot be used in a
calculation or a comparison. Here are a few brief but very important
rules, from the ANSI standard, to remember about the behavior of NULL
values when dealing with NULLs in SQL statements:

A NULL value cannot be inserted into a
column defined as NOT NULL.

NULL
values are not equal to each other. It is a frequent mistake to
compare two columns that contain NULL and expect the NULL values to
match. (The proper way to identify a NULL value in a
WHERE clause or in a Boolean expression is to
use phrases such as "value IS NULL"
and "value IS NOT NULL".)

A column containing a NULL value is
ignored in the calculation of aggregate values such as
AVG, SUM, or
MAX. COUNT.

When columns that contain NULL values are
listed in the GROUP BY clause of a query, the
query output contains a single row for NULL values. In essence, the
ANSI standard considers all NULLs found to be in a single group.

DISTINCT and
ORDER BY clauses, like GROUP
BY, also see NULL values as indistinguishable from each
other. With the ORDER BY clause, the vendor is
free to choose whether NULL values sort high (first in the result
set) or sort low (last in the result set) by default.

1.1.1.3 Metadata (rules 4 and 10)

Codd's fourth rule for relational databases states
that data about the database (metadata) must be stored in standard
tables, just as all other data.
Metadata
is data that describes the database itself. For example, every time
you create a new table or view in a database, records are created and
stored that describe the new table. Additional records are needed to
store any columns, keys, or constraints on the table. This technique
is implemented in most commercial and open source SQL database
products. For example, SQL Server uses what it calls
"system tables" to track all the
information about the databases, tables, and database objects in any
given database. It also has "system
databases" that keep track of information about the
server on which the database is installed and configured.

1.1.1.4 The language (rules 5 and 11)

Codd's rules do not require SQL
to be used with a relational database. His rules, particularly rules
5 and 11, only specify how the language should behave when coupled
with a relational database. At one time, SQL competed with other
languages (like Digital's RDO or Fox/PRO) that might
have fit the relational bill, but SQL won out for three reasons.
First, SQL is a relatively simple, intuitive, English-like language
that handles most aspects of data manipulation. Second, SQL is
satisfyingly high-level. A programmer or Database Administrator (DBA)
does not have to spend time ensuring that data is stored in the
proper memory registers or that data is cached to disk. The Database
Management System (DBMS) handles that task automatically. Finally,
because SQL is not owned by any single vendor, it was adopted across
a number of platforms.

1.1.1.5 Views (rule 6)

A
view
is a virtual table that does not exist as a physical data repository
of data, but is instead constructed, on the fly, from a
SELECT statement whenever the view is queried.
Views enable you to construct different representations out of the
same source data for a variety of audiences without having to alter
the way in which the data is stored.

Some vendors support database objects called materialized
views. Materialized views are not governed by the same
rules as ANSI standard views.

1.1.1.6 Set operations (rules 7 and 12)

Other database manipulation languages, such as the venerable
Xbase, perform their
data operations quite differently from SQL. These languages require
you to tell the program exactly how to treat the data, one record at
a time. Since the program cycles down through a list of records,
performing its logic on one record after another, this style of
programming is frequently called row
processing or procedural
programming.

In contrast, SQL programs operate on logical
sets of data. Set theory is applied in most all
SQL statements such as SELECT,
INSERT, UPDATE, or
DELETE statements. In effect, data is selected
from a set called a table. Unlike the row processing style,
set processing allows a programmer to tell the
database simply what is required, not
how each individual piece of data should be
handled. Sometimes set processing is referred to as
declarative
processing, since a programmer
declares only what data is wanted, as in "Give me
all employees in the southern region who earn more than $70,000 per
year," rather than describing the exact procedure
used to retrieve or manipulate the data.

Set theory was the brainchild of mathematician Georg Cantor, who
developed it at the end of the nineteenth century. At the time, set
theory (and his theory of the infinite) was quite controversial.
Today, set theory is such a common part of life that it is learned in
elementary school.

Examples of set theory in conjunction with
relational databases are detailed in the following section.

1.1.2 Codd's Rules in Action: Simple SELECT Examples

Up to this point, the chapter has been about the individual aspects
of a relational database platform as defined by Codd and implemented
under ANSI SQL. This section presents a high-level overview of the
most important SQL statement, SELECT, and
some of its most salient pointsnamely, the relational
operations known as projections,
selections, and joins:

Projection

Retrieves specific columns of data.

Selection

Retrieves specific rows of data.

Join

Returns columns and rows from two or more tables in a single result
set.

Although at first glance it might appear that the
SELECT statement deals only with the relational
selection operation, in actuality, SELECT
embodies all three operations. (Refer to SELECT Statement.)

The following statement embodies the projection operation by
selecting the first and last names of an author, plus his home state,
from the authors table:

SELECT au_fname, au_lname, state
FROM   authors

The results from any such SELECT statement are
presented as another table of data:

au_fname             au_lname                            state
-------------------- ----------------------------------- ----- 
Johnson              White                               CA
Marjorie             Green                               CA    
Cheryl               Carson                              CA    
Michael              O'Leary                             CA
Meander              Smith                               KS    
Morningstar          Greene                              TN    
Reginald             Blotchet-Halls                      OR    
Innes                del Castillo                        MI

The resulting data is sometimes called a result
set, work table, or a
derived table, differentiating it from the
base table in the database that is the target of
the SELECT statement.

It is important to note that the relational operation of projection,
and not selection, is specified using the SELECT
clause (that is, the keyword SELECT followed by
a list of expressions to be retrieved) of a
SELECT statement. Selection, the operation of
retrieving specific rows of data, is specified using the
WHERE clause in a SELECT
statement. WHERE filters out unwanted rows of
data and retrieves only the requested rows. Using the previous
example, the following selects authors from states other than
California:

SELECT au_fname, au_lname, state
FROM   authors
WHERE  state <> 'CA'

The previous query retrieved all authors; the result of this query is
a much smaller subset of records:

au_fname             au_lname                            state 
-------------------- ----------------------------------- ----- 
Meander              Smith                               KS    
Morningstar          Greene                              TN    
Reginald             Blotchet-Halls                      OR    
Innes                del Castillo                        MI

By combining the capabilities of projection and selection together in
a single query, you can use SQL to retrieve only the columns and
records that you need at any given time.

Joins are the next, and last, relational operation
we're going to talk about in this section. A join
relates one table to another in order to return a result set
consisting of related data from both tables.

Different vendors allow varying numbers of tables to join in a single
join operation. For example, Oracle places no limit on the number of
joins, while Microsoft SQL Server allows up to 256 tables in a join
operation.

The ANSI standard method of performing joins is to
use the JOIN clause in a
SELECT statement. An older method, know as a
theta join, performs the join analysis in the
WHERE clause. The following example shows both
approaches. Each statement retrieves employee information from the
employee base table as well as job
descriptions from the jobs base
table. The first SELECT uses the newer, ANSI
JOIN clause, while the second
SELECT uses a theta join:

-- ANSI style
SELECT a.au_fname, a.au_lname, t.title_id
FROM   authors AS a
JOIN   titleauthor AS t ON a.au_id = t.au_id
WHERE  a.state <> 'CA'
-- Theta style
SELECT a.au_fname, a.au_lname, t.title_id
FROM   authors AS a,
titleauthor AS t 
WHERE  a.au_id = t.au_id
AND a.state <> 'CA'

For more information about joins, refer to JOIN Subclause.

SQL in a Nutshell, 2nd Edition [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی