9.2 Data Warehouse Design
The
database serves as the foundation of the data warehouse: it is the
place where the data is stored. But there is more to a data warehouse
than simply dataa data warehouse becomes useful only when
business users properly use data stored within it. This may seem like
a trivial point, but we've seen numerous companies
build data warehouses without consulting business users to determine
business needs; thus, the database ends up with very few users,
little activity, and no business intelligence is gained.Assuming that your warehouse is well planned and there is a demand
for the data, your next challenge will be to figure out how to handle
the demand. You will be faced with the need to design your data
warehouse to deliver appropriate performance to your
usersperformance that may initially seem far beyond the
capabilities of your system, because the information requested from
the data warehouse can involve summaries and comparisons of massive
amounts of detailed data.When you start designing the data warehouse, also remember that a
data warehouse is never complete. When the business changes, so too
must the data warehouse. Thus, the ability to track changes through
metadata stored in a repository often becomes critical in design
phases.Various design tools can provide this capability.
Oracle's Warehouse
Builder provides a metadata repository and also the capability to
import metadata from operational tables and then forward-engineer new
schema and tables. A data warehouse designer creates columns for the
new tables and builds constraints for the new schema. Maps are then
created between source and target columns with appropriate
transformations. DML scripts for creation of new tables, and PL/SQL
or SQL*Loader scripts for extraction, transformation, and loading
(ETL) are automatically generated.As noted previously, data warehouses historically have had a
different set of usage characteristics from those of an OLTP
database. One aspect that makes it easier to meet data warehousing
performance requirements is the high percentage of read operations.
Oracle's locking model, described in detail in
Chapter 7, is ideally suited to support data warehouse operations. Oracle
doesn't place any locks onto data
that's being read, thus reducing contention and
resource requirements for situations in which there are a lot of
database reads. Oracle is consequently well-suited to act as the
repository for a data warehouse. Because locks don't
escalate, Oracle is also extremely appropriate for near real-time
data feeds into the warehouse in a scenario not unlike some OLTP
implementations.Warehousing usage characteristics also lead to different types of
design schema. In OLTP databases, transaction data is usually stored
in multiple tables and data items are stored only once. If a query
requests data from more than one transaction table, the tables are
joined together. Typically, the database query optimizer decides
which table to use as the starting point for the join, based on the
assumption that the data in the tables is essentially equally
important.A warehouse schema may use a different
model. Although Oracle-based data warehouses may be modeled as third
normal form (3NF), when business users need an understandable schema
to formulate their own ad-hoc queries or analytical processing is
required, key transaction data may be more appropriately stored in a
central fact table, surrounded by dimension or look-up tables, as
shown in Figure 9-2. The fact table can contain summarized data
for data items duplicated elsewhere in the warehouse, and
dimension
tables can contain multiple hierarchies. As companies consolidate
their data marts into enterprise data warehouses, many now deploy a
variation called a hybrid
schema, a mixture of third normal form and
star schema.Ralph
Kimball, author of the widely read book The Data Warehouse
Toolkit (see Appendix B for details), is largely credited
with discovering that users of data warehouses typically pose their
queries in such a manner that the star schema, shown in Figure 9-2, is an
appropriate model to use. A typical query might be something like the
following:
Show me how many sales of computers (a product type) were sold by a
store chain (a channel) in Wisconsin (a geography) over the past 6
months (a time).
The schema in Figure 9-2 shows a relatively large sales transactions
table (called a fact
table) surrounded by smaller tables (called
dimensions or look-up
tables). The query just described is often
called
multidimensional, because several dimensions are
included (time is almost always one of them). Because these queries
are typical in a data warehouse, the recognition of the star schema
by a cost-based optimizer can deliver enormous performance benefits.
Figure 9-2. Typical star schema
