Oracle Essentials [Electronic resources] : Oracle Database 10g, 3rd Edition نسخه متنی

9.6 Other Data Warehouse Software

A data warehouse
isn't necessarily built with a single software
product, nor is it simply a database. In addition to the database
capabilities we've described, if
you're going to build an effective data warehouse
topology like the one we've outlined, your software
will provide the following functionality:

Extract data from operational data sources

The extraction process is fairly
straightforward but can be significantly optimized, because it
typically involves moving large amounts of data.

Transform and/or cleanse data

Because the data in a data warehouse can come from many different
sources, the data must frequently be converted into a common format.
Because of inherent problems in legacy systems, the original data may
also have to be cleansed by eliminating or modifying invalid values.

Transport data from OLTP systems to the data warehouse/marts

As with the extraction process, the transportation process can be
optimized for performance.

Provide basic reporting

Basic reporting can give you the same reporting capabilities as a
normal database.

Provide OLAP for multidimensional analysis

OLAP tools give you the specialized type of analysis you need to spot
business changes and trends. These tools typically provide a way to
translate multiple dimensions into the two-dimensional world of a
screen and give you the ability to easily rotate those dimensions.

Provide data mining

Data mining is a process in which the data in a data warehouse is
modeled against known outcomes, and the models can lead to new
business opportunities.

Provide a way to store and retrieve metadata

A metadata repository can be used not only to store the descriptive
data, but also to enable extended management services such as
versioning and impact analysis.

The following sections provide details on how each of these
requirements can be met.

9.6.1 Extraction, Transformation, and Loading

The
first three requirements described in the previous list are often
handled by what are called ETL tools (for extraction, transformation,
and loading). Those who are new to data warehousing might want to
assume that the data in the operational sources is clean and that
transformations can be ignored.
Unfortunately, this is rarely the case because the process of
understanding the data sources, designing the transformations,
testing the loading process, and debugging is often the most
time-consuming part of deployment. Transformations generally remove
bogus data (including erroneous entries and duplicate entries),
convert data items to an agreed-upon format, and filter data not
considered necessary for the warehouse. These operations not only
improve the quality of the data, but also frequently reduce the
overall amount of data, which, in turn, improves data warehouse
performance.

The frequency with which the extraction occurs is largely determined
by the required timeliness of the data in the warehouse. Most
extraction and loading takes place on more of a
"batch" basis with a known time
delay (typically hourly or daily, but sometimes even weekly or
monthly). Many early warehouses were completely refreshed during the
loading
process. As data volumes grew, this became impractical due to limited
time frames available for loading. Today, updates to tables are more
common. When a need for near real-time data exists, warehouses can be
loaded nearly continuously using a trickle feed.

Is Cleanliness Best?

Once the data in the warehouse is
"clean," is this version of the
true nature of the data propagated back to the originating OLTP
systems? This is an important issue for data warehouse
implementation. In some cases, a "closed
loop" process is implemented whereby updates are
provided back to the originating systems. In addition to minimizing
some of the cleansing that takes place during future extractions,
operational reports thus become more accurate.

Another viable option is to avoid cleansing by improving the quality
of the data at the time it's input into the
operational system. This can sometimes be accomplished by not
allowing a "default" condition as
allowable input into a data field. Presenting the data-entry person
with an array of valid options, one of which
must be selected, is often a way to ensure the
most consistent and valid responses. Many companies also provide
education to the data-entry people at this time, showing them how the
data they're keying in will be used and what the
significance of it is.

Simple
extraction and transportation of data is possible using one of
several Oracle database features:

Transparent Gateways and Heterogeneous Services

Provide
a bridge to retrieve data from non-Oracle sources using Oracle SQL to
load an Oracle database. Heterogeneous Services provide Open
DataBase Connectivity (ODBC) connectivity to non-Oracle relational
sources. Gateways can optionally provide a higher level of
performance when extracting data from non-Oracle sources.

Transportable tablespaces

Another feature for data movement,
transportable tablespaces enable rapid data movement between Oracle
instances without export/import. Metadata (the data dictionary) is
exported from the source and imported to the target. The transferred
tablespace can then be mounted on the target. Oracle Database
10g introduces cross-platform transportable
tablespaces, which can move a tablespace from one type of system
(e.g., Solaris) to another (e.g., Linux).

Oracle Streams

Comes bundled
with Oracle since Oracle9i Release 2. Streams
includes Oracle's log-based replication, Advanced
Queueing (AQ), and, in Oracle Database 10g,
transportable tablespaces. It is often used for near real-time data
movement. Oracle Database 10g
adds support for downstream capture, which allows changed
data to be collected from log files, eliminating overhead; RMAN and
Transportable Tablespaces for instantiation; and now also supports
LONG, LONG RAW, and NCLOB datatypes. Oracle Database
10g adds an asynchronous change data capture
that uses Streams to transport from a source database to a target
only changed records.

Data Pump Fast Import/Export

New in Oracle Database
10g, and enabled via external table support,
Data Pump is a new import/export format. Parallel direct path loading
and unloading is supported.

Each of these database features is typically used for high
performance data transfers and not (by themselves) for difficult
transformations. Oracle Warehouse Builder (OWB) is
Oracle's ETL tool used for building maps from
extraction sources, through predefined or custom transformations to
target tables. OWB can then be used to automatically generate the
scripts needed to perform the ETL. More than just an ETL tool, OWB
can also be used as a data warehouse design tool and provides a
metadata repository. Designs may also be imported from a variety of
design tools, such as Oracle Designer, CA's ERwin,
and Sybase PowerDesigner.

In most warehouse building, metadata is first imported that describes
source tables, including Oracle (via database links) and other RDBMS
systems (through ODBC or gateways) and flat files. Target tables are
designed or imported, and source metadata is mapped to target
metadata, including transformations (see Figure 9-4 for an example).
OWB's basic set of transformations include a name
and address cleansing operator for use with Oracle
partners' libraries and applications that perform
"householding," matching, and
merging of data.

OWB can then validate the source-to-target mappings. Once validated,
you can then generate any of the following:

DDL, if target tables are to be created

SQL*Loader control files, for the loading of flat files

PL/SQL scripts, for ETL from relational sources

Figure 9-4. Typical Oracle Warehouse Builder source table-to-target table mapping

Scripts are deployed to, and run at, the target data warehouse,
typically scheduled using the Enterprise Manager job scheduler. For
more complex scheduling of ETL jobs where certain prerequisites must
be met, OWB includes an interface to Oracle Workflow.

OWB provides access to a number of other non-Oracle sources. A SAP
Integrator provides SQL access to Oracle RDBMS-based SAP tables, and
an ABAP code generator builds access to any SAP table on any
database, including cluster tables, through a RFC connection.

For high-speed loading of flat files, Oracle SQL*Loader's
direct path load option provides rapid loading
by bypassing the buffer cache and rollback mechanism and writing
directly to the datafile. You can run SQL*Loader sessions in parallel
to further speed the table loading process (as many warehouses need
to be loaded in a limited "window"
of time). Many popular extraction tools, including OWB, generate
SQL*Loader scripts.

Oracle9i first added key ETL functionality in
the core database engine including support for external tables, table
functions, merge (i.e., insert or update depending on the existence
of a data item), multi-table inserts, change data capture, and
resumable statements. Today, OWB can be used to leverage this
functionality. Additionally, OWB can create trickle feeds through
Oracle's AQ, a component of Streams.

9.6.2 Reporting and Ad-Hoc Query Tools

Marketing, financial, and other business
analysts are rarely interested in the storage and schema that hold
their information. Their interest levels rise when the discussion
turns to the tools they'll be using. Business
intelligence tools are often evaluated and purchased within
individual business areas, sometimes without close IT coordination.
Oracle database customers have a choice of Oracle's
suite of business intelligence tools or popular independent
vendors' products such as Hyperion/Brio, Business
Objects, Cognos, and MicroStrategy. Oracle bundles the business tools
it provides in the Oracle Application Server.

Of the tools users need, the least-glamorous but most-often-accessed
are reports.
Business analysts may wish to avoid reporting software if it requires
help from the IT staff, especially if the process of creating and
deploying the reports is more time-consuming and less flexible than
manipulating the data directly. However, reports appeal to executives
within a company because they require no special IT skills or
knowledge. The distribution of such reports has been greatly improved
as web-based report generation becomes commonplace and the user needs
only familiarity with a browser.

Oracle's tool for generating reports,
Oracle
Reports, has a wizard-based frontend for building reports that can
then be deployed to the Web for access as Adobe Acrobat, plain text,
or HTML files. With this tool, you can cache reports on a middle-tier
server for better performance. The tool also provides some limited
drill-down search capabilities, in which a user can ask for more
detail about a particular portion of a report.

More sophisticated users need to pose
"what-if" questions, such as
"what sales volume occurred in the Midwest
sub-regions in 2004 and how did sales compare to
2003?" Such users need ad-hoc query tools. Initially
developed as client/server tools, many are now web-based and provide
similar functionality to the earlier reporting tools. A common theme
has been to make these tools much more graphically intuitive. Queries
are typically posed by selecting icons that represent tables and
items. Users only need be knowledgeable on where their data is
stored, and what it represents. The SQL is generated behind the
scenes.

Oracle
Discoverer, a web-based ad-hoc query tool, provides an easy-to-use
frontend for picking and choosing data items around which to build
queries and is typically used by business users. Users can generate
their own reports and deploy them to the Web as HTML files.
Discoverer has a query governor that can predict the amount of time a
query will take based on comparisons in records of previous queries
kept in the database server. IT can use a separate administrative
layer to place limits on the time allowed to perform queries, because
an analyst can (either intentionally or accidentally) request
incredibly complex and lengthy queries for a particular report. This
type of request could result in slow delivery of results and
negatively impact overall database performance.

Oracle Discoverer is most often used for querying an Oracle
relational database, but ODBC support enables this tool to be used
with other databases.

The Portal product, a part of Oracle Application Server, provides an
integration point for custom-built business intelligence applications
using Oracle Reports, Discoverer, and other tools. It also provides
access to a number of other applications and web sites through its
interface, and it is highly customizable by users.

9.6.3 OLAP and OLAP Applications Building

As business users become more sophisticated, their questions evolve
from "what happened" to
"what trends are present and what might happen in
the future?" OLAP tools provide
the ability to handle time-series and mathematical analysis for
understanding past trends and forecasting the future.

OLAP initially grew around the early inability of relational
databases to effectively handle multidimensional queries (described
above in Section 9.2). The first OLAP tools were packaged with
their own
data
"cubes." Data was downloaded from
relational sources.

These separate database engines are called Multidimensional On-Line Analytical
Processing engines, or MOLAP engines. Examples
include Oracle's Express Server and Hyperion
Essbase, as well as
the Microsoft offering that is part of SQL Server. These MOLAP
engines handle queries extremely quickly and work best when the data
is not updated frequently (because the cube-generation process takes
time).

Today, OLAP tools are used more often with relational databases than
MOLAP engines, because star queries are supported to various degrees
in many databases and because there is an increased need for very
frequently updated data. When used in this fashion, the interaction
is called ROLAP, which
stands for Relational On-Line Analytical Processing. Tools that can
work against either relational databases or MOLAP engines are
sometimes referred to as hybrid tools.

Oracle's OLAP functionality now resides in the
Oracle relational database management system (RDBMS). For ROLAP
deployment, Oracle Discoverer and other tools can leverage ANSI
standard analytic functions built into the database as SQL
extensions. Discoverer can also leverage the RDBMS materialized views
and an analyzer component provides a frontend to the database OLAP
option.

Popular business intelligence tools can also access the OLAP Option,
a MOLAP cube within the relational database, via SQL (since
Oracle9i Release 2). Custom OLAP applications,
formerly built via Oracle's Express Objects, are now
built using Oracle's
JDeveloper and
business
intelligence beans. The Java beans provide prebuilt
components for manipulating tables, crosstabs, and graphs, and for
building queries and calculations similar to the functionality
previously found in Express. JDeveloper generates Java code using
these building blocks that map to the Java OLAP API provided by
Oracle's OLAP Option. The Oracle Database
10g release adds the capability to use XML to
describe the OLAP multidimensional data model.

9.6.4 Data Mining

Data
mining, an often overused and misunderstood term in data
warehousing, is the use of mathematical algorithms to model
relationships in the data that wouldn't be apparent
by using any other tools. Most companies shouldn't
approach data mining unless analysts have met the following criteria:

An understanding of the quality and meaning of the data in the
warehouse

Business insight gained using other tools and the warehouse

An understanding of a business issue being driven by too many
variables to model outcomes in any other way

In other words, data-mining tools are not a replacement for the
analytical skills of data warehouse users.

The data-mining tools themselves can rely on a number of techniques
to produce the relationships, such as:

Extended statistical algorithms, provided by statistical tools
vendors, that can highlight statistical variations in the data.

Clustering techniques that show how business outcomes can fall into
certain groups, such as insurance claims versus time for various age
brackets. In this example, once a low-risk group is found or
classified, further research into influencing factors or
"associations" might take place.

Logic models (if A occurs, then B or C are possible outcomes)
validated against small sample sets and then applied to larger data
models for prediction.

Neural networks "trained" against
small sets, with known results to be applied later against a much
larger set.

Visualization techniques used to graphically plot variables and
understand which variables are key to a particular outcome.

Data mining has been used successfully to solve difficult business
problems such as fraud detection and churn in micro-opportunity
marketing, as well as in other areas where many variables can
influence an outcome. Companies servicing credit cards use data
mining to track unusual usagefor example, the unexpected
charging to a credit card of expensive jewelry in a city not normally
traveled to by the cardholder. Discovering clusters of unusual buying
patterns within certain small groups might also drive
micro-opportunity marketing campaigns aimed at small audiences with a
high probability of purchasing products or services.

A recent
trend among relational database providers is tighter integration of
data- mining algorithms into the relational database.
Oracle's data-mining strategy initially leveraged a
traditional client/server version of a product called
Oracle Darwin
to provide algorithms for modeling associations, neural networks,
classification and regression trees, and clusters against Oracle
tables or flat files. Oracle began to embed these algorithms in the
Oracle9i database. Algorithms now embedded in
the database include Naïve Bayes, Associations, Adaptive
Bayes Networks, Clustering, Support Vector Machines (SVM), and
Nonnegative Matrix Factorization (NMF). The algorithms are accessible
via a Java API; however, they may prove most useful when leveraged by
Oracle products such as Oracle Personalization, for real-time web
site recommendations. New data mining capabilities in Oracle Database
10g include text mining (providing document
clustering and classification) and BLAST similarity searches,
leveraging the new SVM algorithms (common in genetic research). Data
mining applications can be custom built using
Oracle's JDeveloper in combination with
DM4J. DM4J is used to
develop, test, and score the models. It provides the ability to
define metadata, tune the generated Java code, view generated XML
files, and test application components.

9.6.5 Business Intelligence Applications

Business intelligence applications
are prebuilt solutions providing extended reporting and
"dashboard-like" interfaces to
display business trends. The applications may directly access OLTP
applications by leveraging Oracle materialized views, such as
Oracle's Daily Business Intelligence, or more
traditional data warehouses, such as SAP Business Warehouse and most
custom deployment topologies.

The new business intelligence applications from software vendors
focus on specific areas of business expertise, such as marketing or
financial analysis. These applications include predefined queries,
reports, and charts that deliver the kind of information required for
a particular type of business analysis while sparing the end user the
complexity of creating these objects from scratch.

Oracle ERP Applications Intelligence modules provide a
mixture of access into Oracle transactions tables (for Daily Business
Intelligence) and into Oracle's Enterprise Data
Warehouse (EDW). Access is commonly through prebuilt Oracle
Discoverer workbooks containing prepopulated business metadata.
Oracle intelligence modules include Financial, Purchasing, Project,
Manufacturing, Supply Chain, Marketing, and Human Resources
Intelligence. The EDW can leverage customized Oracle Applications
Flexfields through an Oracle Applications Integrator available for
Oracle Warehouse Builder. It can also be extended to extract and load
data from non-Oracle sources through OWB. Oracle also offers other
useful applications, including a Balanced Scorecard for measuring and
displaying progress toward meeting business objectives and key
performance indicators.

The promise of tightly integrated solutions, such as the Intelligence
modules, is that they will provide easier-to-deploy solutions with
more out-of-the-box functionality. While some customization will
probably always be needed, the time required to deploy an initial and
useful solution should be substantially reduced.

Oracle Essentials [Electronic resources] : Oracle Database 10g, 3rd Edition نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی