Oracle Essentials [Electronic resources] : Oracle Database 10g, 3rd Edition

Jonathan Stern

نسخه متنی -صفحه : 167/ 88

9.6 Other Data Warehouse Software

A data warehouse isn't necessarily built with a single software product, nor is it simply a database. In addition to the database capabilities we've described, if you're going to build an effective data warehouse topology like the one we've outlined, your software will provide the following functionality:

Extract data from operational data sources

The extraction process is fairly straightforward but can be significantly optimized, because it typically involves moving large amounts of data.

Transform and/or cleanse data

Because the data in a data warehouse can come from many different sources, the data must frequently be converted into a common format. Because of inherent problems in legacy systems, the original data may also have to be cleansed by eliminating or modifying invalid values.

Transport data from OLTP systems to the data warehouse/marts

As with the extraction process, the transportation process can be optimized for performance.

Provide basic reporting

Basic reporting can give you the same reporting capabilities as a normal database.

Provide OLAP for multidimensional analysis

OLAP tools give you the specialized type of analysis you need to spot business changes and trends. These tools typically provide a way to translate multiple dimensions into the two-dimensional world of a screen and give you the ability to easily rotate those dimensions.

Provide data mining

Data mining is a process in which the data in a data warehouse is modeled against known outcomes, and the models can lead to new business opportunities.

Provide a way to store and retrieve metadata

A metadata repository can be used not only to store the descriptive data, but also to enable extended management services such as versioning and impact analysis.

The following sections provide details on how each of these requirements can be met.

9.6.1 Extraction, Transformation, and Loading

The first three requirements described in the previous list are often handled by what are called ETL tools (for extraction, transformation, and loading). Those who are new to data warehousing might want to assume that the data in the operational sources is clean and that transformations can be ignored. Unfortunately, this is rarely the case because the process of understanding the data sources, designing the transformations, testing the loading process, and debugging is often the most time-consuming part of deployment. Transformations generally remove bogus data (including erroneous entries and duplicate entries), convert data items to an agreed-upon format, and filter data not considered necessary for the warehouse. These operations not only improve the quality of the data, but also frequently reduce the overall amount of data, which, in turn, improves data warehouse performance.

The frequency with which the extraction occurs is largely determined by the required timeliness of the data in the warehouse. Most extraction and loading takes place on more of a "batch" basis with a known time delay (typically hourly or daily, but sometimes even weekly or monthly). Many early warehouses were completely refreshed during the loading process. As data volumes grew, this became impractical due to limited time frames available for loading. Today, updates to tables are more common. When a need for near real-time data exists, warehouses can be loaded nearly continuously using a trickle feed.

Is Cleanliness Best?

Once the data in the warehouse is "clean," is this version of the true nature of the data propagated back to the originating OLTP systems? This is an important issue for data warehouse implementation. In some cases, a "closed loop" process is implemented whereby updates are provided back to the originating systems. In addition to minimizing some of the cleansing that takes place during future extractions, operational reports thus become more accurate.

Another viable option is to avoid cleansing by improving the quality of the data at the time it's input into the operational system. This can sometimes be accomplished by not allowing a "default" condition as allowable input into a data field. Presenting the data-entry person with an array of valid options, one of which must be selected, is often a way to ensure the most consistent and valid responses. Many companies also provide education to the data-entry people at this time, showing them how the data they're keying in will be used and what the significance of it is.

Simple extraction and transportation of data is possible using one of several Oracle database features:

Transparent Gateways and Heterogeneous Services

Provide a bridge to retrieve data from non-Oracle sources using Oracle SQL to load an Oracle database. Heterogeneous Services provide Open DataBase Connectivity (ODBC) connectivity to non-Oracle relational sources. Gateways can optionally provide a higher level of performance when extracting data from non-Oracle sources.

Transportable tablespaces

Another feature for data movement, transportable tablespaces enable rapid data movement between Oracle instances without export/import. Metadata (the data dictionary) is exported from the source and imported to the target. The transferred tablespace can then be mounted on the target. Oracle Database 10g introduces cross-platform transportable tablespaces, which can move a tablespace from one type of system (e.g., Solaris) to another (e.g., Linux).

Oracle Streams

Comes bundled with Oracle since Oracle9i Release 2. Streams includes Oracle's log-based replication, Advanced Queueing (AQ), and, in Oracle Database 10g, transportable tablespaces. It is often used for near real-time data movement. Oracle Database 10g adds support for downstream capture, which allows changed data to be collected from log files, eliminating overhead; RMAN and Transportable Tablespaces for instantiation; and now also supports LONG, LONG RAW, and NCLOB datatypes. Oracle Database 10g adds an asynchronous change data capture that uses Streams to transport from a source database to a target only changed records.

Data Pump Fast Import/Export

New in Oracle Database 10g, and enabled via external table support, Data Pump is a new import/export format. Parallel direct path loading and unloading is supported.

Each of these database features is typically used for high performance data transfers and not (by themselves) for difficult transformations. Oracle Warehouse Builder (OWB) is Oracle's ETL tool used for building maps from extraction sources, through predefined or custom transformations to target tables. OWB can then be used to automatically generate the scripts needed to perform the ETL. More than just an ETL tool, OWB can also be used as a data warehouse design tool and provides a metadata repository. Designs may also be imported from a variety of design tools, such as Oracle Designer, CA's ERwin, and Sybase PowerDesigner.

In most warehouse building, metadata is first imported that describes source tables, including Oracle (via database links) and other RDBMS systems (through ODBC or gateways) and flat files. Target tables are designed or imported, and source metadata is mapped to target metadata, including transformations (see Figure 9-4 for an example). OWB's basic set of transformations include a name and address cleansing operator for use with Oracle partners' libraries and applications that perform "householding," matching, and merging of data.

OWB can then validate the source-to-target mappings. Once validated, you can then generate any of the following:

DDL, if target tables are to be created

SQL*Loader control files, for the loading of flat files

PL/SQL scripts, for ETL from relational sources

Figure 9-4. Typical Oracle Warehouse Builder source table-to-target table mapping

Scripts are deployed to, and run at, the target data warehouse, typically scheduled using the Enterprise Manager job scheduler. For more complex scheduling of ETL jobs where certain prerequisites must be met, OWB includes an interface to Oracle Workflow.

OWB provides access to a number of other non-Oracle sources. A SAP Integrator provides SQL access to Oracle RDBMS-based SAP tables, and an ABAP code generator builds access to any SAP table on any database, including cluster tables, through a RFC connection.

For high-speed loading of flat files, Oracle SQL*Loader's direct path load option provides rapid loading by bypassing the buffer cache and rollback mechanism and writing directly to the datafile. You can run SQL*Loader sessions in parallel to further speed the table loading process (as many warehouses need to be loaded in a limited "window" of time). Many popular extraction tools, including OWB, generate SQL*Loader scripts.

Oracle9i first added key ETL functionality in the core database engine including support for external tables, table functions, merge (i.e., insert or update depending on the existence of a data item), multi-table inserts, change data capture, and resumable statements. Today, OWB can be used to leverage this functionality. Additionally, OWB can create trickle feeds through Oracle's AQ, a component of Streams.

9.6.2 Reporting and Ad-Hoc Query Tools

Marketing, financial, and other business analysts are rarely interested in the storage and schema that hold their information. Their interest levels rise when the discussion turns to the tools they'll be using. Business intelligence tools are often evaluated and purchased within individual business areas, sometimes without close IT coordination. Oracle database customers have a choice of Oracle's suite of business intelligence tools or popular independent vendors' products such as Hyperion/Brio, Business Objects, Cognos, and MicroStrategy. Oracle bundles the business tools it provides in the Oracle Application Server.

Of the tools users need, the least-glamorous but most-often-accessed are reports. Business analysts may wish to avoid reporting software if it requires help from the IT staff, especially if the process of creating and deploying the reports is more time-consuming and less flexible than manipulating the data directly. However, reports appeal to executives within a company because they require no special IT skills or knowledge. The distribution of such reports has been greatly improved as web-based report generation becomes commonplace and the user needs only familiarity with a browser.

Oracle's tool for generating reports, Oracle Reports, has a wizard-based frontend for building reports that can then be deployed to the Web for access as Adobe Acrobat, plain text, or HTML files. With this tool, you can cache reports on a middle-tier server for better performance. The tool also provides some limited drill-down search capabilities, in which a user can ask for more detail about a particular portion of a report.

More sophisticated users need to pose "what-if" questions, such as "what sales volume occurred in the Midwest sub-regions in 2004 and how did sales compare to 2003?" Such users need ad-hoc query tools. Initially developed as client/server tools, many are now web-based and provide similar functionality to the earlier reporting tools. A common theme has been to make these tools much more graphically intuitive. Queries are typically posed by selecting icons that represent tables and items. Users only need be knowledgeable on where their data is stored, and what it represents. The SQL is generated behind the scenes.

Oracle Discoverer, a web-based ad-hoc query tool, provides an easy-to-use frontend for picking and choosing data items around which to build queries and is typically used by business users. Users can generate their own reports and deploy them to the Web as HTML files. Discoverer has a query governor that can predict the amount of time a query will take based on comparisons in records of previous queries kept in the database server. IT can use a separate administrative layer to place limits on the time allowed to perform queries, because an analyst can (either intentionally or accidentally) request incredibly complex and lengthy queries for a particular report. This type of request could result in slow delivery of results and negatively impact overall database performance.

Oracle Discoverer is most often used for querying an Oracle relational database, but ODBC support enables this tool to be used with other databases.

The Portal product, a part of Oracle Application Server, provides an integration point for custom-built business intelligence applications using Oracle Reports, Discoverer, and other tools. It also provides access to a number of other applications and web sites through its interface, and it is highly customizable by users.

9.6.3 OLAP and OLAP Applications Building

As business users become more sophisticated, their questions evolve from "what happened" to "what trends are present and what might happen in the future?" OLAP tools provide the ability to handle time-series and mathematical analysis for understanding past trends and forecasting the future.

OLAP initially grew around the early inability of relational databases to effectively handle multidimensional queries (described above in Section 9.2). The first OLAP tools were packaged with their own data "cubes." Data was downloaded from relational sources.

These separate database engines are called Multidimensional On-Line Analytical Processing engines, or MOLAP engines. Examples include Oracle's Express Server and Hyperion Essbase, as well as the Microsoft offering that is part of SQL Server. These MOLAP engines handle queries extremely quickly and work best when the data is not updated frequently (because the cube-generation process takes time).

Today, OLAP tools are used more often with relational databases than MOLAP engines, because star queries are supported to various degrees in many databases and because there is an increased need for very frequently updated data. When used in this fashion, the interaction is called ROLAP, which stands for Relational On-Line Analytical Processing. Tools that can work against either relational databases or MOLAP engines are sometimes referred to as hybrid tools.

Oracle's OLAP functionality now resides in the Oracle relational database management system (RDBMS). For ROLAP deployment, Oracle Discoverer and other tools can leverage ANSI standard analytic functions built into the database as SQL extensions. Discoverer can also leverage the RDBMS materialized views and an analyzer component provides a frontend to the database OLAP option.

Popular business intelligence tools can also access the OLAP Option, a MOLAP cube within the relational database, via SQL (since Oracle9i Release 2). Custom OLAP applications, formerly built via Oracle's Express Objects, are now built using Oracle's JDeveloper and business intelligence beans. The Java beans provide prebuilt components for manipulating tables, crosstabs, and graphs, and for building queries and calculations similar to the functionality previously found in Express. JDeveloper generates Java code using these building blocks that map to the Java OLAP API provided by Oracle's OLAP Option. The Oracle Database 10g release adds the capability to use XML to describe the OLAP multidimensional data model.

9.6.4 Data Mining

Data mining, an often overused and misunderstood term in data warehousing, is the use of mathematical algorithms to model relationships in the data that wouldn't be apparent by using any other tools. Most companies shouldn't approach data mining unless analysts have met the following criteria:

An understanding of the quality and meaning of the data in the warehouse

Business insight gained using other tools and the warehouse

An understanding of a business issue being driven by too many variables to model outcomes in any other way

In other words, data-mining tools are not a replacement for the analytical skills of data warehouse users.

The data-mining tools themselves can rely on a number of techniques to produce the relationships, such as:

Extended statistical algorithms, provided by statistical tools vendors, that can highlight statistical variations in the data.

Clustering techniques that show how business outcomes can fall into certain groups, such as insurance claims versus time for various age brackets. In this example, once a low-risk group is found or classified, further research into influencing factors or "associations" might take place.

Logic models (if A occurs, then B or C are possible outcomes) validated against small sample sets and then applied to larger data models for prediction.

Neural networks "trained" against small sets, with known results to be applied later against a much larger set.

Visualization techniques used to graphically plot variables and understand which variables are key to a particular outcome.

Data mining has been used successfully to solve difficult business problems such as fraud detection and churn in micro-opportunity marketing, as well as in other areas where many variables can influence an outcome. Companies servicing credit cards use data mining to track unusual usagefor example, the unexpected charging to a credit card of expensive jewelry in a city not normally traveled to by the cardholder. Discovering clusters of unusual buying patterns within certain small groups might also drive micro-opportunity marketing campaigns aimed at small audiences with a high probability of purchasing products or services.

A recent trend among relational database providers is tighter integration of data- mining algorithms into the relational database. Oracle's data-mining strategy initially leveraged a traditional client/server version of a product called Oracle Darwin to provide algorithms for modeling associations, neural networks, classification and regression trees, and clusters against Oracle tables or flat files. Oracle began to embed these algorithms in the Oracle9i database. Algorithms now embedded in the database include Naïve Bayes, Associations, Adaptive Bayes Networks, Clustering, Support Vector Machines (SVM), and Nonnegative Matrix Factorization (NMF). The algorithms are accessible via a Java API; however, they may prove most useful when leveraged by Oracle products such as Oracle Personalization, for real-time web site recommendations. New data mining capabilities in Oracle Database 10g include text mining (providing document clustering and classification) and BLAST similarity searches, leveraging the new SVM algorithms (common in genetic research). Data mining applications can be custom built using Oracle's JDeveloper in combination with DM4J. DM4J is used to develop, test, and score the models. It provides the ability to define metadata, tune the generated Java code, view generated XML files, and test application components.

9.6.5 Business Intelligence Applications

Business intelligence applications are prebuilt solutions providing extended reporting and "dashboard-like" interfaces to display business trends. The applications may directly access OLTP applications by leveraging Oracle materialized views, such as Oracle's Daily Business Intelligence, or more traditional data warehouses, such as SAP Business Warehouse and most custom deployment topologies.

The new business intelligence applications from software vendors focus on specific areas of business expertise, such as marketing or financial analysis. These applications include predefined queries, reports, and charts that deliver the kind of information required for a particular type of business analysis while sparing the end user the complexity of creating these objects from scratch.

Oracle ERP Applications Intelligence modules provide a mixture of access into Oracle transactions tables (for Daily Business Intelligence) and into Oracle's Enterprise Data Warehouse (EDW). Access is commonly through prebuilt Oracle Discoverer workbooks containing prepopulated business metadata. Oracle intelligence modules include Financial, Purchasing, Project, Manufacturing, Supply Chain, Marketing, and Human Resources Intelligence. The EDW can leverage customized Oracle Applications Flexfields through an Oracle Applications Integrator available for Oracle Warehouse Builder. It can also be extended to extract and load data from non-Oracle sources through OWB. Oracle also offers other useful applications, including a Balanced Scorecard for measuring and displaying progress toward meeting business objectives and key performance indicators.

The promise of tightly integrated solutions, such as the Intelligence modules, is that they will provide easier-to-deploy solutions with more out-of-the-box functionality. While some customization will probably always be needed, the time required to deploy an initial and useful solution should be substantially reduced.