Why build a data warehouse or business intelligence solution? Why is the data in an On-Line Transaction Processing (OLTP) database only part of a business intelligence solution? Data warehouses are often designed with the following in mind:
Strategic analyses discern trends in data, rather than individual facts.
Data warehouses often are used to create simple reports based on aggregate values culled from enormous amounts of data. If OLTP databases attempted to create these aggregates on the fly they would use a lot of resources, which would impact their ability to process transactions in a timely manner.
The information in a data warehouse is more commonly read-only, with infrequent updates.
This means that the overhead of transaction control, an important part of a normal OLTP database system, isn't really needed for a data warehouse.
The data used for analyses may not require up-to-the-minute accuracy.
Because strategic analysis is concerned with trends over time, the data used can be a day, a week, or even a month old, depending on the analysis being done. This means that data in a data warehouse can have some or all of its aggregate values created as part of a batch process offline. However, as warehouses are used for tactical management, near real-time accuracy emerges as a requirement.
The design required for an efficient data warehouse differs from the standard normalized design for a relational database.
Queries are typically read against a fact table, which may contain summary information using a specific type of schema design called a star schema. This design lets you access facts quite flexibly along key dimensions or "look-up" values. (The star schema is described in more detail later, in Section 9.2.) For instance, a data warehouse user may want to compare the total amount of sales, which comes from a fact table, by region, store in the region, and items, all of which can be considered key dimensions.
Business intelligence is not a new idea. The use of corporate data for strategic decision making beyond simple tracking and day-to-day operations has gone on for almost as long as computing itself.
Quite early, builders and users of operational systems recognized potential business benefits of analyzing the data in complementary systems. In fact, much of the early growth in personal computers was tied to the use of spreadsheets that performed analyses against data downloaded from the operational systems. Business executives began to direct IT efforts toward building solutions used to obtain a better understanding of business using existing business data. This understanding enabled new business strategies. Many of these early joint initiatives were successful data warehousing projects. Today, data warehouses are used in business areas such as customer relationship management, sales and marketing campaign analysis, product management and packaging, financial analysis, supply chain analysis, and risk and fraud analysis.
In the 1980s, many organizations began using dedicated servers for these applications, which were collectively known as decision support systems (DSS). Decision-support queries tended to be particularly CPU and memory intensive and took place in a primarily read-only environment, whereas traditional OLTP was typically I/O intensive with a large number of updates. The characteristics of queries were much less predictable (e.g., more "ad-hoc") than what had been experienced in OLTP systems. This led to the development of data stores for decision support apart from those for OLTP.
When Bill Inmon (whose books are noted in Appendix B) and others popularized the term "data warehouse" in the early 1990s, a formalized common infrastructure for building a solution came into being. Since then, the topology of the data warehouse has continued to evolve, as the next section illustrates.
Although data warehouses were initially conceived as a large, enterprise-wide source of all information, their topology often now appears as a multi-tier architecture, as shown in Figure 9-1.
The topology shown in Figure 9-1 is the mature result of years of experience. The evolution from a traditional client/server environment to multiple tiers occurred for a variety of reasons. Initial efforts at creating a single warehouse often resulted in "analysis paralysis." Just as efforts to define an enterprise-wide OLTP model often take years (due to cross-departmental politics and the scope of the effort), similar attempts in data warehousing also took much longer than businesses were willing to accept. These efforts were further hampered by the continually changing analysis requirements necessitated by a continually changing marketplace. While the data elements and requirements for operational systems can remain relatively stagnant over time, understanding business trends can be like trying to catch lightning in a bottle.
Consequently, attempts at building such enterprise-wide models in an effort to satisfy everyone often satisfied no one.
When some large-scale, enterprise-only data warehouse efforts ended in dismal failure, frustration and impatience followed. Some people reacted and built department-focused independent data marts by extracting data from the appropriate operational source systems. Many data marts were initially quite successful because they fulfilled a specific business need relatively quickly.
However, problems began to surface. There was often no coordination between departments regarding basic definitions, such as "customer." If a senior manager asked the same question of multiple departments, the answers provided by these independent data marts were often different, thus calling into question the validity of any of the marts. Many departments also encountered ongoing difficulty in managing these multiple data marts and in maintaining extractions from operational sources (which were often duplicated across multiple departments).
As decision-support architects took another look at their solutions, they began to realize that it was very important to have a consistent view of the detailed data at an enterprise data warehouse level. They also saw that data marts could solve business problems and provide return on investment in an incremental fashion. Today, most successful implementers simultaneously grow dependent data marts one business solution at a time while growing the enterprise warehouse server in an incremental fashion.
The currently accepted definition of a data mart is simply a subject- or application-specific data warehouse, usually implemented within a department. Typically, these data marts are built for performance and may include a large number of summary tables. Data marts were initially thought of as being small, because not all the detail data for a department or data from other departments need be loaded in the mart. However, some marts get quite large as they incorporate data from outside sources (sometimes purchased) that isn't relevant in other parts of the business.
In some companies, data marts are deployed to meet specific project goals with models optimized for performance for that particular project. Such data marts are retired when the project is completed and the hardware is reused for other projects. As the analysis requirements for a business change, the topology of any particular data warehouse is subject to evolution over time, so developers must be aware of this possibility.
Increasingly, companies interested in costs savings are reexamining the wisdom of having a large number of data marts and are consolidating marts where possible. Recent new database features enabling effective management of different user communities helped to make consolidation possible.
The operational data store (ODS) concept also grew in popularity in the 1990s. The ODS may best be described as a distribution center for current data. Like the OLTP servers, the schema is highly normalized and the data is recent. The ODS serves as a consolidation point for reporting and can give the business one location for viewing current data that crosses divisions or departments. The popularity of the ODS grew in part as a result of companies in the midst of acquisitions and mergers. These organizations often face mixed-application environments. The ODS acts as a staging location that can be used as the source for further transformations into a data warehouse or into data marts.
The warehouse server, or enterprise data warehouse, is a multi-subject historical information store usually supporting multiple departments and often serving as the corporate database of record. When an ODS is established, the warehouse server often extracts data from the ODS. When an ODS isn't present, data for the warehouse is directly extracted and transformed from operational sources. External data may also feed the warehouse server.
As noted previously in the data mart discussion, platform consolidation is being examined within these tiers today. The enterprise data warehouse can be the point of consolidation for the ODS and multiple data marts. Although different logical models remain, they are sometimes consolidated to a single platform.