The purpose of this chapter is to introduce students to the rationale and basic concepts of data warehousing from a database management point of view. We contrast operational and informational processing, and we discuss the reasons why so many organizations are seeking to exploit data warehouses for competitive advantage. We discuss alternative data warehouse architectures (especially the database architectures) and techniques for populating a warehouse.
Specific student objectives are included in the beginning of the chapter. From an instructor’s point of view, the objectives of this chapter are to:
1. Establish the fact that many organizations today are experiencing an information gap. That is, they are drowning in data but starving for information.
2. Define data warehousing and describe four characteristics of a data warehouse.
3. Describe two major factors that drive the need for data warehousing as well as several advances in the field of information systems that have enabled data warehousing.
4. Contrast operational systems and informational systems from the view point of data management.
5. Describe the basic architectures that are most often used with data warehouses.
6. Contrast transient and periodic data, and discuss how data warehouses are used to build a historical record of an organization.
7. Discuss the purposes of populating a data warehouse and the problems of data reconciliation.
8. Contrast data warehouses and data marts.
9. Describe and illustrate the dimensional data model (or star schema) that is often used in data warehouse design.
Modern Database 400 Management, Ninth Edition
Independent data mart
Real-time data warehouse
Logical data mart
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Dependent data mart
Online analytical processing (OLAP)
Operational data store (ODS)
Enterprise data warehouse (EDW)
1. Discuss the importance of data warehousing in organizations today. Over 90 percent of large (Fortune 1000) companies have completed data warehouses or have a warehousing project underway. Ask your students to suggest reasons for this popularity.
2. Discuss job opportunities in data warehousing, business intelligence, and data mining. Numerous Web sites have job listings as well as newspaper advertisements.
3. Emphasize that a successful data warehousing project requires the integration of everything the students have learned throughout the database course (in fact, everything in the IS curriculum).
4. Discuss the idea of heterogeneous data (use Figure 11-1). Ask your students for reasons why such data are so commonplace and what problems they present.
5. Compare operational and informational systems using Table 11-1. Ask your students for examples of each type of system.
6. Compare the two-layer (Figure 11-2), independent data mart (Figure 11-3), dependent data mart and operational data store (Figure 11-4), and logical data mart (Figure 11-5) architectures.
7. Discuss the three-layer data architecture (Figure 11-6). Ask your students why it might be necessary to have both a reconciled data layer and a derived data layer.
8. Compare transient data (Figure 11-8) with periodic data (Figure 11-9). Explain how periodic data provide a historical record of events.
9. Discuss the steps in data reconciliation (Figure 11-10). Emphasize that this is generally considered to be the most complex challenge in data warehousing.
10. Discuss some of the typical data transformation functions (use Figures 11-11 and 11-12). Have your students suggest other practical examples.
11. Introduce components of a star schema (Figure 11-13) and discuss the example shown in Figures 11-14 and 11-15. Have your students help you diagram another example (university, football team, etc.).
Chapter 11 401
12. Discuss some variations of the star schema (Figure 11-18).
13. Discuss conformed dimensions and how these could be used (Figure 11-17).
14. Discuss normalizing dimension tables (Figures 11-19 and 11-20).
15. Discuss slowly changing dimensions and some ways to handle this. Use Figure 11-21 as an example of one possible solution.
16. Discuss some of the ways end users can use a data warehouse or data mart (use Figures 11-22 and 11-23). Ask your students to suggest some advantages of these user interfaces.
17. Introduce the topic of data mining (use Table 11-4). If time permits, have your students read a recent article on data mining in a publication such as DM Review (available at www.dmreview.com).
18. Have your students register for the Teradata Student Network and show them how they can access extensive material on data warehousing and business intelligence.
Answers to Review Questions
1. Define each of the following terms:
a. Data warehouse A subject-oriented, integrated, time-variant, non-volatile collection of data used in support of management decision-making processes (Inmon and Hackathorn, 1994).
b. Data mart A data warehouse that is limited in scope, whose data is obtained by selecting and (where appropriate) summarizing data from the enterprise data warehouse.
c. Reconciled data Detailed, historical data that are intended to be the single, authoritative source for all decision support applications and not generally intended to be accessed directly by end users.
d. Derived data Data that have been selected, formatted, and aggregated for end-user decision support applications.
e. Online analytical processing (OLAP) The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques.
f. Data mining Knowledge discovery using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics (Weldon 1996).
g. Star schema A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for star schema.
h. Snowflake schema An expanded version of a star schema in which all of the tables are fully normalized.
i. Grain The length of time (or other meaning) associated with each record in the table.
j. Conformed dimension One or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table.
Modern Database 402 Management, Ninth Edition
2. Match the following terms and definitions:
c periodic data
d data mart
e star schema
f data mining
b reconciled data
g dependent data mart
i data visualization
a transient data
h snowflake schema
3. Contrast the following terms:
a. Transient data; periodic data In transient data, changes to existing records are written over previous records, thus destroying the previous data content. In periodic data, the data is never physically altered or deleted once they have been added to the store.
b. Data warehouse; data mart; operational data store A data warehouse is an integrated and consistent store of subject-oriented data that are obtained from a variety of sources and formatted into a meaningful context to support decision making in an organization. A data mart is a data warehouse that is limited in scope and whose data are obtained by selecting and (where appropriate) summarizing data from the enterprise data warehouse. An operational data store is much different from a data warehouse or data mart because it is updatable, has a limited amount of historical data, and is available to operational users for use in decision support.
c. Reconciled data; derived data Reconciled data are intended to be the single, authoritative source for all decision-support applications and not generally intended to be accessed by end users; derived data have been selected, formatted, and aggregated for end-user decision support applications.
d. Fact table; dimension table Fact tables contain factual or quantitative data about a business such as units sold, orders booked, and so on. Dimensional tables hold descriptive data about the business.
e. Star schema; snowflake schema A star schema is a simple database design in which dimensional data are separated from fact or event data, while a snowflake schema is an expanded version of a star schema in which all of the tables are fully normalized.
f. Independent data mart; dependent data mart; logical data mart An independent data mart is populated with data extracted from the operational environment without the benefit of a reconciled data layer; a dependent data mart is populated exclusively from the enterprise data warehouse and its reconciled data layer. A logical data mart is created from a relational view of a data warehouse.
Chapter 11 403
4. Five major trends that necessitate data warehousing in many organizations today:
a. No single system of record
b. Multiple systems are not synchronized
c. Organizations want to analyze the activities in a balanced way
d. Customer relationship management
e. Supplier relationship management
5. Major components of a data warehouse architecture:
a. Operational data Stored in the various operational systems throughout the organization (and sometimes in external systems)
b. Reconciled data The type of data stored in the enterprise data warehouse
c. Derived data The type of data stored in each of the data marts
6. List three types of metadata that appear in a three-layer data warehouse architecture, and briefly describe the purpose of each type:
a. Operational metadata These are metadata that describe the data in the various operational systems (as well as external data) that feed the enterprise data warehouse. Operational metadata typically exist in a number of different formats, and they are unfortunately, often of poor quality.
b. Enterprise data warehouse (EDW) metadata These metadata are derived from (or at least are consistent with) the enterprise data model. They describe the reconciled data layer. EDW metadata also describe the rules that are used to transform operational data to reconciled data.
c. Data mart metadata These metadata describe the derived data layer. They also describe the rules that are used to transform reconciled data to derived data.
7. Four characteristics of a data warehouse:
a. Subject-oriented A data warehouse is organized around the key subjects (or high-level entities) of the enterprise. Major subjects may include customers, patients, students, products, and time.
b. Integrated The data housed in the data warehouse are defined using consistent naming conventions, formats, encoding structures, and related characteristics gathered from several internal systems of record and also often from sources external to the organization. This means that the data warehouse holds the one version of “the truth.”
c. Time-variant Data in the data warehouse contain a time dimension so that they may be used to study trends and changes.
d. Nonupdatable Data in the data warehouse are loaded and refreshed from operational systems, but cannot be updated by end users.
Modern Database 404 Management, Ninth Edition
8. Five claimed limitations of independent data marts:
a. A separate ETL process is developed for each data mart. This can yield costly redundant data and efforts.
b. A clear, enterprise-wide view of data may not be provided because data marts may not be consistent with one another.
c. Analysis is limited because there is no capability to drill down into greater detail or into related facts in other data marts.
d. Scaling costs are excessive as each new application creates a separate data mart, which repeats all the extract and load steps.
e. Attempting to make the separate data marts consistent generates a high cost to the organization.
9. Two claimed benefits of independent data marts:
a. Allow for the concept of a data warehouse to be proved by working on a series of small, fairly independent projects.
b. A reduction in the amount of time until a benefit from data warehousing is perceived by the organization, so that there is not a delay until all data are centralized.
10. Three types of operations that can be easily performed with OLAP tools:
a. Slicing a cube
c. Data mining
11. List four objectives of derived data:
a. Provide ease of use for decision support applications
b. Provide fast response for predefined user queries or requests for information
c. Customize data for particular target user groups
d. Support ad-hoc queries and data mining and other analytic applications
12. Is the star schema a relational data model? Why or why not?
The star schema is a denormalized implementation of the relational data model. The fact table plays the role of a normalized n-ary associative entity that links together the instances of the various dimensions. Usually, the dimension tables are in second normal form or possibly (but rarely) in third normal form. The dimension tables are denormalized and because they are not updated nor joined with one another, provide an optimized user view for specific information needs but could not be used for operational purposes.
13. Explain how the volatility of a data warehouse is different from the volatility of a database for an operational information system:
Chapter 11 405
A major difference between a data warehouse and an operational system is the type of data stored. An operational system most often stores transient data, which are overwritten when changes to the data occur. Thus, the data in an operational system are very volatile. On the other hand, a data warehouse usually contains periodic data, which are never overwritten once they have been added to the store. A data warehouse contains a history of the varying values for important (dimensional) data.
14. Explain the pros and cons of logical data marts:
a. New data marts can be created quickly because no physical database or database technology needs to be acquired or created. Also, loading routines do not need to be written.
b. Data marts are always up-to-date because data in a view are created when the view is referenced. Views can be materialized.
Logical data marts are only practical for moderate-sized data warehouses or when high performance data warehousing technology is used.
15. What is a helper table and why is it often used to help organize derived data?
A star schema data mart is comprised of fact and dimension tables. Fact tables are completely normalized because each fact depends on the whole composite primary key and nothing but the composite primary key. Dimension tables may not be fully normalized. Helper tables in the data warehouse world act as associative entities in the conceptual model world to link instances of data in M:N relationships. The helper table acts as a way to normalize the relationship between the dimension data and the fact data, such as in the case of a multivalued dimension situation explained in Figure 11-15 in the text.
16. The characteristics of a surrogate key as used in a data warehouse or data mart:
All keys used to join the fact table to the dimension tables should be system assigned. The key should be simple as compared to the production or composite key. It is best to maintain the same length and format for all surrogate keys across the entire data warehouse, regardless of the business dimensions involved.
17. Time is almost always a dimension in a data warehouse or data mart because data marts and data warehouses record facts about dimensions over time. Date and time are almost always included as a dimension table, and a date surrogate key is usually one of the components of the primary key of the fact table. The time dimension is critical to most of the reporting and analysis needs that end users of the data warehouse have. Often, users will want to view how facts (such as sales) have changed over time or may want to compare one time period against another.
Modern Database 406 Management, Ninth Edition
18. What is the purpose of conformed dimensions for different star schemas within the same data warehousing environment?
A conformed dimension is one or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary keys. Thus, conformed dimensions are important when there are multiple fact tables (often because there are multiple data marts) to be able to have consistent results across the marts and to be able to write queries that cut across the different marts.
Conformed dimensions allow users to:
a. Share nonkey dimension data
b. Query across fact tables with consistency
c. Work on facts and business subjects for which all users have the same meaning
19. Can a fact table have no nonkey attributes?
Yes, this would be an example of a factless fact table. There are two general situations in which this might be useful: to track events and to inventory the set of possible occurrences.
20. In what way are dimension tables often not normalized?
Most dimension tables are not normalized so that for a given user group the dimension data are only one join away from associated facts. One example might be multivalued data, in which one could store multiple values by using several different fields. Another example would be the incorporation of data from other tables that are not part of the star schema but might be needed for analysis.
21. What is a hierarchy as it relates to a dimension table?
A dimension table often has a natural hierarchy among the rows. Some examples might be geographical hierarchies (markets within a state, states within a region) and product hierarchies (products within a product line). These hierarchies can be handled in two ways:
a. Include all information for each level of the hierarchy in a single, denormalized table with a helper table (Figure 11-20)
b. Normalize the dimension into a nested set of tables (one for each level of the hierarchy) with 1:M relationships between them
22. What is the meaning of the phrase “slowly changing dimension”?
Although data warehouses track data over time, the business does not remain static. We need to keep track of the history of values in order to record the history
Chapter 11 407
of facts with correct dimensional descriptions when the facts occurred. Dimension data changes slower than transactional data, thus we can consider dimensions to be slowly changing dimensions.
23. Explain the most common approach used to handle slowly changing dimensions.
Create a new dimension table row (with a new key) each time the dimension object changes and this new row will contain all the dimension characteristics. A fact row is associated with the key whose attributes apply at the time of the fact. This approach allows us to create as many dimensional object changes as necessary. It can become unwieldy if rows change frequently. We may also want to store the surrogate key value for the original object in the dimension row so that we can relate changes back to the original object.
24. One of the claimed characteristics of a data warehouse is that it is nonupdatable. What does this mean?
Nonupdatable means that data, once put in the data warehouse, are never changed (except to correct errors), but rather new versions of the same data may be stored.
25. In what ways are a data staging area and an enterprise data warehouse different?
The data staging area contains only current, consolidated data from source systems whereas an enterprise data warehouse (EDW) contains time-stamped history.