Chapter 4: Data Warehousing and On-line Analytical
Processing
■ Data Warehouse: Basic Concepts
■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary
1
What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database that is maintained separately from
the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ Data warehousing:
■ The process of constructing and using data warehouses
2
Data Warehouse—Subject-Oriented
■ Organized around major subjects, such as customer,
product, sales
■ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process
3
Data Warehouse—Integrated
■ Constructed by integrating multiple, heterogeneous data
sources
■ relational databases, flat files, on-line transaction
records
■ Data cleaning and data integration techniques are
applied.
■ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.
4
Data Warehouse—Time Variant
■ The time horizon for the data warehouse is significantly
longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element”
5
Data Warehouse—Nonvolatile
■ A physically separate store of data transformed from the
operational environment
■ Operational update of data does not occur in the data
warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data
6
OLTP vs. OLAP
7
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
8
Data Warehouse: A Multi-Tiered Architecture
Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data OLAP Engine Front-End
Storage Tools 9
Three Data Warehouse Models
■ Enterprise warehouse
■ collects all of the information about subjects spanning
the entire organization
■ Data Mart
■ a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases
■ Only some of the possible summary views may be
materialized
10
Extraction, Transformation, and Loading (ETL)
■ Data extraction
■ get data from multiple, heterogeneous, and external
sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse
format
■ Load
■ sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the
warehouse
11
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
12
Chapter 4: Data Warehousing and On-line Analytical
Processing
■ Data Warehouse: Basic Concepts
■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary
13
From Tables and Spreadsheets to
Data Cubes
■ A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
14
Cube: A Lattice of Cuboids
all
0-D (apex) cuboid
time item location supplier
1-D cuboids
time,location item,location location,supplier
time,item 2-D cuboids
time,supplier item,supplier
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D (base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
16
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
17
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
18
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 19
A Concept Hierarchy:
Dimension (location)
all all
region Europe ... North_America
country Germany ... Spain Canada ... Mexico
city Frankfurt ... Vancouver ... Toronto
office L. Chan ... M. Wind
20
Data Cube Measures: Three Categories
■ Distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()
■ Algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each of
which is obtained by applying a distributive aggregate
function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank()
21
Multidimensional Data
■ Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Re
Industry Region Year
Category Country Quarter
Product
Product City Month Week
Office Day
Month
22
A Sample Data Cube
Total annual sales
Date of TVs in U.S.A.
1Qtr 2Qtr sum
t
3Qtr 4Qtr
uc
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
23
Cuboids Corresponding to the Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
product,date product,country date, country
2-D cuboids
3-D (base) cuboid
product, date, country
24
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
25
Fig. 3.10 Typical OLAP
Operations
26
Chapter 4: Data Warehousing and On-line Analytical
Processing
■ Data Warehouse: Basic Concepts
■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Design and Usage
■ Data Warehouse Implementation
■ Data Generalization by Attribute-Oriented
Induction
■ Summary
27
Design of Data Warehouse: A Business
Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
28
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record
29
Data Warehouse Usage
■ Three kinds of data warehouse applications
■ Information processing
■ supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
■ Analytical processing
■ multidimensional analysis of data warehouse data
■ supports basic OLAP operations, slice-dice, drilling, pivoting
■ Data mining
■ knowledge discovery from hidden patterns
■ supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools
30