An Introduction to
Data Warehousing
Data, Data everywhere
yet ...
I cant find the data I need
data is scattered over the
network
many
subtle
I
cant
get versions,
the data I need
differences
need an expert to get the data
I cant understand the data I found
available data poorly
documented
I cant use the data I found
results are unexpected
data needs to be transformed
from one form to other
2
So What Is a Data
Warehouse?
Definition: A single, complete and consistent
store of data obtained from a variety of different
sources made available to end users in a what
they can understand and use in a business
context. [Barry Devlin]
By comparison: an OLTP (on-line transaction
processor) or operational system is used to deal
with the everyday running of one aspect of an
enterprise.
OLTP systems are usually designed
independently of each other and it is difficult for
them to share information.
Why Do We Need Data
Warehouses?
Consolidation of information
resources
Improved query performance
Separate research and decision
support functions from the
operational systems
Foundation for data mining, data
visualization, advanced reporting
and OLAP tools
Why Data Warehousing?
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
What
Whatisisthe
themost
most
effective
effectivedistribution
distribution
channel?
channel?
What
Whatproduct
productpromprom-otions
-otionshave
havethe
thebiggest
biggest
impact
impacton
onrevenue?
revenue?
Who
Whoare
aremy
mycustomers
customers
and
andwhat
whatproducts
products
are
arethey
theybuying?
buying?
Which
Whichcustomers
customers
are
aremost
mostlikely
likelyto
togo
go
to
tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
What Is a Data Warehouse Used
for?
Knowledge discovery
Making consolidated reports
Finding relationships and correlations
Data mining
Examples
Banks identifying credit risks
Insurance companies searching for fraud
Medical research
How Do Data Warehouses Differ
From Operational Systems?
Goals
Structure
Size
Performance optimization
Technologies used
Comparison Chart of Database
Types
Data warehouse
Operational system
Subject oriented
Transaction oriented
Large (hundreds of GB up to several
TB)
Small (MB up to several GB)
Historic data
Current data
De-normalized table structure (few
tables, many columns per table)
Normalized table structure (many
tables, few columns per table)
Batch updates
Continuous updates
Usually very complex queries
Simple to complex queries
Design Differences
Operational System
ER Diagram
Data Warehouse
Star Schema
Supporting a Complete
Solution
Operational SystemData Entry
Data WarehouseData Retrieval
Data Warehouses, Data Marts, and Operational Data
Stores
Data Warehouse The queryable source of data
in the enterprise. It is comprised of the union of
all of its constituent data marts.
Data Mart A logical subset of the complete data
warehouse. Often viewed as a restriction of the
data warehouse to a single business process or to
a group of related business processes targeted
toward a particular business group.
Operational Data Store (ODS) A point of
integration for operational systems that
developed independent of each other. Since an
ODS supports day to day operations, it needs to
be continually updated.
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than
update
Use of the system is loosely defined
and can be ad-hoc
Used by managers and end-users to
understand the business and make
judgements
12
What are the users
saying...
Data should be integrated
across the enterprise
Summary data had a real
value to the organization
Historical data held the
key to understanding data
over time
What-if capabilities are
required
13
Data Warehousing -It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
A decision support database
maintained separately from
the organizations operational
database
14
Data Warehouse
Architecture
Relational
Databases
Legacy
Data
Purchased
Data
Optimized Loader
Extraction
Cleansing
Data Warehouse
Engine
Analyze
Query
Metadata Repository
15
From the Data Warehouse to Data
Marts
Information
Individually
Structured
Departmentally
Structured
Organizationally
Data Warehouse
Structured
Less
History
Normalized
Detailed
More
Data
16
Users have different views
of Data
OLAP
Tourists: Browse
information harvested
by farmers
Farmers: Harvest information
from known access paths
Organizationally
structured
Explorers: Seek out the unknown
and previously unsuspected rewards
hiding in the detailed data
17
Wal*Mart Case Study
Founded by Sam Walton
One the largest Super Market Chains
in the US
Wal*Mart: 2000+ Retail Stores
SAM's Clubs 100+Wholesalers Stores
This case study is from Felipe Carinos (NCR Teradata)
presentation made at Stanford Database Seminar
18
Old Retail Paradigm
Wal*Mart
Inventory Management
Merchandise Accounts
Payable
Purchasing
Supplier Promotions:
National, Region, Store
Level
Suppliers
Accept Orders
Promote Products
Provide special
Incentives
Monitor and Track
The Incentives
Bill and Collect
Receivables
Estimate Retailer
Demands
19
New (Just-In-Time) Retail
Paradigm
No more deals
Shelf-Pass Through (POS Application)
One Unit Price
Suppliers paid once a week on ACTUAL items sold
Wal*Mart Manager
Daily Inventory Restock
Suppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass Through
Stock some Large Items
Delivery may come from supplier
Distribution Center
Suppliers merchandise unloaded directly onto Wal*Mart
Trucks
20
Information as a Strategic
Weapon
Daily Summary of all Sales Information
Regional Analysis of all Stores in a logical
area
Specific Product Sales
Specific Supplies Sales
Trend Analysis, etc.
Wal*Mart uses information when
negotiating with
Suppliers
Advertisers etc.
21
Schema Design
Database organization
must look like business
must be recognizable by business user
approachable by business user
Must be simple
Schema Types
Star Schema
Fact Constellation Schema
Snowflake schema
22
Star Schema
A single fact table and for each
dimension one dimension table
Does not capture hierarchies directly
T
i
date, custno, prodno, cityname, sales
m
e
c
u
s
t
f
a
c
t
p
r
o
d
c
i
t
y
23
Dimension Tables
Dimension tables
Define business in terms already familiar
to users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets,
cities), products, customers, salesperson,
etc.
24
Fact Table
Central table
Typical example: individual sales
records
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a
billion)
Access via dimensions
25
Snowflake schema
Represent dimensional hierarchy directly
by normalizing tables.
Easy to maintain and saves storage
T
i
p
r
o
d
date, custno, prodno, cityname, ...
m
e
c
u
s
t
f
a
c
t
c
i
t
y
r
e
g
i
26 o
n
Fact Constellation
Fact Constellation
Multiple fact tables that share many
dimension tables
Booking and Checkout may share many
dimension tables in the hotel industry
Hotels
Promotion
Booking
Checkout
Travel Agents
Room Type
Customer
27
Data Granularity in
Warehouse
Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller
number of records to be processed
design around traditional high level
reporting needs
tradeoff with volume of data to be
stored and detailed usage of data
28
Granularity in Warehouse
Solution is to have dual level of
granularity
Store summary data on disks
95% of DSS processing done against this
data
Store detail on tapes
5% of DSS processing against this data
29
Levels of Granularity
Banking Example
Operational
account
activity date
amount
teller
location
account bal 60 days of
account
month
# trans
withdrawals
monthly account deposits
register -- up to
average bal
10 years
activity
Not all fields
need be
archived
amount
activity date
amount
account bal
30
Data Integration Across
Sources
Savings
Same data
different name
Loans
Different data
Same name
Trust
Data found here
nowhere else
Credit card
Different keys
same data
31
Data Transformation
Operational/
Source Data
Sequential
Legacy
Data
Accessing
Capturing
Extracting
Transformation Reconciling Conditioning Loading
Relational
External
Householding Filtering
Validating
Scoring
Data transformation is the foundation
for achieving single version of the truth
Major concern for IT
Data warehouse can fail if appropriate
data transformation strategy is not
developed
32
Data Transformation
Example
encoding
appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female
unit
appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds
field
Data Warehouse
appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
33
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999
34
Data Transformation Terms
Extracting
Conditioning
Scrubbing
Merging
Householding
Enrichment
Scoring
Loading
Validating
Delta Updating
35
Data Transformation Terms
Householding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a
household
Can result in substantial savings: 1
million catalogues at Rs. 50 each costs
Rs. 50 million . A 2% savings would save
Rs. 1 million
36
Refresh
Propagate updates on source data to
the warehouse
Issues:
when to refresh
how to refresh -- incremental refresh
techniques
37
When to Refresh?
periodically (e.g., every night, every
week) or after significant events
on every update: not warranted
unless warehouse data require
current data (up to the minute stock
quotes)
refresh policy set by administrator
based on user needs and traffic
possibly different policies for
different sources
38
Refresh techniques
Incremental techniques
detect changes on base tables:
replication servers (e.g., Sybase, Oracle,
IBM Data Propagator)
snapshots (Oracle)
transaction shipping (Sybase)
compute changes to derived and
summary tables
maintain transactional correctness for
incremental load
39
How To Detect Changes
Create a snapshot log table to record
ids of updated rows of source data
and timestamp
Detect changes by:
Defining after row triggers to update
snapshot log when source table changes
Using regular transaction log to detect
changes to source data
40
Querying Data Warehouses
SQL Extensions
Multidimensional modeling of data
OLAP
More on OLAP later
41
SQL Extensions
Extended family of aggregate
functions
rank (top 10 customers)
percentile (top 30% of customers)
median, mode
Object Relational Systems allow addition
of new aggregate functions
Reporting features
running total, cumulative totals
42
Reporting Tools
Andyne Computing -- GQL
Brio -- BrioQuery
Business Objects -- Business Objects
Cognos -- Impromptu
Information Builders Inc. -- Focus for
Windows
Oracle -- Discoverer2000
Platinum Technology -- SQL*Assist,
ProReports
PowerSoft -- InfoMaker
SAS Institute -- SAS/Assist
Software AG -- Esperant
Sterling Software -- VISION:Data
43
Decision support tools
Direct
Query
Reporting
tools
Crystal reports
Merge
Clean
Summarize
Detailed
transactional
data
Mining
tools
OLAP
Intelligent Miner
Essbase
Relational
DBMS+
e.g. Redbrick
Data warehouse
GIS
data
Operational data
Bombay branch
Oracle
Delhi branch
Calcutta branch
IMS
Census
data
SAS
44
Deploying Data
Warehouses
What business information
keeps you in business
today? What business
information can put you out
of business tomorrow?
What business information
should be a mouse click
away?
What business conditions
are the driving the need for
business information?
45
Cultural Considerations
Not just a technology
project
New way of using
information to support
daily activities and
decision making
Care must be taken to
prepare organization for
change
Must have organizational
backing and support
46
User Training
Users must have a higher level of IT
proficiency than for operational
systems
Training to help users analyze data in
the warehouse effectively
47
Summary: Building a Data
Warehouse
Data Warehouse Lifecycle
Analysis
Design
Import data
Install front-end
tools
Test and deploy
A case -- the STORET Central
Warehouse
Improved performance and faster
data retrieval
Ability to produce larger reports
Ability to provide more data query
options
Streamlined application navigation
Old Web Application Flow
Central Warehouse Application
Flow
Search Criteria
Selection
Report Size Feedback/
Report Customization
Report Generation
Web Application Demo
STORET Central Warehouse:
http://epa.gov/storet/dw_home.html
STORET Central Warehouse
Potential Future Enhancements
More query functionality
Additional report types
Web Services
Additional source systems?
STORET
State
System A
State
System B
Data Warehouse
Components
SOURCE:
Ralph Kimball
Data Warehouse Components
Detailed
SOURCE:
Ralph Kimball
Online analytical
processing
(OLAP)
56
Nature of OLAP Analysis
Aggregation -- (total sales, percentto-total)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate
data
Complex criteria specification
Visualization
Need interactive response to aggregate
57
Multi-dimensional Data
Re
gi
on
Measure - sales (actual, plan, variance)
Product
W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
1 2 34 5 6 7
Month
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product
Industry
Region
Country
Time
Year
Category
Region
Quarter
Product
City
Office
Month
Day
week
58
Conceptual Model for OLAP
Numeric measures to be analyzed
e.g. Sales (Rs), sales (volume), budget,
revenue, inventory
Dimensions
other attributes of data, define the
space
e.g., store, product, date-of-sale
hierarchies on dimensions
e.g. branch -> city -> state
59
Operations
Rollup: summarize data
e.g., given sales data, summarize sales
for last year by product category and
region
Drill down: get more details
e.g., given summarized sales as above,
find breakup of sales by city within each
region, or within the Andhra region
60
More OLAP Operations
Hypothesis driven search: E.g.
factors affecting defaulters
view defaulting rate on age aggregated over
other dimensions
for particular age segment detail along
profession
Need interactive response to aggregate
queries
=> precompute various aggregates
61
MOLAP vs ROLAP
MOLAP: Multidimensional array OLAP
ROLAP: Relational OLAP
Type
Size
Colour Amount
Shirt
Shirt
Shirt
Shirt
Shirt
Shirt
Shirt
ALL
S
L
ALL
S
L
ALL
ALL
ALL
Blue
Blue
Blue
Red
Red
Red
ALL
ALL
10
25
35
3
7
10
45
1290
62
SQL Extensions
Cube operator
group by on all subsets of a set of
attributes (month,city)
redundant scan and sorting of data can
be avoided
Various other non-standard SQL
extensions by vendors
63
OLAP: 3 Tier DSS
Data Warehouse
OLAP Engine
Database Layer
Application Logic Layer
Presentation Layer
Generate SQL
execution plans in the
OLAP engine to obtain
OLAP functionality.
Obtain multidimensional reports
from the DSS Client.
Store atomic data
in industry
standard Data
Warehouse.
Decision Support Client
64
Strengths of OLAP
It is a powerful
visualization tool
It provides fast,
interactive response
times
It is good for analyzing
time series
It can be useful to find
some clusters and
outliners
Many vendors offer OLAP
tools
65
Brief History
Express and System W DSS
Online Analytical Processing - coined by
EF Codd in 1994 - white paper by
Arbor Software
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
MOLAP: Multidimensional OLAP (Hyperion (Arbor
Essbase), Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
66
OLAP and Executive Information
Systems
Andyne Computing -Pablo
Arbor Software -Essbase
Cognos -- PowerPlay
Comshare -Commander OLAP
Holistic Systems -Holos
Information Advantage
-- AXSYS, WebOLAP
Informix -- Metacube
Microstrategies
--DSS/Agent
Oracle -- Express
Pilot -- LightShip
Planning Sciences -Gentium
Platinum Technology -ProdeaBeacon, Forest
& Trees
SAS Institute -SAS/EIS, OLAP++
Speedware -- Media
67
Microsoft OLAP strategy
Plato: OLAP server: powerful, integrating
various operational sources
OLE-DB for OLAP: emerging industry standard
based on MDX --> extension of SQL for OLAP
Pivot-table services: integrate with Office 2000
Every desktop will have OLAP capability.
Client side caching and calculations
Partitioned and virtual cube
Hybrid relational and multidimensional storage
68