0% found this document useful (0 votes)
184 views

Data Warehousing and On-Line Analytical Processing

The document discusses key concepts related to data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. The document outlines characteristics of a data warehouse such as being subject-oriented, integrated, time-variant and non-volatile. It also discusses ETL processes, multidimensional data modeling using cubes and dimensions, and different data warehouse architectures including enterprise data warehouses and data marts.

Uploaded by

Chitransh Naman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
184 views

Data Warehousing and On-Line Analytical Processing

The document discusses key concepts related to data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. The document outlines characteristics of a data warehouse such as being subject-oriented, integrated, time-variant and non-volatile. It also discusses ETL processes, multidimensional data modeling using cubes and dimensions, and different data warehouse architectures including enterprise data warehouses and data marts.

Uploaded by

Chitransh Naman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 40

1

Data Warehousing and On-line Analytical


Processing
2
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of managements
decision-making process.W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
3
Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process
4
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data
sources
relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
5
Data WarehouseTime Variant
The time horizon for the data warehouse is significantly
longer than that of operational systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain time element

6
Data WarehouseNonvolatile
A physically separate store of data transformed from the
operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
Design
Mapping
Source OLTP
Systems
Raw Detail
No/Minimal History
Integrated
Scrubbed
History
Summaries
Targeted
Specialized (OLAP)
Data Characteristics
System Monitoring
Meta Data
Extract
Scrub
Transform
Central
Repository
Load
Index
Aggregation
Data
Warehouse
Architected
Data Mart
Replication
Data Set Distribution
Access & Analysis
Resource Scheduling & Distribution
End User
Workstations
A Data Warehouse Is A Process
Operational
Source
Systems
E
x
t
r
a
c
t
i
o
n

S
y
s
t
e
m
s

Operational
Data Store
Independent
Data Mart
Data
Warehouse
Architected
Data Mart
User
Workstations
There Are Many Options
9
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response


10
Why a Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrency
control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:
missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP
analysis directly on relational databases
11
Data Warehouse: A Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources
Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server
Three-Tier Data Warehouse Architecture
Bottom tier is a warehouse database server that
is almost always a relational database system.
Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or
other external sources .
Middle tier is an OLAP server that is typically
implemented using either (1) a relational OLAP
(ROLAP) model, that is, an extended relational
DBMS that maps operations on multidimensional
data to standard relational operations;
Top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or
data mining tools (e.g., trend analysis, prediction,
and so on).
From the architecture point of view, there are
three data warehouse models: the enterprise
warehouse, the data mart, and the virtual
warehouse.
15
Three Data Warehouse Models
Enterprise warehouse
collects all of the information about subjects spanning
the entire organization
Data Mart
a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
A set of views over operational databases
Only some of the possible summary views may be
materialized
What are the pros and cons of the top-down and
bottom-up approaches to data warehouse
development?
TOP DOWN- it is expensive, takes a long time to
develop, and lacks flexibility due to the difficulty in
achieving consistency.
Bottom-up approach to the design, development,
and deployment of independent data marts
provides flexibility, low cost, and rapid return of
investment.
Integrating various disparate data marts into a
consistent enterprise data warehouse.
A recommended method for the development of
data warehouse systems is to implement the
warehouse in an incremental and evolutionary
manner
18
Extraction, Transformation, and Loading (ETL)
Data extraction
get data from multiple, heterogeneous, and external
sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse
format
Load
sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
propagate the updates from the data sources to the
warehouse
19
Metadata Repository
Meta data is the data defining warehouse objects. It stores:
Description of the structure of the data warehouse
schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
Operational meta-data
data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance
warehouse schema, view and derived data definitions
Business data
business terms and definitions, ownership of data, charging policies
20
Chapter 4: Data Warehousing and On-line
Analytical Processing
Data Warehouse: Basic Concepts
Data Warehouse Modeling: Data Cube and OLAP
Data Warehouse Design and Usage
Data Warehouse Implementation
Summary
21
From Tables and Spreadsheets to
Data Cubes
A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys
to each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
WHAT IS DATA CUBE?
A data cube allows data to be modeled and
viewed in multiple dimensions. It is defined
by dimensions and facts.
Dimensions are the perspectives or entities with
respect to which an organization wants to keep
records.
Facts are numerical measures. It is the quantities
by which we want to analyze relationships
between dimensions.
23
Cube: A Lattice of Cuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D (base) cuboid
24
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a
set of dimension tables
Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
25
Example of Star Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
state_or_province
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
26
Example of Snowflake Schema
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
item
branch_key
branch_name
branch_type
branch
supplier_key
supplier_type
supplier
city_key
city
state_or_province
country
city
27
Example of Fact Constellation
time_key
day
day_of_the_week
month
quarter
year
time
location_key
street
city
province_or_state
country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
item
branch_key
branch_name
branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key
shipper_name
location_key
shipper_type
shipper
28
A Concept Hierarchy:
Dimension (location)
all
Europe North_America
Mexico Canada Spain Germany
Vancouver
M. Wind L. Chan
...
... ...
...
...
...
all
region
office
country
Toronto Frankfurt city
29
View of Warehouses and Hierarchies
Specification of hierarchies
Schema hierarchy
day < {month <
quarter; week} < year
Set_grouping hierarchy
{1..10} < inexpensive
30
Multidimensional Data
Sales volume as a function of product, month,
and region
P
r
o
d
u
c
t

Month
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region Year

Category Country Quarter

Product City Month Week

Office Day
31
A Sample Data Cube
Total annual sales
of TVs in U.S.A.
Date
C
o
u
n
t
r
y

sum
sum

TV
VCR
PC
1Qtr
2Qtr
3Qtr
4Qtr
U.S.A
Canada
Mexico
sum
32
Cuboids Corresponding to the Cube
all
product
date
country
product,date product,country date, country
product, date, country
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboid
33
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
34
Fig. 3.10 Typical OLAP
Operations
Roll-up: The roll-up operation (also called the drill-
up operation by some vendors) performs
aggregation on a data cube, either by climbing up
a concept hierarchy for a dimension or by
dimension reduction.
Drill-down: Drill-down is the reverse of roll-up. It
navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping
down a concept hierarchy for a dimension or
introducing additional dimensions.
Slice and dice: The slice operation performs a
selection on one dimension of the given cube,
resulting in a subcube.
The dice operation defines a subcube by
performing a selection on two or more dimensions
Need for Data Warehousing
Industry has huge amount of operational data
Knowledge worker wants to turn this data into
useful information.
This information is used by them to support
strategic decision making.

It is a platform for consolidated historical data for
analysis.
It stores data of good quality so that knowledge
worker can make correct decisions.
Need for Data Warehousing (contd..)
From business perspective
-it is latest marketing weapon
-helps to keep customers by learning more
about their needs .
-valuable tool in todays competitive fast
evolving world.
Data Warehousing Tools
Data Warehouse
SQL Server 2000 DTS
Oracle 8i Warehouse Builder
OLAP tools
SQL Server Analysis Services
Oracle Express Server
Reporting tools
MS Excel Pivot Chart
VB Applications
References
Building Data Warehouse by Inmon
Data Mining:Concepts and Techniques by
Han,Kamber.
www.dwinfocenter.org
www.datawarehousingonline.com
www.billinmon.com

You might also like