0% found this document useful (0 votes)
21 views

Data Warehouse

The document discusses data warehousing and OLAP technology. It describes the differences between data warehouses and heterogeneous databases as well as operational databases. The document also explains concepts like star schemas, snowflake schemas, and fact constellations that are used in data warehouse design.

Uploaded by

Vignesh Senthil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Warehouse

The document discusses data warehousing and OLAP technology. It describes the differences between data warehouses and heterogeneous databases as well as operational databases. The document also explains concepts like star schemas, snowflake schemas, and fact constellations that are used in data warehouse design.

Uploaded by

Vignesh Senthil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Warehousing and OLAP

Technology

* 1
Data Warehouse vs. Heterogeneous DBMS

■ Traditional heterogeneous DB integration:


■ Build wrappers/mediators on top of heterogeneous databases
■ Query driven approach
■ When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
■ Complex information filtering, compete for resources
■ Data warehouse: update-driven, high performance
■ Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis

* 2
Data Warehouse vs. Operational DBMS
■ OLTP (on-line transaction processing)
■ Major task of traditional relational DBMS
■ Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
■ OLAP (on-line analytical processing)
■ Major task of data warehouse system
■ Data analysis and decision making
■ Distinct features (OLTP vs. OLAP):
■ User and system orientation: customer vs. market
■ Data contents: current, detailed vs. historical, consolidated
■ Database design: ER + application vs. star + subject
■ View: current, local vs. evolutionary, integrated
■ Access patterns: update vs. read-only but complex queries
* 3
OLTP vs. OLAP

* 4
Why Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
■ Different functions and different data:
■ missing data: Decision support requires historical data
which operational DBs do not typically maintain
■ data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
■ data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled
* 5
From Tables and Spreadsheets
to Data Cubes

■ A data warehouse is based on a multidimensional data model which


views data in the form of a data cube
■ A data cube, such as sales, allows data to be modeled and viewed
in multiple dimensions
■ Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
■ Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
■ In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
* 6
Cube: A Lattice of Cuboids

all
0-D(apex)
cuboid
tim ite locatio supplie
e m n r 1-D
cuboids
time,item time,location item,location location,supplier
2-D
time,supplier item,supplier cuboids
time,location,supplie
time,item,location r 3-D
cuboids
time,item,supplie item,location,supplier
r
4-D(base)
time, item, location, supplier cuboid
* 7
Conceptual Modeling of
Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
* 8
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key location_key
branch_key
branch_name dollars_sold street
branch_type units_sold city
province_or_street
country
avg_sales
Measures

* 9
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch branch_key location
location_key
location_key
branch_key
street
branch_name
city_key city
branch_type units_sold
city_key
dollars_sold
avg_sales city
province_or_street
Measures country

* 10
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key
from_location
branch location
branch_key location_key to_location
location_key
branch_name street dollars_cost
branch_type units_sold
city
dollars_sold province_or_street units_shipped
country shipper
Measures avg_sales
shipper_key
shipper_name
location_key
* shipper_type 11
Measures: Three Categories
■ distributive: if the result derived by applying the function
to n aggregate values is the same as that derived by
applying the function on all the data without partitioning.
■ E.g., count(), sum(), min(), max().
■ algebraic: if it can be computed by an algebraic function
with M arguments (where M is a bounded integer), each
of which is obtained by applying a distributive aggregate
function.
■ E.g., avg(), min_N(), standard_deviation().
■ holistic: if there is no constant bound on the storage size
needed to describe a subaggregate.
■ E.g., median(), mode(), rank().
* 12
Multidimensional Data
■ Sales volume as a function of product, month,
and region
Dimensions: Product, Location,
Time
n

Hierarchical summarization paths


o
gi

Industry Region Year


Re

Category Country Quarter


Produc

Product City Month Week


t

Office Day

Mont
* h 13
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qt 2Qt 3Qt 4Qt su
t
uc

TV r
r r r m
od

PC U.S.
Pr

VCR A

Country
su
Canad
m
a
Mexic
o
su
m

* 14
Cuboids Corresponding to the Cube

al
l 0-D(apex)
product countr cuboid
date
y 1-D
cuboids
product,dat product,countr date,
e y country 2-D
cuboids

3-D(base)
product, date, cuboid
country

* 15
Browsing a Data Cube

■ Visualization
■ OLAP capabilities
■ Interactive manipulation
* 16
Typical OLAP Operations

■ Roll up (drill-up): summarize data


■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or detailed
data, or introducing new dimensions
■ Slice and dice:
■ project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes.
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its back-
end relational tables (using SQL)
* 17
A Star-Net Query Model
Customer
Shipping
Orders Custome
Method
CONTRACTS r
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Produc
ANNUALY QTRLY DAILY t
PRODUCT ITEM PRODUCT GROUP
CIT
Y SALES PERSON
COUNTRY
DISTRIC
T
REGION
DIVISION
Locatio
Promotio Organization
n
* Each (abstraction
n level) circle is called a footprint 18
Design of a Data Warehouse: A
Business Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user
* 19
Data Warehouse Design Process

■ Top-down, bottom-up approaches or a combination of both


■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record

* 20
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadat
Integrato
a
source r
s Analysis
Operational Extract Query
Transform Data Serv Reports
DBs
Load
Warehouse e Data
Refresh
mining

Data
Marts

Data Data OLAP Engine Front-End Tools


* 21
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Data Data Enterprise


Mart Mart Data
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


* 22
OLAP Server Architectures
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ greater scalability
■ Multidimensional OLAP (MOLAP)
■ Array-based multidimensional storage engine (sparse matrix
techniques)
■ fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP)
■ User flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers
■ specialized support for SQL queries over star/snowflake schemas
* 23

You might also like