0% found this document useful (0 votes)
131 views59 pages

DWH Concepts

This document provides an introduction to data warehousing. It defines a data warehouse as a subject-oriented, integrated, nonvolatile, and time-variant collection of data to support management's decisions. The document discusses the evolution of data warehousing and the need for data warehousing. It contrasts online transaction processing systems with data warehouse applications and also contrasts data marts with data warehouses. The document provides an overview of data warehouse architecture and data modeling concepts.

Uploaded by

er_sgoel6903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views59 pages

DWH Concepts

This document provides an introduction to data warehousing. It defines a data warehouse as a subject-oriented, integrated, nonvolatile, and time-variant collection of data to support management's decisions. The document discusses the evolution of data warehousing and the need for data warehousing. It contrasts online transaction processing systems with data warehouse applications and also contrasts data marts with data warehouses. The document provides an overview of data warehouse architecture and data modeling concepts.

Uploaded by

er_sgoel6903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

An Introduction

to
Data Warehousing

1
Objectives

 At the end of this session, you will know :


– What is Data Warehousing
– The evolution of Data Warehousing
– Need for Data Warehousing
– OLTP Vs Warehouse Applications
– Data marts Vs Data Warehouses
– Operational Data Stores
– Overview of Warehouse Architecture (Inmon, Kimball)
– Data Modeling (Normalization)
– Dimension Modeling, Star schemas, SCD
What is a DataWarehouse ?
What is a Data Warehouse ?

A data warehouse is a subject-oriented,


integrated, nonvolatile, time-variant collection
of data in support of management's decisions.
- WH Inmon

WH Inmon - Regarded As Father Of Data Warehousing


Subject-Oriented- Characteristics of a Data Warehouse

Data
Operational
Warehouse

Leads Prospects Customers Products

Quotes Regions Time


Orders

Focus is on Subject Areas rather than Applications


Integrated - Characteristics of a Data Warehouse

Appl A - m,f
Appl B - 1,0 m,f
Appl C - male,female

Appl A - balance dec fixed (13,2)


balance dec
Appl B - balance pic 9(9)V99
fixed (13,2)
Appl C - balance pic S9(7)V99 comp-3

Appl A - bal-on-hand
Appl B - current-balance Current balance
Appl C - cash-on-hand

Appl A - date (julian)


Appl B - date (yymmdd) date (julian)
Appl C - date (absolute)

Integrated View Is The Essence Of A Data Warehouse


Non-volatile - Characteristics of a Data Warehouse

insert change

Operational Data
Warehouse
insert
delete
load
read only
access
replace
change

Data Warehouse Is Relatively Static In Nature


Time Variant - Characteristics of a Data Warehouse

Operational Data
Warehouse

Current Value data Snapshot data


• time horizon : 60-90 days • time horizon : 5-10 years
• data warehouse stores historical
data

Data Warehouse Typically Spans Across Time


Alternate Definitions

A collection of integrated, subject oriented databases


designed to support the DSS function, where each
unit of data is relevant to some moment of time
- Imhoff
Alternate Definitions

Data Warehouse is a repository of data summarized


or aggregated in simplified form from operational
systems. End user orientated data access and
reporting tools let user get at the data for decision
support - Babcock
Evolution of Data Warehousing
1960 - 1985 : MIS Era

• Unfriendly
• Slow
• Dependent on IS programmers
• Inflexible
• Analysis limited to defined reports
Focus on Reporting
Evolution of Data Warehousing

1985 - 1990 : Querying Era


Queries that are
formulated by the user
on the spur of the
moment

• Adhoc, unstructured access to corporate data

• SQL as interface not scalable

• Cannot handle complex analysis

Focus on Online Querying


Evolution of Data Warehousing
1990 - 20xx : Analysis Era

• Trend Analysis
• What If ?
• Cross Dimensional Comparisons
• Statistical profiles
• Automated pattern and rule discovery

Focus on Online Analysis


Business Queries

Typical Business Queries

 Which product generated maximum revenue over last two


quarters in a chosen geographical region, city wise, relative to
the previous version of product, compared with the plan

 What percent of customer procures product A with B in a chosen

region, broken down by city, season, and income group


OLTP Systems Vs Data Warehouse
Remember

Between OLTP and Data Warehouse systems

users are different

data content is different,

data structures are different

hardware is different
Understanding The Differences Is The Key
OLTP Vs Warehouse

Operational System Data Warehouse


Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient Denormalized Design for


Design for TP Query Processing
OLTP Vs Warehouse
Operational System Data Warehouse
Designed for Atmocity, Designed for quite or static
Consistency, Isolation and database
Durability
Organized by transactions Organized by subject
(Order, Input, Inventory) (Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrent
users
Volatile Data Non Volatile Data
OLTP Vs Warehouse

Operational System Data Warehouse


Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness
Data Marts
 Enterprise wide data warehousing projects have a
very large cycle time
 Getting consensus between multiple parties may
also be difficult
 Departments may not be satisfied with priority
accorded to them
 Sometimes individual departmental needs may be
strong enough to warrant a local implementation
 Application/database distribution is also an
important factor
Data Marts

Subject or Application Oriented Business View of

Warehouse
» Finance, Manufacturing, Sales etc.

» Smaller amount of data used for Analytic Processing

» Address a single business process

A Logical Subset of The Complete Data Warehouse


Data Warehouse and Data Mart

Data Warehouse Data Marts


Scope  Application Neutral  Specific Application
 Centralized, Shared Requirement
 Cross LOB/enterprise  LOB, department
 Business Process
Oriented
Data  Historical Detailed data  Detailed (some history)
Perspective  Some summary  Summarized

Subjects  Multiple subject areas  Single Partial subject


 Multiple partial subjects
 OLTP snapshots
Data Warehouse and Data Mart

Data Warehouse Data Marts


Data Sources  Many  Few
 Operational/ External  Operational, external
Data data
 OLTP snapshots
Implement  9-18 months for first  4-12 months
stage
Time Frame
 Multiple stage
implementation
Characteristics  Flexible, extensible  Restrictive, non
 Durable/Strategic extensible
 Data orientation  Short life/tactical
 Project Orientation
Warehouse or Mart First ?

Data Warehouse First Data Mart first


Expensive Relatively cheap

Large development cycle Delivered in < 6 months


Change management is Easy to manage change
difficult
Difficult to obtain continuous Can lead to independent and
corporate support incompatible marts
Technical challenges in Cleansing, transformation,
building large databases modeling techniques may be
incompatible
Different kinds of Information Needs

Is this medicine available


 Current
in stock

What are the tests this


patient has completed so
 Recent far

Has the incidence of


Tuberculosis increased in
 Historical last 5 years in Southern
region
Operational Data Store - Definition
Can I see credit
report from
Accounts, Sales Data from multiple
from sources is integrated
marketing and for a subject
open order
report from
order entry for
this customer A subject oriented, integrated,
volatile, current valued data store
containing only corporate
Identical queries may detailed data Data stored only for
give different results
current period. Old
at different times.
Data is either
Supports analysis
archived or moved to
requiring current
Data Warehouse
data
OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse


Audience Operating Analysts Managers and
Personnel analysts
Data access Individual records, Individual records, Set of records,
transaction driven transaction or analysis driven
analysis driven
Data content Current, real-time Current and near- Historical
current
Data granularity Detailed Detailed and lightly Summarized and
summarized derived
Data organization Functional Subject-oriented Subject-oriented
Data quality All application All integrated data Data relevant to
specific detailed needed to support a management
data needed to business activity information needs
support a business
activity
OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse


Data redundancy Non-redundant Somewhat Managed
within system; redundant with redundancy
Unmanaged operational
redundancy among databases
systems
Data stability Dynamic Somewhat dynamic Static
Data update Field by field Field by field Controlled batch
Data usage Highly structured, Somewhat Highly
repetitive structured, some unstructured,
analytical heuristic or
analytical
Database size Moderate Moderate Large to very large
Database Stable Somewhat stable Dynamic
structure stability
OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse


Development Requirements Data driven, Data driven,
methodology driven, structured somewhat evolutionary
evolutionary
Operational Performance and Availability Access flexibility
priorities availability and end user
autonomy
Philosophy Support day-to- Support day-to-day Support managing
day operation decisions & the enterprise
operational
activities
Predictability Stable Mostly stable, some Unpredictable
unpredictability
Response time Sub-second Seconds to minutes Seconds to minutes
Return set Small amount of Small to medium Small to large
data amount of data amount of data
Typical Data Warehouse Architecture

Data
Marts
EIS /DSS

Metadata
Select Query Tools
Extract
Transform
Integrate Data OLAP/ROLAP
Maintain Warehouse

Web Browsers
Operational
Systems/Data Middleware/
API Data Mining
Data
Preparation

Multi-tiered Data Warehouse without ODS


Typical Data Warehouse Architecture

Data
Marts

Metadata Metadata

Select Select

Extract Extract
ODS Transform
Data
Transform Warehouse
Integrate Load

Maintain

Operational
Systems/Data
Data
Data
Preparation
Preparation

Multi-tiered Data Warehouse with ODS


Warehouse Architecture - 1

EIS /DSS

Metadata

Select Query Tools


Extract
Transform
Data
Integrate Warehouse OLAP/ROLAP
Maintain

Web Browsers
Operational
Systems/Data Middleware/
API Data Mining
Data
Preparation

Enterprise Data Warehouse


Warehouse Architecture - 2

Metadata

EIS /DSS
Data Mart

Metadata
Select Query Tools
Extract
Transform Data Mart
Integrate
OLAP/ROLAP
Maintain
Metadata

Web Browsers
Operational Data Mart
Systems/Data Middleware/
Data API Data Mining
Preparation

Single Department Data Mart


Warehouse Architecture - 3

Data
Marts

EIS /DSS
Metadata

Select Query Tools


Extract Data
Transform Warehouse
Integrate OLAP/ROLAP
Maintain

Web Browsers
Operational
Systems/Data Middleware/
Operational
API Data Mining
Data Data Store
Preparation

Multi-tiered Data Warehouse


Data Warehouse Architectures

There are three schools of thought about DW


architectures
– One supports Dimensional Modeling all through

(Ralph Kimball)
– Second supports ER for Data Warehouse and Star

Schemas for Data Marts


– Third supports ER model for DW
Kimball’s View
Operational Systems
Each Star is
a Data Mart
Presentation Server
Staging Area and has both
summary and
detail data

LAN
Data Warehouse
Server
Processes

•Extract
• Scrubbing
• Transformation DW is sum
• Load Jobs total of all
• Aggregation Jobs Data Marts
• Replication
• Monitoring
• Management DW Bus using
• Meta Data Repository Conformed Dimensions
• Meta Data Population
• Meta Data Maintenance

Multiple Data Marts With Conformed Dimensions


Inmon’s View
Operational Systems
Staging Area Data Warehouse Data Marts

LAN
Data Warehouse Server
Processes

Extract
• Scrubbing
• Transformation
• Load Jobs
• Aggregation Jobs
• Replication
• Monitoring Detail Data
• Management in ER format
• Meta Data Repository
• Meta Data Population
• Meta Data Maintenance
Summarized Data
in Star formats

Data Warehouse (ER) Feeding Multiple Data Marts (Star Schema)


Introduction
to
Data Modeling
Data Modeling -

 Must Have a knowledge of SDLC.


 Based on analyzing of raw data. Create logical
design of Database using entities, attributes &
relationship.
STEPS in DATA MODELING
Problem & scope definition

Requirement Gathering

Analysis

Logical Database Design

Deciding Database

Physical Database design

Schema Generation
Normalization

 First Normal Form


 Second Normal Form
 Third Normal Form
 Fourth Normal Form
 Fifth Normal Form
Modeling Techniques
 Entity-Relationship Modeling
– Traditional modeling technique
– Technique of choice for OLTP
– Suited for corporate data warehouse
 Dimensional Modeling
– Analyzing business measures in the specific business context
– Helps visualize very abstract business questions
– End users can easily understand and navigate the data
structure
Entity-Relationship Modeling - Basic Concepts

 Entity
– Object that can be observed and classified by its
properties and characteristics
– Business definition with a clear boundary
– Characterized by a noun
– Example
• Product
• Employee
Entity-Relationship Modeling - Basic Concepts

 Relationship
– Relationship between entities - structural interaction and
association
– described by a verb
– Cardinality
• 1-1
• 1-M
• M-M
– Example : Books belong to Printed Media
Dimension Modeling - Basic Concepts

Dimensional modeling - The star (join) schema

• The Center of the star is the fact table


• Fact table contains the actual Transactions.
• Dimension table contains data about information objects
or time
• The points of the star are the dimension tables
• Fact and dimension tables are joined through the multi-
part key in the fact table
Dimensional Modeling - Basic Concepts
 Fact
– Consisting of measures
– Contains unique compound key made up of foreign
keys to dimensional tables
– Typically represents a business transaction, or event
that can be used in analyzing business process
– Not every combination of foreign keys need exist
– Contains a very large number of rows - hundreds of
millions and up
Dimensional Modeling - Basic Concepts

 Dimension
– Collection of members or units of the same type of
views
– Determine contextual background for facts
– Parameters for OLAP
– Examples :
• Time
• Location/Region
• Customers
Dimensional Modeling - Basic Concepts
 Measures
– A numeric attribute of a fact
– Represents performance or behavior of the business
relative to the dimensions
– The actual numbers are called variables
– Examples :
• Quantity supplied
• Transaction amount
• Sales volume
Dimensional Data Model

 Star Schema
– Fact Tables
– Dimension Tables
 Snowflake Schema
 Coverage Tables
 Factless Tables
Star Join Schema Design

 Schema designed to process large, complex,


adhoc and data intensive queries.
 No concern for concurrency, locking and
insert/update/delete performance
 There are two kinds of tables in a Star Schema
– Fact Tables
– Dimension Tables
An Order Processing ER Model

FK
City Salesrep table

FK
Sales District Order Header Customer Table

Sales Region FK
Order Details Item Table

Sales Country Product Brand

Product Category
Star Schema

CITY Dimension
PRODUCT
DISTRICT s
BRAND
STATE CITY
CATEGORY
REGION PRODUCT
COLOR
PERIOD
SIZE
CUSTOMER
SALES AMOUNT
CUSTOMER
DAY UNITS
ADDRESS
MONTH
CATEGORY
QUARTER
CONTACT
YEAR Measures
Fact Table & Dimension Tables

Fact Tables Dimensional Tables


Numerical Dimensions are
Measurements of attributes about facts.
business are stored in
Fact Tables.
Fact Tables

 Facts should be continuously valued and additive


 By nature fact tables are sparse
 Usually very large - billions of records

– (Basically Fact Tables contains actual transactions or values


being analyzed, they contain a Composite Primary key, where
each attribute of a primary key is a foreign key to the

dimension table).
– (NOTE : FT’s must be in 1,2,3 normal forms.)
Dimensional Tables
 Attributes should be textual and discrete
 Occupy very little space compared to Fact Tables
 Some common dimensions are :
– Customer
– Geography
– Time
– Products

– ( Basically contains detail information of a FACT


Tables.)
BUS ARCHITECTURE

 To put partially in the system & add as & when the


requirement comes.
 Used to transfer information from Data Mrt to
System
Conformed Dimensions
 Dimension that means the same thing with every
possible fact table that it can be joined with
 Conformed dimensions most essential
– For the Bus Architecture
– Integrated function of the Data Warehouse
 Some common dimensions are :
– Customer
– Product
– Location
– Time
Surrogate Keys
 All tables (facts and dimensions) should not use
production keys but Data Warehouse generated
surrogate keys
– Productions keys get reused sometimes
– In case of mergers/acquisitions, protects you from
different key formats
– Production systems may change their systems to
generalize key definitions
– Using surrogate key will be faster
– Can handle Slowly Changing dimensions well
Slowly Changing Dimensions
Certain kinds of dimension attribute changes need
to be handled differently in Data Warehouse
 Type I - Overwrite
– e.g. Name Correction, Description changes
 Type II - Partition History
– Packing change, Customer movement
– Create a new dimension record with new surrogate
key
 Type III - Organizational changes
– Sales Force Reorganization
– Show by sales broken by new and old organizations
– Need to create an old and a new field
Factless Fact Tables

For Event Tracking e.g. attendance

Date
Date_Key
Dimension Student
Student_Key
Dimension
Course Course_Key
Dimension Teacher
Teacher_Key
Dimension
Facility Facility_Key
Dimension

You might also like