Data Warehouse Development Approaches
1
1
Fundamental Questions
Before deciding to build a data warehouse for your organization, you need to ask the following basic and fundamental questions and address the relevant issues:
Top-down or bottom-up approach? Enterprise-wide or departmental? Which firstdata warehouse or data mart? Build pilot or go with a full-fledged implementation? Dependent or independent data marts?
2
Data Warehouse Development Approaches
Data warehouse development approaches
Inmon Model: EDW approach Kimball Model: Data mart approach
Which model is better?
There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse
General Data Warehouse Development Approaches
Big bang approach
Incremental approach: Top-down incremental approach Bottom-up incremental approach
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
Big Bang Approach
Analyze enterprise requirements
Build enterprise data warehouse
Report in subsets or store in data marts
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
Incremental Approach to Warehouse Development
Multiple iterations Shorter implementations Validation of each phase
Increment 1 Strategy Definition Analysis Design
Iterative
Build
Production
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
Top-Down Approach
Analyze requirements at the enterprise level Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load
processes for the initial subject area
Create and populate the initial subject area
data mart within the overall warehouse framework
ISQS 6339, Data Mgmt & BI, Zhangxi Lin 7
Top down
The advantages of this approach are:
A truly corporate effort, an enterprise view of data
Inherently architectednot a union of disparate data marts Single, central storage of data about the content Centralized rules and control May see quick results if implemented with iterations
The disadvantages are:
Takes longer to build even with an iterative method High exposure/risk to failure
Needs high level of cross-functional skills
High outlay without proof of concept
Bottom-Up Approach
Define the scope and coverage of the
data warehouse and analyze the source systems within this scope
Define the initial increment based on the
political pressure, assumed business benefit and data volume
Implement base technical architecture
and establish metadata, extraction, and load processes as required by increment
Create and populate the initial subject
areas within the overall warehouse framework
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
Bottom-Up
The advantages of this approach are:
Faster and easier implementation of manageable pieces
Favorable return on investment and proof of concept Less risk of failure Inherently incremental; can schedule important data marts first Allows project team to learn and grow
The disadvantages are:
Each data mart has its own narrow view of data Permeates redundant data in every data mart
Perpetuates inconsistent and irreconcilable data
Proliferates unmanageable interfaces
10
Dimensional Modeling Process
High level dimensional model design
Choosing business model Declaring the grain Choosing dimensions Identifying the facts
Detailed dimensional model development Dimensional model review and validation
IS Core users Business community
Final design iteration
ISQS 6339, Data Mgmt & BI, Zhangxi Lin
11
Supplemental Slides : Data Warehouse Design Phases
12
Defining the Business Requirements
The concept of business dimensions is fundamental to the requirements definition for a data warehouse.
13
Information package
Your primary goal in the requirements definition phase is to compile information packages
Once you have firmed up the information packages, youll be able to proceed to the other phases. Essentially, information packages enable you to:
Define the common subject areas Design key business metrics Decide how data must be presented Determine how users will aggregate or roll up Decide the data quantity for user analysis or query Decide how data will be accessed
14
15
16
Supplemental Slides : The Others
17
Snowflake Schema Model
Country
Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata
State County City
18
18
Degenerate Dimensions
order_number and order_line in the fact table
For example, you may be looking for average number of products per order. Then you will have to relate the products to the order number to calculate the average. Attributes such as order_number and order_line in the example are called degenerate dimensions and these are kept as attributes of the fact table.
19
Storage and Performance Considerations
Database sizing Data partitioning Indexing Star query optimization
20
20
Database Sizing - Test Load Sampling
Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries
21
21
Data Partitioning
Breaking up of data into separate physical units that can be handled independently Types of data partitioning Horizontal partitioning. Vertical partitioning
22
22
Indexing
Indexing is used for the following reasons: It is a huge cost saving, greatly improving performance and scalability. It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed.
23
23
Parallelism
Sales table P1 P2 P3
Customers table
P1
P2
P3
Parallel Execution Servers
24
24
Using Summary Data
Designing summary tables offers the following benefits: Provides fast access to precomputed data Reduces use of I/O, CPU, and memory
25
25