Data Modeling
Data Warehouse Defined
A data warehouse is a collection of corporate
information, derived directly from
operational systems and some external data
sources. Its specific purpose is to support
business decisions, not business operations
Characteristics of a DW
Subject-oriented Data
collects all data for a subject, from different sources
Read-only Requests
loaded during off-hours, read-only during day hours
Interactive Features, ad-hoc query
flexible design to handle spontaneous user queries
Pre-aggregated data
to improve runtime performance
Highly denormalized data structures
fat tables with redundant columns
Components of a Data Warehouse
Source
Systems
Data
Staging
Area
Storage
Flat Files
RDBMS
Processing
DWH
Servers
Data Mart 1
Dimensional
Conforms to
DW Bus
No User
Query
Services
Data Mart 2
End User
Data Access
Query
Tools
Report
Writers
Mining
Tools
STAGING AREA - SOME
CLARITY
Staging Area
optional
to cleanse the source data
Accepts data from different sources
Data model is required at staging area
Multiple data models may be required for
parking different sources and for transformed
data to be pushed out to warehouse
ODS - SOME CLARITY
Operational Data Store
Optional
Granular, detailed level data
May feed warehouse (eg when warehouse is
aggregated)
Usually a relational model
May keep data for a smaller time period than
warehouse
Data Modeling
WHAT IS A DATA MODEL???
A data model is an abstraction of some aspect of the real
world (system).
WHY A DATA MODEL???
Helps to visualise the business
A model is a means of communication.
Models help elicit and document requirements.
Models reduce the cost of change.
Model is the essence of DW architecture based on which
DW will be implemented
What do we want to do with the
data?
Model depends on what kind of data analysis we
want to do:
Different Data Analysis Techniques
Query and reporting
Display Query Results
Multidimensional analysis
Analyse data content by looking at it in different
perspectives
Data mining
discover patterns and clustering attributes in data
Impact of Data Analysis
Techniques on DM
Query and reporting
Normalized data model
Select associated data elements
summarize and group by category
present results
direct table scan
ER with normalized / denormalized appropriate
Query and reporting
Requirements of a Decision
Support Query Environment
To provide a method for testing hypothesis (eg.
what if .)
To allow ad-hoc queries
To allow human input (DSS makes decisions
with users )
Expects user knowledge of problem
To simulate the behaviour of a real-world
problem
Impact of Data Analysis
Techniques on DM
Multidimensional analysis
Fast and easy access to data
Any number of analysis dimensions in any
combinations
ER will mean many joins
Dimensional model appropriate
Multidimensional Analysis
Data Mining
Data Mining
discovers unusual patterns
requires low level of detail data
A look at different warehouse
architectures
Operational
Data
R
Y
M
A
N
A
External
data
G
E
R
Detailed
Information
Summary
information
M
A
N
A
Meta Data
G
E
R
Warehouse Manager
OLAP
Data Warehouse Architecture - 2
Data Warehouse Architecture - 3
Data Warehouse Architecture - 4
DW Architectures
Architecture Choices depend on
Current infrastructure
Business environment
Desired management and control structure
resources
commitment ..
Data Warehouse/data mart
DW Architectures
Architecture Choices determine
Where will DW reside?
Centrally / locally / distributed
Where will it be managed from?
Centrally / independently
3 choices
Global
Independent
Interconnected
or a combination of the three
DW Architectures
Global Architecture
related to scope of data access and storage
does not mean centralized
can be physically centralized or distributed
enterprise view of data
time-consuming & costly to implement
Global Architecture
DW Architectures
Independent Architecture
stand-alone
controlled by a department
minimal integration
no global view
very fast to implement
DW Architectures
Interconnected Architecture
distributed
integrated and interconnected
gives a global view of enterprise
more complexity
who manages / controls data
another tier in architecture to share common data
between multiple data marts
have a data sharing schema across data marts
Independent and Interconnected
Architecture
Types of Data Warehouse
Enterprise Data Warehouse
Data Mart
Enterprise
Data Warehouse
Datamart
Datamart
Datamart
Enterprise data warehouse
Contains data drawn from multiple operational
systems
Supports time- series and trend analysis across
different business areas
Can be used as a transient storage area to clean
all data and ensure consistency
Can be used to populate data marts
Can be used for everyday and strategic
decision making
Data Mart
Logical subset of enterprise data warehouse
Organized around a single business process
Based on granular data
May or may not contain aggregates
Object of analytical processing by the end user.
Less expensive and much smaller than a full
blown corporate data warehouse.
Distributed and Centralized
Data warehouses
DW sitting on a monolithic machine unrealistic
Separate machines, different OS, different DB
systems - reality
Solution
Share a uniform architecture to allow them to
be fused coherently
Classical Architectures
Physical data warehouse (physical)
Data warehouse --> data marts
Data marts --> data warehouse
Parallel data warehouse and data marts
Physical data warehouse:
Data warehouse --> data marts
External
Data
SOURCE DATA
Operational Data
Data Warehouse
Data Marts
Staging Area
Physical Data Warehouse:
Data Warehouse --> Data Marts
Physical data warehouse:
Data marts --> data warehouse
External
Data
SOURCE DATA
Operational Data
Data Warehouse
Data Marts
Staging Area
Physical Data Warehouse:
Data Marts --> Data Warehouse
Physical data warehouse:
Parallel data warehouse and data
marts
Data Warehouse
External
Data
SOURCE DATA
Staging Area
Operational Data
Data Marts
Physical Data Warehouse:
Parallel Data Warehouse & Data Marts
DW Implementation Approaches
Top Down
Bottom-up
Combination of both
Choices depend on:
current infrastructure
resources
architecture
ROI
Implementation speed
Top Down Implementation
Bottom Up Implementation
DW Implementation Approaches
Top Down
More planning and design
initially
Involve people from
different work-groups,
departments
Data marts may be built
later form Global DW
Overall data model to be
decided up-front
Bottom Up
Can plan initially without
waiting for global
infrastructure
built incrementally
can be built before or in
parallel with Global DW
Less complexity in design
DW Implementation Approaches
Top Down
Consistent data definition
and enforcement of
business rules across
enterprise
High cost, lengthy
process, time consuming
Works well when there is
centralized IS department
responsible for all H/W
and resources
Bottom Up
Data redundancy and
inconsistency between
data marts may occur
Integration requires great
planning
Less cost of H/W and
other resources
Faster pay-back
DW Implementation Approaches
Combined Approach
Determine degree of planning and design for a global
approach to integrate data marts being built by bottom-up
approach
Develop base level infrastructure definition for global DW
at business level
Develop plan to handle data elements needed by multiple
data marts
Build a common data store to be used by data marts and
global DW
Levels of modeling
Business
Process
Conceptual
Logical
Model
Physical
Model
Levels of modeling
Conceptual modeling
Describe data requirements from a business
point of view without technical details
Logical modeling
Refine conceptual models
Data structure oriented, platform independent
Physical modeling
Detailed specification of what is physically
implemented using specific technology
Conceptual Model
A conceptual model shows data through
business eyes.
All entities which have business meaning.
Important relationships
Few significant attributes in the entities.
Few identifiers or candidate keys.
Sample conceptual model
Products
Customer
Invoices
Customers
Sales Reps
Customer
Addresses
Geographic
Boundaries
Sample
Conceptual
Model
Logical Model
Replaces many-to-many relationships with
associative entities.
Defines a full population of entity attributes.
May use non-physical entities for domains
and sub-types.
Establishes entity identifiers.
Has no specifics for any RDBMS or
configuration.
Sample logical model
CUSTOMER INVOICE
#INVOICE ID
#LINE ITEM SEQ
.INVOICE DATE
the bill for
purchased
by
PRODUCT
#PRODUCT CODE
.PRODUCT DESCRIPTION
sold by
the bill sent to
purchased at
CUSTOMER ADDRESS
#CUSTOMER ID
#ADDRESS ID
the bill purchased by
purchased by
CUSTOMER
#CUSTOMER ID
#SNAPSHOT DATE
.CUSTOMER NAME
for the
for the located within
customer
customer
managed by sold to by
the salesman the sales
for
manager for
SALES REP
#SALES REP ID
the salesman
for
the general location of
GEOGRAPHIC
BOUNDARY
#GEO CODE
Sample Logical Model
Physical Model
A Physical data model may include
Referential Integrity
Indexes
Views
Alternate keys and other constraints
Tablespaces and physical storage objects.
PRODUCTS
# PRODUCT_CODE
PRODUCT_DESCRIPTION
CATEGORY_CODE
CATEGORY_DESCRIPTION
SALES_REPS
#SALES_REP_ID
LAST_NAME
FIRST_NAME
oMANAGER_FIRST_NAME
oMANAGER_LAST_NAME
CUSTOMER_INVOICES
CUSTOMERS
#INVOICE_ID
#LINE_ITEM_SEQ
INVOICE_DATE
CUSTOMER_ID
BILL_TO_ADDRESS_ID
SALES_REP_ID
MANAGER_REP_ID
ORGANIZATION_ID
ORG_ADDRESS_ID
PRODUCT_CODE
QUANTITY
UNIT_PRICE
AMOUNT
oPRODUCT_COST
LOAD_DATE
#CUSTOMER_ID
#SNAPSHOT_DATE
CUSTOMER_NAME
oAGE
oMARITAL_STATUS
CREDIT_RATING
Sample Physical
Model
CUSTOMER_ADDRESSES
GEOGRAPHIC_BOUNDARIES
#CUSTOMER_ID
#ADDRESS_ID
ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
SALES_REP_ID
GEO_CODE
LOAD_DATE
#GEO_CODE
CITY_NAME
STATE_NAME
COUNTRY_NAME
oCITY_ABBRV
oSTATE_ABBRV
oCOUNTRY_ABBRV
Data Architecting
What is data architecting???
Structure and locate data according to its
characteristics
3 Basic types of data
Real time data
Derived data
Reconciled data
Data Architecting-Real time
data
Represents current status of business
Used by operational systems to run business
Changes as operational transactions are processed
Very detailed, high level of granularity
Data Architecting - Real time
data
To use Real time data in DW:
Must be
Cleansed (comes from different sources, cleansed to to ensure
data consistency quality)
Summarized (because it contains individual,
transactional,detailed)
Transformed
into an easily understandable format for manipulation
by analysts
Eg. Different units of measure, currency, exchange rates
Data Architecting - Derived data
Data created by summarizing, aggregating,
averaging real-time data through some process
represents a view of business data at a specific
time
Historical record of business over a period
Precalculate derived data elements and summarize
detailed data to improve query processing
Data Architecting - Reconciled
data
Real-time data cleansed, adjusted, enhanced to
provided integrated source of data for analysis
Create and maintain historical data while reconciling
Normally not explicitly defined
Logical result of derivation operations
May be stored as temporary files used to transform
operational data for consistency
Enterprise Data Model (EDM)
Consistent definition of data elements common to
a business
High-level business view
Generic logical data model
Physical data design
EDM - The Phased Enterprise
Data Model
Enterprise Data Model (EDM)
Phases
Increasing order of Information required
Information Planning
Business Analysing
Logical Data Modeling
Physical Data design
Enterprise Data Model (EDM)
Information Planning
Consolidated view of the business
Identify some business concepts (20-30)
called subject areas / super entity/ business entity in which
the organization is interested [Link], product
Purpose
To set up scope and architecture of DW
To provide a single comprehensive point of view
Enterprise Data Model (EDM)
Business Analyzing
Define contents of primary business concepts.
Gather and arrange business requirements
Defines business terms
Purpose
To set up scope and architecture of DW
To provide a single comprehensive point of view
Enterprise Data Model (EDM)
Logical Data Modeling
Enterprise-wide in scope
consists of several entities, relationships, attributes
complete model in 3rd Normal Form.
Can be divided into 2 types:
Generic logical data model (enterprise level)
Logical application model (application level)
Enterprise Data Model (EDM)
Physical Data Design
space
performance
physical distribution of data
Purpose:
To design for the physical implementation
Enterprise Data Model (EDM)
Is it possible to draw an EDM ???
Not always!!
Phased approach OR a simple EDM
list of subject areas (<25)
define business relationships between subject areas
define contents of each subject area
Granularity
Level of summarization of data elements
Level of detail available in the data
More the detail Lower the granularity
Why is it important in DW???
Opportunity for TRADE-OFF
performance
vs. volume of data stored
ability to access detailed data vs. cost of storage
Granularity
Granularity
To overcome trade-offs between data volume and
query capability :
Divide the data in the DW
Create 2 levels of granularity of data
Detailed Raw data
keep it on separate storage medium
load when required
Summarized data
Data Partitioning Model
WHY?
To understand, maintain and navigate a DW
TYPES of Partitioning
Logical and Physical
Data Partitioning Model
Logical Partitioning - WHY?
Goals:
Data Partitioning Model Logical Partitioning
Partition large volumes of data by splitting
Helps to make data easier to:
Restructure
Index
Sequentially scan
Reorganize
Recover
Monitor
Data Partitioning Model Logical Partitioning
Logical Partition - HOW??
Criteria
Time period (date, month, or quarter)
almost always chosen
Geography (location)
Product (more generically, by line of business)
Organizational unit
A combination of the above
Data Partitioning Model Logical Partitioning
Data Partitioning Model -Subject
Areas
Subject areas classified by the topics of interest to
the business.
5W1H rule
when, where, who, what, why, and how
eg. who could be customer, employee, manager, supplier,
business partner, competitor.
Get a candidate list of subject areas
Decompose,rearrange, select, redefine in more
detail
Data Partitioning Model -Subject
Areas
Define the business relationships among
subject areas
This will determine the dimensions used
Subject Areas help define criteria like:
Unit of the data model
Unit of an implementation project
Unit of management of the data
Basis for the integration of multiple mplementations
unit for analysis should be business process
Data Modeling - Techniques
What needs to be modeled during
a data warehouse project
STAGING AREA
YES ! (maybe multiple data models are
required)
ODS
YES !
DATAWAREHOUSE/DATAMART
YES!
Data Modeling - Techniques
Modeling techniques
E-R Modeling
Dimensional Modeling
Implementation and modeling
styles
Modeling versus implementation
Modeling: describe what should be built to
non-technical folks
Implementation: describe what is actually built
to technical folks
Implementation and modeling
styles (Contd )
Relational modeling
Use for implementation
Difficult to understand by non-technical folks
Dimensional modeling
Use for modeling during analysis and design
phases
Can be implemented using other modeling
styles e.g. object-oriented, relational
E-R Modeling
Produces a data model, using two basic
concepts entities and the relationships
between those entities.
Detailed ER models also contain attributes,
which can be properties of either the entities
or the relationships.
Conventions used in E-R
modeling
Entities
EMPLOYEE
EmpName
Address
Attributes
Relationships or Associations
Belongs
To
Entities
Principal data objects about which information
is to be collected.
Usually recognizable concepts such as person,
things, or events.
Examples : EMPLOYEES, PROJECTS,
INVOICES.
Attributes & Relationships
Attributes describe the entity of which they
are associated.
A relationship represents an association
between two or more entities. An example :
Employees are assigned to projects
Departments manage one or more projects.
Types of Data Relationships Cardinality
One - One
1: 1
One - Many
1: m
Many - Many
m:n
Recursive data relationship
Normalization
Remove data redundancy
0 NF - contains repeating values
1 NF - No repeating values
2 NF - Every attribute is dependent on the key, the
whole key and nothing but the key
3 NF - No non-key attribute is functionally
dependent on another non-key attribute
Denormalization - carefully introduced redundancy to
improve query performance
Normalization - 1NF
Eliminate Repeating groups
Person
Skills
A Oracle, DB2
B MS Access, Oracle
C Oracle, CICS, SQL
D DB2, CICS
Who are the ones who have DB2 skills???
Normalization - 2NF
Eliminate Redundant data
Skill ID Skill Description
S1 DB2
S2 Oracle
S3 MS Access
S4 CICS
S5 SQL
Normalization - 3NF
Eliminate Columns Not Dependent On Key
Memb ID Skill ID
A
S1
Comp ID. Comp Name Location
D1
Core Tech HYD
Relational modeling
Represents business entities, data items
associated with each entity, and the
relationships of business interest among the
entities
Entities are usually broken down into
smallest possible units and combined using
relationships
Diagram looks like a spiderweb
Entity Completeness Checklist
Name
to describe the data contained
to meet naming conventions/standards
Description
to describe precisely what the entity represents
required for sharing and reuse of data model
components
Category
classifies entities sharing common characteristics
Entity Completeness Checklist
(contd.)
Category Types
Fundamental entities(represents basic or core
concepts)
Associative/intersecting entities(to associate
entities to reconcile m-m relation)
Attributive entities(to describe or categorize other
entity)
Subtype entities(to represent a subset of
occurrences of parent entity)
Entity Completeness Checklist
(contd.)
Abbreviations
document the abbreviation and full definition
Acronyms
avoid (not understood by all, not unique)
if used, document them
Current Number of occurrences
to estimate entity statistics for all entity
categories
Entity Completeness Checklist
(contd.)
Authority
Metadata authority(to approve change of entities,
attributes etc.)
Data authority(to change occurrences of entity)
Primary Key/Foreign Key/Non-key attribute
names
Relationships to other entities
no entity stands by itself
Homonyms
Same or similar in sound or spelling as another
BUT DIFFERENT IN MEANING!!
Create CONFUSION!
IDENTIFY AND ELIMINATE them for
entities and attributes!!
Synonyms
Same meaning ...
Same logical concept ...
Assigned different names!!
Introduce redundancy in model!
IDENTIFY AND RESOLVE them - for entities
and attributes!!
Synonyms (contd.)
Compare Definition, Relationships to other entities, Key
structure, attributes, domain values
Attribute Completeness
Checklist
Name
to uniquely identify the attribute
to meet naming conventions/standards
Description
to describe precisely what the attribute represents
Type
refers to how the attribute is used in the datamodel
Completeness Checklist (contd.)
Key attributes
primary keys in the entity that they are defined
primary / foreign keys in other entities that they occur in)
implemented with a unique index
Non-Key attributes
contain the bulk of the information
need not be unique
candidate keys not selected as primary keys
secondary keys may be selected as access paths
implemented using non-unique index
Completeness Checklist (contd.)
Domain
set of permitted values for the attribute
Domain elements
General Domain
describes the manner in which data is represented(data type)
alphanumeric, real, integer, boolean, sound, digital video etc.
Specific Domain
Enumerated domain
specific set of values that are valid and allowed
static values (eg. Flat type : 2 bed, 3 bed, duplex etc)
Completeness Checklist (contd.)
Abbreviations
document the abbreviation and full definition
Acronyms
avoid (not understood by all, not unique)
if used, document them
Key use
applies only to primary keys
will serve as primary or foreign key in child entity
Source
whether attribute is primitive or derived
Completeness Checklist (contd.)
If derived, establish the formula
document formula
formula should identify any other attributes required to
generate value for derived attribute
Traceability
why is the attribute there
refer to source (paragraph, citation of statement, physical
data structure element ...)
mapped to metadata object that is maintained as part of
system lifecycle (eg. Critical success factor, objective,
physical system element like file, table
Derived Attributes
Created by accumulating values of multiple
instances of attributes. Eg.
Aggregation/summarization
Library Branch
BranchBranch
Holding
Holding
Branch id
Total Titles
Branch id
book id
number of copies
Total Titles = count of (Branch Holding) where
(Branch Holding) Branch id = (Library) Branch id
CalculatedAttributes
Describes a feature of a single instance of entity
Calculated from another single instance of related attribute
Attribute Metadata
TASK
Task id
Task Start date
Task End Date
Task Duration
Branch
Calculation formula for task Holding
duration:
Task duration = task end date - task start date
Derivation Dependencies :
1. Task start date and Task duration
2. Task end date and Task duration
Calculated Attributes - contd..
Should Data model contain derived attributes??
YES !!
represent information that management actually wants
users have an opportunity to specify business rules
provide an opportunity to validate that all necessary base
data is captured
design is made easier as requirements are already
mapped
In DSS environment - ESSENTIAL
NEVER use derived attributes as PRIMARY keys
Derived attributes - An example
ORDER
TIME
PERIOD
PRODUCT
Order #(PK)
order date
Product #(PK)
Product Name
Product Price
PRODUCT
ORDER
Product # (PK)
Order # (PK)
Total units sold
Total sales price
Period Start Date(PK)
Period End Date(PK)
Period Reference Name
PRODUCT
PERIOD
Product # (PK)
Period Start Date(PK)
Period End Date (PK)
Total product period sales
Attribute Names
Unique name representing its business meaning
clear, concise, self-explanatory
minimize use of special characters
length > 50 gives flexibility
limitations of 32, 33 exist in some CASE tools
standard documented abbreviations made
SHOULD NOT
replace or contradict definition of attribute
contain abbreviations not approved by authority
Attribute Names
SHOULD NOT CONTAIN
possessive forms ( Individuals birth date)
articles (a, an, the)
conjunctions (and, but)
verbs (person owns property)
prepositions (at, by, under, for, of ..)
plural words (product names..)
names of organizations, forms, screens, reports
eg. Block 61 title (refers to a specific field on a form)
Attribute Description
Builds on and is consistent with attribute name
unambiguous, clear, economically worded
stand alone (not dependent on another attribute
definition to convey meaning. BEWARE of circular
attribute definitions)
Never MISS giving a description
AVOID:
restating the name of attribute and/or characteristics (eg.
Length, data type, domain values)
using technical jargon
limiting description to direct extract from dictionary
Some attribute descriptions
Need improvement
Pretty Good
Location name - the name of a
location
Safety level quantity - The
calculated minimum quantity of a
product SKU that must be on
hand to reduce risk of out-of order line total quantity - a sixstock conditions
digit integer total
directional indicator - E, W, S,
N, NE
operating quantity - The
calculated, demand-driven
quantity of a material item that
must be maintained and
replenished for use in day-to-day
operations
Primary Key Attributes
Stable (not to change in value, cannot be null)
Minimal (in number of attributes.. Large composite
keys not advisable)
Factless(should not contain intelligent groupings of
data)
Definitive(value always exists for every occurrence)
Primary Key Attributes
Candidate Keys (Possible primary keys)
One among them is chosen as Primary key
The others are alternate keys
eg. Candidate keys for a U.S. Citizen are:
driving license #
passport #
SS #
None of them are definitive
Fingerprint ID Is DEFINITIVE
Primary Key AttributesSurrogate Keys
Use artificial key/surrogate key/pseudokey/system-generated key to ensure uniqueness
when:
no attribute possesses all PK characteristics
candidate keys are large and complex
ALWAYS USE IN DW Data Model
Relationships- Checklist
Name & Description - Optional
Type (identifying/non-identifying)
Cardinality (Degree/Nature)
one-to-one 1:1
many-to-one m:1
one-to many 1:m
many-to-many m:m(resolved using associative entities)
Deletion Integrity Rules
(cascade/disassociate/disallow)
Limitations of E-R Modeling
Poor Performance
Tend to be very complex and difficult to
navigate.
Dimensional Modeling
Dimensional modeling uses three basic
concepts : measures, facts, dimensions.
Is powerful in representing the requirements
of the business user in the context of
database tables.
Focuses on numeric data, such as values
counts, weights, balances and occurences.
Dimensional modeling
Must identify
Business process to be supported
Grain (level of detail)
Dimensions
Facts
Conventions used in Dimensional
modeling
Facts
Measures(Variables)
Dimensions
Dimension members
Dimension hierarchies
Facts
A fact is a collection of related data items,
consisting of measures and context data.
Each fact typically represents a business
item, a business transaction, or an event that
can be used in analyzing the business or
business process.
Facts are measured, continuously valued,
rapidly changing information. Can be
calculated and/or derived.
Fact Table
A table that is used to store business
information (measures) that can be used in
mathematical equations.
Quantities
Percentages
Prices
Dimensions
A dimension is a collection of members or
units of the same type of views.
Dimensions determine the contextual
background for the facts.
Dimensions represent the way business
people talk about the data resulting from a
business process, e.g., who, what, when,
where, why, how
Dimension Table
Table used to store qualitative data about
fact records
Who
What
When
Where
Why
Dimension data should be
verbose, descriptive
complete
no misspellings, impossible values
indexed
equally available
documented ( metadata to explain origin,
interpretation of each attribute)
Dimensional model
visualise a dimensional model as a CUBE
(hypercube because dimensions can be more than
3 in number)
Operations for OLAP
Drill Down :Higher level of detail
Roll Up: summarized level of data
(The navigation path is determined by hierarchies within dimensions.)
Slice: cuts through the [Link] can focus on specific
perspectives
Dice: rotates the cube to another perspective (change the
dimension)
Drill down . Roll up
Slice and Dice
Dimensions
Collection of members or units of the same type of
views.
determine the contextual background for the facts.
the parameters over which we want to perform
OLAP (Eg. Time, Location/region, Customers)
Member is a distinct name to determine data items
position (eg. Time - Month, quarter)
Hierarchy arrange members into hierarchies or levels
Hierarchies
Allow for the rollup of data to more
summarized levels.
Time
day
month
quarter
year
Hierarchies
Aggregates
Aggregate
Tables
are
pre-stored
summarized tables created at a higher
level of granularity across any or all of the
dimensions.
If the existing granularity is Day wise sales,
then creating a separate month wise sales
table is an example of Aggregate Table.
Aggregates
The use of such aggregates is the single
most effective tool the data warehouse
designer has to improve query performance.
Usage of Aggregates can increase the
performance of Queries by several times.
Measures
A measure is a numeric attribute of a fact,
representing the performance or behaviour of the
business relative to dimensions.
The actual numbers are called as variables.
Eg. sales in money, sales volume, quantity supplied, supply cost,
transaction amount
A measure is determined by combinations of the
members of the dimensions and is located on
facts.
THE CUBE
Types of Facts
Additive
Able to add the facts along all the dimensions
Discrete numerical measures eg. Retail sales in $
Semi Additive
Snapshot, taken at a point in time
Measures of Intensity
Not additive along time dimension eg. Account
balance, Inventory balance
Added and divided by number of time period to get
a time-average
Types of Facts
Non Additive
Numeric measures that cannot be added across any
dimensions
Intensity measure averaged across all dimensions eg.
Room temperature
Textual facts - AVOID THEM
Advantages of Dimensional
Modeling
Allows complex multi-dimensional data
structure to be defined with a very simple data
model.
Reduces number of physical joins the query
has to process
Simplifies the view of data model.
Allows DWH to expand and evolve with
relatively low maintenance.
Sample business process versus
dimension table
Products
Product Sales
Customers Location
Sales
Rep
Date
Product
Manufacturing
Employee
Compensation
Sample measure versus
dimension table
Product Sales
($)
Products Customers
Location Sales
Rep
Date
Product
Manufacturing
(units)
Sales
Commission ($)
Payroll (gross)
($)
TIME PERIOD
PRODUCT
Product description
Category code
Category description
SALES REP
Last name
First name
Invoice date
Fiscal year
Quarter
Month
Week
CUSTOMER REP SALES
Customer snapshot date
Invoice date
Gross sales
Quantity
Product cost
CUSTOMERS
Customer name
ADDRESS
Address line 1
Address line 2
City name
State abbreviation
Postal code
Country name
CUSTOMER DEMOGRAPHICS
Snapshot date
Credit rating
Marital status
Age
Sample Logical Model
for Dimensional Data Mart
PRODUCT_SNAPSHOTS
PRODUCTS
#PRODUCT_CODE
#SNAPSHOT_DATE
. MSRP
. UOM
. PRIMARY_SUPPLIER_NAME
. SUPPLIER_CITY_NAME
. SUPPLIER_STATE_ABBRV
. SUPPLIER_COUNTRY_NAME
#PRODUCT_CODE
. PRODUCT_DESCRIPTION
. CATEGORY_CODE
. CATEGORY_DESCRIPTION
SALES_REPS
# SALES_REP_ID
. LAST_NAME
. FIRST_NAME
o
MANAGER_FIRST_
NAME
oMANAGER_LAST
_NAME
CUSTOMER_INVOICES
#INVOICE_ID
#LINE_ITEM_SEQ
. INVOICE_DATE
. CUSTOMER_DATE
. BILL_TO_ADDRESS_ID
. SALES_REP_ID
. MANAGER_REP_ID
. ORGANIZATION_ID
. ORG_ADDRESS_ID
. PRODUCT_CODE
. QUANTITY
. UNIT_PRICE
. AMOUNT
o PRODUCT COST
. LOAD_DATE
CUSTOMER_ADDRESSES
#CUSTOMER_ID
#ADDRESS_ID
. ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
. SALES_REP_ID
. GEO_CODE
. LOAD_DATE
PURCHASE_INVOICES
# INVOICE_ID
#LINE_ITEM_SEQ
. INVOICE_DATE
. SUPPLIER_ID
. ADDRESS_ID
. BUDGET_ID
. REVISION_SEQ
. BUDGET_LINE_ITEM_SEQ
. PRODUCT_CODE
. QUANTITY
. UNIT_PRICE
. AMOUNT
. LOAD_DATE
CUSTOMERS
#CUSTOMER_ID
#SNAPSHOT_DATE
. CUSTOMER_NAME
oAGE
oMARITAL STATUS
. CREDIT_RATING
#BUDGET_ID
#REVISION_SEQ
#LINE_ITEM_SEQ
. BLI_TYPE_CODE
. BLI_TYPE_DESCRIPTION
. ORGANIZATION_ID
. ADDRESS_ID
. BUDGET_PERIOD
. LOAD_DATE
. BUDGET_AMOUNT
. EXPENDITURES
o PRODUCT_CODE
SUPPLIER_ADDRESSES
#SUPPLIER_ID
#ADDRESS_ID
. SUPPLIER_NAME
oPOSTAL_CODE
. GEO_CODE
. LOAD_DATE
GEOGRAPHIC_BOUNDARIES
#GEO_CODE
. CITY_NAME
. STATE_NAME
. COUNTRY_NAME
oCITY_ABBRV
oSTATE_ABBRV
oCOUNTRY_ABBRV
BUDGET_DETAILS
Sample Physic
Model
for
Data Warehous
INTERNAL_ORG_ADDRESSES
#ORGANIZATION_ID
#ADDRESS_ID
. ORG_TYPE
. ORGANIZATION_NAME
. ADDRESS_LINE1
oADDRESS_LINE2
oPOSTAL_CODE
. GEO_CODE
oPARENT_ORG_ID
. LOAD_DATE
Common structures for datamarts:
Denormalize!
Star
Single fact table surrounded by denormalised
dimension tables
The fact table primary key is the composite of the
foreign keys (primary keys of dimension tables)
Fact table contains transaction type information.
Many star schemas in a data mart
Easily understood by end users, more disk storage
required
Example of Star- schema
Common structures for datamarts:
Denormalize!
Snowflake
Single fact table surrounded by normalised dimension
tables
Normalizes dimension table to save data storage space.
When dimensions become very very large
Less intuitive, slower performance due to joins
May want to use both approaches, especially if
supporting multiple end-user tools.
Example of Snow flake schema
Snowflake - Disadvantages
Normalization of dimension makes it
difficult for user to understand
Decreases the query performance because it
involves more joins
Dimension tables are normally smaller than
fact tables - space may not be a major issue
to warrant snowflaking
Keys ..
Primary Keys
uniquely identify a record
Foreign Keys
primary key of another table referred here
Surrogate Keys
system-generated key for dimensions
key on its own has no meaning
integer key, less space
More Keys ..
Smart Keys
primary key out of various attributes of
dimension
AVOID THEM!
Join to Fact table should be on single surrogate
key
Production Keys
DO NOT USE Production defined attributes
Business may reuse/change them - DW cannot!
Basic Dimensional Modeling
Techniques
Slowing changing Dimensions
Rapidly changing Small Dimensions
Large Dimensions
Rapidly changing Large Dimensions
Degenerate Dimensions
Junk Dimensions
Slowly Changing Dimensions
A dimension is considered a Slowly
Changing Dimension when its attributes
remain almost constant over time, requiring
relatively minor alterations to represent the
evolved state.
Slowly changing DimensionOptions
Eg. Key does not change but description changes (product
description)
TYPE 1
Overwrite dimension record with new
values
used when old value of attribute has no
significance
Slowly changing DimensionOptions
TYPE 2
Create a new record using a new value of
surrogate key
used when history can be clearly partitioned
query only on new value or only old value
query on some other attributes - return all
records)
Slowly changing DimensionOptions (contd..)
TYPE 3
Create an old field in dimension to store
immediate previous value
used when change is a soft change
no perfect partition in history
may want to track for sometime with both old
or new value
do not use when there are too many such soft
changes successively
Slowly Changing DimensionAn Example
Slowly Changing Dimension
Rapidly Changing Small
Dimensions
Eg. Rapid changes to product dimension
Type 2 (use surrogate key and create a new
record)
use effective dates
use only until dimension table remains
small
Large Dimensions
Dimensions containing several million records!!!
HOW TO SUPPORT???
Database to support indexing technology
that support rapid browsing
Find and suppress duplicate entries in the
dimension (eg. Name and address
matching)
Never use Type 2 to solve changing
dimensions (i.e. adding records)
Rapidly Changing Monster
Dimensions
Dimensions containing > 100 million records!!!
HOW TO SUPPORT???
Break the Monster dimension into separate
dimension tables
Constant information into original table
New dimension table can have discrete
values for each attribute
Choose pre-defined set of values per
attribute
Rapidly Changing Monster
Dimensions (contd..)
Build the data in this dimension with all
possible combinations of values for each
attribute
Identify each combination uniquely
Everytime an event occurs and is recorded
in fact table, attach it with the unique
combination ID.
Fact Table
Customer Dimension
Customer_Key (PK)
Name
Original_Address
date_of_birth
first_order_date
..
Income
Education
Number_children
marital_status
credit_score
purchase_score
Fact Table
Any fact table
containing
customer_key as a
foreign key..
Any fact table
containing
customer_key and
demog_key as
foreign keys ..
Customer Dimension
Becomes..
Customer_Key
(PK) Name
Original_Address
date_of_birth
first_order_date
..
Demographics Dimension
Demog_Key (PK)
Income
Education
Number_children
marital_status
credit_score
purchase_score
Customer Dimension
Customer_Key (PK)
Relatively constant
attributes .
Demographics dimension
Fact Table
Any fact table containing
customer_key,
demog_Key
demog_key and
demographic attributes
.
purch_cred_demog_key
Purchase-Credit Demographics dimension
Customer_Key (PK)
Relatively constant
attributes .
as foreign keys .
Rapidly Changing Monster
Dimensions (contd..)
Advantages
No increase in data storage everytim event occurs
Drawbacks
Forced to use ranges of discrete values for
dimensional attributes
New dimension cannot be too big (not >1M)
Data in new dimension can be accessed along with
static data only through the fact table - slower
Only if event occurs, link the static and changing
portions of dimension - keep a dummy event in fact
Degenerate Dimensions
Occur in line item oriented fact tables
occur when dimension table is left only
with a single key and no other fields
all other attributes have been moved into
other dimension tables
Moved to fact table - not joined to anything
Junk Dimensions
Number of miscellaneous flags and text
attributes left over after design
WHAT TO DO WITH THEM????
DO NOT
Leave them behind in the fact table
Make each flag and attribute into its own dimension
Strip off all such flags and attributes
Junk Dimensions (contd)
DO
Grouping of random flags and attributes
take away from fact and group them into junk
dimension
eg. Open ended comments fields
Conformed Dimensions
Dimension that means the same thing with every
possible fact table that it is joined.
Dimension is identically the same dimension in each
data mart
Major responsibility of the central DWdesign team is to
establish, publish, maintain and enforce them
DW cannot function as an integrated whole without
strict adherence to conformed dimensions
Conformed Dimensions (Contd.)
When you dont need Conformed Dimensions
Several lines of business where the customers and
products are disjoint.
Dont manage these separate business lines
together
THE TIME DIMENSION
Time_key
day_of_week
day_number_in_month
day_number_overall
week_number_in_year
month
quarter
fiscal_period
holiday_flag
weekday_flag
last_day_in_month_flag
season
event
Time Dimension
An exclusive Time dimension is required
because the SQL date semantics and
functions cannot generate several important
attributes required for analytical purposes.
Attributes like weekdays, weekends, fiscal
period, holidays, season cannot be
generated by SQL statements.
Time Dimension
Moreover SQL date stamps occupy more
space largely increasing the size of the fact
table.
Joins on such SQL generated date-stamps
are costly decreasing the query speed
significantly.
Time Dimension
The Day of week(Monday, ...) is useful to
create reports comparing for ex. Monday
sales to Friday sales.
The Day number in month is useful for
comparing measures for the same day in
each month.
The last day in month flag is useful for
performing payday analysis.
Time Dimension
The holiday flag and season attributes are
useful for holiday VS non-holiday analysis
and season business analysis.
Event attribute is needed to record special
days like strike days, etc..
Case Study
on
Data Modeling
Store
Store Key
Store Id
Store Name
Locality
Region
.
.
Sales Fact
Time Key
Product Key
Store Key
Promotion Key
Sales (Rs.)
Product
Product Key
Product Id
Product category
..
Brand Name
SKU
..
Promotion
Fact
Time Key
Product key
Store key
Promotion key
Time
Time Key
Time Id
Date
Month
Year
.
.
Promotion
Promotion key
Promotion Id
Promotion Category
..
Promotion Name
..
A Retail chain sample dimensional model
Retail Chains Sample Dimensional model
The first sales fact table measures the sales
figures at a granularity of SKU, Day and
Individual Store and Promotion name.
Only the SKU s that actually sell on the
day make it into the sales fact table
irrespective of whether they are on
promotion or not.
Retail Chains Sample Dimensional model
The second promotion fact table is a
factless fact table. It has a granularity of
SKU, Day, Store and Promotion Name.
This promotion fact table records which
items are on promotion in which stores and
at what times.
Retail Chains Sample Dimensional model
Time, Product and Store are common
dimensions in both the fact tables.
Product and Promotion are Type 2 Slowly
changing dimensions.
Retail Chains Sample Dimensional model
The sales fact enables the sales monitoring and
analysis across Product, Stores, Time and
Promotion dimensions.
The second promotion fact table is needed to
answer the critical question . Which are the
products that were on promotion but did not sell
on a particular day?
Retail Chains Sample Dimensional model
The second fact table can be avoided if we keep
zero sales figures in the sales fact table. but that
would make our sales fact table very
[Link] less than 5% of products which
were on promotion on a particular day actually
sell.
Retail Chains Sample Dimensional model
Bitmap Indexes on the foreign key columns in
the fact tables.
Bitmap Indexes on low cardinality columns in
dimensional tables like Month, Product
Category, Store category, etc
B-Tree Indexes on Dimension key columns.
Retail Chains Sample Dimensional model
The sales fact is partitioned across the Month
column.
Aggregates can be created in future based on
understanding of frequently needed & time
taking queries looking for summarized
information.
Aggregates
Consider a schema with Product and Time
Dimensions with a granularity of individual
product Brand and day wise sales.
The Product Hierarchy:
Category-Product-Brand
The Time Hierarchy:
Year-Month-Day
Aggregates
Product Dimension
Categories : 3
Products : 30
Brands
: 150
i.e 150 rows in the Product Dimension
Time Dimension
Year : 5
Month
: 60
Days : 365*5=1825
i.e 1825 rows in the Time Dimension
Aggregates
Assuming a transaction for each of the
Brands everyday; we have 1825*150 rows
in our sales Fact table.
A Query like: Show Category wise sales
figures for the past five years would have to
access 1825*150 rows to get the answer.
Aggregates
Aggregated Tables
Product
Category: 3
Time
Year : 5
Month: 60
There would be 60*3=180 rows in this
aggregated fact table.
The query on this table needs to access only
180 rows to get the same set of results.
Aggregates
MONTH
Time_Key
Month
Fiscal_Period
Season
CATEGORY
AGG. SALES
FACT
Category_Key
Time_Key
Department
Category_Key
Sales
Cost
Category
Aggregates
Aggregates increase the complexity of the
data model.
Aggregates increase the maintenance load
on the Data warehouse. They must be
updated as the base table data gets updated.
Aggregates occupy storage space. Hence
aggregates should be created only for
frequent and time taking queries.
Aggregate Navigation
Aggregate Navigation features enable endusers to query the data mart without
bothering about the presence of aggregates.
Without Aggregate navigation, the end user
needs to be aware of the presence of
aggregates so that he can query the
aggregated table instead of detailed table
thus increasing the complexity of the user
interface.
Aggregate Navigation
An aggregate navigator intercepts the
clients SQL and if possible transforms
base-level SQL into aggregate aware SQL.
Aggregate Aware function in Business
Objects 4.1 is an example of Aggregate
navigator.
Aggregate Navigation
New features in Oracle 8i like Materialized
views, Query rewrite
enable aggregate navigation to be built
within the data mart DBMS instead of front
end access tools.
enables all front end access tools to utilize
the aggregate navigation feature.
Factless Fact table
Factless fact tables are fact tables that do
not have any measures.
These kind of fact tables arise when there
are no obvious measures for the business
area.
Daily attendance tracking is one such
example of a business area having no
concrete measures.
Factless fact tables
TIME
STUDENT
Time_Key
Student_Key
COURSE
Course_Key
TEACHER
Teacher_Key
attendance=1
The grain of this fact table is individual attendance event.
Dummy measure-attendance included to make the SQL
more readable.
Store
Store Key
Store Id
Store Name
Locality
Region
.
.
Sales Fact
Time Key
Product Key
Store Key
Promotion Key
Sales (Rs.)
Product
Product Key
Product Id
Product category
..
Brand Name
SKU
..
Promotion
Fact
Time Key
Product key
Store key
Promotion key
Time
Time Key
Time Id
Date
Month
Year
.
.
Promotion
Promotion key
Promotion Id
Promotion Category
..
Promotion Name
..
A Retail chain sample dimensional model
When to start data modeling???
When requirements address these questions:
. Who (people, groups, organizations) is of interest to the
user?
What (functions) is the user trying to analyze?
Why does the user need the data?
When (for what point in time) does the data need to be
recorded?
Where (geographically, organizationally) do relevant
processes occur?
How do we measure the performance or state of the
functions being analyzed?
Approaches to Data Gathering
1. Source Driven
define requirements by using the source data in
production operational systems.
by analyzing an ER model of source data OR
by analyzing the actual physical record layouts and
selecting data elements deemed to be of interest.
Advantages
Know data that you can supply
Minimize user involvement in early stages of project
Disadvantages
Increased risk of producing wrong set of requirements
Approaches to Data Gathering
2. User Driven
define requirements by investigating the functions the
users perform
done through a series of meetings and/or interviews
with users.
Advantages
Focus on what is needed rather than what is available
Disadvantages
Expectations to be closely managed.
Combine both: Identify Subject areas (Source driven) and
define specific requirements in a Subject area (User driven)
Data Modeling for Data
Warehouse - Steps
12 Steps to Data modeling for Data Warehouse
1. Study ER
3. Review Dimension
5. Identify Facts
7. Merge Facts
9. Name Facts
11. Record Metadata
[Link] and Analyse
4. Add Time Dimension
6. Granularity
8. Review Facts
10. Size the model
12. Validate model
Case Study
CelDial Case Study
Case study (contd..)
1. Study the ER
Step 1: Remove all entities that act as
associative entities and all subtype
entities.
([Link] Component, Inventory,
Order Line, Order, Retail Store, and
Corporate Sales Office)
Note: Be careful to create all the
many-to-many relationships that
replace these entities
Case study (contd..)
Step 2: Roll up the entities at the end of each of
the many-to-many relationships into single
entities.
For each new entity, consider which attributes
in the original entities would be useful
constraints on the new dimension.
Note : Remember to consider attributes of any
subtype entities removed in the first step.
Logical Model is a logical representation:
remove individual keys and replace with
generic key for each dimension.
Case study (contd..)
Note:
Roll the salesperson up into the sales
dimension
implies (correctly) that the relationships
among outlet, salesperson and customer
roll up into the sales to customer
relationship.
The many-to-many relationship
between customer and sales prevents
the erroneous rollup of customer into
sales person and ultimately into sales.
Case study (contd..)
2. Evaluate and Analyze business of the
organization
Requirements that are collected must represent
these :
what is being analyzed (Dimensions)
evaluation criteria for what is being analyzed
(Measures)
IDENTIFY the measures and dimensions
Analyze the questions, define measures and
dimensions to meet requirements.
Case study (contd..)
Advantage of the approach:
Used all information available
Corporate Dimensions (from ER)
From requirements gathered
Disadvantage of using only requirements:
More time consuming
Miss some dimensions altogether
Eg. Customer and Component dimensions and the Number of
Cash Registers and Floor Space, attributes of the Sales
dimension
Case study (contd..)
3. Review the dimensions
Do we have all data to answer all the questions?
a. Sales and Manufacturing??? Yes
b. Product
Q2, Q3 can they be answered? NO!
Whats MISSING??
Unit cost of model at any point in time is required.
History of unit cost required. Add begin and end
date in product dimension.
Unit cost Derivation rule?? Given (Defining Cost and
Revenue)
Case study (contd..)
4. Add Time Dimensions
Lowest level of Time - DAY
Reporting requirements ???
By day, week and month
Final Dimension List
Case study (contd..)
5. Identify Facts
One set of dimensions and its associated measures
make up what is called a fact.
Organizing the dimensions and measures into facts .
The process of grouping dimensions and measures
together in a manner that can address the specified
requirements. HOW?
First create an initial fact for each of the queries in
the case study.
Note: For any measures that describe exactly the same set of
dimensions, create only one fact
Case study (contd..)
Note:
Q6, Q8,Q9 do not have any measures
If we did not:
merge Q6 with Q5, Q7 in Fact 4
merge Q8 and Q9 with Q2 in Fact 2
left with Factless Facts (fact with no measures)
the sale of a product at a point in time (facts 2 and 3) at
a specific location (fact 2 only), has occurred. No other
measurement is required.
Case study (contd..)
6. Determine Granularity
Level of detail at which fact is recorded
Try to keep at most detailed level (summarize if required)
Additivity : ability of measure to be
summarized
fully additive additive across all dimensions -advised)
non-additive adding % of 2 facts - not possible)
semi-additive adding balances of same account at 2
different points in time. Additive only across some
dimensions)
Case study (contd..)
Fact 1 :Average quantity on hand (monthly)
Total cost and total revenue (daily)
Solution a. Split into 2 facts
b. Make the time dimension consistent
Make time to lowest level - DAY
Average quantity on hand - non-additive
Solution store actual quantity on hand and let the
query calculate average.
Case study (contd..)
Fact 2:
Two different levels of granularity
Q2
(daily)
Q8, Q9 (month)
Solution: Since measures are fully additive, set the
grain of time to a day. A query can handle any
summarization to the monthly level.
Case study (contd..)
Fact 3,4:
Two different grains of time. Neither can roll up to the other.
Options:
a. Change grain. But measures are non-additive
b. Split into multiple facts. But both facts have same
measures with only time grain different
Solution: Change time grain to DAY
Change non-additive measure to additive by storing atomic
elements of %.
Case study (contd..)
Solution
Replace % (Fact 3) with quantity of models sold
through: - retail outlet,
- corporate sales office
- salesperson.
Total quantity sold is already present. % can be calculated
Replace % (Fact 4) with :
- number of models eligible for discount,
- quantity of models eligible for discount actually sold
- quantity of models sold at a discount.
Case study (contd..)
7. Merge Facts
Consolidate facts where possible WHY?
Easier for a user to find the data needed to satisfy a query if
there are fewer places to look.
Expand the analysis potential because you can relate more
measures to more dimensions at a higher level of
granularity.
Fewer facts - lesser administration
HOW??
Determine for each measure which additional dimensions can
be added to increase its granularity
Case study (contd..)
Fact 1: No finer breakdown for quantity on hand or reorder
level
Fact 2: Already has all the dimensions in Fact 3 and 4
Fact 3 : Add Sales dimension to break up
Total Cost, Total Revenue, Total quantity sold
The sales dimension contains both outlet type and
salesperson data. Using this structure we can
classify the total quantity sold, negating the need
to store the three individual totals.
Solution: Merge Fact 3 into Fact 2
Case study (contd..)
Fact 4: Add Product dimension.
Number of models eligible for discount can be calculated
directly from the product dimension. Not needed in
consolidated Fact
Product dimension tells whether an individual model is
eligible for discount
Use the total quantity sold (consolidated from fact 3) to
represent the quantity of models eligible for discount
actually sold.
Case study (contd..)
Fact 4: Add Product dimension.
Quantity of models sold at a discount - Retain
OR
record the discount amount and generate the
quantity sold at a discount by adding up the
quantity sold where the discount amount is not
zero.
Solution: Merge Fact 2, 3, 4
Case study (contd..)
8. Review the facts for opportunities to add other
dimensions, increasing the potential for valuable analysis.
Fact 1 : Can it be broken down further ?? NO
Fact 2 : YES! manufacturing and customer dimensions can
be applied ; Component Dimension cannot be applied
dimensions are: sales, product, manufacturing, customer,
time.
All can be identified at the time an order is placed.
ADD order as a dimension (to increase analysis potential)
Order has no attributes. Add order key to fact as a
degenerate dimension
Case study (contd..)
9. Name the facts
Fact 1 - Inventory Fact
Fact 2 - Sales Fact
10. Size the model
calculate the size of the data in a table
number of rows * length of each row
To calculate row length:
4 bytes for each numeric or date attribute
number of characters for character attribute
number of digits in a decimal attribute / 2 and rounded up.
Case study (contd..)
To calculate Number of rows:
No history maintained. Use from operational system.
Seller = 48 ( 3 corp+ 15retail + 30 salesmen)
Customer = 3000
Manufacturing plants = 7
How long should we keep data? 3 complete yrs
Time : 1 row per day = 1461 (4 * 365 + 1 day for leap year)
No. of models of products = 300
No. of models experiencing changes = 10 per week =
10 * 52 * 4 = 2080
No. of product rows = 300 + 2080 = 2324380
Case study (contd..)
Size of Inventory Fact =
7 plants x 300 models x 1,461 = 3,068,100 rows
Size of Sales Fact
Corporate Sales
500 sales x 10 models x 5 days x 52 weeks x 4 years =
5,200,000 rows
Retail Sales
1000 sales x 2 models x 7 days x 52 weeks x 4 years =
2,912,000 rows
Size of Sale Fact = 8,112,000 rows
Case study (contd..)
Case Study (contd..)
11. Record Metadata
Model (Name, Definition, Purpose,Contact Person, List of
Facts, dimensions and measures)
Fact (Name,Definition, Alias, Load Frequency, Measure,
Grain of time, dimensions, contact person)
Dimension (Name, Definition, Alias, hierarchy, change rule,
load frequency, attribute, fact, measure, contact person)
Attribute (Name, Definition, Alias, change rule, data type,
domain, derivation rule)
Measure (Name, Definition, Alias, data type, domain,
derivation rule, fact, dimension)
Case Study (contd..)
12. Validate the Model with user
Confirms that model meets user requirements
Confirms that user understands the model.
Validated portion goes through design
Remaining goes back in iterative development of model