0% found this document useful (0 votes)
2 views71 pages

Lecture 4 2025 EUI

The document outlines key concepts in data warehousing, including the star and snowflake schema, granularity, and the structure of fact and dimension tables. It discusses the importance of identifying business processes, the role of OLAP operations, and the challenges in maintaining data integrity and handling promotional data. Additionally, it presents case studies and examples to illustrate inventory models and the design steps for creating effective data warehouses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views71 pages

Lecture 4 2025 EUI

The document outlines key concepts in data warehousing, including the star and snowflake schema, granularity, and the structure of fact and dimension tables. It discusses the importance of identifying business processes, the role of OLAP operations, and the challenges in maintaining data integrity and handling promotional data. Additionally, it presents case studies and examples to illustrate inventory models and the design steps for creating effective data warehouses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Part I: Data Warehousing

Prof. Hoda M. O. Mokhtar


Lecture Outline

01 REVIEW OF THE PREVIOUS LECTURE

02 STAR & SNOWFLAKE SCHEMA

03 INVENTORY CASE STUDY

04 TYPES of FACT TABLE

Prof. Hoda M. O. Mokhtar 1


Granularity
 What is granularity?
 The level of detail or summarization of the units of data in the data warehouse

Low (course) Highly summarized data


granularity

high (fine) Detailed data


granularity

Prof. Hoda M. O. Mokhtar 2


Identifying the actual business process a data warehouse should cover.
This could be Marketing, Sales, HR, etc.
Model Building Steps
It is the most important step of the Data Modelling process

The Grain describes the level of detail for the business problem/solution.
It is the process of identifying the lowest level of information for any table in
your data warehouse.

Dimensions are nouns like date, location, etc.


These dimensions are where all the data should be stored.

Identify numeric measures


facts

Star or snowflake schema


Elements of Dimensional Data Model
Dimensions Dimension Dimension Facts or
Fact Table
Tables Attributes Measures

•A fact table is a • Dimensions • A dimension table •The Attributes are • Facts are the
primary table in • Dimension provides contains dimensions the various measurements/
dimension the context of a fact. characteristics of metrics from your
modelling. surrounding a • They are joined to the dimension in
fact table via a business process.
business process dimensional data
event foreign key.
modeling.
•A Fact Table • In simple terms, they • The Dimension • For a Sales
contains give who, what, Attributes are the business process,
1.Measurements/ where of a fact. various columns in a •In the Location a measurement
• In the Sales business dimension table dimension, the
facts would be sales
process, for the fact • No limit for the attributes can be
2.Foreign key to number
quarterly sales number of
dimension table dimensions •State
number, the
dimensions would be • The dimension can •Country
• Who– Customers also contain one or •Zipcode
• Where – Location more hierarchical
relationships
• What – Products
Dimensional Model: A Simple Example
Date Product
Dimension Dimension

Date_Dim PK Product_Dim PK

Date-related Sales Fact Table Product-related


Attributes Attributes
Foreign Keys to
Customer
Location Dimension Dimensional Tables Dimension

Loc_Dim PK Cust_Dim PK
Facts / Measures
Location-related Customer-related
Attributes Attributes

Prof. Hoda M. O. Mokhtar 5


Example 2: Data Cubes

2022

Prof. Hoda M. O. Mokhtar


Example 2 - II

Prof. Hoda M. O. Mokhtar


Example 2: III

Prof. Hoda M. O. Mokhtar


Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product customer location
1-D cuboids

product, customer product, location customer, country


2-D cuboids

3-D(base) cuboid
product, customer, location

Prof. Hoda M. O. Mokhtar 9


OLAP Operations: Sales OLAP Cube

Total sales amount in $ in


State X for Product Y during
the Year Z

https://www.coursera.org/learn/data-warehouse-fundamentals/ungradedLti/7tDHW/hands-on-lab-populating-a-data-warehouse-using-postgresql

Prof. Hoda M. O. Mokhtar 10


Slicing Data Cubes

2018

Slicing reduces
cube dimension
by 1

Prof. Hoda M. O. Mokhtar 11


Dicing Data Cubes

Dicing shrinks
a dimension

Prof. Hoda M. O. Mokhtar 12


Drilling Up and Down Data Cubes

Drilling into subcategories within a dimension


Prof. Hoda M. O. Mokhtar 13
Pivoting Data Cubes

Prof. Hoda M. O. Mokhtar 14


Rolling Up in Data Cubes

Summarize a dimension
(average, count, sum, etc.)
Prof. Hoda M. O. Mokhtar 15
3. Choose the Dimensions

Prof. Hoda M. O. Mokhtar


Star
Schema:
Retail
Case
Study

Prof. Hoda M. O. Mokhtar


Surrogate Keys
• Integer keys that are sequentially assigned as needed in the staging area
to populate a dimension table and join to the fact table.

• In the dimension table, the surrogate key is the primary key.


• In the fact table, the surrogate key is a foreign key to a specific
dimension and may be part of the fact table’s primary key

• Is not a smart key in any way.

• Also known as artificial keys, integer keys, meaningless keys, non-


natural keys, and synthetic keys.

Prof. Hoda M. O. Mokhtar


What is the problem?

• Many item lines could be for items not promoted

• Referential integrity should be maintained

• Introduce “not on promotion” value for those items

Prof. Hoda M. O. Mokhtar


Rule 1: Null values

• You must avoid “null” keys in the fact table.

• A proper design includes a row in the corresponding


dimension table to identify that the dimension is not applicable
to the measurement.

Prof. Hoda M. O. Mokhtar


Promotion dimension (cont.)

Prof. Hoda M. O. Mokhtar


New problems !!

• What products were on promotion but did not sell?

• The sales fact table only records the products actually sold. There are
no fact table rows with zero facts for products that didn’t sell because
doing so would enlarge the fact table enormously.

• Thus, a second promotion coverage or event fact table is needed to


help answer the question concerning what didn’t happen.

Prof. Hoda M. O. Mokhtar


Promotion Coverage Fact Table
• The promotion coverage fact table keys would be date, product, store,
and promotion in our case study.

• This is similar to the sales fact table we just designed; however, the
grain would be significantly different.

• Here we’d load one row in the fact table for each product on promotion
in a store regardless of whether the product sold or not.

Prof. Hoda M. O. Mokhtar


Factless Fact Table
• We refer to the promotion coverage table as a factless fact table
because it has no measurement metrics; it merely captures the
relationship between keys.

• What products where on promotion but didn’t sell requires a 3-step


process:
1. query the promotion coverage table to determine the universe of products that
were on promotion on a given day.
2. determine what products sold from the POS sales fact table on that day.
3. The answer to our original question is the set difference between these two
lists of products.

Prof. Hoda M. O. Mokhtar


Rule 2: Degenerate Dimension (DD)

• Operational control numbers such as order numbers, invoice


numbers, and bill-numbers usually give rise to empty dimensions
and are represented as degenerate dimensions (that is, dimension
keys without corresponding dimension tables)

Prof. Hoda M. O. Mokhtar


Retail Model

Prof. Hoda M. O. Mokhtar


Retail Model in Action

• A business user might be interested in better understanding monthly


sales dollar volume by promotion for the snacks category during
January 2002 for stores in the Boston district.

• Place query constraints on month and year in the date dimension,


district in the store dimension, and category in the product
dimension.

Prof. Hoda M. O. Mokhtar


Example

Prof. Hoda M. O. Mokhtar


Query Result

Prof. Hoda M. O. Mokhtar


Quiz 1

Prof. Hoda M. O. Mokhtar 30


What is a schema?
• Definition 1: A schema is a definition of an entire database. It defines
the structure and the type of contents that each data element within
the structure can contain.

• Definition 2: A schema is the structure of a database system. In a


relational database, the schema defines the tables, the attributes in
each table.

• Definition 3: A schema describes the objects that are represented in


the database, and the relationships among them

Prof. Hoda M. O. Mokhtar


What is a Star Schema?
• The star schema is a simple data warehouse schema.

• It is called a star schema because the diagram of this schema resembles a


star, with points radiating from a central table.

• The center of the star consists of a large fact table and the points of the star
are the dimension tables.

• A star schema is characterized by one or more very large fact tables that
contain the primary information in the data warehouse, and a number of
much smaller dimension tables (or lookup tables), each of which contains
information about the entries for a particular attribute in the fact table.

Prof. Hoda M. O. Mokhtar


Examples of Star Schema

• Relations in the star schema are 1:m


• Ex: a product in the product dimension is unique but can occur in more
than 1 record in the fact table
Prof. Hoda M. O. Mokhtar
Why Star Schema?

1. Provide a direct and intuitive mapping between the business entities


being analyzed by end users and the schema design.
2. Provide highly optimized performance for typical star queries.
3. Are widely supported by a large number of business intelligence tools
4. Every dimension is equivalent
Logical design is independent of query patterns
5. Extensible design
 Add facts -> add records to fact table
 Add dimension

Prof. Hoda M. O. Mokhtar


What is Star Query?

• Is a join between a fact table and a number of dimension tables.

• Each dimension table is joined to the fact table using a primary


key to foreign key join, but the dimension tables are not joined to
each other.

Prof. Hoda M. O. Mokhtar


Star
Schema:
Retail Case
Study

Prof. Hoda M. O. Mokhtar


An Expanded
Star Schema

Prof. Hoda M. O. Mokhtar


Dimension Normalization: Snowflaking

• Dimension table normalization typically is referred to as


snowflaking.

• Redundant attributes are removed from the flat, denormalized


dimension table and placed in normalized secondary dimension
tables.

• If the schema were fully snowflaked, it would appear as a full 3NF


ER diagram.

Prof. Hoda M. O. Mokhtar


What is a Snowflake Schema?

• A snowflake schema is a set of tables = a single, central


fact table + normalized dimension hierarchies.

• Each dimension level is represented in a table.

• Snowflake schema implement dimensional data


structures with fully normalized dimensions.

Prof. Hoda M. O. Mokhtar


Normalizing
the Retail
Schema

Prof. Hoda M. O. Mokhtar


Snowflaking: Product Dimension

Prof. Hoda M. O. Mokhtar


Why Snowflake?

1. If a dimension is very sparse (i.e. most of the possible values for the
dimension have no data) and/or a dimension has a very long list of
attributes which may be used in a query, the dimension table may occupy a
significant proportion of the database and snowflaking may be appropriate

2. A multidimensional view is sometimes added to an existing transactional


database to aid reporting. In this case, the tables which describe the
dimensions will already exist and will typically be normalized. A
snowflake schema will hence be easier to implement.

Prof. Hoda M. O. Mokhtar


Why not snowflake?

1. The multitude of snowflaked tables makes users struggle with the


complexity (simplicity is one of the primary objectives of a
denormalized model).

2. Numerous tables and joins usually translate into slower query


performance.

3. The minor disk space savings associated with snowflaked dimension


tables are insignificant. If we replaced the 20-byte department description in our 150,000-
row product dimension table with a 2-byte code, we’d save a 2.7 MB (150,000 x 18 bytes), but we
may have a 10-GB fact table!
Prof. Hoda M. O. Mokhtar
Why Data Grows Rapidly in DW?

1. Historical data

2. Data can be at a fine level of granularity for additional query


flexibility

3. Data in the data warehouse is integrated from various sources

Prof. Hoda M. O. Mokhtar


Granularity Revisited

• Highly detailed data can be voluminous that it is unusable

• If data is truly huge, move inactive data to overflow/archive


storage

Prof. Hoda M. O. Mokhtar


DW Size Estimation

• Consider the shown retail store star schema.


• Assume ~10,000 transactions/hour
• 60,000 products
• Each key = 2 bytes
• Each measure = 2 bytes
• DD = 2 bytes
• Time horizon = 5 years

Prof. Hoda M. O. Mokhtar 47


Size Estimation Procedure

1. Identify all tables to be built. The rule is that there will be 1 or 2


large tables and a number of smaller tables

2. Estimate the size of the row in each table

3. For a 1year horizon, estimate the maximum number of rows in the


table
1. Use previous data
2. Estimate based on market

4. Repeat for 5 year


Prof. Hoda M. O. Mokhtar
Inventory Case Study

Prof. Hoda M. O. Mokhtar


Retailer Value Chain

• Consisting of organization’s key


business processes
• The flow of an organization’s
primary activities
• Provides high-level insight into
the overall enterprise DW

Prof. Hoda M. O. Mokhtar


Inventory Models

1. Inventory Periodic Snapshot

2. Inventory Transactions

3. Inventory Accumulating Snapshot

Prof. Hoda M. O. Mokhtar


Inventory Model 1

• We need to analyze the daily quantity-on-hand inventory


levels by product and store

• Questions:
• What are the 4 design steps?
• What type of fact table do we have in this model?

Prof. Hoda M. O. Mokhtar


Model 1: Inventory Periodic Snapshot

• Everyday we measure the level of each product and place them on


separate rows in the fact table

• We need to analyze the daily quantity-on-hand inventory levels by


product and store

Prof. Hoda M. O. Mokhtar


4-step Design Approach
1. Select a business process: we’re interested in analyzing the retail
store inventory.

2. Declare the grain: we want to see daily inventory by product at


each individual store

3. Choose the dimensions: date, product, and store.

4. Identify the numeric facts: quantity on hand

Prof. Hoda M. O. Mokhtar


Periodic Snapshot Schema

Prof. Hoda M. O. Mokhtar


New Challenges

• Inventory levels are measured frequently


• To avoid out-of-stock situation
• Inventory generates dense snapshot tables. Since there is a row in the fact
table for virtually every product in every store every day.

• Example
• 60,000 products * 100 store * 14 row width = 84MB
• A year’s worth of daily snapshots >= 30GB

Prof. Hoda M. O. Mokhtar


Semi-additive Facts
• Quantity on hand can be summarized across products or stores and result
in a valid total.

• Inventory levels, however, are not additive across dates because they
represent snapshots of a level or balance at one point in time.

• Because inventory levels are additive across some dimensions but not all,
we refer to them as semiadditive facts.

Prof. Hoda M. O. Mokhtar


Semi-additive facts: Example
• Consider your bank account balance
• On Monday, you had 50$ in your account.
• On Tuesday, the balance remains unchanged.
• On Wednesday, you deposit another $50 into your account so that the balance
is now $100.
• The account has no further activity through the end of the week.
• On Friday, you can’t add up the daily balances during the week and declare that
your balance is $400 (based on $50 + 50 + 100 + 100 + 100).
• To combine account balances and inventory levels across dates we average them
(resulting in an $80 average balance in the checking example).

Prof. Hoda M. O. Mokhtar


Rule 1

• All measures that record a static level (inventory levels, financial


account balances, and measures of intensity such as room
temperatures) are inherently non-additive across the date dimension
and possibly other dimensions.

• In these cases, the measure may be aggregated usefully across time,


for example, by averaging over the number of time periods.

Prof. Hoda M. O. Mokhtar


Model 2: Inventory Transactions

• Record every transaction that affects inventory.


• Inventory transactions at the warehouse might include the following:
• Receive product
• Place product into inspection hold
• Release product from inspection hold
• Return product to vendor due to inspection failure
• Authorize product for sale
• Package product for shipment
• Ship product to customer
• Return product to inventory from customer return

Prof. Hoda M. O. Mokhtar


Warehouse Inventory Transaction Schema

• Each inventory transaction identifies the date, product, warehouse,


vendor, transaction type, and in most cases, a single amount
representing the inventory quantity impact caused by the
transaction.

Prof. Hoda M. O. Mokhtar


Model 3: Inventory Accumulating Snapshot

• Let’s assume that the inventory goes through a series of well-defined


events or milestones as it moves through the warehouse, such as:
• receiving, inspection, bin placement, authorization to sell, picking,
boxing, and shipping.

• The idea behind the accumulating snapshot fact table is to provide an


updated status of the product shipment as it moves through these
milestones.

• Each fact table row will be updated until the product leaves the
warehouse.

Prof. Hoda M. O. Mokhtar


Accumulating Snapshot Schema

Prof. Hoda M. O. Mokhtar


Periodic Snapshot Fact Table

• Periodic snapshots are needed to see the cumulative performance of


the business at regular, predictable time intervals.

• We take a picture of the activity at the end of a day, week, or month,


then another picture at the end of the next period, and so on.

• The periodic snapshots are stacked consecutively. The periodic


snapshot fact table often is the only place to easily retrieve a regular,
predictable, trendable view of the business performance metrics.

Prof. Hoda M. O. Mokhtar


Transaction Fact Table - I

• These fact tables represent an event that occurred at an


instantaneous point in time.

• A row exists in the fact table for a given customer or product


only if a transaction event occurred.

• Conversely, a given customer or product likely is linked to


multiple rows in the fact table because hopefully the customer or
product is involved in more than one transaction.

Prof. Hoda M. O. Mokhtar


Accumulating Snapshot Fact Table

• Accumulating snapshots represent an indeterminate time span,


covering the complete life of a milestone

• The accumulating snapshot fits naturally with short-lived processes


that have a definite beginning and end. Long-lived processes, such as
bank accounts, are better modeled with periodic snapshot fact tables

Prof. Hoda M. O. Mokhtar


Accumulating Snapshot Rules

Rule 1:
Accumulating snapshots typically have multiple dates in the fact
table representing the major milestones of the process. However,
just because a fact table has several dates doesn’t dictate that it
is an accumulating snapshot.

Rule 2:
The primary differentiator of an accumulating snapshot is that
we typically revisit the fact rows as activity takes place.

Prof. Hoda M. O. Mokhtar


When to use accumulating snapshot fact table? -
Example
• Consider the order management process as a pipeline: Customers
place an order, order goes into backlog until it is released to manufacturing.
The manufactured products are placed in finished goods inventory, then
shipped to the customers and invoiced.

• So far we’ve considered each of these activities as a separate fact


table. This allows us to isolate our analysis to the performance of a
single business process.
• Sometimes we need to understand product velocity, or how quickly
products move through the order fulfillment pipeline.

Prof. Hoda M. O. Mokhtar


Order Fulfillment Pipeline

Prof. Hoda M. O. Mokhtar


Order Fulfillment Accumulating Snapshot Fact Table

Prof. Hoda M. O. Mokhtar


Comparison of Fact Table Types

Prof. Hoda M. O. Mokhtar

You might also like