DWM Module 1
DWM Module 1
TE – D – DWM
Prof. Smruti Vyavahare
Assistant Professor
Dept. of Computer Engineering,
SIES Graduate School of Technology
1
Prof. Smruti Vyavahare
Contents
2
Prof. Smruti Vyavahare
Course Structure & Syllabus
3
Prof. Smruti Vyavahare
Course Structure & Syllabus
4
Prof. Smruti Vyavahare
Course Structure & Syllabus
5
Prof. Smruti Vyavahare
Course Structure & Syllabus
6
Prof. Smruti Vyavahare
Study Material
7
Prof. Smruti Vyavahare
Course Objectives & Course Outcomes
8
Prof. Smruti Vyavahare
9
Prof. Smruti Vyavahare
10
Prof. Smruti Vyavahare
11
Prof. Smruti Vyavahare
12
Prof. Smruti Vyavahare
13
Prof. Smruti Vyavahare
Feature Real-Time Application Example
Amazon uses its data warehouse to
Database designed for analytical tasks analyze customer buying patterns to
recommend products.
Flipkart pulls data from orders, customer
Data from multiple applications reviews, logistics, and support systems
into one warehouse.
Marketing managers at Swiggy explore
Easy to use and supports long interactive sessions customer order trends during campaigns to
measure success.
2. Makes the enterprise’s current and historical information easily available for strategic
decision-making
Stores both recent and past data to help leaders make informed choices.
Example: Nestlé uses historical sales data to decide which products to promote during festive
seasons.
1. Running of simple queries and reports against current and historical data:
Users should be able to easily retrieve both recent and old data to generate insights or reports.
Example: Sales trends over the last 5 years.
3. Ability to query, step back, analyze, and then continue the process to any desired length:
The system supports interactive exploration of data — users can drill down, go back, and
explore further.
Useful for deep data analysis.
4. Ability to spot historical trends and apply them in future interactive processes:
The system identifies patterns over time (e.g., seasonal trends) and uses them in future
planning or decision models.
17
Prof. Smruti Vyavahare
Definition of Data Warehouse
"A Data Warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in
support of management’s decisions."
— Bill Inmon (Father of Data Warehousing)
1. Subject-Oriented Data
Data is organized around key business subjects or areas like sales, finance, customer, product a data
warehouse focuses on analyzing specific topics for decision-making.
Example: All data related to customers — their purchases, feedback, history — is grouped for analysis.
2. Integrated Data
Data from multiple sources (like ERP, CRM, spreadsheets) is cleaned, transformed, and stored in a
unified format.
Example: Customer IDs from different departments are standardized into one format.
3. Time-Variant Data
Data is stored with a time dimension (daily, monthly, yearly) to allow analysis over a long period.
It enables trend analysis, forecasting, and comparisons.
Example: Sales reports over the past 5 years can be compared to identify seasonal trends.
18
Prof. Smruti Vyavahare
4. Non-Volatile Data
Once data enters the warehouse, it is not frequently updated or deleted.
Example: A sales transaction from 2022 remains unchanged and available for future analysis.
5. Data Granularity
It refers to the level of detail of the data — from highly detailed (transaction-level) to summarized
(monthly or yearly reports).
Data warehouses usually store data at multiple levels of granularity.
Example: You can view both individual customer purchases and total monthly revenue.
19
Prof. Smruti Vyavahare
20
Prof. Smruti Vyavahare
1. Data Sources
It including:
Operational Systems: Databases used in day-to-day business operations like ERP, CRM, billing systems, etc
Flat Files: Data stored in spreadsheets, CSV files, logs, etc.
These systems are typically OLTP (Online Transaction Processing) systems.
2. Staging Area
This is where ETL (Extract, Transform, Load) happens:
Extract data from sources.
Transform data (cleaning, formatting, merging, etc.).
Load the processed data into the data warehouse.
This stage ensures data quality and consistency before it's stored permanently.
3. Warehouse
The central component where data is stored and managed. It includes:
Metadata: Data about the data (e.g., source, time stamp, format).
Summary Data: Aggregated or pre-calculated data (e.g., monthly sales totals).
Raw Data: Detailed data from the source systems (e.g., individual transactions).
This enables both high-level analysis and deep dives into raw data.
4. Users
These are the end-users who consume the data using different tools for various purposes:
Analytics: Advanced analysis, dashboards, data visualization.
Reporting: Standardized reports for business operations.
Mining: Data mining techniques to discover patterns, trends, and insights.
These outputs support decision-making at tactical and strategic levels. 21
Prof. Smruti Vyavahare
22
Prof. Smruti Vyavahare
1. Bottom Tier: Data Source Layer
This is where the data originates from.
• Operational Databases: These are the existing databases used for day-to-day business operations
(e.g., ERP, CRM).
• External Sources: These can be flat files, cloud data, or third-party data providers.
• ETL Process (Extract, Clean, Transform, Load, Refresh):
Extract: Pulls raw data from operational and external sources.
Clean: Removes errors and inconsistencies.
Transform: Converts data into a usable format.
Load: Transfers the transformed data into the data warehouse.
Refresh: Updates the data periodically.
A Data Mart is essentially a smaller, more focused version of a data warehouse. It’s tailored to meet the
needs of a specific group within an organization.
Key Characteristics:
Subset of a Data Warehouse: It contains a portion of the data warehouse, usually relevant to a specific
business area.
Department-Specific: Designed for use by a particular department like Marketing, Sales, HR, or
Finance.
Department-Controlled: Typically managed and maintained by the department that uses it.
Fewer Data Sources: Pulls data from a limited number of sources, unlike a full data warehouse which
integrates data from across the organization.
Smaller and More Agile: Because of its limited scope, it’s easier to manage and adapt to changes.
24
Prof. Smruti Vyavahare
25
Prof. Smruti Vyavahare
Data warehousing design strategies or Approaches for building data warehouse
Top-Down Approach
The Top-Down Approach, introduced by Bill Inmon, is a method for designing data warehouses that starts
by building a centralized, company-wide data warehouse. It ensures data consistency and provides a strong
foundation for decision-making.
Working of Top-Down Approach
Central Data Warehouse: The process begins with creating a comprehensive data warehouse where data
from various sources is collected, integrated, and stored. This involves the ETL (Extract, Transform, Load)
process to clean and transform the data.
Specialized Data Marts: Once the central warehouse is established, smaller, department-specific data marts
(e.g., for finance or marketing) are built. These data marts pull information from the main data warehouse,
ensuring consistency across departments.
26
Prof. Smruti Vyavahare
Bottom-Up Approach
The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and incremental path to
designing data warehouses. Instead of starting with a central data warehouse, it begins by building small,
department-specific data marts that cater to the immediate needs of individual teams, such as sales or finance.
These data marts are later integrated to form a larger, unified data warehouse.
• Department-Specific Data Marts: The process starts with creating data marts for individual departments
or specific business functions. These data marts are designed to meet immediate data analysis and reporting
needs, allowing departments to gain quick insights.
• Integration into a Data Warehouse: Over time, these data marts are connected and consolidated to create
a unified data warehouse. The integration ensures consistency and provides a comprehensive view of the
organization’s data.
27
Prof. Smruti Vyavahare
Difference between Top-Down Approach & Bottom-Up Approach
28
Prof. Smruti Vyavahare
Metadata
Metadata is "data about the data". In the context of a data warehouse, it provides descriptive information
about the warehouse's data contents, structure, source, and processes.
1. Metadata means "data about the data“:
Metadata gives context to raw data. It explains things like:
• What each column means (e.g., “DOB” = Date of Birth)
• Where the data comes from
• What format it is in (e.g., date, number, text)
• How often it is updated
3. Directory Function
• Metadata helps navigate, access, and manage the contents of a data warehouse, especially useful
when data volume is large and complex.
4. Architectural Role:
• Metadata is a core architectural component of a data warehouse—it supports data integration, access,
governance, and quality control.
29
Prof. Smruti Vyavahare
Types of Metadata
1. Operational Metadata
It contains information about the operational data sources, the systems that provide data to the warehouse.
The examples are:
• File names and formats (CSV, Excel, etc.)
• Data refresh schedules
• Load times and history logs
Helps in tracking and managing the source data and its movement into the warehouse.
2. Extraction and Transformation Metadata
It includes :
• Data extraction from source systems: how, when, and how often data is pulled.
• Business rules applied during extraction.
• Data transformations performed before storing in the warehouse.
The examples are:
• Extraction methods (API, ETL tools)
• Frequency (daily, hourly, real-time)
• Rules like converting formats, removing duplicates, or merging tables
It helps in documenting the entire ETL (Extract, Transform, Load) process for auditing,
debugging, and ensuring data accuracy.
3. End-user Metadata
This is the user-friendly metadata that acts like a "navigational map" of the data warehouse for business users.
The examples are:
• Data dictionary (table/column names with business descriptions)
• Report labels and definitions
• Tooltips in dashboards
It enables end users (like analysts or managers) to easily find, understand, and use data for decision-
making.
30
Prof. Smruti Vyavahare
ER-Modelling VS Dimensional-Modelling
1. Usage
• E-R Modelling:
1. Supports OLTP – used for daily transactions like banking, bookings, etc.
2. It focuses on data integrity and efficiency in updating data.
• Dimensional Modelling:
1. Supports OLAP – used for data analysis, business intelligence, reporting.
2. Optimized for data retrieval and queries.
2. Structure
• E-R Modelling:
1. Entities (tables) are connected using multiple joins (often normalized).
• Dimensional Modelling:
Still uses joins, but mostly between fact and dimension tables in a simpler way
(star/snowflake schema).
3. Data Organization
• E-R Modelling:
Normalized: Data is split into smaller tables to reduce redundancy.
• Dimensional Modelling:
Denormalized: Data is often repeated to simplify queries and improve performance.
31
Prof. Smruti Vyavahare
4. Redundancy
• E-R Modelling:
Tries to remove redundancy to ensure consistency.
• Dimensional Modelling:
Allows redundancy for faster querying and simplicity.
5. Modification Impact
• E-R Modelling:
If change the model, it often affects the application using it.
• Dimensional Modelling:
More flexible and extensible->can handle new data elements easily without breaking
existing designs.
6. Adaptability
• E-R Modelling:
Can be fragile if query patterns change—structure is complex.
• Dimensional Modelling:
More robust—design is stable even if reporting/query needs evolve.
32
Prof. Smruti Vyavahare
Dimensional-Modelling
Dimensional modelling is a data structure technique used to design databases that support efficient
querying and reporting. It’s especially used in data warehouses.
33
Prof. Smruti Vyavahare
4. Developed by Ralph Kimball
• Developed by Ralph Kimball and it consists of 'fact' and 'dimension' tables.
• Fact Tables: Store measurable data (e.g., sales, profit, quantity).
• Dimension Tables: Store context about facts (e.g., customer, time, product, region).
Product_Dim, Date_Dim,
Dimension Table
Customer_Dim
34
Prof. Smruti Vyavahare
35
Prof. Smruti Vyavahare
36
Prof. Smruti Vyavahare
37
Prof. Smruti Vyavahare
Information Package Diagram (IPD)
1. An Information Package Diagram (IPD) is a logical design tool used in dimensional modeling
to:
• Identify business dimensions (descriptive data)
• Identify facts or metrics (numerical data)
It is the foundation step in designing a data warehouse schema like star schema or snowflake
schema.
2. It helps structure the dimensions (e.g., time, customer, product) and the metrics/facts (e.g., sales,
quantity, profit) that are to be analyzed.
5. Every IPD includes measurable metrics (facts) alongside their corresponding dimensions.
38
Prof. Smruti Vyavahare
39
Prof. Smruti Vyavahare
40
Prof. Smruti Vyavahare
41
Prof. Smruti Vyavahare
42
Prof. Smruti Vyavahare
43
Prof. Smruti Vyavahare
44
Prof. Smruti Vyavahare
1. Star Schema
• Central fact table connects directly to several dimension tables.
• Dimension tables are denormalized (flat, no sub-tables).
• Simple and fast to query.
Mainly used for: Quick report generation.
Example are:
• Fact table: Sales_Fact (sale_id, date_id, product_id, amount)
• Dimensions: Product_Dim, Customer_Dim, Date_Dim
2. Snowflake Schema
• Extension of the star schema.
• Dimension tables are normalized (split into sub-tables).
• Uses less storage but is slower to query.
Mainly used for: Saving space and reducing redundancy.
Example are:
• Product_Dim might link to Category_Dim and Supplier_Dim.
45
Prof. Smruti Vyavahare
46
Prof. Smruti Vyavahare
47
Prof. Smruti Vyavahare
48
Prof. Smruti Vyavahare
49
Prof. Smruti Vyavahare
50
Prof. Smruti Vyavahare
51
Prof. Smruti Vyavahare
52
Prof. Smruti Vyavahare
53
Prof. Smruti Vyavahare
54
Prof. Smruti Vyavahare
55
Prof. Smruti Vyavahare
56
Prof. Smruti Vyavahare
57
Prof. Smruti Vyavahare
58
Prof. Smruti Vyavahare
59
Prof. Smruti Vyavahare
60
Prof. Smruti Vyavahare
61
Prof. Smruti Vyavahare
62
Prof. Smruti Vyavahare
63
Prof. Smruti Vyavahare
64
Prof. Smruti Vyavahare
65
Prof. Smruti Vyavahare
66
Prof. Smruti Vyavahare
67
Prof. Smruti Vyavahare
Major Steps in ETL Process
68
Prof. Smruti Vyavahare
Major Steps in ETL Process
69
Prof. Smruti Vyavahare
70
Prof. Smruti Vyavahare
OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that allows users to
analyze information from multiple database systems at the same time. It is based on multidimensional data
model and allows the user to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes are known as Hyper-cubes.
71
Prof. Smruti Vyavahare
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can
be done by:
In the cube given in overview section, the drill down operation is performed by moving down in the
concept hierarchy of Time dimension (Quarter -> Month).
72
Prof. Smruti Vyavahare
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It can
be done by:
In the cube given in the overview section, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).
73
Prof. Smruti Vyavahare
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the cube
given in the overview section, a sub-cube is selected by selecting following dimensions with criteria:
Location = "Delhi" or "Kolkata"
74
Prof. Smruti Vyavahare
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. In
the cube given in the overview section, Slice is performed on the dimension Time = "Q1".
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a new
view of it.
75
Prof. Smruti Vyavahare
76
Prof. Smruti Vyavahare
Thank You!
([email protected])
77
Prof. Smruti Vyavahare