0% found this document useful (0 votes)
21 views77 pages

DWM Module 1

Uploaded by

dhansanushree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views77 pages

DWM Module 1

Uploaded by

dhansanushree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module 1: Data Warehousing Fundamentals

TE – D – DWM
Prof. Smruti Vyavahare
Assistant Professor
Dept. of Computer Engineering,
SIES Graduate School of Technology

1
Prof. Smruti Vyavahare
Contents

• Course Structure & Syllabus


• Study Material
• Course Objectives & Course Outcomes
• Introduction to Data Warehouse
• Data warehouse architecture
• Data warehouse versus Data Marts
• E-R Modeling versus Dimensional Modeling
• Information Package Diagram
• Data Warehouse Schemas; Star Schema, Snowflake Schema, Factless Fact Table, Fact Constellation Schema.
• Update to the dimension tables.
• Major steps in ETL process
• OLTP versus OLAP
• OLAP operations: Slice, Dice, Rollup, Drilldown and Pivot.

2
Prof. Smruti Vyavahare
Course Structure & Syllabus

3
Prof. Smruti Vyavahare
Course Structure & Syllabus

4
Prof. Smruti Vyavahare
Course Structure & Syllabus

5
Prof. Smruti Vyavahare
Course Structure & Syllabus

6
Prof. Smruti Vyavahare
Study Material

7
Prof. Smruti Vyavahare
Course Objectives & Course Outcomes

8
Prof. Smruti Vyavahare
9
Prof. Smruti Vyavahare
10
Prof. Smruti Vyavahare
11
Prof. Smruti Vyavahare
12
Prof. Smruti Vyavahare
13
Prof. Smruti Vyavahare
Feature Real-Time Application Example
Amazon uses its data warehouse to
Database designed for analytical tasks analyze customer buying patterns to
recommend products.
Flipkart pulls data from orders, customer
Data from multiple applications reviews, logistics, and support systems
into one warehouse.
Marketing managers at Swiggy explore
Easy to use and supports long interactive sessions customer order trends during campaigns to
measure success.

Banks use data warehouses to analyze


Read-intensive data usage
customer transactions for fraud detection.

HR managers at Infosys use dashboards


Direct user interaction without IT help to track employee performance without
needing IT support.
Big Bazaar loads sales data every night to
Content updated periodically and stable
analyze store performance.
Netflix stores historical viewing data to
Includes current and historical data understand viewer behavior and improve
recommendations.
Zomato can quickly analyze how many
Ability to run queries and get results online orders were placed by city and time slot
during peak hours.
Sales teams at Tata Motors can generate
Users can initiate reports region-wise sales reports without waiting
for IT teams.
14
Prof. Smruti Vyavahare
15
Prof. Smruti Vyavahare
1. Provides an integrated and total view of the enterprise
Combines data from different departments (sales, HR, finance, etc.) into one place.
Example: A company like Tata Consultancy Services (TCS) uses a data warehouse to see
complete data across global offices in one unified dashboard.

2. Makes the enterprise’s current and historical information easily available for strategic
decision-making
Stores both recent and past data to help leaders make informed choices.
Example: Nestlé uses historical sales data to decide which products to promote during festive
seasons.

3. Makes decision-support transactions possible without hindering operational systems


Allows deep data analysis without slowing down daily operations (like order processing).
Example: While Flipkart runs millions of daily transactions, its warehouse separately analyzes
data for demand forecasting.

4. Renders the organization’s information consistent


Data from different systems is cleaned and standardized for accuracy and consistency.
Example: If the HR system uses “M” and “Male,” both are treated uniformly in the warehouse to
maintain consistency.

5. Presents a flexible and interactive source of strategic information


Users can query the system, generate reports, or perform data visualization easily.
Example: Marketing analysts at Zomato use dashboards to track user behavior and food trends
interactively.
16
Prof. Smruti Vyavahare
A Data Warehouse is a central repository where data from multiple sources is stored, integrated, and
analyzed to support decision-making.

Processing Requirements in the New System

1. Running of simple queries and reports against current and historical data:
Users should be able to easily retrieve both recent and old data to generate insights or reports.
Example: Sales trends over the last 5 years.

2. Ability to perform “what if” analysis in many different ways:


Users can simulate different scenarios.
Example: “What if we increase product prices by 10%?”

3. Ability to query, step back, analyze, and then continue the process to any desired length:
The system supports interactive exploration of data — users can drill down, go back, and
explore further.
Useful for deep data analysis.

4. Ability to spot historical trends and apply them in future interactive processes:
The system identifies patterns over time (e.g., seasonal trends) and uses them in future
planning or decision models.

“This new environment is known as the Data Warehouse environment”

17
Prof. Smruti Vyavahare
Definition of Data Warehouse
"A Data Warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in
support of management’s decisions."
— Bill Inmon (Father of Data Warehousing)

Features of Data Warehouse

1. Subject-Oriented Data
Data is organized around key business subjects or areas like sales, finance, customer, product a data
warehouse focuses on analyzing specific topics for decision-making.
Example: All data related to customers — their purchases, feedback, history — is grouped for analysis.

2. Integrated Data
Data from multiple sources (like ERP, CRM, spreadsheets) is cleaned, transformed, and stored in a
unified format.
Example: Customer IDs from different departments are standardized into one format.

3. Time-Variant Data
Data is stored with a time dimension (daily, monthly, yearly) to allow analysis over a long period.
It enables trend analysis, forecasting, and comparisons.
Example: Sales reports over the past 5 years can be compared to identify seasonal trends.

18
Prof. Smruti Vyavahare
4. Non-Volatile Data
Once data enters the warehouse, it is not frequently updated or deleted.
Example: A sales transaction from 2022 remains unchanged and available for future analysis.

5. Data Granularity
It refers to the level of detail of the data — from highly detailed (transaction-level) to summarized
(monthly or yearly reports).
Data warehouses usually store data at multiple levels of granularity.
Example: You can view both individual customer purchases and total monthly revenue.

19
Prof. Smruti Vyavahare
20
Prof. Smruti Vyavahare
1. Data Sources
It including:
Operational Systems: Databases used in day-to-day business operations like ERP, CRM, billing systems, etc
Flat Files: Data stored in spreadsheets, CSV files, logs, etc.
These systems are typically OLTP (Online Transaction Processing) systems.

2. Staging Area
This is where ETL (Extract, Transform, Load) happens:
Extract data from sources.
Transform data (cleaning, formatting, merging, etc.).
Load the processed data into the data warehouse.
This stage ensures data quality and consistency before it's stored permanently.

3. Warehouse
The central component where data is stored and managed. It includes:
Metadata: Data about the data (e.g., source, time stamp, format).
Summary Data: Aggregated or pre-calculated data (e.g., monthly sales totals).
Raw Data: Detailed data from the source systems (e.g., individual transactions).
This enables both high-level analysis and deep dives into raw data.

4. Users
These are the end-users who consume the data using different tools for various purposes:
Analytics: Advanced analysis, dashboards, data visualization.
Reporting: Standardized reports for business operations.
Mining: Data mining techniques to discover patterns, trends, and insights.
These outputs support decision-making at tactical and strategic levels. 21
Prof. Smruti Vyavahare
22
Prof. Smruti Vyavahare
1. Bottom Tier: Data Source Layer
This is where the data originates from.
• Operational Databases: These are the existing databases used for day-to-day business operations
(e.g., ERP, CRM).
• External Sources: These can be flat files, cloud data, or third-party data providers.
• ETL Process (Extract, Clean, Transform, Load, Refresh):
Extract: Pulls raw data from operational and external sources.
Clean: Removes errors and inconsistencies.
Transform: Converts data into a usable format.
Load: Transfers the transformed data into the data warehouse.
Refresh: Updates the data periodically.

2. Middle Tier: OLAP Server


This tier acts as an interface between stored data and end-user tools.
• Data Warehouse: The central repository where cleaned and transformed data is stored.
• Data Marts: Subsets of the data warehouse, created for specific departments (e.g., sales, finance)
• OLAP Servers (Online Analytical Processing): Used for fast querying, multidimensional analysis, and
summarizing large datasets.
• Metadata Repository: Stores information about data (such as source, structure, transformation rules)
and is used for monitoring and administration.

3. Top Tier: Front-End Tools


This tier allows users to access and analyze the data.
• Query/Report Tools: For generating reports and querying the database.
• Analysis Tools: Visual dashboards and data visualization tools.
• Data Mining Tools: For discovering hidden patterns and insights from large datasets.
23
Prof. Smruti Vyavahare
Data Mart

A Data Mart is essentially a smaller, more focused version of a data warehouse. It’s tailored to meet the
needs of a specific group within an organization.
Key Characteristics:
Subset of a Data Warehouse: It contains a portion of the data warehouse, usually relevant to a specific
business area.
Department-Specific: Designed for use by a particular department like Marketing, Sales, HR, or
Finance.
Department-Controlled: Typically managed and maintained by the department that uses it.
Fewer Data Sources: Pulls data from a limited number of sources, unlike a full data warehouse which
integrates data from across the organization.
Smaller and More Agile: Because of its limited scope, it’s easier to manage and adapt to changes.

24
Prof. Smruti Vyavahare
25
Prof. Smruti Vyavahare
Data warehousing design strategies or Approaches for building data warehouse

Top-Down Approach
The Top-Down Approach, introduced by Bill Inmon, is a method for designing data warehouses that starts
by building a centralized, company-wide data warehouse. It ensures data consistency and provides a strong
foundation for decision-making.
Working of Top-Down Approach
Central Data Warehouse: The process begins with creating a comprehensive data warehouse where data
from various sources is collected, integrated, and stored. This involves the ETL (Extract, Transform, Load)
process to clean and transform the data.
Specialized Data Marts: Once the central warehouse is established, smaller, department-specific data marts
(e.g., for finance or marketing) are built. These data marts pull information from the main data warehouse,
ensuring consistency across departments.

26
Prof. Smruti Vyavahare
Bottom-Up Approach
The Bottom-Up Approach, popularized by Ralph Kimball, takes a more flexible and incremental path to
designing data warehouses. Instead of starting with a central data warehouse, it begins by building small,
department-specific data marts that cater to the immediate needs of individual teams, such as sales or finance.
These data marts are later integrated to form a larger, unified data warehouse.

Working of Bottom-Up Approach

• Department-Specific Data Marts: The process starts with creating data marts for individual departments
or specific business functions. These data marts are designed to meet immediate data analysis and reporting
needs, allowing departments to gain quick insights.

• Integration into a Data Warehouse: Over time, these data marts are connected and consolidated to create
a unified data warehouse. The integration ensures consistency and provides a comprehensive view of the
organization’s data.

27
Prof. Smruti Vyavahare
Difference between Top-Down Approach & Bottom-Up Approach

Feature Top-Down (Inmon) Bottom-Up (Kimball)

First Step Build EDW Build Data Marts

Time to Implement Longer Faster

Integration Level High Medium

Initial Cost High Low

Suitable For Strategic enterprise vision Departmental focus

28
Prof. Smruti Vyavahare
Metadata
Metadata is "data about the data". In the context of a data warehouse, it provides descriptive information
about the warehouse's data contents, structure, source, and processes.
1. Metadata means "data about the data“:
Metadata gives context to raw data. It explains things like:
• What each column means (e.g., “DOB” = Date of Birth)
• Where the data comes from
• What format it is in (e.g., date, number, text)
• How often it is updated

2. Yellow Pages Analogy:


• Yellow Pages tells what business is where and what it does. Similarly, metadata tells you what data
is where in the data warehouse and what it means.

3. Directory Function
• Metadata helps navigate, access, and manage the contents of a data warehouse, especially useful
when data volume is large and complex.

4. Architectural Role:
• Metadata is a core architectural component of a data warehouse—it supports data integration, access,
governance, and quality control.

29
Prof. Smruti Vyavahare
Types of Metadata
1. Operational Metadata
It contains information about the operational data sources, the systems that provide data to the warehouse.
The examples are:
• File names and formats (CSV, Excel, etc.)
• Data refresh schedules
• Load times and history logs
Helps in tracking and managing the source data and its movement into the warehouse.
2. Extraction and Transformation Metadata
It includes :
• Data extraction from source systems: how, when, and how often data is pulled.
• Business rules applied during extraction.
• Data transformations performed before storing in the warehouse.
The examples are:
• Extraction methods (API, ETL tools)
• Frequency (daily, hourly, real-time)
• Rules like converting formats, removing duplicates, or merging tables
It helps in documenting the entire ETL (Extract, Transform, Load) process for auditing,
debugging, and ensuring data accuracy.
3. End-user Metadata
This is the user-friendly metadata that acts like a "navigational map" of the data warehouse for business users.
The examples are:
• Data dictionary (table/column names with business descriptions)
• Report labels and definitions
• Tooltips in dashboards
It enables end users (like analysts or managers) to easily find, understand, and use data for decision-
making.
30
Prof. Smruti Vyavahare
ER-Modelling VS Dimensional-Modelling

1. Usage
• E-R Modelling:
1. Supports OLTP – used for daily transactions like banking, bookings, etc.
2. It focuses on data integrity and efficiency in updating data.
• Dimensional Modelling:
1. Supports OLAP – used for data analysis, business intelligence, reporting.
2. Optimized for data retrieval and queries.

2. Structure
• E-R Modelling:
1. Entities (tables) are connected using multiple joins (often normalized).
• Dimensional Modelling:
Still uses joins, but mostly between fact and dimension tables in a simpler way
(star/snowflake schema).

3. Data Organization
• E-R Modelling:
Normalized: Data is split into smaller tables to reduce redundancy.
• Dimensional Modelling:
Denormalized: Data is often repeated to simplify queries and improve performance.

31
Prof. Smruti Vyavahare
4. Redundancy
• E-R Modelling:
Tries to remove redundancy to ensure consistency.
• Dimensional Modelling:
Allows redundancy for faster querying and simplicity.

5. Modification Impact
• E-R Modelling:
If change the model, it often affects the application using it.
• Dimensional Modelling:
More flexible and extensible->can handle new data elements easily without breaking
existing designs.

6. Adaptability
• E-R Modelling:
Can be fragile if query patterns change—structure is complex.
• Dimensional Modelling:
More robust—design is stable even if reporting/query needs evolve.

7. Complexity for Users


• E-R Modelling:
Hard to understand for business users; complex joins and structure.
• Dimensional Modelling:
Easy and user-friendly, especially for analysts and decision-makers.

32
Prof. Smruti Vyavahare
Dimensional-Modelling
Dimensional modelling is a data structure technique used to design databases that support efficient
querying and reporting. It’s especially used in data warehouses.

1. Optimized for Data Warehousing


• Data structure technique optimized for data storage in a Data Warehouse.
• It is designed to store historical and analytical data that supports decision-making, not real-time
updates.

2. Faster Data Retrieval


• Optimizes the database for faster retrieval of data.
• It organizes data in a denormalized structure (fewer joins), making it faster to query, especially for
large datasets.
• Beneficial for dashboards, and data analysis.

3. Designed for Analytical Use


• A dimensional model in data warehouse is designed to read, summarize, analyze numeric information.
• It’s suitable for:
1. Reading and aggregating data (e.g., total sales by month)
2. Summarizing across dimensions (e.g., sales by region, by product)
3. Analyzing trends (e.g., monthly performance)

33
Prof. Smruti Vyavahare
4. Developed by Ralph Kimball
• Developed by Ralph Kimball and it consists of 'fact' and 'dimension' tables.
• Fact Tables: Store measurable data (e.g., sales, profit, quantity).
• Dimension Tables: Store context about facts (e.g., customer, time, product, region).

Example (Retail Data Warehouse):

Sales_Fact –> sale_id, product_id,


Fact Table
date_id, total_amount

Product_Dim, Date_Dim,
Dimension Table
Customer_Dim

34
Prof. Smruti Vyavahare
35
Prof. Smruti Vyavahare
36
Prof. Smruti Vyavahare
37
Prof. Smruti Vyavahare
Information Package Diagram (IPD)

1. An Information Package Diagram (IPD) is a logical design tool used in dimensional modeling
to:
• Identify business dimensions (descriptive data)
• Identify facts or metrics (numerical data)
It is the foundation step in designing a data warehouse schema like star schema or snowflake
schema.

2. It helps structure the dimensions (e.g., time, customer, product) and the metrics/facts (e.g., sales,
quantity, profit) that are to be analyzed.

3. Before designing database tables, use an IPD to plan out.

4. IPD defines the structure of a dimensional model.

5. Every IPD includes measurable metrics (facts) alongside their corresponding dimensions.

38
Prof. Smruti Vyavahare
39
Prof. Smruti Vyavahare
40
Prof. Smruti Vyavahare
41
Prof. Smruti Vyavahare
42
Prof. Smruti Vyavahare
43
Prof. Smruti Vyavahare
44
Prof. Smruti Vyavahare
1. Star Schema
• Central fact table connects directly to several dimension tables.
• Dimension tables are denormalized (flat, no sub-tables).
• Simple and fast to query.
Mainly used for: Quick report generation.
Example are:
• Fact table: Sales_Fact (sale_id, date_id, product_id, amount)
• Dimensions: Product_Dim, Customer_Dim, Date_Dim

2. Snowflake Schema
• Extension of the star schema.
• Dimension tables are normalized (split into sub-tables).
• Uses less storage but is slower to query.
Mainly used for: Saving space and reducing redundancy.
Example are:
• Product_Dim might link to Category_Dim and Supplier_Dim.

3. Fact Constellation Schema (Galaxy Schema)


• Contains multiple fact tables that share common dimension tables.
• Useful for complex systems with more than one business process.
Mainly used for: Large enterprises with varied operations.
Example are:
• Fact tables: Sales_Fact, Returns_Fact
• Shared dimensions: Customer_Dim, Date_Dim

45
Prof. Smruti Vyavahare
46
Prof. Smruti Vyavahare
47
Prof. Smruti Vyavahare
48
Prof. Smruti Vyavahare
49
Prof. Smruti Vyavahare
50
Prof. Smruti Vyavahare
51
Prof. Smruti Vyavahare
52
Prof. Smruti Vyavahare
53
Prof. Smruti Vyavahare
54
Prof. Smruti Vyavahare
55
Prof. Smruti Vyavahare
56
Prof. Smruti Vyavahare
57
Prof. Smruti Vyavahare
58
Prof. Smruti Vyavahare
59
Prof. Smruti Vyavahare
60
Prof. Smruti Vyavahare
61
Prof. Smruti Vyavahare
62
Prof. Smruti Vyavahare
63
Prof. Smruti Vyavahare
64
Prof. Smruti Vyavahare
65
Prof. Smruti Vyavahare
66
Prof. Smruti Vyavahare
67
Prof. Smruti Vyavahare
Major Steps in ETL Process

68
Prof. Smruti Vyavahare
Major Steps in ETL Process

69
Prof. Smruti Vyavahare
70
Prof. Smruti Vyavahare
OLAP Operations
OLAP stands for Online Analytical Processing Server. It is a software technology that allows users to
analyze information from multiple database systems at the same time. It is based on multidimensional data
model and allows the user to query on multi-dimensional data (eg. Delhi -> 2018 -> Sales data). OLAP
databases are divided into one or more cubes and these cubes are known as Hyper-cubes.

71
Prof. Smruti Vyavahare
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed data. It can
be done by:

• Moving down in the concept hierarchy

• Adding a new dimension

In the cube given in overview section, the drill down operation is performed by moving down in the
concept hierarchy of Time dimension (Quarter -> Month).

72
Prof. Smruti Vyavahare
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP cube. It can
be done by:

• Climbing up in the concept hierarchy

• Reducing the dimensions

In the cube given in the overview section, the roll-up operation is performed by climbing up in the concept
hierarchy of Location dimension (City -> Country).

73
Prof. Smruti Vyavahare
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the cube
given in the overview section, a sub-cube is selected by selecting following dimensions with criteria:
Location = "Delhi" or "Kolkata"

• Time = "Q1" or "Q2"

• Item = "Car" or "Bus"

74
Prof. Smruti Vyavahare
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube creation. In
the cube given in the overview section, Slice is performed on the dimension Time = "Q1".

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation gives a new
view of it.

75
Prof. Smruti Vyavahare
76
Prof. Smruti Vyavahare
Thank You!
([email protected])

77
Prof. Smruti Vyavahare

You might also like