0% found this document useful (0 votes)

31 views

Chapter 3

Uploaded by

Syed Kazmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Chapter 3

Uploaded by

Syed Kazmi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Data Warehousing (CS614)

Lecture No. 03
Lecture 3: Introduction to Data Ware Housing–Part III

Learning Goals
It is a blend of many technologies, the basic concept being:

• Take all data from different operational systems.

• If necessary, add relevant data from industry.
• Transform all data and bring into a uniform format.
• Integrate all data as a single entity.
• Store data in a format supporting easy access for decision support.
• Create performance enhancing indices.
• Implement performance enhancement joins.
• Run ad-hoc queries with low selectivity.

3.1 What is a Data Warehouse?

A Data Warehouse is not something shrink-wrapped i.e. you take a set of CDs and install
into a box and soon you have a Data Warehouse up and running. A Data Warehouse
evolves over time, you don’t buy it. Basically it is about taking/collecting data from
different heterogeneous sources. Heterogeneous means not only the operating system is
different but so is the underlying file format, different databases, and even with same
database systems different representations for the same entity. This could be anything
from different columns names to different data types for the same entity.

Companies collect and record their own operational data, but at the same time they also
use reference data obtained from external sources such as codes, prices etc. This is not the
only external data, but customer lists with their contact information are also obtained
from external sources. Therefore, all this external data is also added to the data
warehouse.

As mentioned earlier, even the data collected and obtained from within the company is
not standard for a host of different reasons. For example, different operational systems
being used in the company were developed by different vendors over a period of time,
and there is no or minimal evenness in data representation etc. When that is the state of
affairs (and is normal) within a company, then there is no control on the quality of data
obtained from external sources. Hence all the data has to be transformed into a uniform
format, standardized and integrated before it can go into the data warehouse.

In a decision support environment, the end user i.e. the decision maker is interested
in the big picture. Typical DSS queries do not involve using a primary key or
asking questions about a particular customer or account. DSS queries deal with
number of variables spanning across number of tables (i.e. join operations) and
looking at lots of historical data. As a result large number of records are processed
and retrieved. For such a case, specialized or different database architectures/topologies
are required, such as the star schema. We will cover this in detail in the relevant lecture.

© Copyright Virtual University of Pakistan 21

Data Warehousing (CS614)

Recall that a B-Tree is a data structure that supports dictionary operations. In the context
of a database, a B-Tree is used as an index that provides access to records without
actually scanning the entire table. However, for very large databases the corresponding B-
Trees becomes very large. Typically the node of a B-Tree is stored in a memory block,
and traversing a B-Tree involves O(log n) page faults. This is highly undesirable, because
by default the height of the B-Tree would be very large for very large data bases.
Therefore, new and unique indexing techniques are required in the DWH or DSS
environment, such as bitmapped indexes or cluster index etc. In some cases the designers
want such powerful indexing techniques, that the queries are satisfied from the indexes
without going to the fact tables.

In typical OLTP environments, the size of tables are relatively small, and the rows of
interest are also very small, as the queries are confined to specifics. Hence traditional
joins such as nested-loop join of quadratic time complexity does not hurt the performance
i.e. time to get the answer. However, for very large databases when the table sizes are in
millions and the rows of interest are also in hundreds of thousands, nested-loop join
becomes a bottle neck and is hardly used. Therefore, new and unique forms of joins are
required such as sort-merge join or hash based join etc.

Run Ad-Hoc queries with low Selectivity

Have already explained what is meant by ad-hoc queries. A little bit about selectivity is in
order. Selectivity is the ratio between the number of unique values in a column divided by
the total number of values in that column. For example the selectivity of the gender
column is 50%, assuming gender of all customers is known. If there are N records in a
table, then the selectivity of the primary key column is 1/N. Note that a query consists of
retrieving records based on a a combination of different columns, hence the choice of
columns determine the selectivity of the query i.e. the number of records retrieved
divided by the total number of records present in the table.

In an OLTP (On-Line Transaction Processing) or MIS (Management Information System)

environment, the queries are typically Primary Key (PK) based, hence the number of
records retrieved is not more than a hundred rows. Hence the selectivity is very high. For
a Data Warehouse (DWH) environment, we are interested in the big picture and have
queries that are not very specific in nature and hardly use a PK. As a result hundreds of
thousands of records (or rows) are retrieved from very large tables. Thus the ratio of
records retrieved to the total number of records present is high, and hence the selectivity
is low.

3.2 How is it different?

Decision making is Ad-Hoc

© Copyright Virtual University of Pakistan 22

Data Warehousing (CS614)

Figure-3.1: Running in circles

Consider a decision maker or a business user who wants some of his questions to be
answered. He/she sets a meeting with the IT people, and explains the requirements. The
IT people go over the cycle of system analysis and design, that takes anywhere from
couple of weeks to couple of months and they finally design and develop the system.
Happy and proud with their achievement the IT people go to the business user with the
reporting system or MIS system. After a learning curve the business users spends some
time with the brand new system, and may get some answers to the required questions. But
then those answers results in more questions. The business user has no choice to meet the
IT people with a new set of requirements. The business user is frustrated that his
questions are not getting answered, while the IT people are frustrated that the business
user always changes the requirements. Both are correct in their frustration.

Different patterns of hardware utilization

100%

Operational DWH
Figure-3.2: Different patterns of CPU Usage

© Copyright Virtual University of Pakistan 23

Data Warehousing (CS614)

Although there are peaks and valleys in the operational processing, but ultimately there is
relatively static pattern of utilization. There is an essentially different pattern of hardware
utilization in the data warehouse environment i.e. a binary pattern of utilization, either the
hardware is utilized fully or not at all. Calculating a mean utilization for a DWH is not a
meaningful activity. Therefore, trying to mix the two environments is a recipe for
disaster. You can optimize the machine for the performance of one type of application,
not for both.

Bus vs. Train Analogy

Consider the analogy of a bus and train. I believe you can find dozens of buses operating
between Lahore and Rawalpindi almost every 30 minutes. As a consequence, literally
there are buses moving between Lahore and Rawalpindi almost continuously through out
the day. But how many times a dedicated train moves between the two cities? Only twice
a day and carries a bulk of passengers and cargo. Binary operation i.e. either traveling or
not. The train can NOT be optimized for every 30-min travel, it will never fill to capacity
and run into loss. A bus can not be optimized to travel only twice, it will stand idle and
passengers would take vans etc. Bottom line: Two modes of transportation, can not be
interchanged.

Combines historical & Operational Data

Don’t do data entry into a DWH, OLTP or ERP are the source systems.

OLTP systems don’t keep history, cant get balance statement more than a year
old.

DWH keep historical data, even of bygone customers. Why?

In the context of bank, want to know why the customer left?

What were the events that led to his/her leaving? Why?

Customer retention.

3.3 Why keep historical data?

The data warehouse is different because, again it’s not a database you do data entry. You
are actually collecting data from the operational systems and loading into the DWH. So
the transactional processing systems like the ERP system are the source systems for the
data warehouse. You feed the data into the data warehouse. And the data warehouse
typically collects data over a period of time. So if you look at your transactional
processing OLTP systems, normally such systems don’t keep very much history.
Normally if a customer leaves or expired, the OLTP systems typically purge the data
associated with that customer and all the transactions off the database after some amount
of time. So normally once a year most business will purge the database of all the old
customers and old transactions. In the data warehouse we save the historical data.
Because you don’t need historical data to do business today, but you do need the
historical data to understand patterns of business usage to do business tomorrow, such
why a customer left?

© Copyright Virtual University of Pakistan 24

Data Warehousing (CS614)

How much History?

Depends on:
Industry.

Cost of storing historical data.

Economic value of historical data.

Industries and history

Telecomm calls are much much more as compared to bank transactions-
18 months of historical data.

Retailers interested in analyzing yearly seasonal patterns- 65 weeks of

historical data.

Insurance companies want to do actuary analysis, use the historical data

in order to predict risk- 7 years of historical data.

Hence, a DWH NOT a complete repository of data

How back do you look historically? It really depends a lot on the industry. Typically it’s
an economic equation. How far back depends on how much dose it cost to store that extra
years work of data and what is it’s economic value? So for example in financial
organizations, they typically store at least 3 years of data going backward. Again it’s
typical. It’s not a hard and fast rule.

In a telecommunications company, for example, typically around 18 months of data is

stored. Because there are a lot more call details records then there are deposits and
withdrawals from a bank account so the storage period is less, as one can not afford to
store as much of it typically. Another important point is, the further back in history you
store the data, the less value it has normally. Most of the times, most of the access into the
data is within that last 3 months to 6 months. That’s the most predictive data.

In retail business, retailers typically store at least 65 weeks of data. Why do they do that?
Because they want to be able to look at this season’s selling history to last season’s
selling history. For example, if it is Eid buying season, I want to look at the transit-buying
this Eid and compare it with the year ago. Which means I need 65 weeks in order to get
year going back, actually more then a year? It’s a year and a season. So 13 weeks are
additionally added to do the analysis. So it really depends a lot on the industry. But
normally you expect at least 13 months.

Economic value of data Vs. Storage cost & Data Warehouse a complete repository of
data?

This raises an interesting question, do we decide about storage of historical data using
only time, or consider space also, or both?

Usually (but not always) periodic or batch updates rather than real-time.

Data Warehousing (CS614)

The boundary is blurring for active data warehousing.

For an ATM, if update not in real-time, then lot of real trouble.

DWH is for strategic decision making based on historical data. Wont hurt if
transactions of last one hour/day are absent.

Rate of update depends on:

Volume of data,
Nature of business,
Cost of keeping historical data,
Benefit of keeping historical data.

It’s also true that in the traditional data warehouse the data acquisition is done on periodic
or batch based, rather then in real time. So think again about ATM system, when I put my
ATM card and make a withdrawal, the transactions are happening in real time, because if
they don’t the bank can get into trouble. Someone can withdraw more money then they
had in their account! Obviously that is not acceptable. So in an online transaction
processing (OLTP) system, the records are updated, deleted and inserted in real-time as
the business events take place, as the data entry takes place, as the point of sales system at
a super market captures the sales data and inserts into the database.

In a traditional data warehouse that is not true, because the traditional data warehouse is
for strategic decision-making not for running day to day business. And for strategic
decision making, I don’t need to know the last hour’s worth of ATM deposits. Because
strategic decisions take the long term perspective. For this reason and for efficiency
reasons normally what happens is that in the data warehouse you update on some
predefined schedule basis. May be it’s once a month, maybe it’s once a weak, maybe it’s
even once every night. It depends on the volume of data you are working with, and how
important the timings of the data are and so on.

3.4 Deviation from the PURIST approach

Let me first explain what/who a purist is. A purist is an idealist or traditionalist who
wants everything to be done by the book or the old arcane ways (only he/she knows), in
short he/she is not pragmatic or realist. Because the purist wants everything perfect, so
he/she has good excuses of doing nothing, as it is not a perfect world. When automobiles
were first invented, it was the purists who said that the automobiles will fail, as they scare
the horses. As Iqbal very rightly said “Aina no sa durna Tarzay Kuhan Pay Arna…”

As data warehouses become main-stream and the corresponding technology also becomes
mainstream technology, some traditional attributes are being deviated in order to meet the
increasing demands of the user’s. We have already discussed and reconciled with the fact
that a data warehouse is NOT the complete repository of data. The other most noticeable
deviations being time variance and non-volatility.

Deviation from Time Variance and Nonvolatility

As the size of data warehouse grows over time (e.g., in terabytes), reloading and
appending data can become a very tedious and time consuming task. Furthermore, as
business users get the “hang of it” they start demanding that more up-to-date data be

Data Warehousing (CS614)

available in the data warehouse. Therefore, instead of sticking to the traditional data
warehouse characteristic of keeping the data nonvolatile and time variant, new data is
being added to the data warehouse on a daily basis, if not on a real-time basis and at the
same time historical data removed to make room for the “fresh” data. Thus, new
approaches are being made to tackle this task. Two possible methods are as follows:

• Perform hourly/daily batch updates from shadow tables or log files.

Transformation rules are executed during the loading process. Thus, when the data
reaches the target data warehouse database, it is already transformed, cleansed and
summarized.

Perform real-time updates from shadow tables or log files. Again, transformation rules are
executed during the loading process. Instead of batch updates, this takes place on a per
transaction basis that meets certain business selection criteria.

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Etl Testing
86% (7)
Etl Testing
234 pages
BI Notes
No ratings yet
BI Notes
9 pages
Chapter 2
No ratings yet
Chapter 2
5 pages
Lecture No. 02
No ratings yet
Lecture No. 02
9 pages
Lecture 01
No ratings yet
Lecture 01
50 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Lecture 3
No ratings yet
Lecture 3
18 pages
Kabul University: Computer Science Faculty
No ratings yet
Kabul University: Computer Science Faculty
36 pages
Lec1 - Introduction To DWH
No ratings yet
Lec1 - Introduction To DWH
41 pages
Data War Eh Puse
No ratings yet
Data War Eh Puse
51 pages
Lecture 3
No ratings yet
Lecture 3
12 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
74 pages
Lecture # 1-2-Intro
No ratings yet
Lecture # 1-2-Intro
55 pages
Csb4318 DWDM Unit - 1 Revised
No ratings yet
Csb4318 DWDM Unit - 1 Revised
68 pages
Data Ware House Concepts
No ratings yet
Data Ware House Concepts
12 pages
DWM Lecture 1
No ratings yet
DWM Lecture 1
33 pages
Unit 6 Data Warehousing
No ratings yet
Unit 6 Data Warehousing
40 pages
Lesson 2. Data Warehouse Basic Concepts
No ratings yet
Lesson 2. Data Warehouse Basic Concepts
18 pages
Unit 2
No ratings yet
Unit 2
31 pages
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
No ratings yet
A.V.C.College of Engineering: Mannampandal, Mayiladuthurai-609 305
96 pages
DWH Indexes
No ratings yet
DWH Indexes
11 pages
DWH Start l2
No ratings yet
DWH Start l2
117 pages
DMDW Merged SaumyaRanjan Dash
No ratings yet
DMDW Merged SaumyaRanjan Dash
535 pages
UNITyssu 1 LT
No ratings yet
UNITyssu 1 LT
12 pages
Overview of Data Warehousing: AIM: - To Learn Architectural Framework For Data Warehousing Theory
No ratings yet
Overview of Data Warehousing: AIM: - To Learn Architectural Framework For Data Warehousing Theory
10 pages
Data Mining and Data Warehousing Notes
No ratings yet
Data Mining and Data Warehousing Notes
20 pages
Unit 2 Data Warehousing and OLAP
No ratings yet
Unit 2 Data Warehousing and OLAP
72 pages
Module-1
No ratings yet
Module-1
78 pages
Management Reporting Such As Annual and Quarterly Comparisons
No ratings yet
Management Reporting Such As Annual and Quarterly Comparisons
37 pages
Week-2-Data Warehouse and Olap
No ratings yet
Week-2-Data Warehouse and Olap
57 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
S1, S2, 2. Data Warehousing Concepts and Stella Gatziu and AVavouras (1999)
No ratings yet
S1, S2, 2. Data Warehousing Concepts and Stella Gatziu and AVavouras (1999)
4 pages
Lecture 3
No ratings yet
Lecture 3
12 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
48 pages
DBMS II Seven 7
No ratings yet
DBMS II Seven 7
13 pages
Module 1 DMDW
No ratings yet
Module 1 DMDW
64 pages
Unit-I DW - Architecture
100% (1)
Unit-I DW - Architecture
96 pages
Lect 5 Data Warehousing I_240924_033406
No ratings yet
Lect 5 Data Warehousing I_240924_033406
38 pages
Data Warehouse Data Mining: Rahul Sachdeva
No ratings yet
Data Warehouse Data Mining: Rahul Sachdeva
35 pages
Lecture No. 04 Data Warehouse
No ratings yet
Lecture No. 04 Data Warehouse
9 pages
Lect 1
No ratings yet
Lect 1
45 pages
DWM Unit 1. Introduction To Data Warehousing
100% (4)
DWM Unit 1. Introduction To Data Warehousing
12 pages
Lec 01 - Intro to Data Warehouse
No ratings yet
Lec 01 - Intro to Data Warehouse
54 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
122 pages
DWBI Unit-1
No ratings yet
DWBI Unit-1
19 pages
Lecture 1-2
No ratings yet
Lecture 1-2
30 pages
DWM Unit I
No ratings yet
DWM Unit I
114 pages
Understanding Data Warehouse
No ratings yet
Understanding Data Warehouse
24 pages
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
No ratings yet
Unit No: 01 Introduction To Data Warehouse: by Pratiksha Meshram
38 pages
CH 1
No ratings yet
CH 1
53 pages
2024 Datawarehousing Week 1
No ratings yet
2024 Datawarehousing Week 1
59 pages
DWH - Lecture 05
No ratings yet
DWH - Lecture 05
11 pages
Data Warehousing and Data Mining: Dr. Karunendra Verma
No ratings yet
Data Warehousing and Data Mining: Dr. Karunendra Verma
101 pages
DWM CHP1 NOTES
No ratings yet
DWM CHP1 NOTES
25 pages
Lecture 3
No ratings yet
Lecture 3
49 pages
D W H Info: Main Menu DWH Concepts and Fundamentals Back
No ratings yet
D W H Info: Main Menu DWH Concepts and Fundamentals Back
7 pages
Data Warehousing
100% (1)
Data Warehousing
51 pages
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
100% (1)
Final Interview Questions (Etl - Informatica) : Subject Oriented, Integrated, Time Variant, Non Volatile
77 pages
Project Brazen - ITO - Nov. 27
100% (1)
Project Brazen - ITO - Nov. 27
474 pages
Test Drive OtherTutorial Intools
No ratings yet
Test Drive OtherTutorial Intools
16 pages
MEAN 3 L4 Advanced MongoDB With Aggregation
No ratings yet
MEAN 3 L4 Advanced MongoDB With Aggregation
94 pages
практис 3
No ratings yet
практис 3
4 pages
Learning Pandas 2nd Edition Heydt - The ebook in PDF format with all chapters is ready for download
100% (1)
Learning Pandas 2nd Edition Heydt - The ebook in PDF format with all chapters is ready for download
62 pages
TOAD DBA Module Guide
No ratings yet
TOAD DBA Module Guide
83 pages
Test 44
No ratings yet
Test 44
213 pages
Krishnaik06 The-Grand-Complete-Data-Science-Materials
No ratings yet
Krishnaik06 The-Grand-Complete-Data-Science-Materials
45 pages
Rakha Haris TI-3B 1907411027: Section 15 - 1
No ratings yet
Rakha Haris TI-3B 1907411027: Section 15 - 1
25 pages
Advanced Excel, FMEA, Inventory & Stores MGMT
No ratings yet
Advanced Excel, FMEA, Inventory & Stores MGMT
6 pages
10 SQL Commands
No ratings yet
10 SQL Commands
18 pages
DW SQL 201
No ratings yet
DW SQL 201
99 pages
Remove All Deleted Records Pack Table Command: To Completely, Use The
No ratings yet
Remove All Deleted Records Pack Table Command: To Completely, Use The
3 pages
DFS Design and Implementation
No ratings yet
DFS Design and Implementation
40 pages
sql1 PDF
No ratings yet
sql1 PDF
87 pages
Comandos Informix
No ratings yet
Comandos Informix
10 pages
CHAPTER - 02 - Query Processing - CS 2nd Year - 2016
No ratings yet
CHAPTER - 02 - Query Processing - CS 2nd Year - 2016
49 pages
Chapter 1 Databases and Database Users
100% (1)
Chapter 1 Databases and Database Users
7 pages
Microsoft Social Listening Users Guide
No ratings yet
Microsoft Social Listening Users Guide
77 pages
Documentdb Tutorial
100% (1)
Documentdb Tutorial
178 pages
Green Plum
No ratings yet
Green Plum
15 pages
Tivoli Netcool Support'S Guide To Tuning The Common Netcool/Omnibus Triggers by Jim Hutchinson Document Release: 2.0
No ratings yet
Tivoli Netcool Support'S Guide To Tuning The Common Netcool/Omnibus Triggers by Jim Hutchinson Document Release: 2.0
12 pages
SQL Interview Questions
100% (1)
SQL Interview Questions
14 pages
HANA S4 Features in NutShel
100% (1)
HANA S4 Features in NutShel
10 pages
Unit 3
No ratings yet
Unit 3
8 pages
1403 Confio SQL Server Tuning Infographics 8 5x11
No ratings yet
1403 Confio SQL Server Tuning Infographics 8 5x11
1 page
Ai Vector Search Users Guide
No ratings yet
Ai Vector Search Users Guide
609 pages
Advanced Data Base Management II
No ratings yet
Advanced Data Base Management II
2 pages
Nikita Jain Internship Report
No ratings yet
Nikita Jain Internship Report
25 pages
Data Warehouse Interview Questions
100% (1)
Data Warehouse Interview Questions
6 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Data Warehousing (CS614)

• Take all data from different operational systems.

3.1 What is a Data Warehouse?

© Copyright Virtual University of Pakistan 21

Run Ad-Hoc queries with low Selectivity

In an OLTP (On-Line Transaction Processing) or MIS (Management Information System)

3.2 How is it different?

Decision making is Ad-Hoc

© Copyright Virtual University of Pakistan 22

Figure-3.1: Running in circles

Different patterns of hardware utilization

© Copyright Virtual University of Pakistan 23

Bus vs. Train Analogy

Combines historical & Operational Data

DWH keep historical data, even of bygone customers. Why?

In the context of bank, want to know why the customer left?

What were the events that led to his/her leaving? Why?

3.3 Why keep historical data?

© Copyright Virtual University of Pakistan 24

How much History?

Cost of storing historical data.

Economic value of historical data.

Industries and history

Retailers interested in analyzing yearly seasonal patterns- 65 weeks of

Insurance companies want to do actuary analysis, use the historical data

Hence, a DWH NOT a complete repository of data

In a telecommunications company, for example, typically around 18 months of data is

© Copyright Virtual University of Pakistan 25

The boundary is blurring for active data warehousing.

For an ATM, if update not in real-time, then lot of real trouble.

Rate of update depends on:

3.4 Deviation from the PURIST approach

Deviation from Time Variance and Nonvolatility

© Copyright Virtual University of Pakistan 26

• Perform hourly/daily batch updates from shadow tables or log files.

© Copyright Virtual University of Pakistan 27

You might also like