DW Chap2
DW Chap2
(1-29)
1.8.4 Junk Dimension ETL Tools : Most commonly used ETL tools are
Sybase, Oracle Warehouse builder, Clover ETL and
A junk dimension is a grouping of typically low MarkLogic.
cardinality attributes, so you can remove them from
main dimension.
You can use Junk dimensions to implement the rapidly
changing dimension where you can use it to stores the
attribute that changes rapidly. For example, attributes
such as flags, weights, BMI (body mass index), etc. (1A16)Fig. 1.9.1 : ETL Process
1.8.5 Degenerated Dimension Let us understand each step of the ETL process in depth:
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-30)
Only these changes in data will be extracted and then distorted and meaningless data within a dataset.
loaded. Identifying the last changed data itself is the Smoothing uses algorithms to highlight the special
complex process and involve many logics. features in the data. After removing noise, the process
You can detect the changes in the source system from can detect any small changes to the data to detect
the specific column in the source system that has the special patterns.
last changed timestamp. You can also create a change Data Aggregation: Aggregation is the process of
table in the source system, which keeps track of the collecting data from a variety of sources and storing it
changes in the source data. in a single format. Here, data is collected, stored,
analyzed and presented in a report or summary format.
(B) Physical Extraction
It helps in gathering more information about a
Physical extraction has two methods: Online and particular data cluster. The method helps in collecting
Offline extraction. vast amounts of data.
(i) Online Extraction Discretization: This is a process of converting
In this process, extraction process directly connects to continuous data into a set of data intervals. Continuous
the source system and extract the source data. attribute values are substituted by small interval labels.
This makes the data easier to study and analyze.
(ii) Offline Extraction
Generalization: In this process, low-level data
The data is not extracted directly from the source
attributes are transformed into high-level data
system but is staged explicitly outside the original
attributes using concept hierarchies. For example, age
source system.
data can be in the form of (20, 30) in a dataset. It is
You can consider the following common structure in transformed into a higher conceptual level into a
offline extraction: categorical value (young, old).
1. Flat file: Generic format Attribute construction: In the attribute construction
2. Dump file: Database specific file method, new attributes are created from an existing set
2. Transformation
of attributes. For example, in a dataset of employee
information, the attributes can be employee name,
The second step of the ETL process is transformation. employee ID and address. These attributes can be used
In this step, a set of rules or functions are applied on to construct another dataset that contains information
the extracted data to convert it into a single standard about the employees who have joined in the year 2019
format. only. This method of reconstruction makes mining
It may involve following processes/tasks: more efficient and helps in creating new datasets
1. Filtering – loading only certain attributes into the quickly.
data warehouse. Normalization: Also called data pre-processing, this is
2. Cleaning – filling up the NULL values with some one of the crucial techniques for data transformation
default values, mapping U.S.A, United States and in data mining. Here, the data is transformed so that it
America into USA, etc. falls under a given range. When attributes are on
3. Joining – joining multiple attributes into one. different ranges or scales, data modelling and mining
4. Splitting – splitting a single attribute into multiple can be difficult. Normalization helps in applying data
attributes. mining algorithms and extracting data faster.
5. Sorting – sorting tuples on the basis of some 3. Loading
attribute (generally key-attribute).
The third and final step of the ETL process is loading.
Data Transformation Techniques
In this step, the transformed data is finally loaded into
Data Smoothing: This method is used for removing the data warehouse.
the noise from a dataset. Noise is referred to as the
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-31)
Sometimes the data is updated by loading into the data designed in such a way that creating and viewing
warehouse very frequently and sometimes it is done reports become easy. At the core of the OLAP concept,
after longer but regular intervals. is an OLAP Cube. The OLAP cube is a data structure
The rate and period of loading solely depends on the optimized for very quick data analysis. The OLAP
requirements and varies from system to system. Cube consists of numeric facts called measures which
are categorized by dimensions. OLAP Cube is also
Loading can be carried in two ways:
called the hypercube.
(A) Refresh : Data Warehouse data is completely
rewritten. This means that older file is replaced.
Refresh is usually used in combination with static
extraction to populate a data warehouse initially.
(B) Update : Only those changes applied to source
information are added to the Data Warehouse. An
update is typically carried out without deleting or
modifying pre-existing data. This method is used
in combination with incremental extraction to
update data warehouses regularly. (1A17)Fig. 1.10.1 : OLTP Vs OLAP Operations
1.10 OLTP VS OLAP The following table summarizes the major differences
between OLTP and OLAP system design.
We can divide IT systems as transactional (OLTP) and Table 1.10.1: OLTP Vs OLAP
analytical (OLAP). In general, we can assume that
OLTP systems provide source data to data warehouses, Parameters OLTP OLAP
whereas OLAP systems help to analyze it. Process It is an online OLAP is an online
transactional analysis and data
OLTP (On-line Transaction Processing) is system. It retrieving process.
characterized by a large number of short on-line manages
transactions (INSERT, UPDATE, DELETE). The main database
emphasis for OLTP systems is put on very fast query modification.
processing, maintaining data integrity in multi-access Characteristic It is It is characterized by
characterized a large volume of
environments and an effectiveness measured by by large data.
number of transactions per second. In OLTP database numbers of
there is detailed and current data, and schema used to short online
store transactional databases is the entity model transactions.
(usually 3NF). Functionality OLTP is an OLAP is an online
online database database query
OLAP (On-line Analytical Processing) is modifying management system.
characterized by relatively low volume of transactions. system.
Queries are often very complex and involve Method OLTP uses OLAP uses the data
aggregations. For OLAP systems a response time is an traditional warehouse.
effectiveness measure. OLAP applications are widely DBMS.
used by Data Mining techniques. In OLAP database Query Insert, Update, Mostly Select
and Delete operations
there is aggregated, historical data, stored in multi- information
dimensional schemas (usually star schema). For from the
example, a bank storing years of historical records of database.
check deposits could use an OLAP database to provide Table Tables in Tables in OLAP
reporting to business users. OLAP databases are OLTP database database are not
are normalized. normalized.
divided into one or more cubes. The cubes are
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-32)
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-33)
In the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple
levels of abstraction defined by concept hierarchies. This organization provides users with the flexibility to view data
from different perspectives.
For example, we have attributes as day, temperature and humidity, we can group values in subsets and name these
subsets, thus obtaining a set of hierarchies as shown in Fig. 1.11.1.
(1A18)Fig. 1.11.1
OLAP provides a user-friendly environment for interactive data analysis. A number of OLAP data cube operations
exist to materialize different views of data, allowing interactive querying and analysis of the data.
The most popular end user operations on dimensional The concept hierarchy can be defined as
data are:
hotdayweek. The roll-up operation groups the data
1. Roll-Up by levels of temperature.
The roll-up operation (also called drill-up or 2. Drill-Down
aggregation operation) performs aggregation on a data
cube, either by climbing up a concept hierarchy for a The drill-down operation (also called roll-down) is the
dimension or by climbing down a concept hierarchy, reverse operation of roll-up. Drill-down is like
i.e. dimension reduction. Let us explain roll up with an zooming-in on the data cube. It navigates from less
example: detailed record to more detailed data. Drill-down can
Consider the following cubes illustrating temperature be performed by either stepping down a concept
of certain days recorded weekly: hierarchy for a dimension or adding additional
dimensions.
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Figure shows a drill-down operation performed on the
Week1 1 0 1 0 1 0 0 0 0 0 1 0
dimension time by stepping down a concept hierarchy
Week2 0 0 0 1 0 0 1 2 0 1 0 0 which is defined as day, month, quarter, and year.
Consider that we want to set up levels (hot (80-85), Drill-down appears by descending the time hierarchy
mild (70-75), cool (64-69)) in temperature from the from the level of the quarter to a more detailed level of
above cubes.
the month.
To do this, we have to group column and add up the
value according to the concept hierarchies. This Because a drill-down adds more details to the given
operation is known as a roll-up. data, it can also be performed by adding a new
By doing this, we get the following cube : dimension to a cube. For example, a drill-down on the
Temperature Cool mild Hot central cubes of the figure can occur by introducing an
Week1 2 1 1 additional dimension, such as a customer group. Drill-
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-34)
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-35)
1. Roll-up Operation
NOTES
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-36)
2. Drill-down Operation
3. Slice Operation
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-37)
4. Dice Operation
(1A23) Fig. 1.11.6: Dice Operation for Location and Time Dimensions
5. Pivot Operation
(1A24) Fig. 1.11.7: Pivot Operation for Location and Item Dimensions
Ex. 1.11.1 : Consider a data warehouse for a hospital where there are three dimensions: a) Doctor b) Patient c) Time.
Consider two measuresi) Countii) Charge where charge is the fee that the doctor charges a patient for a visit. For the above
example create a cube and illustrate the following OLAP operations.1) Rollup 2) Drill down 3) Slice 4) Dice 5) Pivot.
Soln. : 1. Roll-up Operation
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-38)
2. Drill-down Operation
3. Slice Operation
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-39)
4. Dice Operation
There are three types of OLAP servers, namely, Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP) and
Hybrid OLAP (HOLAP).
Relational On-Line Analytical Processing (ROLAP) work mainly for the data that resides in a relational database,
where the base data and dimension tables are stored as relational tables.
ROLAP servers are placed between the relational back-end server and client front-end tools.
ROLAP servers use RDBMS to store and manage warehouse data, and OLAP middleware to support missing
pieces.
Example : DSS Server of Microstrategy
Advantages of ROLAP
1. ROLAP can handle large amounts of data. 2. Can be used with data warehouse and OLTP systems.
Disadvantages of ROLAP
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-40)
Multidimensional On-Line Analytical Processing (MOLAP) support multidimensional views of data through
array-based multidimensional storage engines.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Example : Oracle Essbase
Advantages of MOLAP
Disadvantages of MOLAP
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-41)
Advantages of HOLAP
Disadvantage of HOLAP
1. HOLAP architecture is very complex because it supports both MOLAP and ROLAP servers.
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-42)
The drawback is that this is less straightforward than For example, take the “Orders” business process from
hypercube and can carry steeper learning curves. Some an online catalog company where you might have
systems use the combined approach of hypercube and customer orders in a fact table called FactOrders with
multi-cubes, by separating the storage, processing, and dimensions Customer, Product, and OrderDate.
presentation layers. It stores data as multi-cubes but With possibly millions of orders in the transaction fact,
presents as a hypercube. it makes sense to start thinking about aggregates.
To further the above example, assume that the business
is interested in a report: “Monthly orders by state and
product type”.
While you could generate this easily enough using the
FactOrders fact table, you could likely speed up the
data retrieval for the report by at least half (but likely
much, much more) using an aggregate.
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture
Data Warehousing and Mining (MU-Sem 5-Comp.) (Data Warehousing Fundamentals)….Page no. (1-43)
Q. 1.2 A data warehouse is which of the following? Q. 1.10 Which is NOT considered as a standard querying
(a) Can be updated by end users. technique?
(b) Contains numerous naming conventions and (a) Roll-up (b) Drill-down
formats. (c) DSS (d) Pivot Ans. : (c)
(c) Organized around important subject areas. Q. 1.11 Among the following which is not a type of
(d) Contains only current data. Ans. : (c) business data?
(a) Real time data (b) Application data
Q. 1.3 An operational system is which of the following?
(c) Reconciled data (d) Derived data
(a) A system that is used to run the business in real
time and is based on historical data.
Ans. : (b)
(b) A system that is used to run the business in real Q. 1.12 A snowflake schema has which of the following
time and is based on current data. types of tables?
(c) A system that is used to support decision (a) Fact (b) Dimension
making and is based on current data. (c) Helper (d) All of the above Ans. : (d)
(d) A system that is used to support decision Q. 1.13 The extract process is which of the following?
making and is based on historical data. (a) Capturing all of the data contained in various
Ans. : (b) operational systems
(b) Capturing a subset of the data contained in
Q. 1.4 What is the type of relationship in star schema?
various operational systems
(a) many-to-many (b) one-to-one (c) Capturing all of the data contained in various
(c) many-to-one (d) one-to-many Ans. : (d) decision support systems
Q. 1.5 Fact tables are _______. (d) Capturing a subset of the data contained in
(a) completely demoralized. various decision support systems Ans. : (b)
(b) partially demoralized. Q. 1.14 Which of the following is not true regarding
(c) completely normalized. characteristics of warehoused data?
(a) Changed data will be added as new data
(d) partially normalized. Ans. : (c)
(b) Data warehouse can contain historical data
Q. 1.6 Data warehouse is volatile, because obsolete data (c) Obsolete data are discarded
are discarded
(d) Users can change data once entered into the
(a) True (b) False Ans. : (b) data warehouse Ans. : (d)
Q. 1.7 Which is NOT a basic conceptual schema in Data Q. 1.15 Which of the following statements is incorrect?
Modeling of Data Warehouses?
(a) ROLAPs have large data volumes
(a) Star Schema (b) Tree Schema (b) Data form of ROLAP is large multidimentional
(c) Snowflake Schema array made of cubes
(d) Fact Constellation Schema Ans. : (b) (c) MOLAP uses sparse matrix technology to
manage data sparcity
Q. 1.8 Among the followings which is not a characteristic
of Data Warehouse? (d) Access for MOLAP is faster than ROLAP
(a) Integrated (b) Volatile Ans. : (b)
(c) Time-variant (d) Subject-oriented Q. 1.16 Which of the following standard query techniques
Ans. : (b) increase the granularity
(a) roll-up (b) drill-down
Q. 1.9 What is not considered as issues in data
warehousing?
(c) slicing (d) dicing Ans. : (b)
(a) Optimization (b) Data transformation Q. 1.17 The full form of OLAP is
(c) Extraction (d) Intermediation (a) Online Analytical Processing
(b) Online Advanced Processing
Ans. : (d)
(c) Online Analytical Performance
(d) Online Advanced Preparation Ans. : (a)
(MU-New Syllabus w.e.f academic year 21-22)(M5-56) Tech-Neo Publications...A SACHIN SHAH Venture