0% found this document useful (0 votes)

339 views

Data Mining and Data Warehousing Notes

This document contains module notes on data warehousing and modeling. It discusses the difference between operational database systems and data warehouses. Data warehouses are designed for online analytical processing and report generation while operational databases are for online transaction processing. It also compares OLTP and OLAP systems. The notes define data warehousing as a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. A multitier architecture for data warehouses is described involving extraction, transformation and loading of data from operational sources into a data warehouse for analysis.

Uploaded by

shilpa veeru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

339 views

Data Mining and Data Warehousing Notes

Uploaded by

shilpa veeru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

S J P N Trust's CSE Dept.

Hirasugar Institute of Technology, Nidasoshi. DMDW

Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)

Subject Code Module No Module Name Prepared By

DATA MINING DATA
Prof. Shilpa B.
AND DATA (15CS651) I WAREHOUSING
Hosagoudra
WAREHOUSING AND MODELING

MODULE – 1

DATA WAREHOUSING AND MODELING (08 Hours)

MODULE CONENT: Basic Concepts: Difference between Operational Database systems and
Data warehouse, Data Warehousing: A multitier Architecture, Data warehouse models:
Enterprise warehouse, Data mart and virtual warehouse, Extraction, Transformation and loading,
Metadata Repository, Data warehouse design and usage: Business Analysis framework, Data
warehouse design process and usage for information processing, Online analytical processing to
multidimensional data mining. Data Cube: A multidimensional data model, Stars, Snowflakes
and Fact constellations: Schemas for multidimensional Data models, Dimensions: The role of
concept Hierarchies, Measures: Their Categorization and computation, Typical OLAP
Operations.

Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : [email protected]
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)

Q1) Explain/Define Data warehousing? Differentiate & Compare Operational Database System
(ODS) and Data Warehousing System (DWS).
Ans: A data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for integration of a variety
of application systems. They support information processing by providing a solid platform of
consolidated historic data for analysis. According to William H. Inmon, “A data warehouse is a
subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of
management’s decision making process”.

Q2) What is Operational Database systems/stores (ODS)? How does it differ from data
warehouse? Explain. Ans for Q1, Q2, and Q3:
Operational systems/stores (ODS) Data warehousing systems
(OLTP) (OLAP)
ODS are designed to support high-volume online DW systems are designed to support high-volume online
transaction processing (OLTP) with minimal back-end analytical processing (OLAP) and subsequent, often
reporting. elaborate report generation.
Operational systems are generally process- DW systems are generally subject-oriented, organized
oriented or process-driven, meaning that they are around business areas that the organization needs
focused on specific business processes or tasks. information about. Such subject areas are usually
populated with data from one or more operational
Example: Tasks include billing, registration, etc. systems.
Example: Revenue may be a subject area of a data
warehouse that incorporates data from operational
systems that contain student tuition data, alumni gift
data, financial aid data, etc.
Data within operational systems are generally updated Data within a data warehouse is generally non-volatile,
regularly according to need. meaning that new data may be added regularly, but once
loaded, the data is rarely changed, thus preserving an
ever-growing history of information.
In short, data within a data warehouse is generally read-
only.
ODS are generally optimized to perform fast inserts and DW systems are generally optimized to perform fast
updates of relatively small volumes of data. retrievals of relatively large volumes of data
ODS are generally application-specific, resulting in a DW systems are generally integrated at a layer above the
multitude of partially or non-integrated systems application layer, avoiding data redundancy problems.
and redundant data
Operational systems generally require a non-trivial level DW systems generally appeal to an end-user community
of computing skills amongst the end-user community. with a wide range of computing skills, from novice to
expert users.
It involves day-to-day processing. It involves historical processing of information.
OLTP systems are used by clerks, DBAs, or database OLAP systems are used by knowledge workers such as
professionals. executives, managers, and analysts.
It is used to run the business. It is used to analyze the business.
It is based on Entity Relationship Model. It is based on Star Schema, Snowflake Schema, and Fact
Constellation Schema.
It is application oriented. It focuses on Information out.
It provides primitive and highly detailed data. It provides summarized and consolidated data.
It provides detailed and flat relational view of data. It provides summarized and multidimensional view of
data.
The database size is from 100 MB to 100 GB. The database size is from 100GB to 100 TB.
It provides high performance. These are highly flexible.

Q3) Differentiate OLTP and OLAP OR Compare between OLTP and OLAP systems.

Q4) What Is a DataWarehouse? Explain multitier Architecture.

ANS:
Data warehouses have been defined in many Ways. Data warehousing provides architectures and
tools for business executives to systematically organize, understand, and use their data to make
strategic decisions. A data warehouse refers to a data repository that is maintained separately
from an organization’s operational databases. Data warehouse systems allow for integration of a
variety of application systems. They support information processing by providing a solid
platform of consolidated historic data for analysis.
According to William H. Inmon “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of management’s decision making process”.
Major features of a data warehouse.
The four keywords—subject-oriented, integrated, time-variant, and nonvolatile.
1) Subject-oriented: A data warehouse is organized around major subjects such as
customer, supplier, product, and sales. Data warehouses typically provide a simple and
concise view of particular subject issues by excluding data that are not useful in the
decision support process.
2) Integrated: A data warehouse is usually constructed by integrating multiple
heterogeneous sources, such as relational databases, flat files, and online transaction
records.

3) Time-variant: Data are stored to provide information from an historic perspective (e.g.,
the past 5–10 years). Every key structure in the data warehouse contains, either implicitly
or explicitly, a time element.
4) Nonvolatile: A data warehouse is always a physically separate store of data transformed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms. It usually requires only two operations in data accessing: initial
loading of data and access of data.

How are organizations using the information from data warehouses?

Many organizations use this information to support business decision-making activities including
(1) Increasing customer focus, which includes the analysis of customer buying patterns (such as
buying preference, buying time, budget cycles, and appetites for spending).
(2) Repositioning products and managing product portfolios by comparing the performance of
sales by quarter, by year, and by geographic regions in order to fine-tune production strategies.
(3) Analyzing operations and looking for sources of profit.
(4) Managing customer relationships, making environmental corrections, and managing the cost
of corporate assets.
Data Warehousing: A Multitiered Architecture:

1) The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources These tools and utilities perform
data extraction, cleaning, and transformation as well as load and refresh functions to
update the data warehouse The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server. Examples of gateways include
ODBC (Open Database Connection) and OLEDB (Object Linking and Embedding
Database) by Microsoft and JDBC (Java Database Connection). This tier also contains a
metadata repository, which stores information about the data warehouse and its contents.
2) The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or (2) a
multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3) The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models:

1) Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or
more operational systems or external information providers, and is cross-functional in scope. It
typically contains detailed data as well as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be
implemented on traditional mainframes, computer superservers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design and build.

2) Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales. The data contained in data marts tend
to be summarized. Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. Depending on the source of data, data marts can be categorized
as independent or dependent. Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated locally
within a particular department or geographic area. Dependent data marts are sourced directly
from enterprise data warehouses.

3) Virtual warehouse: A virtual warehouse is a set of views over operational databases. For
efficient query processing, only some of the possible summary views may be materialized. A
virtual warehouse is easy to build but requires excess capacity on operational database servers.

Extraction, Transformation, and Loading:

Data warehouse systems use back-end tools and utilities to populate and refresh their Data,
These tools and utilities include the following functions:

1) Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
2) Data cleaning, which detects errors in the data and rectifies them when possible.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : [email protected]
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)

3) Data transformation, which converts data from legacy or host format to warehouse format.
4) Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.
5) Refresh, which propagates the updates from the data sources to the warehouse.

Metadata Repository:
Metadata are data about data. When used in a data warehouse, metadata are the data that
define warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time-stamping any extracted data,
the source of the extracted data, and missing fields that have been added by data cleaning or
integration processes.

A metadata repository should contain the following:

1) A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and contents.
2)Operational metadata, which include data lineage (history of migrated data and the sequence
of transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
3)The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
4)Mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user authorization
and access control).
5)Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles.
6)Business metadata, which include business terms and definitions, data ownership information,
and charging policies.

Data Cube: A Multidimensional Data Model

Q) What is a data cube?
ANS) A data cube allows data to be modeled and viewed in multiple dimensions. It is defined
by dimensions and facts. In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.
For example, AllElectronics may create a sales data warehouse in order to keep records of the
store’s sales with respect to the dimensions time, item, branch, and location. These dimensions
allow the store to keep track of things like monthly sales of items and the branches and locations
at which the items were sold.
Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension. For example, a dimension table for item may contain the attributes item
name, brand, and type. A multidimensional data model is typically organized around a central
theme, such as sales. This theme is represented by a fact table. Facts are numeric measures. The
fact table contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.

A simple 2-D data cube that is, in fact, a table or spreadsheet for sales data from
AllElectronics. In particular, we will look at the AllElectronics sales data for items sold per
quarter in the city of Vancouver.
To view the sales data with a third dimension. For instance, suppose we would like to view
the data according to time and item, as well as location, The 3-D data in the table are represented
as a series of 2-D tables. lllly we can think of a 4-D cube as being a series of 3-D cubes.

To display any n-dimensional data as a series of (n-1) dimensional “cubes”.

The data cube is a metaphor for multidimensional data storage. The data cube often referred to as
a cuboid. Given a set of dimensions, we can generate a cuboid for each of the possible subsets of
the given dimensions. The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group-by. The lattice of cuboids is then referred to as a data
cube.

Base cuboid: The cuboid that holds the lowest level of summarization is called the base cuboid.
Ex: the 4-D cuboid in Figure, is the base cuboid for the given time, item, location, and supplier
dimensions.
Apex cuboid/ The 0-D cuboid: It holds the highest level of summarization. In Figure it is the
total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is typically
denoted by all.

Schemas for Multidimensional Data Models:

For online transaction processing, the entity-relationship data model is commonly used. But
for data warehouse, however, requires a concise, subject-oriented schema that facilitates online
data analysis. The most popular data model for a data warehouse is a multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a fact constellation schema.

1) Stars Schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains a large central table (fact table) containing the bulk of the data, with no
redundancy, and a set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with the dimension tables displayed
in a radial pattern around the central fact table.
The following fig shows the star schema module for the Sales of AllElectronics.
Sales are considered along four dimensions: time, item, branch, and location. The schema
contains a central fact table for sales that contains keys to each of the four dimensions,
alongwith two measures: dollars sold and units sold. To minimize the size of the fact
table, dimension identifiers (e.g., time key and item key) are system-generated identifiers.
each dimension is represented by only one table, and each table contains a set of
attributes.
Disadvantages: Some similar kind of Entries for attributes in the dimension table will
create redundancy among the attributes

2) Snowflakes Schema:

The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.

Advantages: The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such a table is easy to maintain and saves storage space. However, this space
savings is negligible in comparison to the typical magnitude of the fact table. Furthermore, the
snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to
execute a query. Consequently, the system performance may be adversely impacted. Hence,
although the snowflake schema reduces redundancy, it is not as popular as the star schema in
data warehouse design.

The main difference between the two schemas is in the definition of dimension tables. The
single dimension table for item in the star schema is normalized in the snowflake schema,
resulting in new item and supplier tables. For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to
the supplier dimension table, containing supplier key and supplier type information. Similarly,
the single dimension table for location in the star schema can be normalized into two new tables:
location and city. The city key in the new location table links to the city dimension. Notice that,
when desirable, further normalization can be performed on province or state and country in the
snowflake schema shown in above Figure.

3) Fact Constellations Schema:

Sophisticated applications may require multiple fact tables to share dimension tables. This kind
of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.

Above Figure shows a fact constellation schema which specifies two fact tables, sales and
shipping. The sales table definition is identical to that of the star schema (Figure 4.6). The
shipping table has five dimensions, or keys—item key, time_key, shipper_key, from_location,
and to_location, and two measures dollars_cost and units_shipped. A fact constellation schema
allows dimension tables to be shared between fact tables. For example, the dimensions tables for
time, item, and location are shared between the sales and shipping fact tables.

Q) Distinction between a data warehouse and a data mart?

Ans:
Data Warehouse Data Mart
1) A data warehouse collects information about 1) A data mart, on the other hand, is a
subjects that span the entire organization, such department subset of the data warehouse that
as customers, items, sales, assets, and focuses on selected subjects, and thus its scope
personnel, and thus its scope is enterprise- is department wide.
wide. 2) For data marts, the star or snowflake schema
2) For data warehouses, the fact constellation is commonly used, since both are geared
schema is commonly used, since it can model toward modeling single subjects, although the
multiple, interrelated subjects. star schema is more popular and efficient.

Dimensions: The Role of Concept Hierarchies:

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts.
Ex: Consider a concept hierarchy for the dimension location. City values for location include
Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to the
province or state to which it belongs. The provinces and states can in turn be mapped to the
country (e.g., Canada or the United States) to which they belong. These mappings form a
concept hierarchy for the dimension location, mapping a set of low-level concepts (i.e., cities) to
higher-level, more general concepts (i.e., countries). This concept hierarchy is illustrated in
following Figure.
Example: A concept hierarchy for location.

Many concept hierarchies are implicit within the database schema. These attributes are related by
a total order, forming a concept hierarchy such as “street < city < province or state < country.”
This hierarchy is shown in Figure (a). Alternatively, the attributes of a dimension may be
organized in a partial order, forming a lattice. An example of a partial order for the time
dimension based on the attributes day, week, month, quarter, and year is “day < fmonth <
quarter; weekg < year”. This lattice structure is shown in Figure (b).

Schema Hierarchy. A concept hierarchy that is a total or partial order among attributes in a
database schema is called a schema hierarchy.

Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can be
defined among groups of values.

Concept hierarchies may be provided manually by system users, domain experts, or knowledge
engineers, or may be automatically generated based on statistical analysis of the data
distribution.

Measures: Their Categorization and Computation:

A data cube measure is a numeric function that can be evaluated at each point in the data cube
space. A measure value is computed for a given point by aggregating the data corresponding to
the respective dimension–value pairs defining the given point.

Ex: A multidimensional point in the data cube space can be defined by a set of dimension–value
pairs; for example, <time = “Q1”, location = “Vancouver”, item = “computer”>.

Measures can be organized into three categories—Distributive, Algebraic, and Holistic—

based on the kind of aggregate functions used.

1) Distributive: An aggregate function is distributive if it can be computed in a distributed

manner as follows. Suppose the data are partitioned into n sets. We apply the function to each
partition, resulting in n aggregate values. If the result derived by applying the function to the n
aggregate values is the same as that derived by applying the function to the entire data set
(without partitioning), the function can be computed in a distributed manner.

Example: sum(), count(), min(), and max() are distributive aggregate functions.
sum() can be computed for a data cube by first partitioning the cube into a set of
subcubes, computing sum() for each subcube, and then summing up the counts obtained for each
subcube. Hence, sum() is a distributive aggregate function.

2) Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function

with M arguments (where M is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function. A measure is algebraic if it is obtained by applying an
algebraic aggregate function.
Example: avg(), min N(), max N() and standard deviation() are algebraic aggregate functions.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : [email protected]
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)

avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive
aggregate functions.

3) Holistic: An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a subaggregate. That is, there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the computation. A measure is holistic if it
is obtained by applying a holistic aggregate function.
Example: Holistic functions include median(), mode(), and rank().

Typical OLAP Operations:

How are concept hierarchies useful in OLAP?

Ans:
In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies. This organization provides
users with the flexibility to view data from different perspectives. A number of OLAP data cube
operations exist to materialize these different views, allowing interactive querying and analysis
of the data at hand. Hence, OLAP provides a user-friendly environment for interactive data
analysis.
Roll-up, Drill-down, Slice and dice, Pivot (rotate) and Other OLAP operations such as
drill-across, drill-through.

Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction.
Ex:
Figure shows the result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location This hierarchy was defined as the total order “street < city <
province or state < country.” The roll-up operation shown aggregates the data by ascending the
location hierarchy from the level of city to the level of country.

NOTE: When roll-up is performed by dimension reduction, one or more dimensions are
removed from the given cube.
Ex: consider a sales data cube containing only the location and time dimensions. Roll-up may be
performed by removing, say, the time dimension, resulting in an aggregation of the total sales by
location, rather than by location and by time.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
Ex: Figure shows the result of a drill-down operation performed on the central cube by
stepping down a concept hierarchy for time defined as “day < month < quarter < year.” Drill-
down occurs by descending the time hierarchy from the level of quarter to the more detailed
level of month. The resulting data cube details the total sales per month rather than summarizing
them by quarter.

NOTE: Drill-down adds more detail to the given data it can also be performed by adding new
dimensions to a cube such as customer group in the figure.

Slice: The slice operation performs a selection on one dimension of the given cube, resulting in a
subcube.
Ex: a slice operation where the sales data are selected from the central cube for the
dimension time using the criterion time = “Q1”.

Dice: The dice operation defines a subcube by performing a selection on two or more
dimensions.
Ex: a dice operation on the central cube based on the following selection criteria that
involve three dimensions: (location D “Toronto” or “Vancouver”) and (time D “Q1” or “Q2”)
and (item D “home entertainment” or “computer”).

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in
view to provide an alternative data presentation.
Ex: a pivot operation where the item and location axes in a 2-D slice are rotated. Other
examples include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series of 2-D
planes.

Other OLAP operations: Drill-across, Drill-through

Some OLAP systems offer additional drilling operations. For example,

The drill-across executes queries involving (i.e., across) more than one fact table.
The drill-through operation uses relational SQL facilities to drill through the bottom level of a
data cube down to its back-end relational tables.

Examples of typical OLAP operations on multidimensional data.

Q) Explain the three kinds of data warehouse applications. How does data mining relate to
information processing and online analytical processing?
Ans:
There are three kinds of data warehouse applications: information processing, analytical
processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is
to construct low-cost web-based accessing tools that are then integrated with web browsers.

Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down,

roll-up, and pivoting. It generally operates on historic data in both summarized and detailed
forms. The major strength of online analytical processing over information processing is the
multidimensional data analysis of data warehouse data.

Data mining supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models, performing classification and prediction, and presenting the
mining results using visualization tools.

Information processing, based on queries, can find useful information. However, answers to
such queries reflect the information directly stored in databases or computable by aggregate
functions. They do not reflect sophisticated patterns or regularities buried in the database.
Therefore, information processing is not data mining. Online analytical processing comes a step
closer to data mining because it can derive information summarized at multiple granularities
from user-specified subsets of a data warehouse.

Q) Why the multidimensional data mining is particularly important for? Mentions the
reasons.
Reasons are as follows:
1) High quality of data in data warehouses: Most data mining tools need to work on
integrated, consistent, and cleaned data, which requires costly data cleaning, data integration, and
data transformation as preprocessing steps. A data warehouse constructed by such preprocessing
serves as a valuable source of high-quality data for OLAP as well as for data mining. Notice that
data mining may serve as a valuable tool for data cleaning and data integration as well.
2) Available information processing infrastructure surrounding data warehouses:
Comprehensive information processing and data analysis infrastructures have been or will be
systematically constructed surrounding data warehouses, which include accessing, integration,
consolidation, and transformation of multiple heterogeneous databases, ODBC/OLEDB
connections, Web accessing and service facilities, and reporting and OLAP analysis tools. It is
prudent to make the best use of the available infrastructures rather than constructing everything
from scratch.
3) OLAP-based exploration of multidimensional data: Effective data mining needs
exploratory data analysis. A user will often want to traverse through a database, select portions
of relevant data, analyze them at different granularities, and present knowledge/ results in
different forms. Multidimensional data mining provides facilities for mining on different subsets
of data and at varying levels of abstraction—by drilling, pivoting, filtering, dicing, and slicing on
a data cube and/or intermediate data mining results. This, together with data/knowledge
visualization tools, greatly enhances the power and flexibility of data mining.

4) Online selection of data mining functions: Users may not always know the specific kinds of
knowledge they want to mine. By integrating OLAP with various data mining functions,
multidimensional data mining provides users with the flexibility to select desired data mining
functions and swap data mining tasks dynamically.

Q) Illustrate the business analysis framework for data warehouse design and mention Four
different views regarding a data warehouse design.
Ans:
First, having a data warehouse may provide a competitive advantage by presenting relevant
information from which to measure performance and make critical adjustments to help win over
competitors.
Second, a data warehouse can enhance business productivity because it is able to quickly and
efficiently gather information that accurately describes the organization.
Third, a data warehouse facilitates customer relationship management because it provides a
consistent view of customers and items across all lines of business, all departments, and all
markets.
Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and
exceptions over long periods in a consistent and reliable manner.

To design an effective data warehouse we need to understand and analyze business needs and
construct a business analysis framework. The construction of a large and complex information
system can be viewed as the construction of a large and complex building, for which the owner,
architect, and builder have different views.
Four different views regarding a data warehouse design must be considered: the topdown view,
the data source view, the data warehouse view, and the business query view.
1) The top-down view allows the selection of the relevant information necessary for the data
warehouse. This information matches current and future business needs.
2) The data source view exposes the information being captured, stored, and managed by
operational systems. This information may be documented at various levels of detail and
accuracy, from individual data source tables to integrated data source tables. Data sources are
often modeled by traditional data modeling techniques, such as the entity-relationship model or
CASE (computer-aided software engineering) tools.
3) The data warehouse view includes fact tables and dimension tables. It represents the
information that is stored inside the data warehouse, including pre calculated totals and counts,
as well as information regarding the source, date, and time of origin, added to provide historical
context.
4) The business query view is the data perspective in the data warehouse from the end-user’s
viewpoint.

Q) Explain the steps involved in Data Warehouse design process.

Ans:
The warehouse design process consists of the following steps:
1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account
administration, sales, or the general ledger). If the business process is organizational and
involves multiple complex object collections, a data warehouse model should be followed.
However, if the process is departmental and focuses on the analysis of one kind of business
process, a data mart model should be chosen.
2. Choose the business process grain, which is the fundamental, atomic level of data to be
represented in the fact table for this process (e.g., individual transactions, individual daily
snapshots, and so on).
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time,
item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.

Because data warehouse construction is a difficult and long-term task, its implementation
scope should be clearly defined. The goals of an initial data warehouse implementation should be
specific, achievable, and measurable. This involves determining the time and budget allocations,
the subset of the organization that is to be modeled, the number of data sources selected, and the
number and types of departments to be served.

ASSIGNMENT - I
Sem: VI Sub: Data Mining and Data Warehousing Sub. Code: 15CS651
Max. Marks : 25 Mapped CO: C320.1 Date: 07-03-2019

Module – 1
Q. RBT
Description of Question Marks
No. Level

BATCH 1

Define Data warehousing. Explain the differences between Operational 5 L2

1 Database System (ODS) and Data Warehousing System (DWS) or OLTP and
OLAP.
2 With neat diagram explain in detail the multitier data warehouse architecture. 5 L2
Explain in detail the star schema, snowflake schema and fact constellation 5 L2
3
schema with neat diagram.
4 What is data cube measure? How it is categorized? Explain. 5 L2
5 Explain typical OLAP operations with example. 5 L2

BATCH 2

1 Explain the data warehouse Metadata Repository. 5 L2

2 Briefly explain data warehousing and also explain Data Warehouse models. 5 L2
Explain the following: 5 L2
3 1) Concept of Hierarchies and its representation with example. 2) Measure and
its categories.
4 What are the functions of data warehouse tools and utilities? Explain. 5 L2
5 What is the KDD Process? Explain in detail with neat diagram. 5 L2

BATCH 3

1 Define Data Warehouse. List and define a key features of Data Warehouse. 5 L2
2 Explain the Data Cube (OLAP) operation with an example for each. 5 L2
3 Define data cube. With example, explain a multidimensional data model. 5 L2
Explain the recommended approach for data warehouse development with neat 5 L2
4
diagram.
Develop a 4-D data cube representation of sales data, according to time, item, location 5 L3
5
and supplier.

Last Date of Submission: 11-03-2019

Course Coordinator Module Coordinator HOD

Mrs. Shilpa B. Hosagoudra Mr. S. V. Manjaragi Dr. Parashuram Baraki

Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : [email protected]

Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
HCI Theory HCI Theory HCI Theory: C M C M C M
No ratings yet
HCI Theory HCI Theory HCI Theory: C M C M C M
131 pages
Cassandra PPT Final
No ratings yet
Cassandra PPT Final
23 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
Ddbms Lab Manual
No ratings yet
Ddbms Lab Manual
100 pages
Datascience With Answers
100% (1)
Datascience With Answers
36 pages
DWDM Lecturenotes PDF
No ratings yet
DWDM Lecturenotes PDF
133 pages
Dbms Unit-I
100% (4)
Dbms Unit-I
80 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Spark MCQ
No ratings yet
Spark MCQ
3 pages
DBMS Lab (18IS507) Manual With Solutions-1
No ratings yet
DBMS Lab (18IS507) Manual With Solutions-1
24 pages
Hive PPT
No ratings yet
Hive PPT
25 pages
Introduction To SQL - NEW
No ratings yet
Introduction To SQL - NEW
27 pages
Dimensional Modeling and Schemas: Data Modeling Research Paper
No ratings yet
Dimensional Modeling and Schemas: Data Modeling Research Paper
11 pages
File Hanling - New - C++
No ratings yet
File Hanling - New - C++
26 pages
DWH Question Bank
No ratings yet
DWH Question Bank
9 pages
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
Python Machine Learning
100% (2)
Python Machine Learning
70 pages
Data Mining Basics
No ratings yet
Data Mining Basics
20 pages
Data Warehousing and Data Mining Important Question
No ratings yet
Data Warehousing and Data Mining Important Question
7 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
21 pages
DBMS Unit 3 Notes
No ratings yet
DBMS Unit 3 Notes
29 pages
DBMS LAB Manual Iare
No ratings yet
DBMS LAB Manual Iare
10 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
HBase
No ratings yet
HBase
31 pages
Model Test Paper Dbms
No ratings yet
Model Test Paper Dbms
14 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
63 pages
Data Warehousing Quick Guide
No ratings yet
Data Warehousing Quick Guide
43 pages
NoSQL Technologies Notes Unit 1
100% (1)
NoSQL Technologies Notes Unit 1
20 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Data Model - Important - Concepts
No ratings yet
Data Model - Important - Concepts
24 pages
Dbms Lab Manual
50% (2)
Dbms Lab Manual
99 pages
Cloud Computing Unit 5
No ratings yet
Cloud Computing Unit 5
16 pages
Cheat Sheet: Created by Tomi Mester
100% (1)
Cheat Sheet: Created by Tomi Mester
12 pages
Data Mining New Notes Unit 3 PDF
No ratings yet
Data Mining New Notes Unit 3 PDF
12 pages
Data Mining Questions
100% (1)
Data Mining Questions
7 pages
12 Cool Data Science Projects Ideas For Beginners and Experts
No ratings yet
12 Cool Data Science Projects Ideas For Beginners and Experts
25 pages
Apache Pig
100% (2)
Apache Pig
80 pages
ML UNIT-IV Notes
100% (1)
ML UNIT-IV Notes
23 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
Oltp VS Olap
100% (1)
Oltp VS Olap
9 pages
Cloud Computing 2marks
No ratings yet
Cloud Computing 2marks
6 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
Data Warehousing Laboratory
0% (1)
Data Warehousing Laboratory
28 pages
Question - Bank (MCQ) - Advance Analytics - Question Bank eDBDA Sept 21
No ratings yet
Question - Bank (MCQ) - Advance Analytics - Question Bank eDBDA Sept 21
14 pages
3.1 Lesson 1.2 Hints PDF
0% (1)
3.1 Lesson 1.2 Hints PDF
7 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Chapter 4. Database System Architecture & Modeling
No ratings yet
Chapter 4. Database System Architecture & Modeling
57 pages
Data MIning & Data Warehousing-TCS-31
No ratings yet
Data MIning & Data Warehousing-TCS-31
2 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
7 - Classification
No ratings yet
7 - Classification
71 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Advanced Database Management Systems
No ratings yet
Advanced Database Management Systems
7 pages
Data-Mining-final-new
No ratings yet
Data-Mining-final-new
109 pages
DWDM Unit-2 PDF
No ratings yet
DWDM Unit-2 PDF
149 pages
DMDW1
No ratings yet
DMDW1
13 pages
Upendra Dutt Assignment
No ratings yet
Upendra Dutt Assignment
4 pages
Abolition of Slavery Timeline
100% (1)
Abolition of Slavery Timeline
45 pages
MJM 024 e 2024
100% (1)
MJM 024 e 2024
11 pages
Constract, Specification and Quantity Surveying
No ratings yet
Constract, Specification and Quantity Surveying
72 pages
Reported Statements Past Simple
No ratings yet
Reported Statements Past Simple
3 pages
Unconstitutionally Obtained Evidence
No ratings yet
Unconstitutionally Obtained Evidence
16 pages
College Physics 10th Edition Serway Test Bank 2024 scribd download full chapters
100% (10)
College Physics 10th Edition Serway Test Bank 2024 scribd download full chapters
52 pages
Worksheet 8 Gread Number1
No ratings yet
Worksheet 8 Gread Number1
3 pages
Monique Bazile MSW Pis2013
No ratings yet
Monique Bazile MSW Pis2013
4 pages
Class XII CS PROJECT Bank Management
No ratings yet
Class XII CS PROJECT Bank Management
6 pages
g12 Fabm2 Week 6
No ratings yet
g12 Fabm2 Week 6
10 pages
WFD Vocabs
No ratings yet
WFD Vocabs
2 pages
Oh Mga Deadhungry
No ratings yet
Oh Mga Deadhungry
7 pages
Oracle Cloud Infrastructure 2023 Sales Specialist Assessmen
No ratings yet
Oracle Cloud Infrastructure 2023 Sales Specialist Assessmen
5 pages
Keepers of The Realms June 2012
100% (2)
Keepers of The Realms June 2012
301 pages
Guerrero V St. Clare's Realty
100% (1)
Guerrero V St. Clare's Realty
2 pages
DLP
No ratings yet
DLP
6 pages
8 Icse Growth of British Power
No ratings yet
8 Icse Growth of British Power
1 page
Pista NG Pahiyas at Sorpresa Kay Lola
100% (3)
Pista NG Pahiyas at Sorpresa Kay Lola
20 pages
Connecting With Donors 445
No ratings yet
Connecting With Donors 445
7 pages
IOQM Olympaid T hnnest - 02 _ Test Paper
No ratings yet
IOQM Olympaid T hnnest - 02 _ Test Paper
2 pages
4E - The Complete List of Fortunes PDF
No ratings yet
4E - The Complete List of Fortunes PDF
44 pages
Teacher Planner Presentation
No ratings yet
Teacher Planner Presentation
15 pages
Neck Adjustment Instructions 1
No ratings yet
Neck Adjustment Instructions 1
1 page
Dispensing To Ambulatory or Outpatient Lec (CH 5 Lec 2)
No ratings yet
Dispensing To Ambulatory or Outpatient Lec (CH 5 Lec 2)
15 pages
Division of Bohol: Republic of The Philippines Department of Education Region VII, Central Visayas
No ratings yet
Division of Bohol: Republic of The Philippines Department of Education Region VII, Central Visayas
3 pages
Chapter 1 Embers of Childhood
No ratings yet
Chapter 1 Embers of Childhood
19 pages
Investment Decision Questions
No ratings yet
Investment Decision Questions
44 pages
HLT7036 Research Methods Portfolio Assessment Brief 22-23
No ratings yet
HLT7036 Research Methods Portfolio Assessment Brief 22-23
6 pages

Data Mining and Data Warehousing Notes

Uploaded by

Data Mining and Data Warehousing Notes

Uploaded by

S J P N Trust's CSE Dept.

Hirasugar Institute of Technology, Nidasoshi. DMDW

Subject Code Module No Module Name Prepared By

DATA WAREHOUSING AND MODELING (08 Hours)

Q4) What Is a DataWarehouse? Explain multitier Architecture.

How are organizations using the information from data warehouses?

Data Warehouse Models:

Extraction, Transformation, and Loading:

A metadata repository should contain the following:

Data Cube: A Multidimensional Data Model

To display any n-dimensional data as a series of (n-1) dimensional “cubes”.

Schemas for Multidimensional Data Models:

3) Fact Constellations Schema:

Q) Distinction between a data warehouse and a data mart?

Dimensions: The Role of Concept Hierarchies:

Measures: Their Categorization and Computation:

Measures can be organized into three categories—Distributive, Algebraic, and Holistic—

1) Distributive: An aggregate function is distributive if it can be computed in a distributed

2) Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function

Typical OLAP Operations:

How are concept hierarchies useful in OLAP?

Other OLAP operations: Drill-across, Drill-through

Some OLAP systems offer additional drilling operations. For example,

Examples of typical OLAP operations on multidimensional data.

Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down,

Q) Explain the steps involved in Data Warehouse design process.

Define Data warehousing. Explain the differences between Operational 5 L2

1 Explain the data warehouse Metadata Repository. 5 L2

Last Date of Submission: 11-03-2019

Course Coordinator Module Coordinator HOD

You might also like