0% found this document useful (0 votes)
13 views56 pages

ML Module1 Ppt - Copy

The document provides an overview of data warehousing and modeling, emphasizing its importance for businesses to analyze and utilize data for strategic decision-making. It defines data warehouses, outlines their key features, and compares them with operational databases, highlighting their role in business intelligence. Additionally, it discusses various types of data warehouses, data integration processes, and OLAP operations for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views56 pages

ML Module1 Ppt - Copy

The document provides an overview of data warehousing and modeling, emphasizing its importance for businesses to analyze and utilize data for strategic decision-making. It defines data warehouses, outlines their key features, and compares them with operational databases, highlighting their role in business intelligence. Additionally, it discusses various types of data warehouses, data integration processes, and OLAP operations for data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Module 1

Data warehousing & Modelling


Motivation for dare warehouse
Competition mounting in every industry, data warehousing the must-
have marketing weapon to retain customers by learning more about
their needs

Data warehousing will provide tools & architecture for business


executives to organise , understand and use data to make strategies .
A data warehouse refers to a data repository that is maintained
separately from an organization’s operational databases. Data
warehouse systems allow for integration of a variety of application
systems. They support information processing by providing a solid
platform of consolidated historic data for analysis.
Oracle definition of data warehouse
• A data warehouse is a type of data management system that is
designed to enable and support business intelligence (BI) activities,
especially analytics. Data warehouses are solely intended to perform
queries and analysis and often contain large amounts of historical
data.
IBM definition
• A core component of business intelligence, a data warehouse
pulls together data from many different sources into a single
data repository for sophisticated analytics and decision
support.
Key features of Datawarehouse

Integrated : from various sources , kept in same format

Subject oriented : around a particular subject ex: customer, sales (instead of whole
organization data)

Nonvolatile: unchanged, read only

Time variant : documented wrt to time (ex: sales of 1-5 years)


• Data warehouse is mainly used for decision support
• Knowledge workers like managers & analyst work on Datawarehouse
to obtain overview or insights on data
• Business decisions like :
✓Analysis of customer buying patterns(ex: combo)
✓Repositioning products and managing product portfolios according to
sales per time and per geographical regions(ex: clothes)
✓To come up with strategies to improve profit
✓Managing customer relationship, managing corporate assets
▪ Organization collect diverse , heterogenous data , distributed data to
integrate provide access is the major challenge
▪ In traditional database approach integrator/wrapper is used to address
this
▪ Integrator is a mediator on multiple heterogenous databases , when a
query is posed to client site , meta data dictionaries will be used to
translate query to relevant query of any particular database
▪ These queries are mapped and sent to local query processor.
▪ The results returned from the different sites are integrated into a global
answer set
▪ This query-driven approach requires complex information filtering and
integration processes and competes with local sites for processing
resources. It is inefficient and potentially expensive for frequent queries.
• Data warehousing employs an update driven approach in which
information from multiple, heterogeneous sources is integrated in
advance and stored in a warehouse for direct querying .
• A data warehouse brings high performance to the integrated
heterogeneous database system because data are copied, pre-
processed, integrated summarized into one data store.
Difference
between
OLAP & OLTP
• Difference between operational database & data
warehouse:
Feature OLTP OLAP

Characteristic Operational processing Information processing analysis

Orientation Transaction Analysis

User Clerk, DBA, database professional Knowledge worker

Function Day to day operation Long term informational requirement decision


support

DB design ER based , application oriented subject-oriented

Data Current data , guaranteed up to date Historic, accuracy-maintained overtime

Summarization highly detailed consolidated


Feature OLTP OLAP

View Detailed , flat relational Summarized , multidimension

Unit of work simple complex

Access Read/write Mostly read

Focus Data in Information out

No. of records Tens millions

No. of users Thousands hundreds

DB size GB ≥ TB

Priority high performance, high availability high flexibility, end-user autonomy


Three Tier data warehouse architecture
Types of warehouse:
Enterprise warehouse
• All the info of entire organization
• Data Integration of 1 or more operational systems
• Detailed and summarized data
• Gb and beyond
• Implemented on mainframes or super servers
• Extensive business model hence takes years to design & build
Data mart
• Subset of Datawarehouse
• Confined particularly to subject & user
• Can be built on low-cost servers like windows or linux
• Implementation can be done in weeks
• Types: dependent or independent
• Independent : locally generated at department
• Dependent : from enterprise
Virtual Datawarehouse
• Looking at frequent queries, (views) summary is created
• Easy to build
• Excess space in servers required
• Data integration is difficult in bottom up
• Data model building as common model is difficult for enterprise in
top down
Extraction, Transformation, and Loading
• Data extraction, which typically gathers data from multiple,
heterogeneous, and external sources.
• Data cleaning, which detects errors in the data and rectifies them
when possible.
• Data transformation, which converts data from legacy or host format
to warehouse format.
• Load, which sorts, summarizes, consolidates, checks integrity, and
builds indices and partitions.
• Refresh, which propagates the updates from the data sources to the
warehouse.
Meta data repository
• Desc of structure , schema, dimensions etc
• Operational metadata: history of migrated data, error and monitoring
reports
• Which algorithms used for summary, aggregation , queries and
reports
• Data source , gateway (ODBC)
• Data related to system performance, which include indices and
profiles that improve data access and retrieval performance
• Business terms
Data cube –Multidimensional approach
• Data cube allows data to be modelled in multidimension
• Dimension-perspective in which organization wants to keep the data
• Ex:Sales –store : dimension :time , item , branch
• Each dimension will be kept in table called dimension table
• Ex: item-itemname, brand, type
• Multidimension data model around a central theme –ex: sales
• Facts –numeric value/measures
• The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables
Schema for multidimensional model
• Db : entity relationship schema is used
• Warehouse : have multidimensional model and hence uses star,
snowflake , fact constellation
Star schema
Snowflake schema
Dimension : The role of Concept Hierarchies
• A concept hierarchy defines a sequence of mappings from a set of
low-level concepts to higher-level, more general concepts
A concept hierarchy that is a total or partial
order among attributes in a database schema is
called a schema hierarchy
Concept hierarchies may also be defined by grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy
• There may be more than one concept hierarchy for a given attribute
or dimension, based on different user viewpoints.
• Concept hierarchies may be provided manually by system users,
domain experts, or knowledge engineers,
• May be automatically generated based on statistical analysis of the
data distribution.
Measures-categorisation
• A data cube measure is a numeric function that can be evaluated at
each point in the data cube space

• A measure value is computed for a given point by aggregating the


data corresponding to the respective dimension–value pairs defining
the given point
• (Dimension-value pair)time = “Q1”, location = “Vancouver”, item =
“computer”
• Measures can be organized into three categories—distributive,
algebraic, and holistic—based on the kind of aggregate functions used
• Distributive: An aggregate function is distributive if it can be
computed in a distributed manner. Suppose the data are partitioned
into n sets. We apply the function to each partition, resulting in n
aggregate values.
• For example, sum() can be computed for a data cube by first
partitioning the cube into a set of subcubes, computing sum() for
each subcube, and then summing up the counts obtained for each
subcube. Hence, sum() is a distributive aggregate function
• count(), min(), and max() are distributive aggregate function
• Distributive measures can be computed efficiently because of the way
the computation can be partitioned.
• Algebraic: An aggregate function is algebraic if it can be computed by
an algebraic function with M arguments each of which is obtained by
applying a distributive aggregate function.
• For example, avg() (average) can be computed by sum()/count(),
where both sum() and count() are distributive aggregate functions
• Holistic function: there does not exist an algebraic function with M
arguments (where M is a constant) that characterizes the
computation median(), mode(), and rank().
OLAP operations
• Organization allows the user to view data in different perspectives.
• Several OLAP data cube operations exist to demonstrate these
different views allowing interactive querying and analysis of the data
at hand
• Hence, OLAP provides a user-friendly environment for interactive
data analysis
OLAP operations
Roll up
Drill down
Slice and dice
Pivot (rotate)
• Roll-up: The roll-up operation (also called the drill-up operation by
some vendors) performs aggregation on a data cube, either by
climbing up a concept hierarchy for a dimension or by dimension
reduction.
• This hierarchy was defined as the total order “street < city < province
or state < country.” The roll-up operation shown aggregates the data
by ascending the location hierarchy from the level of city to the level
of country. In other words, rather than grouping the data by city, the
resulting cube groups the data by country
• Drill-down: Drill-down is the reverse of roll-up.
• It navigates from less detailed data to more detailed data. Drill-down
can be realized by either stepping down a concept hierarchy for a
dimension introducing additional dimensions
• concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of
quarter to the more detailed level of month
• Slice and dice: The slice operation performs a selection on one
dimension of the given cube, resulting in a sub-cube
• Here Slice is performed for the dimension "time" using the
criterion time = "Q1".
• It will form a new sub-cube by selecting one or more
dimensions.
• Dice selects two or more dimensions from a given cube and
provides a new sub-cube The dice operation on the cube based on the following
selection criteria involves three dimensions.
•(location = "Toronto" or "Vancouver")
•(time = "Q1" or "Q2")
•(item =" Mobile" or "Modem")
• Pivot
• The pivot operation is also known as rotation. It rotates the
data axes in view in order to provide an alternative
presentation of data
OLAP Systems versus Statistical Databases
• Statistical db is a database system is designed to support statistical
applications meant for socio economic applications,
• OLAP for business intelligence with large amount of data ,.
• Sdb has truth issues
Star net model for querying
• The querying of multidimensional databases can be based on a star
net model, which consists of radial lines emerging from a central
point, where each line represents a concept hierarchy for a
dimension.
• Each abstraction level in the hierarchy is called a footprint. These
represent the granularities available for use by OLAP operations such
as drill-down and roll-up
• This starnet consists of four radial lines, representing concept
hierarchies for the dimensions location, customer, item, and time,
respectively. Each line consists of footprints representing abstraction
levels of the dimension

You might also like