0% found this document useful (0 votes)
22 views46 pages

Chapter6_DataWareHousing_final

Uploaded by

dungm2524003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views46 pages

Chapter6_DataWareHousing_final

Uploaded by

dungm2524003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Advanced System Analysis and Design

Chapter 6: DATA WAREHOUSING


Trương Quốc Định
Faculty of Information Systems – College of Information & Communication Technology
What is Business Intelligence?
• BI refers to a set of tools and techniques that enable
• company to transform its business data into timely and accurate
information for
• the decisional process,
• the right persons in the most suitable form.
What is Business Intelligence?
• BI is different from Artificial Intelligence (AI)
• AI systems make decisions for the users
• BI systems help the users make the right decisions, based on available data
What is Business Intelligence?

BI and the Web


• The Web makes BI even more useful
• Customers do not appear “physically” in a store;
• A website log is used to capture the behavior of each customer;
• Combine web data with traditional customer data
What is Business Intelligence?

Data Analysis Problems


• The same data found in many different systems
• Heterogeneous sources
• Data is suited for operational systems
• Data are “volatile”
• Data change over time but no historical information
What is Business Intelligence?

Traditional database
• Focused on Online Transactional Processing (OLTP)
• Short, simple queries and frequent updates involving a relatively small
number of tuples.
• Transaction
• Single event that changes something
• Processing of transactions include storage and editing of data
• When transaction is completed then the records of an organization are changed
What is Business Intelligence?

Traditional database
• Used for transaction/function oriented applications
• Used by lower level employee
• Quick updates and retrievals
• Many users accessing the same data
• Users are not technical persons
• Response rate is very fast
Traditional database

Hard/Infeasible Queries for OLTP


• Examples for Business analysis queries
• In the past five years, which product is the most profitable?
• Which public holiday we have the largest sales?
• Which week we have the largest sales?
• Does the sales of dairy products increase over time?
• Difficult to express these queries in SQL -> The need of
multidimensional modeling
Data Warehousing

Purpose and Definition


• Information organized in a unified data model
• Data collected from a number of different sources
• Support decision making
• Easy to perform advanced analysis
Data Warehousing

Solution
• New analysis environment
• Subject oriented (versus function oriented)
• Integrated (logically and physically)
• Time variant (data can always be related to time)
• Stable (data not deleted, several versions)
• Supporting management decisions (different organization)
Data Warehousing

Function vs. Subject Orientation


Data Warehousing

Multidimensional Modeling (sales of supermarkets)


• Facts and measures
• Each sales record is a fact,
and its sales value is a
measure
• Dimensions
• Group correlated attributes
into the same dimension
Data Warehousing

OLTP vs OLAP
OLTP OLAP
Focus Data in Data out
Source of data Operational/Transactional Data Data extracted from various operational data sources,
transformed and loaded into the data warehouse
Purpose of data Manage (control and execute) basic Assists in planning, budgeting, forecasting and decision
business tasks making
Data contents Current data. Far too detailed – not Historical data. Has support for summarization and
suitable for decision making aggregation. Stores and manages data at various levels of
granularity, thereby suitable for decision making
Inserts and Very frequent updates and inserts Periodic updates to refresh the data warehouse
updates
Processing speed Usually returns fast Queries usually take a long time (several hours) to execute and
return
Access Field level access Typically aggregated access to data of business interest
Data Warehousing

OLTP vs OLAP
OLTP OLAP
Database Typically normalized tables. OLTP system Typically de-normalized tables; uses star or snowflake
Design adopts E-R (Entity Relationship) Model schema
Operations Read/Write Mostly read
Backup and Regular backups of operational data are Instead of regular backups, data warehouse is refreshed
Recovery mandatory. Requires concurrency control periodically using data from operational data sources
(locking) and recovery mechanisms (logging)
Joins Many Few
Derived data Rare Common
and aggregates
Data Structures Complex Multi-dimensional
Sample Queries  Search & locate student(s)  Which courses have productivity impact on-the-job?
 Print student scores  How much training is needed on future technologies
 Filter students above 90% marks for non linear growth in BI?
 Why consider investing in DSS experience lab
Data Warehousing

OLAP cube
• Data cube
• Useful data analysis tool
• Generalized GROUP BY queries
• Aggregate facts based on chosen
dimensions
• Why data cube?
• Good for visualization (i.e., text results
hard to understand)
• Multidimensional, intuitive
• Support interactive OLAP operations
Data Warehousing

Extract, Transform, Load (ETL)


• Problems
• Data from different sources
• Data with different formats
• Handling of missing data and erroneous data
• Query performance of DW
• ETL
• Extract
• Transformations / cleansing
• Load
• The most time-consuming process in DW development
• 80% of development time spent on ETL
Data Warehousing

Performance Optimization
• Precompute some partial result in advance and store it.
• At query time, such partial result can be utilized to derive the final
result very fast
• Example
• 1 billion sales rows, 1000 products, 100 locations
SELECT p.category, SUM (s.sales) CREATE VIEW TotalSales (pid, locid, total) AS SELECT p.category, SUM (t.total)
FROM Products p, Sales s SELECT s.pid, s.locid, SUM(s.sales) FROM Products p, TotalSales t WHERE
WHERE p.pid=s.pid FROM Sales s p.pid=t.pid
GROUP BY p.category GROUP BY s.pid, s.locid GROUP BY p.category
Data Warehouse Architecture

Central DW Architecture
• All data in one, All client queries
directly on the central DW
• Pros
• Simplicity
• Easy to manage
• Cons
• Bad performance due to no
redundancy/workload distribution
Data Warehouse Architecture

Federated DW Architecture
• Data stored in separate data marts,
aimed at special departments
• Logical DW (i.e., virtual)
• Data marts contain detail data
• Pros
• Performance due to distribution
• Cons
• More complex
Data Warehouse Architecture

Tiered Architecture
• Data is distributed to data marts in one or
more tiers.
• Only aggregated data in cube tiers.
• Data is aggregated/reduced as it moves
through tiers.
• Pros
• Best performance due to redundancy and
distribution
• Cons
• Most complex
• Hard to manage
Data Warehousing

Common Issues
• Metadata management
• Need to understand data = metadata needed
• Need to know about:
• Data definitions, dataflow, transformations, versions, usage, security
• DW project management
• Data marts are smaller and “safer” (bottom up approach)
• Reasons for failure
• Lack of proper design methodologies
• Deployment problems (lack of training)
• Organizational change is hard… (new processes, data ownership,..)
• Ethical issues (security, privacy,…)
Multidimensional Databases

ER Model vs. Multidimensional Model

Entity Relationship Model Multidimensional Model


• All types of data are “equal”, difficult to identify the • Its only purpose: data analysis
data that is important for business analysis • It is not suitable for OLTP systems
• No difference between: • More built in “meaning”
• What is important • What is important
• What just describes the important • What describes the important
• Normalized databases spread information • What we want to optimize
• When analyzing data, the information must be • Easy for query operations
integrated again
Multidimensional Databases

Basic concepts
• Data is divided into:
• Facts
• Dimensions
• Facts are the important entity: a sale
• Facts have measures that can be aggregated: sales price
• Dimensions describe facts
• A sale has the dimensions Product, Store and Time
• Facts “live” in a multidimensional cube (dice)
• Goal for dimensional modelling:
• Surround facts with as much context (dimensions) as possible.
• Hint: redundancy may be ok.
Multidimensional Databases

Dimensions
• Dimensions have hierarchies with levels
• Typically 3-5 levels (of detail)
• Dimension values are organized in a tree structure
• Product: Product->Type->Category
• Store: Store->Area->City->County
• Time: Day->Month->Quarter->Year
• Levels may have attributes
• Simple, non-hierarchical information
• Day has Workday as attribute
Multidimensional Databases

Types of Facts
• Summative
• Are used with aggregation functions such as sum, average …
• Semi summative
• There are small numbers of quasi-summative fact aggregation
functions that will apply.
• Non-additive
• Cannot use numerical aggregation functions.
• Ratio or percentage is used.
Multidimensional Databases

Types of Fact Table


• Transaction fact table
• A fact for every business transaction.
• Point of Sales (POS) system that records each sale.
Multidimensional Databases

Types of Fact Table


• Periodic Snapshot fact table
• A row summarizes many measurement events occurring over a standard
period, such as a day, a week, or a month.
• The grain is the period, not the individual transaction.
• If no activity takes place during the period, a row is typically inserted
in the fact table containing a zero or null for each fact.
Multidimensional Databases

Types of Fact Table


• Accumulating Snapshot fact table
• A row in an accumulating snapshot fact table summarizes the
measurement events occurring at predictable steps between the
beginning and the end of a process.
• Pipeline or workflow processes, such as claim processing, that have a
defined start point, standard intermediate steps, and defined end point
can be modelled with this type of fact table.
Multidimensional Databases

Types of Fact Table


• Accumulating Snapshot fact table
(example)

• Each row in this table represents an order or a


batch of orders.
• Each of these rows are expected to be
updated multiple times as they proceed
through the order fulfilment pipeline.
• When a row is first created in this table, the
majority of these dates will start out as nulls, but
would eventually be filled up as time passes.
Multidimensional Databases

Measures
• Measures represent the fact property that the users want to study
and optimize
• Example: total sales price
• A measure has two components
• Numerical value: (sales price)
• Aggregation formula (SUM): used for aggregating/combining a
number of measure values into one.
Multidimensional Databases

Types of Measures
• Additive
• Can be aggregated over all dimensions
• Example: sales price
• Often occur in event facts
• Semi-additive
• Cannot be aggregated over some dimensions - typically time
• Example: inventory
• Often occur in snapshot facts
• Non-additive
• Cannot be aggregated over any dimensions
• Example: average sales price
• Occur in all types of facts
The Complete Decision Support System
OLAP Server

ROLAP
• Relational OLAP
• Data stored in relational tables
• Star (or snowflake) schemas used for modelling
• SQL used for querying
• Pros
• Leverages investments in relational technology
• Scalable (billions of facts)
• Flexible, designs easier to change
• New, performance enhancing techniques adapted from MOLAP
• Cons
• Storage use (often 3-4 times MOLAP)
• Response times
OLAP Server

MOLAP
• Multidimensional OLAP
• Data stored in special multidimensional data structures
• E.g., multidimensional array on hard disk
• Pros
• Less storage use (“foreign keys” not stored)
• Faster query response times
• Cons
• Up till now not so good scalability
• Less flexible, e.g., cube must be re-computed when design changes
• Does not reuse an existing investment
OLAP Server

HOLAP
• Hybrid OLAP
• Detail data stored in relational tables (ROLAP)
• Aggregates stored in multidimensional structures (MOLAP)
• Pros
• Scalable (as ROLAP)
• Fast (as MOLAP)
• Cons
• High complexity
Data warehouse

Relational Implementation
• Goal for dimensional modelling: surround the facts with as much
context (dimensions) as we can
• Granularity of the fact table is important
• What does one fact table row represent?
• Important for the size of the fact table
• Often corresponding to a single business transaction (sale)
• But it can be aggregated (sales per product per day per store)
• Some properties
• Many-to-one relationship from fact to dimension
• Many-to-one relationships from lower to higher levels in the hierarchies
Relational Implementation

Classic Star Schema


• A single fact table, with detail and
summary data
• Fact table primary key has only one
key column per dimension
• Each key is generated
• Each dimension is a single table,
highly denormalized
• Level is needed whenever aggregates
are stored with detail facts.
Relational Implementation

The “Fact Constellation” Schema


• In the Fact Constellations, aggregate tables are created separately from the
detail
• Therefor it is impossible to pick up.
Relational Implementation

Snowflake Schema
• Normalize the dimension tables by attribute level, with each smaller
dimension table pointing to an appropriate aggregated fact table.
Multi-Dimensional On-Line Analytical Processing

The MOLAP Cube


• Fact table view vs Multi-dimensional cube
Multi-Dimensional On-Line Analytical Processing

The MOLAP Cube


• Multi-dimensional cube
Cube operations

Roll-up
• Also known as drill-up or
aggregation operation
• Performs aggregation on a data
cube
• by climbing down concept
hierarchies, i.e., dimension
reduction.
• Roll-up is like zooming-out on the
data cubes.
Cube operations

Drill-down
• Also called roll-down is the
reverse operation of roll-up.
• Drill-down is like zooming-in on
the data cube.
• Navigate from less detailed record to
more detailed data.
• Drill-down can be performed by
either
• Stepping down a concept hierarchy
for a dimension
• Or adding additional dimensions.
Cube operations

Slicing
• A slice is a subset of the cubes
• corresponding to a single value for one or
more members of the dimension.
Cube operations

Dicing
• The dice operation describes a subcube
by operating a selection on two or more
dimension.
Cube operations

Pivot
• The pivot operation is also called a rotation.
• Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
• May contain swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.

You might also like