Chapter6_DataWareHousing_final
Chapter6_DataWareHousing_final
Traditional database
• Focused on Online Transactional Processing (OLTP)
• Short, simple queries and frequent updates involving a relatively small
number of tuples.
• Transaction
• Single event that changes something
• Processing of transactions include storage and editing of data
• When transaction is completed then the records of an organization are changed
What is Business Intelligence?
Traditional database
• Used for transaction/function oriented applications
• Used by lower level employee
• Quick updates and retrievals
• Many users accessing the same data
• Users are not technical persons
• Response rate is very fast
Traditional database
Solution
• New analysis environment
• Subject oriented (versus function oriented)
• Integrated (logically and physically)
• Time variant (data can always be related to time)
• Stable (data not deleted, several versions)
• Supporting management decisions (different organization)
Data Warehousing
OLTP vs OLAP
OLTP OLAP
Focus Data in Data out
Source of data Operational/Transactional Data Data extracted from various operational data sources,
transformed and loaded into the data warehouse
Purpose of data Manage (control and execute) basic Assists in planning, budgeting, forecasting and decision
business tasks making
Data contents Current data. Far too detailed – not Historical data. Has support for summarization and
suitable for decision making aggregation. Stores and manages data at various levels of
granularity, thereby suitable for decision making
Inserts and Very frequent updates and inserts Periodic updates to refresh the data warehouse
updates
Processing speed Usually returns fast Queries usually take a long time (several hours) to execute and
return
Access Field level access Typically aggregated access to data of business interest
Data Warehousing
OLTP vs OLAP
OLTP OLAP
Database Typically normalized tables. OLTP system Typically de-normalized tables; uses star or snowflake
Design adopts E-R (Entity Relationship) Model schema
Operations Read/Write Mostly read
Backup and Regular backups of operational data are Instead of regular backups, data warehouse is refreshed
Recovery mandatory. Requires concurrency control periodically using data from operational data sources
(locking) and recovery mechanisms (logging)
Joins Many Few
Derived data Rare Common
and aggregates
Data Structures Complex Multi-dimensional
Sample Queries Search & locate student(s) Which courses have productivity impact on-the-job?
Print student scores How much training is needed on future technologies
Filter students above 90% marks for non linear growth in BI?
Why consider investing in DSS experience lab
Data Warehousing
OLAP cube
• Data cube
• Useful data analysis tool
• Generalized GROUP BY queries
• Aggregate facts based on chosen
dimensions
• Why data cube?
• Good for visualization (i.e., text results
hard to understand)
• Multidimensional, intuitive
• Support interactive OLAP operations
Data Warehousing
Performance Optimization
• Precompute some partial result in advance and store it.
• At query time, such partial result can be utilized to derive the final
result very fast
• Example
• 1 billion sales rows, 1000 products, 100 locations
SELECT p.category, SUM (s.sales) CREATE VIEW TotalSales (pid, locid, total) AS SELECT p.category, SUM (t.total)
FROM Products p, Sales s SELECT s.pid, s.locid, SUM(s.sales) FROM Products p, TotalSales t WHERE
WHERE p.pid=s.pid FROM Sales s p.pid=t.pid
GROUP BY p.category GROUP BY s.pid, s.locid GROUP BY p.category
Data Warehouse Architecture
Central DW Architecture
• All data in one, All client queries
directly on the central DW
• Pros
• Simplicity
• Easy to manage
• Cons
• Bad performance due to no
redundancy/workload distribution
Data Warehouse Architecture
Federated DW Architecture
• Data stored in separate data marts,
aimed at special departments
• Logical DW (i.e., virtual)
• Data marts contain detail data
• Pros
• Performance due to distribution
• Cons
• More complex
Data Warehouse Architecture
Tiered Architecture
• Data is distributed to data marts in one or
more tiers.
• Only aggregated data in cube tiers.
• Data is aggregated/reduced as it moves
through tiers.
• Pros
• Best performance due to redundancy and
distribution
• Cons
• Most complex
• Hard to manage
Data Warehousing
Common Issues
• Metadata management
• Need to understand data = metadata needed
• Need to know about:
• Data definitions, dataflow, transformations, versions, usage, security
• DW project management
• Data marts are smaller and “safer” (bottom up approach)
• Reasons for failure
• Lack of proper design methodologies
• Deployment problems (lack of training)
• Organizational change is hard… (new processes, data ownership,..)
• Ethical issues (security, privacy,…)
Multidimensional Databases
Basic concepts
• Data is divided into:
• Facts
• Dimensions
• Facts are the important entity: a sale
• Facts have measures that can be aggregated: sales price
• Dimensions describe facts
• A sale has the dimensions Product, Store and Time
• Facts “live” in a multidimensional cube (dice)
• Goal for dimensional modelling:
• Surround facts with as much context (dimensions) as possible.
• Hint: redundancy may be ok.
Multidimensional Databases
Dimensions
• Dimensions have hierarchies with levels
• Typically 3-5 levels (of detail)
• Dimension values are organized in a tree structure
• Product: Product->Type->Category
• Store: Store->Area->City->County
• Time: Day->Month->Quarter->Year
• Levels may have attributes
• Simple, non-hierarchical information
• Day has Workday as attribute
Multidimensional Databases
Types of Facts
• Summative
• Are used with aggregation functions such as sum, average …
• Semi summative
• There are small numbers of quasi-summative fact aggregation
functions that will apply.
• Non-additive
• Cannot use numerical aggregation functions.
• Ratio or percentage is used.
Multidimensional Databases
Measures
• Measures represent the fact property that the users want to study
and optimize
• Example: total sales price
• A measure has two components
• Numerical value: (sales price)
• Aggregation formula (SUM): used for aggregating/combining a
number of measure values into one.
Multidimensional Databases
Types of Measures
• Additive
• Can be aggregated over all dimensions
• Example: sales price
• Often occur in event facts
• Semi-additive
• Cannot be aggregated over some dimensions - typically time
• Example: inventory
• Often occur in snapshot facts
• Non-additive
• Cannot be aggregated over any dimensions
• Example: average sales price
• Occur in all types of facts
The Complete Decision Support System
OLAP Server
ROLAP
• Relational OLAP
• Data stored in relational tables
• Star (or snowflake) schemas used for modelling
• SQL used for querying
• Pros
• Leverages investments in relational technology
• Scalable (billions of facts)
• Flexible, designs easier to change
• New, performance enhancing techniques adapted from MOLAP
• Cons
• Storage use (often 3-4 times MOLAP)
• Response times
OLAP Server
MOLAP
• Multidimensional OLAP
• Data stored in special multidimensional data structures
• E.g., multidimensional array on hard disk
• Pros
• Less storage use (“foreign keys” not stored)
• Faster query response times
• Cons
• Up till now not so good scalability
• Less flexible, e.g., cube must be re-computed when design changes
• Does not reuse an existing investment
OLAP Server
HOLAP
• Hybrid OLAP
• Detail data stored in relational tables (ROLAP)
• Aggregates stored in multidimensional structures (MOLAP)
• Pros
• Scalable (as ROLAP)
• Fast (as MOLAP)
• Cons
• High complexity
Data warehouse
Relational Implementation
• Goal for dimensional modelling: surround the facts with as much
context (dimensions) as we can
• Granularity of the fact table is important
• What does one fact table row represent?
• Important for the size of the fact table
• Often corresponding to a single business transaction (sale)
• But it can be aggregated (sales per product per day per store)
• Some properties
• Many-to-one relationship from fact to dimension
• Many-to-one relationships from lower to higher levels in the hierarchies
Relational Implementation
Snowflake Schema
• Normalize the dimension tables by attribute level, with each smaller
dimension table pointing to an appropriate aggregated fact table.
Multi-Dimensional On-Line Analytical Processing
Roll-up
• Also known as drill-up or
aggregation operation
• Performs aggregation on a data
cube
• by climbing down concept
hierarchies, i.e., dimension
reduction.
• Roll-up is like zooming-out on the
data cubes.
Cube operations
Drill-down
• Also called roll-down is the
reverse operation of roll-up.
• Drill-down is like zooming-in on
the data cube.
• Navigate from less detailed record to
more detailed data.
• Drill-down can be performed by
either
• Stepping down a concept hierarchy
for a dimension
• Or adding additional dimensions.
Cube operations
Slicing
• A slice is a subset of the cubes
• corresponding to a single value for one or
more members of the dimension.
Cube operations
Dicing
• The dice operation describes a subcube
by operating a selection on two or more
dimension.
Cube operations
Pivot
• The pivot operation is also called a rotation.
• Pivot is a visualization operations which rotates the data axes in view to provide an
alternative presentation of the data.
• May contain swapping the rows and columns or moving one of the row-dimensions into the column
dimensions.