Data Warehousing/Mining 1
Data Warehousing/Mining
Comp 150
Data Warehousing Introduction
(not in book)
Instructor: Dan Hebert
Data Warehousing/Mining 2
Outline of Lecture
 Data Warehousing and Information
Integration
 Brief History of Data Warehousing
 What is a Data Warehouse?
 Types of Data and Their Uses
 Data Warehouse Architectures
 Issues in Data Warehousing
Data Warehousing/Mining 3
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
Data Warehousing/Mining 4
Problem: Data Management in
Large Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Administration Finance Manufacturing ...
Sales Planning
Stock Mngmt
...
Suppliers
...
Debt Mngmt
Num. Control
...
Inventory
Data Warehousing/Mining 5
Goal: Unified Access to Data
Integration System
• Collects and combines information
• Provides integrated view, uniform user interface
• Supports sharing
World
Wide
Web
Digital Libraries Scientific Databases
Personal
Databases
Data Warehousing/Mining 6
The Traditional Research Approach
Source Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper Wrapper
Wrapper
 Query-driven (lazy, on-demand)
Data Warehousing/Mining 7
Disadvantages of Query-Driven
Approach
 Delay in query processing
– Slow or unavailable information sources
– Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry
Data Warehousing/Mining 8
The Warehousing Approach
Data
Warehouse
Clients
Source Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
 Information
integrated in
advance
 Stored in wh for
direct querying
and analysis
Data Warehousing/Mining 9
Advantages of Warehousing Approach
 High query performance
– But not necessarily most current information
 Doesn’t interfere with local processing at sources
– Complex queries at warehouse
– OLTP at information sources
 Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
 Has caught on in industry
Data Warehousing/Mining 10
Not Either-Or Decision
 Query-driven approach still better for
– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large numbers of
sources
– Clients with unpredictable needs
Data Warehousing/Mining 11
Data Warehouse Evolution
TIME
2000
1995
1990
1985
1980
1960 1975
Information-
Based
Management
Data
Revolution
“Middle
Ages”
“Prehistoric
Times”
Relational
Databases
PC’s and
Spreadsheets
End-user
Interfaces
1st DW
Article
DW
Confs.
Vendor DW
Frameworks
Company
DWs
“Building the
DW”
Inmon (1992)
Data Replication
Tools
Data Warehousing/Mining 12
What is a Data Warehouse?
A Practitioners Viewpoint
“A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it
in a business context.”
-- Barry Devlin, IBM Consultant
Data Warehousing/Mining 13
A Data Warehouse is...
 Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
 Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
 Optimized differently from transaction-
oriented db
 User interface aimed at executive
Data Warehousing/Mining 14
A Data Warehouse is... (continued)
 Large volume of data (Gb, Tb)
 Non-volatile
– Historical
– Time attributes are important
 Updates infrequent
 May be append-only
 Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios
Data Warehousing/Mining 15
Summary
Operational Systems
Enterprise
Modeling
Business
Information Guide
Data
Warehouse
Catalog
Data Warehouse
Population
Data
Warehouse
Business Information
Interface
Data Warehousing/Mining 16
Warehouse is a Specialized DB
Standard DB
 Mostly updates
 Many small transactions
 Mb - Gb of data
 Current snapshot
 Index/hash on p.k.
 Raw data
 Thousands of users (e.g.,
clerical users)
Warehouse
 Mostly reads
 Queries are long and complex
 Gb - Tb of data
 History
 Lots of scans
 Summarized, reconciled data
 Hundreds of users (e.g.,
decision-makers, analysts)
Data Warehousing/Mining 17
Warehousing and Industry
 Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
 WalMart has largest warehouse
– 900-CPU, 2,700 disk, 23 TB Teradata system
– ~7TB in warehouse
– 40-50GB per day
Data Warehousing/Mining 18
Types of Data
 Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
 Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
 Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book
Data Warehousing/Mining 19
Data Warehouse Architectures:
Conceptual View
 Single-layer
– Every data element is stored once only
– Virtual warehouse
 Two-layer
– Real-time + derived data
– Most commonly used approach in
industry today
“Real-time data”
Operational
systems
Informational
systems
Derived Data
Real-time data
Operational
systems
Informational
systems
Data Warehousing/Mining 20
Three-layer Architecture:
Conceptual View
 Transformation of real-time data to derived
data really requires two steps
Derived Data
Real-time data
Operational
systems
Informational
systems
Reconciled Data
Physical Implementation
of the Data Warehouse
View level
“Particular informational
needs”
Data Warehousing/Mining 21
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”
 Both rich research areas
 Industry has focused on (2)
Data Warehousing/Mining 22
Issues in Data Warehousing
 Warehouse Design
 Extraction
– Wrappers, monitors (change detectors)
 Integration
– Cleansing & merging
 Warehousing specification & Maintenance
 Optimizations
 Miscellaneous (e.g., evolution)
Data Warehousing/Mining 23
Data Extraction
 Source types
– Relational, flat file, WWW, etc.
 How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”
Data Warehousing/Mining 24
Warehouse Architecture
Source Source Source
Extractor/
Monitor
Extractor/
Monitor
Extractor/
Monitor
Integrator
Warehouse
Query & Analysis
Client Client
...
Metadata
Data Warehousing/Mining 25
Issues (1)
 Warehouse uses relational data model or multi-
dimensional data model (e.g., data cube)
 On the other hand, source types
– Relational, OO, hierarchical, legacy
– Semistructured: flat file, WWW
 How do we get the data out?
Data Warehousing/Mining 26
Issues (2)
 Warehouse must be kept current in light of
changes to underlying sources
 How do we detect updates in sources?
Data Warehousing/Mining 27
Wrapper
Converts data and queries from one data model to
another
Extends query capabilities for sources with
limited capabilities
Data
Model
B
Data
Model
A
Queries
Data
Queries Source
Wrapper
Data Warehousing/Mining 28
Wrapper Generation
 Solution 1: Hard code for each source
 Solution 2: Automatic wrapper generation
Wrapper
Wrapper
Generator
Definition
Data Warehousing/Mining 29
Wrapper Approach
 Source-specific adapter (a.k.a. wrapper,
translator)
 “Thickness” of adapter depends on source
– Data model used (e.g. rel. schema vs.
unstructured)
– Interface (i.e., query language, API)
– Active capabilities (i.e., triggers)
– Degree of autonomy (e.g., same owner &
modifiable vs. controlled by external entity & no
changes possible)
– Cooperation (e.g., friendly vs. uncooperative)
Data Warehousing/Mining 30
Routine When...
 Many tools for dealing with “standard situations”
– Standard sources with full/many capabilities
 e.g., most commercial DBMSs, all ODBC-compliant sources
– Standard interactions
 e.g., pass-through queries, extraction from rel. tables, replication
– Cooperative sources or sources under our control
 Tools
– Replication tools, ODBC, report writers, third-party
“wrappers”
Data Warehousing/Mining 31
Not So Routine When...
 “Non-standard situations”
– Unstructured or semistructured sources with little
or no explicit schema
– Uncooperative sources
– Sources with limited capabilities (e.g., legacy
sources, WWW)
 Few commercial tools
 Mostly research
Data Warehousing/Mining 32
Data Transformations
 Convert data to uniform format
– Byte ordering, string termination
– Internal layout
 Remove, add & reorder attributes
– Add key
– Add data to get history
 Sort tuples
Data Warehousing/Mining 33
Monitors
 Goal: Detect changes of interest and
propagate to integrator
 How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps
Data Warehousing/Mining 34
Data Integration
 Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
 Rule-based
 Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.
Data Warehousing/Mining 35
Data Cleansing
 Find (& remove) duplicate tuples
– e.g., Jane Doe vs. Jane Q. Doe
 Detect inconsistent, wrong data
– Attribute values that don’t match
 Patch missing, unreadable data
 Notify sources of errors found

DWIntro.ppt

  • 1.
    Data Warehousing/Mining 1 DataWarehousing/Mining Comp 150 Data Warehousing Introduction (not in book) Instructor: Dan Hebert
  • 2.
    Data Warehousing/Mining 2 Outlineof Lecture  Data Warehousing and Information Integration  Brief History of Data Warehousing  What is a Data Warehouse?  Types of Data and Their Uses  Data Warehouse Architectures  Issues in Data Warehousing
  • 3.
    Data Warehousing/Mining 3 Problem:Heterogeneous Information Sources “Heterogeneities are everywhere”  Different interfaces  Different data representations  Duplicate and inconsistent information Personal Databases Digital Libraries Scientific Databases World Wide Web
  • 4.
    Data Warehousing/Mining 4 Problem:Data Management in Large Enterprises  Vertical fragmentation of informational systems (vertical stove pipes)  Result of application (user)-driven development of operational systems Sales Administration Finance Manufacturing ... Sales Planning Stock Mngmt ... Suppliers ... Debt Mngmt Num. Control ... Inventory
  • 5.
    Data Warehousing/Mining 5 Goal:Unified Access to Data Integration System • Collects and combines information • Provides integrated view, uniform user interface • Supports sharing World Wide Web Digital Libraries Scientific Databases Personal Databases
  • 6.
    Data Warehousing/Mining 6 TheTraditional Research Approach Source Source Source . . . Integration System . . . Metadata Clients Wrapper Wrapper Wrapper  Query-driven (lazy, on-demand)
  • 7.
    Data Warehousing/Mining 7 Disadvantagesof Query-Driven Approach  Delay in query processing – Slow or unavailable information sources – Complex filtering and integration  Inefficient and potentially expensive for frequent queries  Competes with local processing at sources  Hasn’t caught on in industry
  • 8.
    Data Warehousing/Mining 8 TheWarehousing Approach Data Warehouse Clients Source Source Source . . . Extractor/ Monitor Integration System . . . Metadata Extractor/ Monitor Extractor/ Monitor  Information integrated in advance  Stored in wh for direct querying and analysis
  • 9.
    Data Warehousing/Mining 9 Advantagesof Warehousing Approach  High query performance – But not necessarily most current information  Doesn’t interfere with local processing at sources – Complex queries at warehouse – OLTP at information sources  Information copied at warehouse – Can modify, annotate, summarize, restructure, etc. – Can store historical information – Security, no auditing  Has caught on in industry
  • 10.
    Data Warehousing/Mining 10 NotEither-Or Decision  Query-driven approach still better for – Rapidly changing information – Rapidly changing information sources – Truly vast amounts of data from large numbers of sources – Clients with unpredictable needs
  • 11.
    Data Warehousing/Mining 11 DataWarehouse Evolution TIME 2000 1995 1990 1985 1980 1960 1975 Information- Based Management Data Revolution “Middle Ages” “Prehistoric Times” Relational Databases PC’s and Spreadsheets End-user Interfaces 1st DW Article DW Confs. Vendor DW Frameworks Company DWs “Building the DW” Inmon (1992) Data Replication Tools
  • 12.
    Data Warehousing/Mining 12 Whatis a Data Warehouse? A Practitioners Viewpoint “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant
  • 13.
    Data Warehousing/Mining 13 AData Warehouse is...  Stored collection of diverse data – A solution to data integration problem – Single repository of information  Subject-oriented – Organized by subject, not by application – Used for analysis, data mining, etc.  Optimized differently from transaction- oriented db  User interface aimed at executive
  • 14.
    Data Warehousing/Mining 14 AData Warehouse is... (continued)  Large volume of data (Gb, Tb)  Non-volatile – Historical – Time attributes are important  Updates infrequent  May be append-only  Examples – All transactions ever at WalMart – Complete client histories at insurance firm – Stockbroker financial information and portfolios
  • 15.
    Data Warehousing/Mining 15 Summary OperationalSystems Enterprise Modeling Business Information Guide Data Warehouse Catalog Data Warehouse Population Data Warehouse Business Information Interface
  • 16.
    Data Warehousing/Mining 16 Warehouseis a Specialized DB Standard DB  Mostly updates  Many small transactions  Mb - Gb of data  Current snapshot  Index/hash on p.k.  Raw data  Thousands of users (e.g., clerical users) Warehouse  Mostly reads  Queries are long and complex  Gb - Tb of data  History  Lots of scans  Summarized, reconciled data  Hundreds of users (e.g., decision-makers, analysts)
  • 17.
    Data Warehousing/Mining 17 Warehousingand Industry  Warehousing is big business – $2 billion in 1995 – $3.5 billion in early 1997 – Predicted: $8 billion in 1998 [Metagroup]  WalMart has largest warehouse – 900-CPU, 2,700 disk, 23 TB Teradata system – ~7TB in warehouse – 40-50GB per day
  • 18.
    Data Warehousing/Mining 18 Typesof Data  Business Data - represents meaning – Real-time data (ultimate source of all business data) – Reconciled data – Derived data  Metadata - describes meaning – Build-time metadata – Control metadata – Usage metadata  Data as a product* - intrinsic meaning – Produced and stored for its own intrinsic value – e.g., the contents of a text-book
  • 19.
    Data Warehousing/Mining 19 DataWarehouse Architectures: Conceptual View  Single-layer – Every data element is stored once only – Virtual warehouse  Two-layer – Real-time + derived data – Most commonly used approach in industry today “Real-time data” Operational systems Informational systems Derived Data Real-time data Operational systems Informational systems
  • 20.
    Data Warehousing/Mining 20 Three-layerArchitecture: Conceptual View  Transformation of real-time data to derived data really requires two steps Derived Data Real-time data Operational systems Informational systems Reconciled Data Physical Implementation of the Data Warehouse View level “Particular informational needs”
  • 21.
    Data Warehousing/Mining 21 DataWarehousing: Two Distinct Issues (1) How to get information into warehouse “Data warehousing” (2) What to do with data once it’s in warehouse “Warehouse DBMS”  Both rich research areas  Industry has focused on (2)
  • 22.
    Data Warehousing/Mining 22 Issuesin Data Warehousing  Warehouse Design  Extraction – Wrappers, monitors (change detectors)  Integration – Cleansing & merging  Warehousing specification & Maintenance  Optimizations  Miscellaneous (e.g., evolution)
  • 23.
    Data Warehousing/Mining 23 DataExtraction  Source types – Relational, flat file, WWW, etc.  How to get data out? – Replication tool – Dump file – Create report – ODBC or third-party “wrappers”
  • 24.
    Data Warehousing/Mining 24 WarehouseArchitecture Source Source Source Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse Query & Analysis Client Client ... Metadata
  • 25.
    Data Warehousing/Mining 25 Issues(1)  Warehouse uses relational data model or multi- dimensional data model (e.g., data cube)  On the other hand, source types – Relational, OO, hierarchical, legacy – Semistructured: flat file, WWW  How do we get the data out?
  • 26.
    Data Warehousing/Mining 26 Issues(2)  Warehouse must be kept current in light of changes to underlying sources  How do we detect updates in sources?
  • 27.
    Data Warehousing/Mining 27 Wrapper Convertsdata and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper
  • 28.
    Data Warehousing/Mining 28 WrapperGeneration  Solution 1: Hard code for each source  Solution 2: Automatic wrapper generation Wrapper Wrapper Generator Definition
  • 29.
    Data Warehousing/Mining 29 WrapperApproach  Source-specific adapter (a.k.a. wrapper, translator)  “Thickness” of adapter depends on source – Data model used (e.g. rel. schema vs. unstructured) – Interface (i.e., query language, API) – Active capabilities (i.e., triggers) – Degree of autonomy (e.g., same owner & modifiable vs. controlled by external entity & no changes possible) – Cooperation (e.g., friendly vs. uncooperative)
  • 30.
    Data Warehousing/Mining 30 RoutineWhen...  Many tools for dealing with “standard situations” – Standard sources with full/many capabilities  e.g., most commercial DBMSs, all ODBC-compliant sources – Standard interactions  e.g., pass-through queries, extraction from rel. tables, replication – Cooperative sources or sources under our control  Tools – Replication tools, ODBC, report writers, third-party “wrappers”
  • 31.
    Data Warehousing/Mining 31 NotSo Routine When...  “Non-standard situations” – Unstructured or semistructured sources with little or no explicit schema – Uncooperative sources – Sources with limited capabilities (e.g., legacy sources, WWW)  Few commercial tools  Mostly research
  • 32.
    Data Warehousing/Mining 32 DataTransformations  Convert data to uniform format – Byte ordering, string termination – Internal layout  Remove, add & reorder attributes – Add key – Add data to get history  Sort tuples
  • 33.
    Data Warehousing/Mining 33 Monitors Goal: Detect changes of interest and propagate to integrator  How? – Triggers – Replication server – Log sniffer – Compare query results – Compare snapshots/dumps
  • 34.
    Data Warehousing/Mining 34 DataIntegration  Receive data (changes) from multiple wrappers/monitors and integrate into warehouse  Rule-based  Actions – Resolve inconsistencies – Eliminate duplicates – Integrate into warehouse (may not be empty) – Summarize data – Fetch more data from sources (wh updates) – etc.
  • 35.
    Data Warehousing/Mining 35 DataCleansing  Find (& remove) duplicate tuples – e.g., Jane Doe vs. Jane Q. Doe  Detect inconsistent, wrong data – Attribute values that don’t match  Patch missing, unreadable data  Notify sources of errors found

Editor's Notes

  • #2 The slides for this text are organized into several modules. Each lecture contains about enough material for a 1.25 hour class period. (The time estimate is very approximate--it will vary with the instructor, and lectures also differ in length; so use this as a rough guideline.) This lecture is the first of two in Module (1). Module (1): Introduction (DBMS, Relational Model) Module (2): Storage and File Organizations (Disks, Buffering, Indexes) Module (3): Database Concepts (Relational Queries, DDL/ICs, Views and Security) Module (4): Relational Implementation (Query Evaluation, Optimization) Module (5): Database Design (ER Model, Normalization, Physical Design, Tuning) Module (6): Transaction Processing (Concurrency Control, Recovery) Module (7): Advanced Topics