Data Warehouse Architecture -Processes-
Overview
Architecture a technical blueprint stage Must support 3 major driving forces:
Populating the warehouse
Data extraction, cleaning and loading
Day-to-day management of the warehouse
Large volumes of data, create/delete summaries
The ability to cope with requirement evolution
Cope with future changes with query profiles
Typical Process Flow
Extract and load the data Clean and transform data into a form that provides good query performance Backup and archive data Manage queries, and direct them to appropriate data sources
Extract & Load Process
Extract
Takes data from source systems and make it available to the data warehouse
Load
Takes extracted data and loads it into the data warehouse
Data in operational systems is held in a from suitable for that system Before loading the data into the DW, information content must be reconstructed Data must become value added business information
Extract & load process must take data and add context and meaning
Issues with ELP
When to start extracting the data, run transformation and consistency checks and so on?
A controlling mechanism is essential to fire each module when appropriate
When to extract?
Data must be in consistent Start extracting data from data sources when it represents the same snapshot of time as all other data sources
Eg. Customer database
Issues
Loading the data
Extracted data are loaded into temporary data store to perform clean up and check for consistency Do not execute consistency checks until all the data sources have been loaded into the temporary data store
Eg. Customer canceling subscriptions
Error recovery must be an integral part of the design The effort required to clean up the source systems increases exponentially with the number of overlapping data sources
Issues
Copy Management tools and clean up
Eg. IBMs Information Warehouse Framework
Data Refresher & Data Hub
Most copy management tools do not have the capability of performing consistency check directly (user must write the logic & code it) Make cost-benefit analysis before purchasing copy management tool
If source systems do not overlap, then consistency checks are very simple
Clean and Transform Data
Steps involved are:
Clean and transform the loaded data into a structure that speeds up queries Partition the data to speed up queries, optimize hardware performance and simplify the DW management Create aggregations to speed up the common queries
Clean and Transform Data
Data needs to be cleaned and checked in the following ways:
Make sure data is consistent with itself Make sure data is consistent with other data within the same source Make sure data is consistent with data in the other source systems Make sure data is consistent with the information already in the DW
Once data is cleaned, convert source data into a structure that is designed to balance query performance and operational cost
The structure must be suitable for long term storage
Backup & archive
Regular backup is essential to recover data from loss Archiving
Older data is removed from the system in a format that allows it to be quickly restored if required Issue
As DW evolves, all information may change Hence to ensure that a restored archive is valid, it becomes necessary to extract all related data and structures as well
Query Management Process
It is a system process
Manages the queries Speeds them up by directing queries to the most effective data source Ensure that all system resources are used effectively Monitor query profiles manage which aggregations to generate This process operates at all times