The document provides an overview of data warehousing and modeling, emphasizing its importance for businesses to analyze and utilize data for strategic decision-making. It defines data warehouses, outlines their key features, and compares them with operational databases, highlighting their role in business intelligence. Additionally, it discusses various types of data warehouses, data integration processes, and OLAP operations for data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
13 views56 pages
ML Module1 Ppt - Copy
The document provides an overview of data warehousing and modeling, emphasizing its importance for businesses to analyze and utilize data for strategic decision-making. It defines data warehouses, outlines their key features, and compares them with operational databases, highlighting their role in business intelligence. Additionally, it discusses various types of data warehouses, data integration processes, and OLAP operations for data analysis.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56
Module 1
Data warehousing & Modelling
Motivation for dare warehouse Competition mounting in every industry, data warehousing the must- have marketing weapon to retain customers by learning more about their needs
Data warehousing will provide tools & architecture for business
executives to organise , understand and use data to make strategies . A data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. Data warehouse systems allow for integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historic data for analysis. Oracle definition of data warehouse • A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. IBM definition • A core component of business intelligence, a data warehouse pulls together data from many different sources into a single data repository for sophisticated analytics and decision support. Key features of Datawarehouse
Integrated : from various sources , kept in same format
Subject oriented : around a particular subject ex: customer, sales (instead of whole organization data)
Nonvolatile: unchanged, read only
Time variant : documented wrt to time (ex: sales of 1-5 years)
• Data warehouse is mainly used for decision support • Knowledge workers like managers & analyst work on Datawarehouse to obtain overview or insights on data • Business decisions like : ✓Analysis of customer buying patterns(ex: combo) ✓Repositioning products and managing product portfolios according to sales per time and per geographical regions(ex: clothes) ✓To come up with strategies to improve profit ✓Managing customer relationship, managing corporate assets ▪ Organization collect diverse , heterogenous data , distributed data to integrate provide access is the major challenge ▪ In traditional database approach integrator/wrapper is used to address this ▪ Integrator is a mediator on multiple heterogenous databases , when a query is posed to client site , meta data dictionaries will be used to translate query to relevant query of any particular database ▪ These queries are mapped and sent to local query processor. ▪ The results returned from the different sites are integrated into a global answer set ▪ This query-driven approach requires complex information filtering and integration processes and competes with local sites for processing resources. It is inefficient and potentially expensive for frequent queries. • Data warehousing employs an update driven approach in which information from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct querying . • A data warehouse brings high performance to the integrated heterogeneous database system because data are copied, pre- processed, integrated summarized into one data store. Difference between OLAP & OLTP • Difference between operational database & data warehouse: Feature OLTP OLAP
Characteristic Operational processing Information processing analysis
Orientation Transaction Analysis
User Clerk, DBA, database professional Knowledge worker
Function Day to day operation Long term informational requirement decision
support
DB design ER based , application oriented subject-oriented
Data Current data , guaranteed up to date Historic, accuracy-maintained overtime
Priority high performance, high availability high flexibility, end-user autonomy
Three Tier data warehouse architecture Types of warehouse: Enterprise warehouse • All the info of entire organization • Data Integration of 1 or more operational systems • Detailed and summarized data • Gb and beyond • Implemented on mainframes or super servers • Extensive business model hence takes years to design & build Data mart • Subset of Datawarehouse • Confined particularly to subject & user • Can be built on low-cost servers like windows or linux • Implementation can be done in weeks • Types: dependent or independent • Independent : locally generated at department • Dependent : from enterprise Virtual Datawarehouse • Looking at frequent queries, (views) summary is created • Easy to build • Excess space in servers required • Data integration is difficult in bottom up • Data model building as common model is difficult for enterprise in top down Extraction, Transformation, and Loading • Data extraction, which typically gathers data from multiple, heterogeneous, and external sources. • Data cleaning, which detects errors in the data and rectifies them when possible. • Data transformation, which converts data from legacy or host format to warehouse format. • Load, which sorts, summarizes, consolidates, checks integrity, and builds indices and partitions. • Refresh, which propagates the updates from the data sources to the warehouse. Meta data repository • Desc of structure , schema, dimensions etc • Operational metadata: history of migrated data, error and monitoring reports • Which algorithms used for summary, aggregation , queries and reports • Data source , gateway (ODBC) • Data related to system performance, which include indices and profiles that improve data access and retrieval performance • Business terms Data cube –Multidimensional approach • Data cube allows data to be modelled in multidimension • Dimension-perspective in which organization wants to keep the data • Ex:Sales –store : dimension :time , item , branch • Each dimension will be kept in table called dimension table • Ex: item-itemname, brand, type • Multidimension data model around a central theme –ex: sales • Facts –numeric value/measures • The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables Schema for multidimensional model • Db : entity relationship schema is used • Warehouse : have multidimensional model and hence uses star, snowflake , fact constellation Star schema Snowflake schema Dimension : The role of Concept Hierarchies • A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy Concept hierarchies may also be defined by grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy • There may be more than one concept hierarchy for a given attribute or dimension, based on different user viewpoints. • Concept hierarchies may be provided manually by system users, domain experts, or knowledge engineers, • May be automatically generated based on statistical analysis of the data distribution. Measures-categorisation • A data cube measure is a numeric function that can be evaluated at each point in the data cube space
• A measure value is computed for a given point by aggregating the
data corresponding to the respective dimension–value pairs defining the given point • (Dimension-value pair)time = “Q1”, location = “Vancouver”, item = “computer” • Measures can be organized into three categories—distributive, algebraic, and holistic—based on the kind of aggregate functions used • Distributive: An aggregate function is distributive if it can be computed in a distributed manner. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. • For example, sum() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence, sum() is a distributive aggregate function • count(), min(), and max() are distributive aggregate function • Distributive measures can be computed efficiently because of the way the computation can be partitioned. • Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments each of which is obtained by applying a distributive aggregate function. • For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive aggregate functions • Holistic function: there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation median(), mode(), and rank(). OLAP operations • Organization allows the user to view data in different perspectives. • Several OLAP data cube operations exist to demonstrate these different views allowing interactive querying and analysis of the data at hand • Hence, OLAP provides a user-friendly environment for interactive data analysis OLAP operations Roll up Drill down Slice and dice Pivot (rotate) • Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction. • This hierarchy was defined as the total order “street < city < province or state < country.” The roll-up operation shown aggregates the data by ascending the location hierarchy from the level of city to the level of country. In other words, rather than grouping the data by city, the resulting cube groups the data by country • Drill-down: Drill-down is the reverse of roll-up. • It navigates from less detailed data to more detailed data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension introducing additional dimensions • concept hierarchy for time defined as “day < month < quarter < year.” Drill-down occurs by descending the time hierarchy from the level of quarter to the more detailed level of month • Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a sub-cube • Here Slice is performed for the dimension "time" using the criterion time = "Q1". • It will form a new sub-cube by selecting one or more dimensions. • Dice selects two or more dimensions from a given cube and provides a new sub-cube The dice operation on the cube based on the following selection criteria involves three dimensions. •(location = "Toronto" or "Vancouver") •(time = "Q1" or "Q2") •(item =" Mobile" or "Modem") • Pivot • The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data OLAP Systems versus Statistical Databases • Statistical db is a database system is designed to support statistical applications meant for socio economic applications, • OLAP for business intelligence with large amount of data ,. • Sdb has truth issues Star net model for querying • The querying of multidimensional databases can be based on a star net model, which consists of radial lines emerging from a central point, where each line represents a concept hierarchy for a dimension. • Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations such as drill-down and roll-up • This starnet consists of four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time, respectively. Each line consists of footprints representing abstraction levels of the dimension