0% found this document useful (0 votes)
2 views

Datascience Unit 02 1

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Key characteristics include being subject-oriented, integrated, time-variant, and non-volatile, while data marts serve specific business needs. The document also discusses data warehouse architecture, OLAP operations, and data preprocessing techniques essential for effective data mining and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Datascience Unit 02 1

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Key characteristics include being subject-oriented, integrated, time-variant, and non-volatile, while data marts serve specific business needs. The document also discusses data warehouse architecture, OLAP operations, and data preprocessing techniques essential for effective data mining and analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

DATA WAREHOUSE

UNIT-02
WHAT IS A DATA WAREHOUSE?

• A Data Warehouse (DW) is a relational database that is


designed for query and analysis rather than transaction
processing. It includes historical data derived from
transaction data from single and multiple sources.
• A Data Warehouse provides integrated, enterprise-wide,
historical data and focuses on providing support for
decision-makers for data modeling and analysis. A Data
Warehouse is a group of data specific to the entire
organization, not only to a particular group of users. It is
not used for daily operations and transaction processing
CHARACTERISTICS OF DATA WAREHOUSE
SUBJECT-ORIENTED

• A data warehouse target on the modeling and analysis of


data for decision-makers. Therefore, data warehouses
typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This
is done by excluding data that are not useful concerning the
subject and including all data needed by the users to
understand the subject.
INTEGRATED

• A data warehouse integrates various heterogeneous data sources like


RDBMS, flat files, and online transaction records. It requires performing
data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among
different data sources.
TIME-VARIANT

• Historical information is kept in a data warehouse. For


example, one can retrieve files from 3 months, 6 months,
12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often
only the most current file is kept.
NON-VOLATILE

• The data warehouse is a physically separate data storage, which is


transformed from the source operational RDBMS. The operational
updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
GOALS OF DATA WAREHOUSING

• To help reporting as well as analysis


• Maintain the organization's historical information
• Be the foundation for decision making.
DATA MART:

• A data mart is a subset of a main data warehouse that is segmented to


serve business needs, typically with a focus on a particular purpose.
• For Example: If we assume an hons college as data warehouse then,
• Geography Dept
• History Dept.
• English Dept.
• Bengali Dept.
• CSE Dept.
• These are all departments. And each department is a data mart of a data
warehouse.
• There may be distinct data marts for finance, sales, production, or
marketing. Departments comprise the software, hardware, programs, and
data related to a particular department inside the firm.
• Although each of these data marts is unique, they may all be coordinated.
• The data marts of several departments differ from one another.
• A departmentally planned tiny warehouse is called a data mart.
META DATA

• Your data warehouse's contents are listed in a directory


called Meta Data.
• Forms of meta data
• Three main types may be found in meta data in a data
warehouse:
• Operational metadata
• Extraction and transformation metadata
• End user meta data
1. OPERATIONAL METADATA

• Data for the data warehouse originates from several operational


systems inside the organization, since operational metadata
encompasses all relevant information about the operational data
sources.
• 2. Metadata Extraction and Transformation
It includes details on every data transformation that has ever
occurred.
3. END-USER METADATA (INDEX)

• The data warehouse's navigational map is the end


user information. It makes it possible for the end
user to locate data warehouse information.
DATA WARE HOUSE ARCHITECTURE
BACK-END TOOLS AND UTILITIES

• They are employed to feed data from operational


databases or other external sources into the data
warehouse (bottom layer).
• These tools and utilities carry out load and refresh
operations to update the data warehouse in
addition to data extraction, cleansing, and
transformation (e.g., merging comparable data
from several sources into unified format).
BOTTOM TIER

• Relational database systems are mostly often


found on the warehouse database server.
• A data warehouse was created by connecting
many data mart.
• Additionally, this layer has a metadata repository
that houses data on the content of the data
warehouse.
• Additionally, there are integrators and monitors on
THE MIDDLE TIER

• An OLAP server is the intermediate tier.


• Typically, MOLAP or ROLAP are used to
implement it.
• ROLAP is the name of the server that
manages relational databases.
• MOLAP is a unique kind of server that is
specifically designed for multidimensional
THE TOP TIER

• It is a front-end client layer that includes data mining,


analysis, and query and reporting capabilities.
NEED FOR DATA WAREHOUSE
BENEFITS OF DATA WAREHOUSE

• Understand business trends and make better forecasting decisions.


• Data Warehouses are designed to perform well enormous amounts of data.
• The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
• Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
• Data warehousing provide the capabilities to analyze a large amount of
historical data.
MULTIDIMENSIONAL DATA MODEL

• A multidimensional data model is a method used to


organize data in a database, allowing users to view
and analyze data from multiple perspectives. This
model is particularly useful in data warehousing and
OLAP (Online Analytical Processing), where it
enables users to quickly retrieve answers to complex
analytical queries
KEY CONCEPTS

• Data Cubes: The core concept of the


multidimensional data model is the data cube, which
allows data to be modeled and viewed in multiple
dimensions. Each dimension represents a different
perspective or entity, such as time, location, or
product. For example, a sales data warehouse might
have dimensions for time, items, and locations
• Dimensions and Facts:
• Dimensions: These are attributes that describe the measures,
such as time, location, or product. They are typically stored in
dimension tables.
• Facts: These are numerical measures that represent the central
theme of the data, such as sales or revenue. Facts are stored in
fact tables, which contain measures of the related dimensional
tables
MULTIDIMENSIONAL DATA MODEL SCHEMA

I. Star Schema: Each dimension in a star schema is represented with


only one-dimension table. This dimension table contains the set of
attributes. The following diagram shows the sales data of a company
with respect to the four dimensions, namely time, item, branch, and
location.
There is a fact table at the center. It contains the keys to each of four
dimensions.
The fact table also contains the attributes, namely dollars sold and units
sold.
II. Snowflake Schema: Some dimension tables in the Snowflake schema
are normalized. The normalization splits up the data into additional
tables. Unlike Star schema, the dimensions table in a snowflake schema
are normalized. For example, the item dimension table in star schema
is normalized and split into two dimension tables, namely item and
supplier table. Now the item dimension table contains the attributes
item_key, item_name, type, brand, and supplier-key. The supplier key is
linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
A fact constellation has multiple fact
III. Fact Constellation Schema:
tables. It is also known as galaxy schema. The sales fact table is
same as that in the star schema. The shipping fact table has the five
dimensions, namely item_key, time_key, shipper_key, from_location,
to_location. The shipping fact table also contains two measures,
namely dollars sold and units sold. It is also possible to share
dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping
fact table.
OLAP OPERATIONS

• Since OLAP servers are based on multidimensional view of data, we will


discuss OLAP operations in multidimensional data.
• Here is the list of OLAP operations −
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
• Roll-up: Roll-up performs aggregation on a data
cube in any of the following ways −
• By climbing up a concept hierarchy for a dimension
• By dimension reduction
Roll-up is performed by climbing up a concept hierarchy for
the dimension location. Initially the concept hierarchy was
"street < city < province < country".
On rolling up, the data is aggregated by ascending the
location hierarchy from the level of city to the level of
country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the
data cube are removed.
• Drill-down: Drill-down is the reverse operation of
roll-up. It is performed by either of the following
ways −
• By stepping down a concept hierarchy for a
dimension
• By introducing a new dimension.
• Drill-down is performed by stepping down a concept
hierarchy for the dimension time.
• Initially the concept hierarchy was "day < month < quarter
< year.“
• On drilling down, the time dimension is descended from the
level of quarter to the level of month.
• When drill-down is performed, one or more dimensions
from the data cube are added.
• It navigates the data from less detailed data to highly
• Slice: The slice operation selects one particular dimension from a
given cube and provides a new sub-cube. Consider the following
diagram that shows how slice works.
• Here Slice is performed for the dimension "time"
using the criterion time = "Q1".

• It will form a new sub-cube by selecting one or


more dimensions.
• Dice:Dice selects two or more dimensions from a given cube and
provides a new sub-cube. Consider the following diagram that shows
the dice operation.
• The dice operation on the cube based on the following
selection criteria involves three dimensions.
• (location = "Toronto" or "Vancouver")
• (time = "Q1" or "Q2")
• (item =" Mobile" or "Modem")
• Pivot: The pivot operation is also known as rotation. It
rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following
diagram that shows the pivot operation.
ADVANTAGES OF DATA CUBES:

• Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business


data, allowing users to view data from different perspectives and levels of detail.
• Interactivity: Data cubes provide interactive access to large amounts of data, allowing
users to easily navigate and manipulate the data to support their analysis.
• Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and
efficient querying and aggregation of data.
• Data aggregation: Data cubes support complex calculations and data aggregation,
enabling users to quickly and easily summarize large amounts of data.
• Improved decision-making: Data cubes provide a clear and comprehensive view of
business data, enabling improved decision-making and business intelligence.
• Accessibility: Data cubes can be accessed from a variety of devices and platforms,
making it easy for users to access and analyze business data from anywhere.
DISADVANTAGES OF DATA CUBE:

• Complexity: OLAP systems can be complex to set up and maintain, requiring specialized technical
expertise.
• Data size limitations: OLAP systems can struggle with very large data sets and may require
extensive data aggregation or summarization.
• Performance issues: OLAP systems can be slow when dealing with large amounts of data,
especially when running complex queries or calculations.
• Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of
OLAP analysis.
• Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the need
for specialized hardware and software.
• Inflexibility: OLAP systems may not easily accommodate changing business needs and may
require significant effort to modify or extend.
DATA PREPROCESSING IN DATA MINING

• Data preprocessing is an important process of data mining. In this


process, raw data is converted into an understandable format and
made ready for further analysis. The motive is to improve data quality
and make it up to mark for specific tasks.
• Tasks in Data Preprocessing
DATA CLEANING

• Data cleaning help us remove inaccurate, incomplete and incorrect data from
the dataset. Some techniques used in data cleaning are −
• Binning − This method handle noisy data to make it smooth. Data gets
divided equally and stored in form of bins and then methods are applied to
smoothing or completing the tasks.
• Regression − Regression functions are used to smoothen the data.
Regression can be linear(consists of one independent variable) or
multiple(consists of multiple independent variables).
• Clustering − It is used for grouping the similar data in clusters and is used for
finding outliers.
DATA INTEGRATION

• The process of combining data from multiple sources


(databases, spreadsheets,text files) into a single dataset.
Single and consistent view of data is created in this process.
Major problems during data integration are Schema
integration(Integrates set of data collected from various
sources), Entity identification(identifying entities from
different databases) and detecting and resolving data values
concept.
DATA TRANSFORMATION

• In this part, change in format or structure of data in order to transform the data suitable for
mining process. Methods for data transformation are −Normalization − Method of scaling
data to represent it in a specific smaller range( -1.0 to 1.0)
Discretization − It helps reduce the data size and make continuous data divide into intervals.
Attribute Selection − To help the mining process, new attributes are derived from the given
attributes.
Concept Hierarchy Generation − In this, the attributes are changed from lower level to higher
level in hierarchy.
Aggregation − In this, a summary of data gets stored which depends upon quality and quantity
of data to make the result more optimal.
DATA REDUCTION

• It helps in increasing storage efficiency and


reducing data storage to make the analysis
easier by producing almost the same results.
Analysis becomes harder while working with
huge amounts of data, so reduction is used to
get rid of that.

You might also like