0% found this document useful (0 votes)
18 views57 pages

DMDW Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views57 pages

DMDW Unit II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Warehousing and Data mining

UNIT II

Data Preprocessing: An Overview, Data Cleaning, Data Integration,


Data Reduction, Data Transformation and Data Discretization.
Data Preprocessing: An Overview

Why Preprocess the Data

•Real-world data is often dirty, incomplete, inconsistent, and noisy.


•Poor data quality affects mining results — “Garbage In, Garbage Out.”
•Preprocessing improves:
• Accuracy of patterns/models
• Efficiency of mining algorithms
• Interpretability of results
Common Data Issues:
•Incomplete: Missing attributes or records
•Noisy: Errors, outliers, or wrong values
•Inconsistent: Conflicting data or formats

Preprocessing ensures high-quality, reliable input for mining.


Data Preprocessing: An Overview

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:


1.Data Cleaning:
1. Remove noise and handle missing values.
2. Ensures consistency and accuracy.
2.Data Integration:
1. Combines data from multiple sources into a coherent store.
2. Resolves schema and entity conflicts.
3.Data Reduction:
1. Reduces volume without losing valuable data.
2. Includes techniques like PCA, sampling, and aggregation.
4.Data Transformation:
1. Converts data into appropriate format or scale.
2. Includes normalization, encoding, and aggregation.
Data Preprocessing: An Overview

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:


1.Data Cleaning:
1. Remove noise and handle missing values.
2. Ensures consistency and accuracy.
2.Data Integration:
1. Combines data from multiple sources into a coherent store.
2. Resolves schema and entity conflicts.
3.Data Reduction:
1. Reduces volume without losing valuable data.
2. Includes techniques like PCA, sampling, and aggregation.
4.Data Transformation:
1. Converts data into appropriate format or scale.
2. Includes normalization, encoding, and aggregation.
Data Preprocessing: An Overview

Preprocessing: Summary & Examples

Task Example
Data Cleaning Fill missing ages with average; remove typos
Data Integration Merge sales data from branches A and B
Data Reduction PCA to reduce 10 features to 3
Data Transformation Normalize salary between 0 and 1
Data Discretization Convert “age=21” → “young” category
Data Cleaning
What is Data Cleaning?
Data cleaning is a fundamental step in the data preprocessing process, where the goal is to detect
and correct errors and inconsistencies in the data to improve its quality and usefulness for mining.
Why is Data Cleaning Needed?
In real-world scenarios, data collected from various sources often suffer from:
•Missing values (e.g., blank fields in forms)
•Noisy data (e.g., typos, outliers, errors from sensors)
•Inconsistent data (e.g., differing date formats like "2025/01/01" vs "01-Jan-2025")

These problems arise due to:


•Faulty data collection instruments
•Human errors during data entry
•Lack of standardization
•Integration of data from multiple sources

Goals of Data Cleaning:


•Improve the accuracy, completeness, and consistency of data.
•Make the dataset suitable for analysis and mining.
•Reduce the risk of biased or incorrect conclusions.
Data Cleaning
Missing Values
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill
in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
Imagine that you need to analyze AllElectronics sales and customer data. You note that many tuples have no recorded
value for several attributes such as customer income. How can you go about filling in the missing values for this attribute?
Let’s look at the following methods.
1. Ignore the tuple:
• Skip the entire record if the class label is missing.
• Suitable only for large datasets and few missing values..
2. Fill in the missing value manually:
• Manually enter the missing data.
• Impractical for large datasets.
3. Use a global constant to fill in the missing value:
• Replace missing values with a constant like "Unknown" or -1.
• Risk: Mining algorithm might treat it as meaningful data.f.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value:
• Fill using the mean (for symmetric data) or median (for skewed data).
• Example: Replace missing income with average income = $56,000..
Data Cleaning
Missing Values
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple:
• Replace using mean/median of the same class.
• Example: Use average income of “low-risk” customers only..
6. Use the most probable value to fill in the missing value:
• Predict using models (e.g., regression, Bayesian, decision tree).
• More accurate but computationally complex..
Data Cleaning
Noisy Data
What is Noisy Data?
Data with random errors, outliers, or incorrect values.
Common in sensor data, user entries, or during transmission.
Causes of Noise:
Faulty instruments
Human entry errors
Data corruption or inconsistencies
Example:
A customer’s age is entered as 250 or salary as 0 — both are likely noise.
Data Cleaning
Noisy Data:Techniques to Handle Noisy Data

1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.
2.Regression:
1. Fit a function (linear/nonlinear) to predict values.
2. Smooth deviations from the fitted line.
3.Clustering:
1. Group similar values together; treat outliers as noise.
4.Combined Computer & Human Inspection:
1. Manually verify values flagged by the system.
5.Smoothing by Aggregation:
1. Replace individual values with grouped averages.
Data Cleaning
Noisy Data:Binning
1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.

Smoothing by Bin Means


Bin 1: Mean = (4 + 8 + 15) / 3 = 9 Smoothed values: 9, 9, 9
Bin 2: Mean = (21 + 21 + 24) / 3 = 22 Smoothed values: 22, 22, 22
Bin 3: Mean = (25 + 28 + 34) / 3 = 29 Smoothed values: 29, 29, 29
Smoothing by Bin Boundaries
How it works:
•For each bin, identify the minimum and maximum values (boundaries).
•Replace each value with the closest boundary (either min or max).

Bin 1: 4, 8, 15
•Min = 4, Max = 15
•Replace:
• 4 → 4 (equal to min)
• 8 → closer to 4 than 15 → 4
• 15 → 15 (equal to max) Result: 4, 4, 15
Data Cleaning
Price (X) Sales (Y)
Noisy data:Regression
100 40
Types of Regression for Smoothing:
1.Linear Regression: 200 80
1. Fits a straight line to data points involving two variables (X and Y). 300 130
2. Used when one attribute can predict another. 400 160

Example: Predicting sales based on advertising spend. 500 210


Fits a straight line: Y = aX + b
Best used when one variable predicts another (e.g., sales vs. price). Fit the line Y = 0.42X - 2
•Predict sales at X = 350 → Y =
0.42×350 - 2 = 145
2.Multiple Linear Regression: •If actual Y = 180 (a noisy value), we
1. Extends linear regression to more than two variables. smooth it to 145
2. Fits data to a multidimensional surface.
3. Useful when multiple features influence a target.

Involves more than two variables.


Fits a multidimensional surface (e.g., sales = f(price, ad budget, rating)).
Data Cleaning
Noisy Data:Outlier Analysis
Outlier Detection Using Clustering
•Clustering groups similar data points based on attributes.
•Outliers are data points that lie outside all clusters.
•These are potential errors, anomalies, or rare events.

Use of Clustering:
•Helps separate normal behavior (clustered points) from
anomalies.
•Useful for detecting fraud, sensor errors, or extreme events.
Data Integration
What is Data Integration?
•Data Integration is the process of combining data from multiple heterogeneous sources into a coherent data
store.
•Common in enterprise systems where data comes from:
• Databases
• Flat files
• Web data
• Sensor logs

Why is Data Integration Needed?


•Enables complete and unified views for analysis.
•Avoids fragmented decision-making.
•Essential for building data warehouses and performing data mining.

Key Challenges:
•Inconsistent attribute names and formats
•Missing or conflicting values
•Duplicate records or tuples
•Matching entities from different systems
Data Integration
What is the Entity Identification Problem?
•Occurs when data from multiple sources refer to the same real-world entity but are represented differently.
•Also known as record linkage, object matching, or merge-purge problem.

Why It Happens:
•Different databases may use:
• Different names or aliases
e.g., "Cust_ID" vs. "CustomerNumber"
• Different formats or data types
e.g., Date of birth: “12/02/2001” vs. “2001-02-12”
• Different identifiers for the same entity
e.g., Employee ID vs. Social Security Number

Goal:
To identify and unify such records before loading into a data warehouse.
Data Integration

What is Redundancy?

Redundancy in Data Integration

•Redundancy means an attribute can be derived from other attributes.


• Example: Annual revenue might be calculated from monthly sales.
•Common causes:
• Derived or duplicate attributes
• Inconsistent naming (e.g., “salary” vs. “income”)
• Dimensional misalignment
Solution: Use correlation analysis to detect and reduce redundancy.
Data Integration: Redundancy
Chi-Square Test for Nominal Data

χ² Correlation Test for Nominal Attributes

•Used to test correlation between two categorical (nominal)


variables.
•Construct a contingency table and calculate:

Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table 3.1)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold (e.g.,
0.001), reject null hypothesis → attributes are correlated.
Data Integration: Redundancy
Chi-Square Test for Nominal Data

.
Data Integration: Redundancy
Chi-Square Test for Nominal Data
Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold
(e.g., 0.001), reject null hypothesis → attributes are
correlated.
Data Integration: Redundancy
Correlation Coefficient for Numeric Data

Pearson’s Correlation Coefficient (Numeric)

•Value ranges:
• +1: perfect positive correlation
• –1: perfect negative correlation
• 0: no linear correlation

Interpretation:
•If r ≈ 0 → no redundancy
•If r close to ±1 → one attribute may be removed
Data Integration: Redundancy
Covariance

Covariance measures how two numeric variables change together.


If both variables increase together, the covariance is positive.
If one increases while the other decreases, the covariance is negative.
If there's no consistent pattern, the covariance is near zero.

Day Stock A Stock B


1 100 200
2 110 210
3 90 190
4 95 195

Here, both stocks tend to move up and down together, so the


covariance is positive.
Data Integration: Redundancy

Covariance and Example

Therefore, given the positive covariance we can say that stock prices for both companies
rise together.
Data Integration: Redundancy

Technique Use in Data Integration


Chi-Square Test Checks if categorical fields (from two sources) are related or redundant.
Detects if two numeric fields carry similar patterns. Helps decide if one
Pearson Correlation (r)
can be dropped.
Detects whether two numeric fields change together — useful for
Covariance
checking dependencies across sources.

Example Scenario:
You’re merging customer datasets from two branches:
•One has Income, the other has Spending, and a third has Salary.
You use:
•Correlation between Income and Salary to detect redundancy.
•Chi-Square to see if Gender and Segment from different systems have consistent relations.
•Covariance to detect directional similarity.
Data Reduction

Content:
•Large datasets can make analysis slow and impractical.
•Data reduction techniques help:
• Reduce volume
• Retain integrity of original data
• Speed up mining without losing meaningful results
•Strategies include:
• Dimensionality reduction
• Numerosity reduction
• Data compression
Data Reduction: Dimensionality Reduction – Wavelet Transforms

Wavelet Transforms
•DWT transforms data into wavelet
coefficients
•Only strong coefficients are kept; others
are set to 0
•Key advantages:
• Reduces noise
How Wavelet Transform Works
• Supports lossy compression Wavelet Transform – Step-by-Step
• Works well for skewed, sparse, or 1.Pad vector length to power of 2
high-dimensional data 2.Apply smoothing and differencing functions to data pairs
•Common wavelet families: Haar, 3.Split into low-frequency (smooth) & high-frequency (detail)
4.Repeat recursively
Daubechies
5.Keep top coefficients as compressed representation
•Applied recursively using pyramid Matrix multiplication is used for transformation
algorithm (orthonormal matrices)
Data Reduction:Principal Components Analysis (PCA)
PCA axes (Y₁ and Y₂ projected from X₁ and X₂)
•Projects high-dimensional data onto a lower-
dimensional space
•Creates new variables (principal components)
from combinations of original ones
•Captures the maximum variance in fewer
dimensions
•PCA steps:
• Normalize data
• Compute orthonormal vectors (principal
components)
• Sort by variance (importance)
• Keep top components (reduce noise & size)
Wavelet vs PCA

Feature Wavelet Transform PCA


Data Type Works well with ordered data Handles sparse, unordered data
Output Transformed wavelet coefficients Principal components
Strength Local detail, noise reduction Variance-based, dimensionality
Compression Lossy (typically) Lossy (based on variance drop)
Data Reduction: Attribute Subset Selection

•Many datasets contain irrelevant or redundant attributes


•Including them:
• Increases computation
• Confuses learning algorithms
• Produces poor-quality patterns

•Goal: Select a minimum attribute set that retains the original data's predictive power
•Helps improve accuracy, interpretability, and speed
Data Reduction: Attribute Subset Selection
Why Attribute Subset Selection Is Needed
Why Reduce Attributes?

•Examples of irrelevant attributes:


• Phone number (when predicting music preference)
•Benefits of reduction:
• Faster mining
• Better pattern quality
• Easier interpretation
•Exhaustive selection is exponential (2ⁿ subsets)
→ Heuristic methods are used instead
Data Reduction: Attribute Subset Selection
Greedy Methods Illustration
Figure – Greedy Attribute Subset
Selection
•Forward Selection:
• Start: {}
• Add A1 → A1, A4 → Final:
{A1, A4, A6}
•Backward Elimination:
• Start: {A1 to A6}
• Remove: A2, A3, A5 →
Final: {A1, A4, A6}
•Decision Tree:
• Attributes used in splits =
selected subset
• Final tree uses A4, A1, A6
→ Discard others
Data Reduction: Attribute Subset Selection

Attribute Evaluation Measures


How to Choose Best/Worst Attributes?
•Based on statistical significance
•Assume attributes are independent

•Common metrics:
• Information Gain
• Gini Index
• Chi-square score
•Used in decision trees and subset selection
Data Reduction: Parametric Data Reduction

Aim: Approximate data using mathematical models


•Two common parametric techniques:
• Regression Models
• Log-Linear Models
•Benefit: Reduce data volume while preserving patterns
Data Reduction: Parametric Data Reduction

Regression Models for Data Reduction


•Linear Regression:
Models relationship between two numeric attributes
y = wx + b
•y: response variable
•x: predictor variable
•w, b: coefficients
•Multiple Regression:
Extends to multiple predictors
y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Use: Best fit line using least squares error minimization


Data Reduction: Parametric Data Reduction

Log-Linear Models
•Used for discrete attributes
•Estimate probability of data points using combinations of
lower-dimensional spaces
•Supports:
• Dimensionality reduction
• Smoothing for sparse data
•Suitable for high-dimensional categorical data
Data Reduction: Histograms

•Histograms approximate data distributions using buckets (bins)


•Each bucket summarizes a portion of the data
•Useful for reducing the size of numeric datasets
•Applied on individual attributes or multiple attributes

Used in data preprocessing, analytics, and compression


Data Reduction: Histograms The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
Singleton Buckets
The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
Each bucket stores a single value- 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20,
20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
frequency pair
30, 30.
•Helps retain exact details
•Useful for capturing outliers and
precise value counts

No.of 1’s:2
No.of 5’s:5
No.of 8’s:2
..
.
.
Data Reduction: Histograms The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
The numbers have been sorted:
Equal-Frequency Histograms 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20,
•Each bucket has roughly the same 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
number of items 30, 30.
•Useful when value distribution is skewed
•Better than equal-width for capturing
balanced patterns

Example: 30 data points → 3 buckets →


~10 items per bucket

1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10: total 13(1-10 count)


Data Reduction: Histograms

Summary and Advantages


•Histograms reduce data size while retaining distribution
•Suitable for:
• Sparse or dense data
• Skewed or uniform data
•Multidimensional histograms:
• Capture attribute dependencies
• Work well up to ~5 attributes
Supports data summarization, compression, and faster mining
Data Reduction: Clustering for Data Reduction

•Clustering groups similar data objects into clusters


•Each cluster represents a set of similar data points
•Objects in different clusters are dissimilar
•Commonly based on distance measures (e.g., Euclidean distance)

Purpose in Data Reduction:


•Replace all records in a cluster with the cluster centroid
•Reduces data size, preserving patterns
•More effective when data naturally forms distinct groups
Data Reduction: Clustering for Data Reduction

•Clustering groups similar data objects into clusters


•Each cluster represents a set of similar data points
•Objects in different clusters are dissimilar
•Commonly based on distance measures (e.g., Euclidean distance)

Purpose in Data Reduction:


•Replace all records in a cluster with the cluster centroid
•Reduces data size, preserving patterns
•More effective when data naturally forms distinct groups
Data Reduction: Sampling for Data Reduction

•Sampling selects a small representative subset of a large dataset for analysis


•Reduces cost and time without scanning the full dataset
•Especially useful for aggregate query estimation

Types of Sampling :
•SRSWOR (Simple Random Sampling Without Replacement):
Selects s unique records randomly from dataset D
•SRSWR (With Replacement):
A record can be selected more than once
•Cluster Sampling:
Selects entire groups (e.g., data pages) instead of individual
records
•Stratified Sampling:
Divides dataset into strata (e.g., age groups) and samples from
each
➤ Ensures representation of minority groups in skewed data
Data Reduction: Sampling
for Data Reduction

Types of Sampling :
•SRSWOR (Simple Random Sampling Without Replacement):
Selects s unique records randomly from dataset D
•SRSWR (With Replacement):
A record can be selected more than once
•Cluster Sampling:
Selects entire groups (e.g., data pages) instead of individual
records
•Stratified Sampling:
Divides dataset into strata (e.g., age groups) and samples from
each
➤ Ensures representation of minority groups in skewed data
Data Reduction: Data Cube Aggregation
•Purpose: Reduces data by summarizing measures across dimensions
•Aggregation = combining detailed data into higher-level summaries
•Example:
Instead of quarterly sales (Q1–Q4), aggregate into annual sales
➤ Reduces data size
➤ Keeps only necessary granularity for analysis
•Benefits:
• Smaller data size
• Faster query processing
• Retains essential analytical meaning
Data Transformation and Discretization

•Overview of transformation techniques


•Importance in preprocessing
•Link to data reduction and mining efficiency
Data Transformation Overview

•Conversion or consolidation of data into a suitable format for mining.


•Benefits:
• Improves mining efficiency.
• Simplifies patterns for better understanding.
•Strategies:
• Smoothing
• Attribute Construction
• Aggregation
• Normalization
• Discretization
• Concept Hierarchy Generation
Data Transformation Overview
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Strategies include:
1.Smoothing – Removes noise from the data. Techniques: binning, regression,
clustering.
2.Attribute Construction (Feature Construction) – Creates new attributes from
given attributes to assist the mining process.
3.Aggregation – Applies summary or aggregation (e.g., daily → monthly totals,
used in data cubes).
4.Normalization – Scales data into a smaller range (e.g., -1.0 to 1.0 or 0.0 to 1.0).
5.Discretization – Replaces raw numeric values with intervals (e.g., 0–10) or
conceptual labels (e.g., youth, adult, senior), forming a concept hierarchy.
6.Concept Hierarchy Generation for Nominal Data – Generalizes nominal
attributes (e.g., street → city → country), often defined automatically at schema
level.
Data Transformation by Normalization

Purpose:
•Scale attribute data into a common range.
•Useful for:
• Neural networks
• Distance-based methods (kNN, clustering)
Methods:
1.Min–Max Normalization
2.Z-Score Normalization
3.Decimal Scaling
Include small numeric example (Example 3.4 and Example 3.5).
Figure: Min–Max and Z-Score formulas.
Data Transformation by Normalization

Min–Max Normalization
Data Transformation by Normalization

Z-Score Normalization
Data Transformation by Normalization

Decimal Scaling Normalization


Data Discretization

1.Discretization by Binning
2.Discretization by Histogram Analysis
3.Discretization by Cluster, Decision Tree, and Correlation Analyses
4.Concept Hierarchy Generation for Nominal Data
Data Discretization :Discretization by Binning

•Definition: Top-down splitting technique that divides attribute values into a fixed
number of bins.
•Types:
• Equal-width binning: Bins have equal value range.
• Equal-frequency binning: Each bin has the same number of tuples.
•Usage:
• Can replace bin values with bin mean or median (smoothing).
• Can be applied recursively to create concept hierarchies.
•Nature: Unsupervised, sensitive to bin count and outliers.
Data Discretization :Discretization by Histogram Analysis

•Definition: Unsupervised method partitioning attribute values into


disjoint ranges (“bins”).
•Types:
• Equal-width histogram: Fixed value range per bin.
• Equal-frequency histogram: Equal number of values per bin.
•Features:
• Recursive application can build multi-level concept hierarchies.
• Minimum interval size can control recursion.
•Example: Price ranges of $10 (equal-width) or equal number of items
(equal-frequency).
Data Discretization :– Discretization by Cluster, Decision Tree
& Correlation Analyses

•Clustering: Groups attribute values into clusters based on similarity and


distribution.
• Top-down: Split clusters further.
• Bottom-up: Merge neighboring clusters.
•Decision Tree Analysis: Supervised; uses class labels and entropy to select
split-points that improve classification accuracy.
•Correlation Analysis (ChiMerge):
• Bottom-up, supervised method.
• Merge intervals with similar class distributions using χ² test.
• Stop merging when a predefined criterion is met.
Data Discretization :– Concept Hierarchy Generation for
Nominal Data
abstraction levels for flexible mining.
•Methods:
• Manual Ordering: Specify hierarchy in
schema (e.g., street < city < state < country).
• Explicit Grouping: Manually group
intermediate values.
• Automatic Ordering: Sort attributes by
number of distinct values (fewer → higher
level).
• Partial Specification with Semantic Linking:
Automatically include related attributes
based on semantic relationships.
•Example: Location hierarchy from Figure 3.13
using distinct value counts.

You might also like