DMDW Unit II
DMDW Unit II
UNIT II
Task Example
Data Cleaning Fill missing ages with average; remove typos
Data Integration Merge sales data from branches A and B
Data Reduction PCA to reduce 10 features to 3
Data Transformation Normalize salary between 0 and 1
Data Discretization Convert “age=21” → “young” category
Data Cleaning
What is Data Cleaning?
Data cleaning is a fundamental step in the data preprocessing process, where the goal is to detect
and correct errors and inconsistencies in the data to improve its quality and usefulness for mining.
Why is Data Cleaning Needed?
In real-world scenarios, data collected from various sources often suffer from:
•Missing values (e.g., blank fields in forms)
•Noisy data (e.g., typos, outliers, errors from sensors)
•Inconsistent data (e.g., differing date formats like "2025/01/01" vs "01-Jan-2025")
1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.
2.Regression:
1. Fit a function (linear/nonlinear) to predict values.
2. Smooth deviations from the fitted line.
3.Clustering:
1. Group similar values together; treat outliers as noise.
4.Combined Computer & Human Inspection:
1. Manually verify values flagged by the system.
5.Smoothing by Aggregation:
1. Replace individual values with grouped averages.
Data Cleaning
Noisy Data:Binning
1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.
Bin 1: 4, 8, 15
•Min = 4, Max = 15
•Replace:
• 4 → 4 (equal to min)
• 8 → closer to 4 than 15 → 4
• 15 → 15 (equal to max) Result: 4, 4, 15
Data Cleaning
Price (X) Sales (Y)
Noisy data:Regression
100 40
Types of Regression for Smoothing:
1.Linear Regression: 200 80
1. Fits a straight line to data points involving two variables (X and Y). 300 130
2. Used when one attribute can predict another. 400 160
Use of Clustering:
•Helps separate normal behavior (clustered points) from
anomalies.
•Useful for detecting fraud, sensor errors, or extreme events.
Data Integration
What is Data Integration?
•Data Integration is the process of combining data from multiple heterogeneous sources into a coherent data
store.
•Common in enterprise systems where data comes from:
• Databases
• Flat files
• Web data
• Sensor logs
Key Challenges:
•Inconsistent attribute names and formats
•Missing or conflicting values
•Duplicate records or tuples
•Matching entities from different systems
Data Integration
What is the Entity Identification Problem?
•Occurs when data from multiple sources refer to the same real-world entity but are represented differently.
•Also known as record linkage, object matching, or merge-purge problem.
Why It Happens:
•Different databases may use:
• Different names or aliases
e.g., "Cust_ID" vs. "CustomerNumber"
• Different formats or data types
e.g., Date of birth: “12/02/2001” vs. “2001-02-12”
• Different identifiers for the same entity
e.g., Employee ID vs. Social Security Number
Goal:
To identify and unify such records before loading into a data warehouse.
Data Integration
What is Redundancy?
Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table 3.1)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold (e.g.,
0.001), reject null hypothesis → attributes are correlated.
Data Integration: Redundancy
Chi-Square Test for Nominal Data
.
Data Integration: Redundancy
Chi-Square Test for Nominal Data
Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold
(e.g., 0.001), reject null hypothesis → attributes are
correlated.
Data Integration: Redundancy
Correlation Coefficient for Numeric Data
•Value ranges:
• +1: perfect positive correlation
• –1: perfect negative correlation
• 0: no linear correlation
Interpretation:
•If r ≈ 0 → no redundancy
•If r close to ±1 → one attribute may be removed
Data Integration: Redundancy
Covariance
Therefore, given the positive covariance we can say that stock prices for both companies
rise together.
Data Integration: Redundancy
Example Scenario:
You’re merging customer datasets from two branches:
•One has Income, the other has Spending, and a third has Salary.
You use:
•Correlation between Income and Salary to detect redundancy.
•Chi-Square to see if Gender and Segment from different systems have consistent relations.
•Covariance to detect directional similarity.
Data Reduction
Content:
•Large datasets can make analysis slow and impractical.
•Data reduction techniques help:
• Reduce volume
• Retain integrity of original data
• Speed up mining without losing meaningful results
•Strategies include:
• Dimensionality reduction
• Numerosity reduction
• Data compression
Data Reduction: Dimensionality Reduction – Wavelet Transforms
Wavelet Transforms
•DWT transforms data into wavelet
coefficients
•Only strong coefficients are kept; others
are set to 0
•Key advantages:
• Reduces noise
How Wavelet Transform Works
• Supports lossy compression Wavelet Transform – Step-by-Step
• Works well for skewed, sparse, or 1.Pad vector length to power of 2
high-dimensional data 2.Apply smoothing and differencing functions to data pairs
•Common wavelet families: Haar, 3.Split into low-frequency (smooth) & high-frequency (detail)
4.Repeat recursively
Daubechies
5.Keep top coefficients as compressed representation
•Applied recursively using pyramid Matrix multiplication is used for transformation
algorithm (orthonormal matrices)
Data Reduction:Principal Components Analysis (PCA)
PCA axes (Y₁ and Y₂ projected from X₁ and X₂)
•Projects high-dimensional data onto a lower-
dimensional space
•Creates new variables (principal components)
from combinations of original ones
•Captures the maximum variance in fewer
dimensions
•PCA steps:
• Normalize data
• Compute orthonormal vectors (principal
components)
• Sort by variance (importance)
• Keep top components (reduce noise & size)
Wavelet vs PCA
•Goal: Select a minimum attribute set that retains the original data's predictive power
•Helps improve accuracy, interpretability, and speed
Data Reduction: Attribute Subset Selection
Why Attribute Subset Selection Is Needed
Why Reduce Attributes?
•Common metrics:
• Information Gain
• Gini Index
• Chi-square score
•Used in decision trees and subset selection
Data Reduction: Parametric Data Reduction
Log-Linear Models
•Used for discrete attributes
•Estimate probability of data points using combinations of
lower-dimensional spaces
•Supports:
• Dimensionality reduction
• Smoothing for sparse data
•Suitable for high-dimensional categorical data
Data Reduction: Histograms
No.of 1’s:2
No.of 5’s:5
No.of 8’s:2
..
.
.
Data Reduction: Histograms The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
The numbers have been sorted:
Equal-Frequency Histograms 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20,
•Each bucket has roughly the same 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
number of items 30, 30.
•Useful when value distribution is skewed
•Better than equal-width for capturing
balanced patterns
Types of Sampling :
•SRSWOR (Simple Random Sampling Without Replacement):
Selects s unique records randomly from dataset D
•SRSWR (With Replacement):
A record can be selected more than once
•Cluster Sampling:
Selects entire groups (e.g., data pages) instead of individual
records
•Stratified Sampling:
Divides dataset into strata (e.g., age groups) and samples from
each
➤ Ensures representation of minority groups in skewed data
Data Reduction: Sampling
for Data Reduction
Types of Sampling :
•SRSWOR (Simple Random Sampling Without Replacement):
Selects s unique records randomly from dataset D
•SRSWR (With Replacement):
A record can be selected more than once
•Cluster Sampling:
Selects entire groups (e.g., data pages) instead of individual
records
•Stratified Sampling:
Divides dataset into strata (e.g., age groups) and samples from
each
➤ Ensures representation of minority groups in skewed data
Data Reduction: Data Cube Aggregation
•Purpose: Reduces data by summarizing measures across dimensions
•Aggregation = combining detailed data into higher-level summaries
•Example:
Instead of quarterly sales (Q1–Q4), aggregate into annual sales
➤ Reduces data size
➤ Keeps only necessary granularity for analysis
•Benefits:
• Smaller data size
• Faster query processing
• Retains essential analytical meaning
Data Transformation and Discretization
Purpose:
•Scale attribute data into a common range.
•Useful for:
• Neural networks
• Distance-based methods (kNN, clustering)
Methods:
1.Min–Max Normalization
2.Z-Score Normalization
3.Decimal Scaling
Include small numeric example (Example 3.4 and Example 3.5).
Figure: Min–Max and Z-Score formulas.
Data Transformation by Normalization
Min–Max Normalization
Data Transformation by Normalization
Z-Score Normalization
Data Transformation by Normalization
1.Discretization by Binning
2.Discretization by Histogram Analysis
3.Discretization by Cluster, Decision Tree, and Correlation Analyses
4.Concept Hierarchy Generation for Nominal Data
Data Discretization :Discretization by Binning
•Definition: Top-down splitting technique that divides attribute values into a fixed
number of bins.
•Types:
• Equal-width binning: Bins have equal value range.
• Equal-frequency binning: Each bin has the same number of tuples.
•Usage:
• Can replace bin values with bin mean or median (smoothing).
• Can be applied recursively to create concept hierarchies.
•Nature: Unsupervised, sensitive to bin count and outliers.
Data Discretization :Discretization by Histogram Analysis