0% found this document useful (0 votes)

18 views57 pages

DMDW Unit II

Uploaded by

vasanthkumarchintapalli143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views57 pages

DMDW Unit II

Uploaded by

vasanthkumarchintapalli143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Warehousing and Data mining

UNIT II

Data Preprocessing: An Overview, Data Cleaning, Data Integration,

Data Reduction, Data Transformation and Data Discretization.
Data Preprocessing: An Overview

Why Preprocess the Data

•Real-world data is often dirty, incomplete, inconsistent, and noisy.

•Poor data quality affects mining results — “Garbage In, Garbage Out.”
•Preprocessing improves:
• Accuracy of patterns/models
• Efficiency of mining algorithms
• Interpretability of results
Common Data Issues:
•Incomplete: Missing attributes or records
•Noisy: Errors, outliers, or wrong values
•Inconsistent: Conflicting data or formats

Preprocessing ensures high-quality, reliable input for mining.

Data Preprocessing: An Overview

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:

1.Data Cleaning:
1. Remove noise and handle missing values.
2. Ensures consistency and accuracy.
2.Data Integration:
1. Combines data from multiple sources into a coherent store.
2. Resolves schema and entity conflicts.
3.Data Reduction:
1. Reduces volume without losing valuable data.
2. Includes techniques like PCA, sampling, and aggregation.
4.Data Transformation:
1. Converts data into appropriate format or scale.
2. Includes normalization, encoding, and aggregation.
Data Preprocessing: An Overview

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:

Preprocessing: Summary & Examples

Task Example
Data Cleaning Fill missing ages with average; remove typos
Data Integration Merge sales data from branches A and B
Data Reduction PCA to reduce 10 features to 3
Data Transformation Normalize salary between 0 and 1
Data Discretization Convert “age=21” → “young” category
Data Cleaning
What is Data Cleaning?
Data cleaning is a fundamental step in the data preprocessing process, where the goal is to detect
and correct errors and inconsistencies in the data to improve its quality and usefulness for mining.
Why is Data Cleaning Needed?
In real-world scenarios, data collected from various sources often suffer from:
•Missing values (e.g., blank fields in forms)
•Noisy data (e.g., typos, outliers, errors from sensors)
•Inconsistent data (e.g., differing date formats like "2025/01/01" vs "01-Jan-2025")

These problems arise due to:

•Faulty data collection instruments
•Human errors during data entry
•Lack of standardization
•Integration of data from multiple sources

Goals of Data Cleaning:

•Improve the accuracy, completeness, and consistency of data.
•Make the dataset suitable for analysis and mining.
•Reduce the risk of biased or incorrect conclusions.
Data Cleaning
Missing Values
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill
in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
Imagine that you need to analyze AllElectronics sales and customer data. You note that many tuples have no recorded
value for several attributes such as customer income. How can you go about filling in the missing values for this attribute?
Let’s look at the following methods.
1. Ignore the tuple:
• Skip the entire record if the class label is missing.
• Suitable only for large datasets and few missing values..
2. Fill in the missing value manually:
• Manually enter the missing data.
• Impractical for large datasets.
3. Use a global constant to fill in the missing value:
• Replace missing values with a constant like "Unknown" or -1.
• Risk: Mining algorithm might treat it as meaningful data.f.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value:
• Fill using the mean (for symmetric data) or median (for skewed data).
• Example: Replace missing income with average income = $56,000..
Data Cleaning
Missing Values
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple:
• Replace using mean/median of the same class.
• Example: Use average income of “low-risk” customers only..
6. Use the most probable value to fill in the missing value:
• Predict using models (e.g., regression, Bayesian, decision tree).
• More accurate but computationally complex..
Data Cleaning
Noisy Data
What is Noisy Data?
Data with random errors, outliers, or incorrect values.
Common in sensor data, user entries, or during transmission.
Causes of Noise:
Faulty instruments
Human entry errors
Data corruption or inconsistencies
Example:
A customer’s age is entered as 250 or salary as 0 — both are likely noise.
Data Cleaning
Noisy Data:Techniques to Handle Noisy Data

1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.
2.Regression:
1. Fit a function (linear/nonlinear) to predict values.
2. Smooth deviations from the fitted line.
3.Clustering:
1. Group similar values together; treat outliers as noise.
4.Combined Computer & Human Inspection:
1. Manually verify values flagged by the system.
5.Smoothing by Aggregation:
1. Replace individual values with grouped averages.
Data Cleaning
Noisy Data:Binning
1.Binning:
1. Sort data and smooth within “bins.”
2. Methods: by mean, median, or boundary.

Smoothing by Bin Means

Bin 1: Mean = (4 + 8 + 15) / 3 = 9 Smoothed values: 9, 9, 9
Bin 2: Mean = (21 + 21 + 24) / 3 = 22 Smoothed values: 22, 22, 22
Bin 3: Mean = (25 + 28 + 34) / 3 = 29 Smoothed values: 29, 29, 29
Smoothing by Bin Boundaries
How it works:
•For each bin, identify the minimum and maximum values (boundaries).
•Replace each value with the closest boundary (either min or max).

Bin 1: 4, 8, 15
•Min = 4, Max = 15
•Replace:
• 4 → 4 (equal to min)
• 8 → closer to 4 than 15 → 4
• 15 → 15 (equal to max) Result: 4, 4, 15
Data Cleaning
Price (X) Sales (Y)
Noisy data:Regression
100 40
Types of Regression for Smoothing:
1.Linear Regression: 200 80
1. Fits a straight line to data points involving two variables (X and Y). 300 130
2. Used when one attribute can predict another. 400 160

Example: Predicting sales based on advertising spend. 500 210

Fits a straight line: Y = aX + b
Best used when one variable predicts another (e.g., sales vs. price). Fit the line Y = 0.42X - 2
•Predict sales at X = 350 → Y =
0.42×350 - 2 = 145
2.Multiple Linear Regression: •If actual Y = 180 (a noisy value), we
1. Extends linear regression to more than two variables. smooth it to 145
2. Fits data to a multidimensional surface.
3. Useful when multiple features influence a target.

Involves more than two variables.

Fits a multidimensional surface (e.g., sales = f(price, ad budget, rating)).
Data Cleaning
Noisy Data:Outlier Analysis
Outlier Detection Using Clustering
•Clustering groups similar data points based on attributes.
•Outliers are data points that lie outside all clusters.
•These are potential errors, anomalies, or rare events.

Use of Clustering:
•Helps separate normal behavior (clustered points) from
anomalies.
•Useful for detecting fraud, sensor errors, or extreme events.
Data Integration
What is Data Integration?
•Data Integration is the process of combining data from multiple heterogeneous sources into a coherent data
store.
•Common in enterprise systems where data comes from:
• Databases
• Flat files
• Web data
• Sensor logs

Why is Data Integration Needed?

•Enables complete and unified views for analysis.
•Avoids fragmented decision-making.
•Essential for building data warehouses and performing data mining.

Key Challenges:
•Inconsistent attribute names and formats
•Missing or conflicting values
•Duplicate records or tuples
•Matching entities from different systems
Data Integration
What is the Entity Identification Problem?
•Occurs when data from multiple sources refer to the same real-world entity but are represented differently.
•Also known as record linkage, object matching, or merge-purge problem.

Why It Happens:
•Different databases may use:
• Different names or aliases
e.g., "Cust_ID" vs. "CustomerNumber"
• Different formats or data types
e.g., Date of birth: “12/02/2001” vs. “2001-02-12”
• Different identifiers for the same entity
e.g., Employee ID vs. Social Security Number

Goal:
To identify and unify such records before loading into a data warehouse.
Data Integration

What is Redundancy?

Redundancy in Data Integration

•Redundancy means an attribute can be derived from other attributes.

• Example: Annual revenue might be calculated from monthly sales.
•Common causes:
• Derived or duplicate attributes
• Inconsistent naming (e.g., “salary” vs. “income”)
• Dimensional misalignment
Solution: Use correlation analysis to detect and reduce redundancy.
Data Integration: Redundancy
Chi-Square Test for Nominal Data

χ² Correlation Test for Nominal Attributes

•Used to test correlation between two categorical (nominal)

variables.
•Construct a contingency table and calculate:

Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table 3.1)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold (e.g.,
0.001), reject null hypothesis → attributes are correlated.
Data Integration: Redundancy
Chi-Square Test for Nominal Data

.
Data Integration: Redundancy
Chi-Square Test for Nominal Data
Example :
•Analyzed gender vs. reading preference
•Used contingency table (Table)
•Result: high χ² value → strong correlation
Interpretation: If χ² is high and p-value < threshold
(e.g., 0.001), reject null hypothesis → attributes are
correlated.
Data Integration: Redundancy
Correlation Coefficient for Numeric Data

Pearson’s Correlation Coefficient (Numeric)

•Value ranges:
• +1: perfect positive correlation
• –1: perfect negative correlation
• 0: no linear correlation

Interpretation:
•If r ≈ 0 → no redundancy
•If r close to ±1 → one attribute may be removed
Data Integration: Redundancy
Covariance

Covariance measures how two numeric variables change together.

If both variables increase together, the covariance is positive.
If one increases while the other decreases, the covariance is negative.
If there's no consistent pattern, the covariance is near zero.

Day Stock A Stock B

1 100 200
2 110 210
3 90 190
4 95 195

Here, both stocks tend to move up and down together, so the

covariance is positive.
Data Integration: Redundancy

Covariance and Example

Therefore, given the positive covariance we can say that stock prices for both companies
rise together.
Data Integration: Redundancy

Technique Use in Data Integration

Chi-Square Test Checks if categorical fields (from two sources) are related or redundant.
Detects if two numeric fields carry similar patterns. Helps decide if one
Pearson Correlation (r)
can be dropped.
Detects whether two numeric fields change together — useful for
Covariance
checking dependencies across sources.

Example Scenario:
You’re merging customer datasets from two branches:
•One has Income, the other has Spending, and a third has Salary.
You use:
•Correlation between Income and Salary to detect redundancy.
•Chi-Square to see if Gender and Segment from different systems have consistent relations.
•Covariance to detect directional similarity.
Data Reduction

Content:
•Large datasets can make analysis slow and impractical.
•Data reduction techniques help:
• Reduce volume
• Retain integrity of original data
• Speed up mining without losing meaningful results
•Strategies include:
• Dimensionality reduction
• Numerosity reduction
• Data compression
Data Reduction: Dimensionality Reduction – Wavelet Transforms

Wavelet Transforms
•DWT transforms data into wavelet
coefficients
•Only strong coefficients are kept; others
are set to 0
•Key advantages:
• Reduces noise
How Wavelet Transform Works
• Supports lossy compression Wavelet Transform – Step-by-Step
• Works well for skewed, sparse, or 1.Pad vector length to power of 2
high-dimensional data 2.Apply smoothing and differencing functions to data pairs
•Common wavelet families: Haar, 3.Split into low-frequency (smooth) & high-frequency (detail)
4.Repeat recursively
Daubechies
5.Keep top coefficients as compressed representation
•Applied recursively using pyramid Matrix multiplication is used for transformation
algorithm (orthonormal matrices)
Data Reduction:Principal Components Analysis (PCA)
PCA axes (Y₁ and Y₂ projected from X₁ and X₂)
•Projects high-dimensional data onto a lower-
dimensional space
•Creates new variables (principal components)
from combinations of original ones
•Captures the maximum variance in fewer
dimensions
•PCA steps:
• Normalize data
• Compute orthonormal vectors (principal
components)
• Sort by variance (importance)
• Keep top components (reduce noise & size)
Wavelet vs PCA

Feature Wavelet Transform PCA

Data Type Works well with ordered data Handles sparse, unordered data
Output Transformed wavelet coefficients Principal components
Strength Local detail, noise reduction Variance-based, dimensionality
Compression Lossy (typically) Lossy (based on variance drop)
Data Reduction: Attribute Subset Selection

•Many datasets contain irrelevant or redundant attributes

•Including them:
• Increases computation
• Confuses learning algorithms
• Produces poor-quality patterns

•Goal: Select a minimum attribute set that retains the original data's predictive power
•Helps improve accuracy, interpretability, and speed
Data Reduction: Attribute Subset Selection
Why Attribute Subset Selection Is Needed
Why Reduce Attributes?

•Examples of irrelevant attributes:

• Phone number (when predicting music preference)
•Benefits of reduction:
• Faster mining
• Better pattern quality
• Easier interpretation
•Exhaustive selection is exponential (2ⁿ subsets)
→ Heuristic methods are used instead
Data Reduction: Attribute Subset Selection
Greedy Methods Illustration
Figure – Greedy Attribute Subset
Selection
•Forward Selection:
• Start: {}
• Add A1 → A1, A4 → Final:
{A1, A4, A6}
•Backward Elimination:
• Start: {A1 to A6}
• Remove: A2, A3, A5 →
Final: {A1, A4, A6}
•Decision Tree:
• Attributes used in splits =
selected subset
• Final tree uses A4, A1, A6
→ Discard others
Data Reduction: Attribute Subset Selection

Attribute Evaluation Measures

How to Choose Best/Worst Attributes?
•Based on statistical significance
•Assume attributes are independent

•Common metrics:
• Information Gain
• Gini Index
• Chi-square score
•Used in decision trees and subset selection
Data Reduction: Parametric Data Reduction

Aim: Approximate data using mathematical models

•Two common parametric techniques:
• Regression Models
• Log-Linear Models
•Benefit: Reduce data volume while preserving patterns
Data Reduction: Parametric Data Reduction

Regression Models for Data Reduction

•Linear Regression:
Models relationship between two numeric attributes
y = wx + b
•y: response variable
•x: predictor variable
•w, b: coefficients
•Multiple Regression:
Extends to multiple predictors
y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

Use: Best fit line using least squares error minimization

Data Reduction: Parametric Data Reduction

Log-Linear Models
•Used for discrete attributes
•Estimate probability of data points using combinations of
lower-dimensional spaces
•Supports:
• Dimensionality reduction
• Smoothing for sparse data
•Suitable for high-dimensional categorical data
Data Reduction: Histograms

•Histograms approximate data distributions using buckets (bins)

•Each bucket summarizes a portion of the data
•Useful for reducing the size of numeric datasets
•Applied on individual attributes or multiple attributes

Used in data preprocessing, analytics, and compression

Data Reduction: Histograms The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
Singleton Buckets
The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
Each bucket stores a single value- 15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20,
20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
frequency pair
30, 30.
•Helps retain exact details
•Useful for capturing outliers and
precise value counts

No.of 1’s:2
No.of 5’s:5
No.of 8’s:2
..
.
.
Data Reduction: Histograms The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar).
The numbers have been sorted:
Equal-Frequency Histograms 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15,
15, 15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20,
•Each bucket has roughly the same 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,
number of items 30, 30.
•Useful when value distribution is skewed
•Better than equal-width for capturing
balanced patterns

Example: 30 data points → 3 buckets →

~10 items per bucket

1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10: total 13(1-10 count)

Data Reduction: Histograms

Summary and Advantages

•Histograms reduce data size while retaining distribution
•Suitable for:
• Sparse or dense data
• Skewed or uniform data
•Multidimensional histograms:
• Capture attribute dependencies
• Work well up to ~5 attributes
Supports data summarization, compression, and faster mining
Data Reduction: Clustering for Data Reduction

•Clustering groups similar data objects into clusters

•Each cluster represents a set of similar data points
•Objects in different clusters are dissimilar
•Commonly based on distance measures (e.g., Euclidean distance)

Purpose in Data Reduction:

•Replace all records in a cluster with the cluster centroid
•Reduces data size, preserving patterns
•More effective when data naturally forms distinct groups
Data Reduction: Clustering for Data Reduction

•Clustering groups similar data objects into clusters

•Each cluster represents a set of similar data points
•Objects in different clusters are dissimilar
•Commonly based on distance measures (e.g., Euclidean distance)

Purpose in Data Reduction:

•Replace all records in a cluster with the cluster centroid
•Reduces data size, preserving patterns
•More effective when data naturally forms distinct groups
Data Reduction: Sampling for Data Reduction

•Sampling selects a small representative subset of a large dataset for analysis

•Reduces cost and time without scanning the full dataset
•Especially useful for aggregate query estimation

Types of Sampling :
•SRSWOR (Simple Random Sampling Without Replacement):
Selects s unique records randomly from dataset D
•SRSWR (With Replacement):
A record can be selected more than once
•Cluster Sampling:
Selects entire groups (e.g., data pages) instead of individual
records
•Stratified Sampling:
Divides dataset into strata (e.g., age groups) and samples from
each
➤ Ensures representation of minority groups in skewed data
Data Reduction: Data Cube Aggregation
•Purpose: Reduces data by summarizing measures across dimensions
•Aggregation = combining detailed data into higher-level summaries
•Example:
Instead of quarterly sales (Q1–Q4), aggregate into annual sales
➤ Reduces data size
➤ Keeps only necessary granularity for analysis
•Benefits:
• Smaller data size
• Faster query processing
• Retains essential analytical meaning
Data Transformation and Discretization

•Overview of transformation techniques

•Importance in preprocessing
•Link to data reduction and mining efficiency
Data Transformation Overview

•Conversion or consolidation of data into a suitable format for mining.

•Benefits:
• Improves mining efficiency.
• Simplifies patterns for better understanding.
•Strategies:
• Smoothing
• Attribute Construction
• Aggregation
• Normalization
• Discretization
• Concept Hierarchy Generation
Data Transformation Overview
In data transformation, the data are transformed or consolidated into forms
appropriate for mining. Strategies include:
1.Smoothing – Removes noise from the data. Techniques: binning, regression,
clustering.
2.Attribute Construction (Feature Construction) – Creates new attributes from
given attributes to assist the mining process.
3.Aggregation – Applies summary or aggregation (e.g., daily → monthly totals,
used in data cubes).
4.Normalization – Scales data into a smaller range (e.g., -1.0 to 1.0 or 0.0 to 1.0).
5.Discretization – Replaces raw numeric values with intervals (e.g., 0–10) or
conceptual labels (e.g., youth, adult, senior), forming a concept hierarchy.
6.Concept Hierarchy Generation for Nominal Data – Generalizes nominal
attributes (e.g., street → city → country), often defined automatically at schema
level.
Data Transformation by Normalization

Purpose:
•Scale attribute data into a common range.
•Useful for:
• Neural networks
• Distance-based methods (kNN, clustering)
Methods:
1.Min–Max Normalization
2.Z-Score Normalization
3.Decimal Scaling
Include small numeric example (Example 3.4 and Example 3.5).
Figure: Min–Max and Z-Score formulas.
Data Transformation by Normalization

Min–Max Normalization
Data Transformation by Normalization

Z-Score Normalization
Data Transformation by Normalization

Decimal Scaling Normalization

Data Discretization

1.Discretization by Binning
2.Discretization by Histogram Analysis
3.Discretization by Cluster, Decision Tree, and Correlation Analyses
4.Concept Hierarchy Generation for Nominal Data
Data Discretization :Discretization by Binning

•Definition: Top-down splitting technique that divides attribute values into a fixed
number of bins.
•Types:
• Equal-width binning: Bins have equal value range.
• Equal-frequency binning: Each bin has the same number of tuples.
•Usage:
• Can replace bin values with bin mean or median (smoothing).
• Can be applied recursively to create concept hierarchies.
•Nature: Unsupervised, sensitive to bin count and outliers.
Data Discretization :Discretization by Histogram Analysis

•Definition: Unsupervised method partitioning attribute values into

disjoint ranges (“bins”).
•Types:
• Equal-width histogram: Fixed value range per bin.
• Equal-frequency histogram: Equal number of values per bin.
•Features:
• Recursive application can build multi-level concept hierarchies.
• Minimum interval size can control recursion.
•Example: Price ranges of $10 (equal-width) or equal number of items
(equal-frequency).
Data Discretization :– Discretization by Cluster, Decision Tree
& Correlation Analyses

•Clustering: Groups attribute values into clusters based on similarity and

distribution.
• Top-down: Split clusters further.
• Bottom-up: Merge neighboring clusters.
•Decision Tree Analysis: Supervised; uses class labels and entropy to select
split-points that improve classification accuracy.
•Correlation Analysis (ChiMerge):
• Bottom-up, supervised method.
• Merge intervals with similar class distributions using χ² test.
• Stop merging when a predefined criterion is met.
Data Discretization :– Concept Hierarchy Generation for
Nominal Data
abstraction levels for flexible mining.
•Methods:
• Manual Ordering: Specify hierarchy in
schema (e.g., street < city < state < country).
• Explicit Grouping: Manually group
intermediate values.
• Automatic Ordering: Sort attributes by
number of distinct values (fewer → higher
level).
• Partial Specification with Semantic Linking:
Automatically include related attributes
based on semantic relationships.
•Example: Location hierarchy from Figure 3.13
using distinct value counts.

Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Correlation
No ratings yet
Correlation
14 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Session 4
No ratings yet
Session 4
40 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Module 2 - DM - AI
No ratings yet
Module 2 - DM - AI
61 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Data Mining for Quality Improvement
100% (1)
Data Mining for Quality Improvement
34 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
94 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Week2 2
No ratings yet
Week2 2
25 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
CH 3
No ratings yet
CH 3
68 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
41 pages
Unit - II
No ratings yet
Unit - II
56 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Discretization and Hierarchy Generation
No ratings yet
Data Discretization and Hierarchy Generation
48 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
UNIT II Data Processing (1) .PPTX DMT
No ratings yet
UNIT II Data Processing (1) .PPTX DMT
43 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Invoice 7244001495
No ratings yet
Invoice 7244001495
1 page
Servlet Examples
No ratings yet
Servlet Examples
8 pages
At&cd Unit 1
No ratings yet
At&cd Unit 1
34 pages
Aws Syllabus
No ratings yet
Aws Syllabus
14 pages
CN1 16qsns
No ratings yet
CN1 16qsns
17 pages
CN Imp
No ratings yet
CN Imp
1 page
V18684 QuestionBank AnswerAndExplanation
No ratings yet
V18684 QuestionBank AnswerAndExplanation
203 pages
Exporting IFC Files from Revit Guide
No ratings yet
Exporting IFC Files from Revit Guide
9 pages
SQL Basics and Table Management Guide
No ratings yet
SQL Basics and Table Management Guide
16 pages
Data Analyst Trainee Skills Overview
No ratings yet
Data Analyst Trainee Skills Overview
2 pages
8MySQL Practical Questions
No ratings yet
8MySQL Practical Questions
4 pages
ADBMS Previous Year
No ratings yet
ADBMS Previous Year
4 pages
SQL For Data Science
No ratings yet
SQL For Data Science
40 pages
BAPI in LSMW for SAP PO Migration
No ratings yet
BAPI in LSMW for SAP PO Migration
16 pages
Idera Database Management Solutions
No ratings yet
Idera Database Management Solutions
1 page
SQL Keywords and Descriptions Guide
No ratings yet
SQL Keywords and Descriptions Guide
3 pages
BMS 202 - Tutorial 04
No ratings yet
BMS 202 - Tutorial 04
6 pages
How Do You Tell A Veteran Thank You
100% (1)
How Do You Tell A Veteran Thank You
8 pages
File Systems: NTFS vs. FAT32
No ratings yet
File Systems: NTFS vs. FAT32
10 pages
NetApp Study Notes
100% (6)
NetApp Study Notes
32 pages
ePowerMonitor ePM Datasheet-V.2023
No ratings yet
ePowerMonitor ePM Datasheet-V.2023
1 page
Data Governance & Quality Management
No ratings yet
Data Governance & Quality Management
36 pages
Fviainboxes Account Details and Login
No ratings yet
Fviainboxes Account Details and Login
3 pages
Deadlock Handling in DB Systems
No ratings yet
Deadlock Handling in DB Systems
5 pages
Oracle DBA Backup & Recovery Lab
No ratings yet
Oracle DBA Backup & Recovery Lab
5 pages
VSAM
No ratings yet
VSAM
1 page
Query To Get DFF and Segment Values - Oracle Apps Reference
No ratings yet
Query To Get DFF and Segment Values - Oracle Apps Reference
6 pages
Cursor Update Quiz Guide
100% (2)
Cursor Update Quiz Guide
4 pages
Resume
No ratings yet
Resume
2 pages
03 Database Management System Important Questions Answers
75% (8)
03 Database Management System Important Questions Answers
6 pages
Migrating 200TB DB to Exadata in One Day
No ratings yet
Migrating 200TB DB to Exadata in One Day
19 pages
Let Us Consider The Schema of Student Table (Stdno Number (5), Name Varchar2 (20), Marks Number (3) )
No ratings yet
Let Us Consider The Schema of Student Table (Stdno Number (5), Name Varchar2 (20), Marks Number (3) )
9 pages
Amazon S3 - Comprehensive Notes
No ratings yet
Amazon S3 - Comprehensive Notes
19 pages
DB2 Procedures & UDFs Guide
100% (1)
DB2 Procedures & UDFs Guide
44 pages
Icdl Module 5
No ratings yet
Icdl Module 5
2 pages
Business Intelligence - Deriving and Sharing Business Intelligence Metadata
No ratings yet
Business Intelligence - Deriving and Sharing Business Intelligence Metadata
4 pages

DMDW Unit II

Uploaded by

DMDW Unit II

Uploaded by

Data Warehousing and Data mining

Data Preprocessing: An Overview, Data Cleaning, Data Integration,

Why Preprocess the Data

•Real-world data is often dirty, incomplete, inconsistent, and noisy.

Preprocessing ensures high-quality, reliable input for mining.

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:

Major Preprocessing Tasks

The key tasks involved in preparing data for mining include:

Preprocessing: Summary & Examples

These problems arise due to:

Goals of Data Cleaning:

Smoothing by Bin Means

Example: Predicting sales based on advertising spend. 500 210

Involves more than two variables.

Why is Data Integration Needed?

Redundancy in Data Integration

•Redundancy means an attribute can be derived from other attributes.

χ² Correlation Test for Nominal Attributes

•Used to test correlation between two categorical (nominal)

Pearson’s Correlation Coefficient (Numeric)

Covariance measures how two numeric variables change together.

Day Stock A Stock B

Here, both stocks tend to move up and down together, so the

Covariance and Example

Technique Use in Data Integration

Feature Wavelet Transform PCA

•Many datasets contain irrelevant or redundant attributes

•Examples of irrelevant attributes:

Attribute Evaluation Measures

Aim: Approximate data using mathematical models

Regression Models for Data Reduction

Use: Best fit line using least squares error minimization

•Histograms approximate data distributions using buckets (bins)

Used in data preprocessing, analytics, and compression

Example: 30 data points → 3 buckets →

1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10: total 13(1-10 count)

Summary and Advantages

•Clustering groups similar data objects into clusters

Purpose in Data Reduction:

•Clustering groups similar data objects into clusters

Purpose in Data Reduction:

•Sampling selects a small representative subset of a large dataset for analysis

•Overview of transformation techniques

•Conversion or consolidation of data into a suitable format for mining.

Decimal Scaling Normalization

•Definition: Unsupervised method partitioning attribute values into

•Clustering: Groups attribute values into clusters based on similarity and

You might also like