0% found this document useful (0 votes)

6 views27 pages

3-Preprocessing

Chapter 3 of 'Data Mining: Concepts and Techniques' focuses on data preprocessing, emphasizing the importance of data quality and the major tasks involved, such as data cleaning, integration, reduction, and transformation. It discusses various issues like missing and noisy data, along with strategies for handling these problems, including normalization and feature selection. The chapter also highlights the significance of dimensionality reduction and sampling methods to improve data analysis efficiency.

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views27 pages

3-Preprocessing

Uploaded by

wasiqbarat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining:

Concepts and Techniques

— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

2
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: accurate or noisy (containing errors, or values
that deviate from the expected)
 Completeness: not recorded (lacking attribute values or
certain attributes of interest …)
 Consistency: e.g. discrepancy in the department codes used
to categorize items
 Timeliness: timely update?
 Believability: how much the data are trustable by users
 Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction (e.g. sampling)
 Data transformation and data discretization
 Normalization
 …

4
Major Tasks in Data Preprocessing

5
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

6
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
features, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
8
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per feature varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the feature mean
 the feature mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect feature values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

10
How to Handle Noisy Data?
 Binning
 First sort data and partition into (equal-frequency) bins
 Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

11
How to Handle Noisy Data (cont.)

 Regression
 smooth by fitting the data into regression functions

12
How to Handle Noisy Data (cont.)

 Clustering
 detect and remove outlier

13
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

14
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

15
Feature Engineering
 Feature Extraction / Construction aims to reduce the number
of features in a dataset by creating new features from the existing
ones (and then discarding the original features).
 e.g. PCA

 Feature Selection: Instead of creating new features, Feature

Selection focuses on choosing a subset of the existing features
that contribute most significantly to the problem.
 This process eliminates irrelevant or redundant features while
preserving the important ones.
 e.g. Feature Subset Selection

 Feature Creation / Generation: Create new features that can

capture the important information in a data set more effectively
than the original ones.
16
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant features

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

17
Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

18
Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
 Duplicate much or all of the information contained in
one or more other features
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant features
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

19
Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

20
Sampling

 Sampling: obtaining a small sample s to represent the

whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
21
Types of Sampling

 Simple random sampling

 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population

 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data

22
Sampling: With or without Replacement

Raw Data
23
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

24
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

25
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing 26
Normalization
 min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
73600  12000
Then $73000 is mapped to 98000  12000 (1.0  0)  0  0.716
 z-score normalization (μ: mean, σ: standard deviation):
v  A
v'
 A

73600  54000
 Ex. Let μ = 54000, σ = 16000. Then  1.225
16000
 Normalization by decimal scaling:
v
v'  j Where j is the smallest integer such that max (|ν’|) < 1
10
27

03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Unit - II
No ratings yet
Unit - II
56 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Correlation
No ratings yet
Correlation
14 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lect 4
No ratings yet
Lect 4
30 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Week 2
No ratings yet
Week 2
96 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Normalization
No ratings yet
Normalization
35 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
ml4
No ratings yet
ml4
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Data Mining
No ratings yet
Data Mining
22 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Vision System
No ratings yet
Vision System
8 pages
AI Planning: Introduction To Artificial Intelligence
No ratings yet
AI Planning: Introduction To Artificial Intelligence
66 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
WM3-96 Manual
No ratings yet
WM3-96 Manual
63 pages
Multimedia and ICT
75% (4)
Multimedia and ICT
23 pages
Flowchart Using Raptor
No ratings yet
Flowchart Using Raptor
3 pages
Data Sheet 6GK5205-3BB00-2AB2: Product Type Designation Scalance Xb205-3
No ratings yet
Data Sheet 6GK5205-3BB00-2AB2: Product Type Designation Scalance Xb205-3
4 pages
internet security
No ratings yet
internet security
3 pages
History of WWW: (World Wide Web)
No ratings yet
History of WWW: (World Wide Web)
4 pages
Software Testing Strategies
No ratings yet
Software Testing Strategies
27 pages
Concept of Computational Model
No ratings yet
Concept of Computational Model
11 pages
Daftar Pengaruh
No ratings yet
Daftar Pengaruh
2 pages
Artificial Neural Networks and Machine Learning - ICANN 2018
No ratings yet
Artificial Neural Networks and Machine Learning - ICANN 2018
854 pages
Vlsi Lab13
No ratings yet
Vlsi Lab13
5 pages
Dial Plan Approach Webex Calling and on-premise Unified CM
No ratings yet
Dial Plan Approach Webex Calling and on-premise Unified CM
5 pages
Migrating ZFS Storage Pools - Managing ZFS File Systems in Oracle® Solaris 11.3
No ratings yet
Migrating ZFS Storage Pools - Managing ZFS File Systems in Oracle® Solaris 11.3
1 page
University of Madras: M.Sc. Degree Programme in Computer Science
No ratings yet
University of Madras: M.Sc. Degree Programme in Computer Science
2 pages
GDC TMS-2000 User-Manual Eng 4.3.1.02 20220413
No ratings yet
GDC TMS-2000 User-Manual Eng 4.3.1.02 20220413
248 pages
Books
No ratings yet
Books
123 pages
Sections 14.1-14.7: Comparable) Modify The
No ratings yet
Sections 14.1-14.7: Comparable) Modify The
4 pages
EEE 431 - Homework 2 - Final
No ratings yet
EEE 431 - Homework 2 - Final
3 pages
SWOT Assessment - Microsoft Azure
No ratings yet
SWOT Assessment - Microsoft Azure
7 pages
11 AirlineProfile 20.2 Implementation Guide
No ratings yet
11 AirlineProfile 20.2 Implementation Guide
5 pages
Low-Level Programming Languages and Pseudocode
No ratings yet
Low-Level Programming Languages and Pseudocode
41 pages
West Bengal University of Technology BF-142, Salt Lake City, Kolkata-700064 Syllabus For BCA
No ratings yet
West Bengal University of Technology BF-142, Salt Lake City, Kolkata-700064 Syllabus For BCA
29 pages
Ireb Cpre Glossary en 2.0
No ratings yet
Ireb Cpre Glossary en 2.0
27 pages
Database Design Lecture Notes
No ratings yet
Database Design Lecture Notes
9 pages
md0f9d7a02ae 2
No ratings yet
md0f9d7a02ae 2
18 pages
Chapter 8
No ratings yet
Chapter 8
17 pages
Plot and Navigate A Virtual Maze: Capstone Project
No ratings yet
Plot and Navigate A Virtual Maze: Capstone Project
22 pages

3-Preprocessing

Uploaded by

3-Preprocessing

Uploaded by

Data Mining:

Concepts and Techniques

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Measures for data quality: A multidimensional view

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

relationship to detect violators (e.g., correlation and clustering

 ETL (Extraction/Transformation/Loading) tools: allow users to

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

 Feature Selection: Instead of creating new features, Feature

 Feature Creation / Generation: Create new features that can

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Sampling: obtaining a small sample s to represent the

 Simple random sampling

Raw Data Cluster/Stratified Sample

 Data Preprocessing: An Overview

 Major Tasks in Data Preprocessing

 Data Transformation and Data Discretization

You might also like