3-Preprocessing
3-Preprocessing
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Reduction
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction (e.g. sampling)
Data transformation and data discretization
Normalization
…
4
Major Tasks in Data Preprocessing
5
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Reduction
6
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
technology limitation
incomplete data
inconsistent data
10
How to Handle Noisy Data?
Binning
First sort data and partition into (equal-frequency) bins
Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
11
How to Handle Noisy Data (cont.)
Regression
smooth by fitting the data into regression functions
12
How to Handle Noisy Data (cont.)
Clustering
detect and remove outlier
13
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
14
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Reduction
15
Feature Engineering
Feature Extraction / Construction aims to reduce the number
of features in a dataset by creating new features from the existing
ones (and then discarding the original features).
e.g. PCA
Data compression
17
Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
18
Feature Subset Selection
Another way to reduce dimensionality of data
Redundant features
Duplicate much or all of the information contained in
one or more other features
E.g., purchase price of a product and the amount of
sales tax paid
Irrelevant features
Contain no information that is useful for the data
mining task at hand
E.g., students' ID is often irrelevant to the task of
predicting students' GPA
19
Clustering
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
20
Sampling
Stratified sampling:
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
22
Sampling: With or without Replacement
Raw Data
23
Sampling: Cluster or Stratified Sampling
24
Chapter 3: Data Preprocessing
Data Quality
Data Cleaning
Data Reduction
25
Data Transformation
A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing 26
Normalization
min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
73600 12000
Then $73000 is mapped to 98000 12000 (1.0 0) 0 0.716
z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73600 54000
Ex. Let μ = 54000, σ = 16000. Then 1.225
16000
Normalization by decimal scaling:
v
v' j Where j is the smallest integer such that max (|ν’|) < 1
10
27