0% found this document useful (0 votes)
6 views27 pages

3-Preprocessing

Chapter 3 of 'Data Mining: Concepts and Techniques' focuses on data preprocessing, emphasizing the importance of data quality and the major tasks involved, such as data cleaning, integration, reduction, and transformation. It discusses various issues like missing and noisy data, along with strategies for handling these problems, including normalization and feature selection. The chapter also highlights the significance of dimensionality reduction and sampling methods to improve data analysis efficiency.

Uploaded by

wasiqbarat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views27 pages

3-Preprocessing

Chapter 3 of 'Data Mining: Concepts and Techniques' focuses on data preprocessing, emphasizing the importance of data quality and the major tasks involved, such as data cleaning, integration, reduction, and transformation. It discusses various issues like missing and noisy data, along with strategies for handling these problems, including normalization and feature selection. The chapter also highlights the significance of dimensionality reduction and sampling methods to improve data analysis efficiency.

Uploaded by

wasiqbarat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining:

Concepts and Techniques

— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

2
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy: accurate or noisy (containing errors, or values
that deviate from the expected)
 Completeness: not recorded (lacking attribute values or
certain attributes of interest …)
 Consistency: e.g. discrepancy in the department codes used
to categorize items
 Timeliness: timely update?
 Believability: how much the data are trustable by users
 Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction (e.g. sampling)
 Data transformation and data discretization
 Normalization
 …

4
Major Tasks in Data Preprocessing

5
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

6
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking feature values, lacking certain features of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data

 Data is not always available


 E.g., many tuples have no recorded value for several
features, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
8
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per feature varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the feature mean
 the feature mean for all samples belonging to the same
class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
9
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect feature values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning


 duplicate records

 incomplete data

 inconsistent data

10
How to Handle Noisy Data?
 Binning
 First sort data and partition into (equal-frequency) bins
 Then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.

11
How to Handle Noisy Data (cont.)

 Regression
 smooth by fitting the data into regression functions

12
How to Handle Noisy Data (cont.)

 Clustering
 detect and remove outlier

13
Data Cleaning as a Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections


 Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering


to find outliers)
 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to


specify transformations through a graphical user interface
 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

14
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

15
Feature Engineering
 Feature Extraction / Construction aims to reduce the number
of features in a dataset by creating new features from the existing
ones (and then discarding the original features).
 e.g. PCA

 Feature Selection: Instead of creating new features, Feature


Selection focuses on choosing a subset of the existing features
that contribute most significantly to the problem.
 This process eliminates irrelevant or redundant features while
preserving the important ones.
 e.g. Feature Subset Selection

 Feature Creation / Generation: Create new features that can


capture the important information in a data set more effectively
than the original ones.
16
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant features

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

17
Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

18
Feature Subset Selection
 Another way to reduce dimensionality of data
 Redundant features
 Duplicate much or all of the information contained in
one or more other features
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant features
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

19
Clustering
 Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
 Can be very effective if data is clustered but not if data
is “smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

20
Sampling

 Sampling: obtaining a small sample s to represent the


whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
21
Types of Sampling

 Simple random sampling


 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population

 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data

22
Sampling: With or without Replacement

Raw Data
23
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

24
Chapter 3: Data Preprocessing

 Data Preprocessing: An Overview

 Data Quality

 Major Tasks in Data Preprocessing

 Data Cleaning

 Data Reduction

 Data Transformation and Data Discretization

25
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing 26
Normalization
 min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12000 to $98000 normalized to [0.0, 1.0].
73600  12000
Then $73000 is mapped to 98000  12000 (1.0  0)  0  0.716
 z-score normalization (μ: mean, σ: standard deviation):
v  A
v'
 A

73600  54000
 Ex. Let μ = 54000, σ = 16000. Then  1.225
16000
 Normalization by decimal scaling:
v
v'  j Where j is the smallest integer such that max (|ν’|) < 1
10
27

You might also like