Swetha Unit 1 Part 2 Data Preprocessing
Swetha Unit 1 Part 2 Data Preprocessing
— Chapter 2 —
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Data Characteristics
Central tendency and
Dispersion of the data
Descriptive statistics are of great help in
understanding the distribution of the
data
1 n x
Mean (algebraic measure) (sample vs. population): x xi
n i 1 N
n
Weighted arithmetic mean: w x i i
x i 1
Trimmed mean: chopping extreme values n
w
i 1
i
xi 2
2 2 2
( xi )
N i 1 N i 1
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
July 9, 2022 Data Mining: Concepts and Techniques 17
Missing Data
technology limitation
incomplete data
inconsistent data
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
rA, B
( A A)( B B ) ( AB) n A B
(n 1)AB (n 1)AB
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Gender Preferred -
reading
Male fiction
Female Non-fiction
Male Non-fiction
Female Fiction
Male Fiction
Male Non-fiction
Female Fiction
Male Fiction
Male Non- fiction
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
July 9, 2022 Data Mining: Concepts and Techniques 34
Chapter 2: Data Preprocessing
smaller in volume but yet produce the same (or almost the same)
analytical results
Data reduction strategies
Data cube aggregation:
Data Compression
understand
Heuristic methods (due to exponential # of choices):
Step-wise forward selection
Decision-tree induction
ss y
lo
Original Data
Approximated
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be
X2
Y1
Y2
X1
linear models
Non-parametric methods
Do not assume models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line
above
Log-Linear Models
Approximate discrete multidimensional probability
distributions.
Given a set of tuples in n dimensions (e.g., described by
n attributes), we can consider each tuple as a point in an
n-dimensional space.
Log-linear models can be used to estimate the probability
of each point in a multidimensional space for a set of
discretized attributes
Regression can be computationally intensive when applied
to highdimensional data
Log-linear models show good scalability for up to 10
dimensions
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance (weighted sum of the
original values that each bucket represents)
MaxDiff – Difference between each pair of adjacent values
Highly effective in handling sparse and dense data
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
July 9, 2022 Data Mining: Concepts and Techniques 57
Sampling: Cluster
Binning
Top-down splitting technique based on a specified
number of bins
unsupervised discretization technique
smoothing by bin means or smoothing by bin medians
Histogram Analysis
Unsupervised, top-down splitting
Histograms partition the values for an attribute, A, into
disjoint ranges called buckets.
Equal-width histogram
Equal frequency histogram
July 9, 2022 Data Mining: Concepts and Techniques 64
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S 1 and S2
using boundary T, the information gain after partitioning is
|S | |S |
I ( S , T ) 1 Entropy ( S 1) 2 Entropy ( S 2)
|S| |S|
Entropy is calculated based on class distribution of the samples in the
set. Given m classes, the entropy of S1 is
m
Entropy ( S1 ) pi log 2 ( pi )
i 1