9-2 Data analysis and pre-processing part 2.pdf
9-2 Data analysis and pre-processing part 2.pdf
Dong-Kyu Chae
❑ Data Visualization
❑ Summary
Graphic Displays of Basic Statistical Descriptions
❑ Boxplot: graphic display of five-number summary
Q3
Median
Q1
Quantile-Quantile (Q-Q) Plot
❑ Displaysthe quantiles of one univariate distribution of a dataset
against the corresponding quantiles of another dataset
❑ Example: Shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile.
❑ Unit prices of items sold at Branch 1 tend to be cheaper than those at
Branch 2.
y=x
Q3
Median
Q1
Scatter Plot
❑ Each pair of feature values is treated as the coordinates and
plotted as a point in the 2D space
❑ It shows correlation between two features
❑ Also provides an overview to see clusters of points, outliers, etc
Scatter Plot
❑ Examples of positively and negatively correlated data
❑ In
the example below, the left half fragment is positively
correlated, and the right half is negative correlated
Scatter Plot
❑ Uncorrelated data
Contents
❑ Data Objects and feature Types
❑ Data Visualization
❑ Summary
Similarity and Dissimilarity
❑ Similarity
d (i, j) = p − m 𝑚
𝑠𝑖𝑚(𝑖, 𝑗) =
p 𝑝
❑ Method 2: Use multiple binary features to express one
nominal feature
❑ Creating a new binary feature for each of the M nominal states
❑ E.g.,
a nominal feature COLOR: [red, yellow, blue, green] can be
expressed by four binary features:
red => [0, 1]; yellow => [0, 1]; blue => [0, 1]; green => [0, 1]
Proximity Measure for Binary Features
❑A contigency table for two different objects:
Obj j
Obj i
Distance Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure
❑ Properties
❑ 𝑑 𝑖, 𝑗 > 0 if 𝑖≠𝑗, and 𝑑 𝑖, 𝑖 = 0 (Positive definiteness)
❑ 𝑑 𝑖, 𝑗 = 𝑑 𝑗, 𝑖 (Symmetry)
❑ 𝑑 𝑖, 𝑗 ≤ 𝑑 𝑖, 𝑘 + 𝑑 𝑘, 𝑗 (Triangle inequality)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1|| = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2|| = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
Distance for Ordinal Features
❑ Order is important, e.g., grade, size, etc…
❑ Such values can be expressed by rank (integers)
❑ Replace i-th object in the f-th feature xif by their rank: rif
rif {1,...,M f }
❑ Finally the rank values are mapped onto [0, 1] by:
rif −1
zif =
M f −1
(𝑓)
❑ 𝛿𝑖𝑗 : importance weight on each feature type 𝑓
Summary
❑ Datafeature types: nominal, binary, ordinal, numerical (interval-
scaled, ratio-scaled)
❑ Gain insight into the data by:
❑ Basic statistical data description: central tendency, dispersion
❑5 numbers summary, visualized by a boxplot.
❑ Visualizations
❑ Scatter plot, QQ plot, histogram, etc…