0% found this document useful (0 votes)
2 views27 pages

9-2 Data analysis and pre-processing part 2.pdf

Chapter 2 discusses various aspects of data analysis, including data objects and feature types, basic statistical descriptions, and data visualization techniques such as boxplots, histograms, and scatter plots. It also covers measuring data similarity and dissimilarity using methods like Minkowski distance and cosine similarity. The chapter emphasizes the importance of understanding different feature types and visualizing data to gain insights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

9-2 Data analysis and pre-processing part 2.pdf

Chapter 2 discusses various aspects of data analysis, including data objects and feature types, basic statistical descriptions, and data visualization techniques such as boxplots, histograms, and scatter plots. It also covers measuring data similarity and dissimilarity using methods like Minkowski distance and cosine similarity. The chapter emphasizes the importance of understanding different feature types and visualizing data to gain insights.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 2: Getting to Know Your Data

Dong-Kyu Chae

PI of the Data Intelligence Lab @HYU


Department of Computer Science & Data Science
Hanyang University
Contents
❑ Data Objects and Feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Graphic Displays of Basic Statistical Descriptions
❑ Boxplot: graphic display of five-number summary

❑ Histogram: x-axis: values/ranges, y-axis: frequencies

❑ Quantile plot: each value xi is paired with fi indicating that


approximately fi of data are  xi

❑ Quantile-quantile (q-q) plot:


The quantiles of one univariant distribution against the
corresponding quantiles of another

❑ Scatter plot: each pair of two feature values is plotted as


points in the 2D space
Histogram Analysis
❑ Histogram: Graph display of 40
frequencies shown as bars 35
❑ Itshows what proportion 30
of cases fall into each of 25
several categories
20
❑ The categories are usually
specified as non- 15
overlapping intervals of 10
some variable
5
❑ The categories (bars) must
be adjacent 0
10000 30000 50000 70000 90000
Quantile Plot
❑ Sort data in increasing order, and display all the data points
❑ Allowing users to assess both the overall behavior and unusual
occurrences
❑ fiindicates that approximately (100 x fi )% of the data are below
or equal to the value xi

Q3

Median
Q1
Quantile-Quantile (Q-Q) Plot
❑ Displaysthe quantiles of one univariate distribution of a dataset
against the corresponding quantiles of another dataset
❑ Example: Shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile.
❑ Unit prices of items sold at Branch 1 tend to be cheaper than those at
Branch 2.
y=x

Q3

Median

Q1
Scatter Plot
❑ Each pair of feature values is treated as the coordinates and
plotted as a point in the 2D space
❑ It shows correlation between two features
❑ Also provides an overview to see clusters of points, outliers, etc
Scatter Plot
❑ Examples of positively and negatively correlated data

❑ In
the example below, the left half fragment is positively
correlated, and the right half is negative correlated
Scatter Plot
❑ Uncorrelated data
Contents
❑ Data Objects and feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Similarity and Dissimilarity
❑ Similarity

❑ Numerical measure of how much alike two data objects are

❑ This value is higher when objects are more alike

❑ Often falls in the range [0,1]

❑ Dissimilarity (e.g., distance)


❑ Numerical measure of how much different two data objects are

❑ Lower when objects are more alike

❑ Minimum dissimilarity is often 0

❑ Proximity refers to a similarity or dissimilarity


Matrix for Data
❑ Data matrix (n-by-p)
❑ n data points with p dimensions (features)
❑ Two modes  x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... x if ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

❑ Dissimilarity (distance) matrix (n-by-n)


❑ n data points, but registers only the distance
❑ A triangular matrix  0 
 d(2,1) 0 
❑ Single mode  
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Proximity Measure for Nominal Features
❑ Nominal features can take 2 or more states
❑ E.g., COLOR: [red, yellow, blue, green, etc…]

❑ Method 1: Simple matching


❑m : # of matches, p : total # of nominal features

d (i, j) = p − m 𝑚
𝑠𝑖𝑚(𝑖, 𝑗) =
p 𝑝
❑ Method 2: Use multiple binary features to express one
nominal feature
❑ Creating a new binary feature for each of the M nominal states
❑ E.g.,
a nominal feature COLOR: [red, yellow, blue, green] can be
expressed by four binary features:
red => [0, 1]; yellow => [0, 1]; blue => [0, 1]; green => [0, 1]
Proximity Measure for Binary Features
❑A contigency table for two different objects:
Obj j

Obj i

❑ Proximity measure for symmetric binary features:


𝑞+𝑡
𝑠𝑖𝑚(𝑖, 𝑗) =
𝑞+𝑟+𝑠+𝑡
❑ Used when two results (0, 1) are equally important (similar with Method 1)

❑ Proximity measure for asymmetric binary features :


𝑞
𝑠𝑖𝑚(𝑖, 𝑗) =
𝑞+𝑟+𝑠

❑ Used when one is much more important than the other.


Dissimilarity between Asymmetric Binary Features
❑ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
❑ Gender is a symmetric feature (ignored in this case)
❑ Assume that the remaining features are asymmetric
❑ Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
Proximity Measure for Numeric Features
❑ First
of all, we need to standardize / normalize
numeric data (will be introduced in the next chapter)
❑ To match scale of each feature
x−
z= 
❑ Z-score:
▪ x: raw score to be standardized, μ: mean of the population, σ:
standard deviation
▪ Meaning: the distance between the raw score and the
population mean in units of the standard deviation
▪ “−” when the raw score is below the mean
▪ “+” when the raw score is above the mean
Matrix for Numeric Data
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Distance Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)

If h=2, it is equal to Euclidean Distance (L-2 norm)


Special Cases of Minkowski Distance
❑h = 1: Manhattan (city block, L1 norm) distance

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

❑h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

❑h → . Supremum (Lmax norm, L norm) distance


❑ This is the maximum difference between any component
(feature) of the vectors
Example: Minkowski Distance
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure

❑ Properties
❑ 𝑑 𝑖, 𝑗 > 0 if 𝑖≠𝑗, and 𝑑 𝑖, 𝑖 = 0 (Positive definiteness)

❑ 𝑑 𝑖, 𝑗 = 𝑑 𝑗, 𝑖 (Symmetry)

❑ 𝑑 𝑖, 𝑗 ≤ 𝑑 𝑖, 𝑘 + 𝑑 𝑘, 𝑗 (Triangle inequality)

❑ A distance that satisfies these properties is a metric


Cosine Similarity
❑A document can be represented by thousands of words, each
recording the frequency of a particular word in the document.

❑ Documents are represented as n-dimensional vectors


❑ Cosine measure: If A and B are two vectors, then:

where • indicates dot product, ||A || is the length


of vector A
Example: Cosine Similarity
❑ cos(d1, d2) = (d1 • d2) /||d1|| ||d2||
❑ It focuses more on vector direction, rather than coordinate distance
❑ Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1|| = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481

||d2|| = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12

cos(d1, d2 ) = 0.94
Distance for Ordinal Features
❑ Order is important, e.g., grade, size, etc…
❑ Such values can be expressed by rank (integers)
❑ Replace i-th object in the f-th feature xif by their rank: rif

rif {1,...,M f }
❑ Finally the rank values are mapped onto [0, 1] by:
rif −1
zif =
M f −1

❑ Then we can use any distance/dissimilarity measures


for numeric data
Features of Mixed Type
❑A database may contain various feature types
❑ Nominal, symmetric binary, asymmetric binary, numeric, ordinal

❑ One may use a weighted average to combine their


effects
𝑝 (𝑓) (𝑓)
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
𝑑(𝑖, 𝑗) = 𝑝 (𝑓)
Σ𝑓=1 𝛿𝑖𝑗

(𝑓)
❑ 𝛿𝑖𝑗 : importance weight on each feature type 𝑓
Summary
❑ Datafeature types: nominal, binary, ordinal, numerical (interval-
scaled, ratio-scaled)
❑ Gain insight into the data by:
❑ Basic statistical data description: central tendency, dispersion
❑5 numbers summary, visualized by a boxplot.

❑ Visualizations
❑ Scatter plot, QQ plot, histogram, etc…

❑ Measure data similarity


❑ Minkowski Distance and its special cases
❑ Cosine similarity
❑ ….
Thank You

You might also like