0% found this document useful (0 votes)

2 views27 pages

9-2 Data analysis and pre-processing part 2.pdf

Chapter 2 discusses various aspects of data analysis, including data objects and feature types, basic statistical descriptions, and data visualization techniques such as boxplots, histograms, and scatter plots. It also covers measuring data similarity and dissimilarity using methods like Minkowski distance and cosine similarity. The chapter emphasizes the importance of understanding different feature types and visualizing data to gain insights.

Uploaded by

김김진진태태

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views27 pages

9-2 Data analysis and pre-processing part 2.pdf

Uploaded by

김김진진태태

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Chapter 2: Getting to Know Your Data

Dong-Kyu Chae

PI of the Data Intelligence Lab @HYU

Department of Computer Science & Data Science
Hanyang University
Contents
❑ Data Objects and Feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Graphic Displays of Basic Statistical Descriptions
❑ Boxplot: graphic display of five-number summary

❑ Histogram: x-axis: values/ranges, y-axis: frequencies

❑ Quantile plot: each value xi is paired with fi indicating that

approximately fi of data are  xi

❑ Quantile-quantile (q-q) plot:

The quantiles of one univariant distribution against the
corresponding quantiles of another

❑ Scatter plot: each pair of two feature values is plotted as

points in the 2D space
Histogram Analysis
❑ Histogram: Graph display of 40
frequencies shown as bars 35
❑ Itshows what proportion 30
of cases fall into each of 25
several categories
20
❑ The categories are usually
specified as non- 15
overlapping intervals of 10
some variable
5
❑ The categories (bars) must
be adjacent 0
10000 30000 50000 70000 90000
Quantile Plot
❑ Sort data in increasing order, and display all the data points
❑ Allowing users to assess both the overall behavior and unusual
occurrences
❑ fiindicates that approximately (100 x fi )% of the data are below
or equal to the value xi

Median
Q1
Quantile-Quantile (Q-Q) Plot
❑ Displaysthe quantiles of one univariate distribution of a dataset
against the corresponding quantiles of another dataset
❑ Example: Shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile.
❑ Unit prices of items sold at Branch 1 tend to be cheaper than those at
Branch 2.
y=x

Median

Q1
Scatter Plot
❑ Each pair of feature values is treated as the coordinates and
plotted as a point in the 2D space
❑ It shows correlation between two features
❑ Also provides an overview to see clusters of points, outliers, etc
Scatter Plot
❑ Examples of positively and negatively correlated data

❑ In
the example below, the left half fragment is positively
correlated, and the right half is negative correlated
Scatter Plot
❑ Uncorrelated data
Contents
❑ Data Objects and feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Similarity and Dissimilarity
❑ Similarity

❑ Numerical measure of how much alike two data objects are

❑ This value is higher when objects are more alike

❑ Often falls in the range [0,1]

❑ Dissimilarity (e.g., distance)

❑ Numerical measure of how much different two data objects are

❑ Lower when objects are more alike

❑ Minimum dissimilarity is often 0

❑ Proximity refers to a similarity or dissimilarity

Matrix for Data
❑ Data matrix (n-by-p)
❑ n data points with p dimensions (features)
❑ Two modes  x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
x ... x if ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

❑ Dissimilarity (distance) matrix (n-by-n)

❑ n data points, but registers only the distance
❑ A triangular matrix  0 
 d(2,1) 0 
❑ Single mode  
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0
Proximity Measure for Nominal Features
❑ Nominal features can take 2 or more states
❑ E.g., COLOR: [red, yellow, blue, green, etc…]

❑ Method 1: Simple matching

❑m : # of matches, p : total # of nominal features

d (i, j) = p − m 𝑚
𝑠𝑖𝑚(𝑖, 𝑗) =
p 𝑝
❑ Method 2: Use multiple binary features to express one
nominal feature
❑ Creating a new binary feature for each of the M nominal states
❑ E.g.,
a nominal feature COLOR: [red, yellow, blue, green] can be
expressed by four binary features:
red => [0, 1]; yellow => [0, 1]; blue => [0, 1]; green => [0, 1]
Proximity Measure for Binary Features
❑A contigency table for two different objects:
Obj j

Obj i

❑ Proximity measure for symmetric binary features:

𝑞+𝑡
𝑠𝑖𝑚(𝑖, 𝑗) =
𝑞+𝑟+𝑠+𝑡
❑ Used when two results (0, 1) are equally important (similar with Method 1)

❑ Proximity measure for asymmetric binary features :

𝑞
𝑠𝑖𝑚(𝑖, 𝑗) =
𝑞+𝑟+𝑠

❑ Used when one is much more important than the other.

Dissimilarity between Asymmetric Binary Features
❑ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
❑ Gender is a symmetric feature (ignored in this case)
❑ Assume that the remaining features are asymmetric
❑ Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
Proximity Measure for Numeric Features
❑ First
of all, we need to standardize / normalize
numeric data (will be introduced in the next chapter)
❑ To match scale of each feature
x−
z= 
❑ Z-score:
▪ x: raw score to be standardized, μ: mean of the population, σ:
standard deviation
▪ Meaning: the distance between the raw score and the
population mean in units of the standard deviation
▪ “−” when the raw score is below the mean
▪ “+” when the raw score is above the mean
Matrix for Numeric Data
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5

Distance Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)

If h=2, it is equal to Euclidean Distance (L-2 norm)

Special Cases of Minkowski Distance
❑h = 1: Manhattan (city block, L1 norm) distance

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

❑h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

❑h → . Supremum (Lmax norm, L norm) distance

❑ This is the maximum difference between any component
(feature) of the vectors
Example: Minkowski Distance
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Distance on Numeric Data: Minkowski Distance
❑ Minkowski distance : A popular distance measure

❑ Properties
❑ 𝑑 𝑖, 𝑗 > 0 if 𝑖≠𝑗, and 𝑑 𝑖, 𝑖 = 0 (Positive definiteness)

❑ 𝑑 𝑖, 𝑗 = 𝑑 𝑗, 𝑖 (Symmetry)

❑ 𝑑 𝑖, 𝑗 ≤ 𝑑 𝑖, 𝑘 + 𝑑 𝑘, 𝑗 (Triangle inequality)

❑ A distance that satisfies these properties is a metric

Cosine Similarity
❑A document can be represented by thousands of words, each
recording the frequency of a particular word in the document.

❑ Documents are represented as n-dimensional vectors

❑ Cosine measure: If A and B are two vectors, then:

where • indicates dot product, ||A || is the length

of vector A
Example: Cosine Similarity
❑ cos(d1, d2) = (d1 • d2) /||d1|| ||d2||
❑ It focuses more on vector direction, rather than coordinate distance
❑ Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1|| = (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481

||d2|| = (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12

cos(d1, d2 ) = 0.94
Distance for Ordinal Features
❑ Order is important, e.g., grade, size, etc…
❑ Such values can be expressed by rank (integers)
❑ Replace i-th object in the f-th feature xif by their rank: rif

rif {1,...,M f }
❑ Finally the rank values are mapped onto [0, 1] by:
rif −1
zif =
M f −1

❑ Then we can use any distance/dissimilarity measures

for numeric data
Features of Mixed Type
❑A database may contain various feature types
❑ Nominal, symmetric binary, asymmetric binary, numeric, ordinal

❑ One may use a weighted average to combine their

effects
𝑝 (𝑓) (𝑓)
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
𝑑(𝑖, 𝑗) = 𝑝 (𝑓)
Σ𝑓=1 𝛿𝑖𝑗

(𝑓)
❑ 𝛿𝑖𝑗 : importance weight on each feature type 𝑓
Summary
❑ Datafeature types: nominal, binary, ordinal, numerical (interval-
scaled, ratio-scaled)
❑ Gain insight into the data by:
❑ Basic statistical data description: central tendency, dispersion
❑5 numbers summary, visualized by a boxplot.

❑ Visualizations
❑ Scatter plot, QQ plot, histogram, etc…

❑ Measure data similarity

❑ Minkowski Distance and its special cases
❑ Cosine similarity
❑ ….
Thank You

18ee81 Psoc Study Material Final
No ratings yet
18ee81 Psoc Study Material Final
190 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
2-2-Data
No ratings yet
2-2-Data
27 pages
02data Part4
No ratings yet
02data Part4
28 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
4
No ratings yet
4
26 pages
Lec 5
No ratings yet
Lec 5
24 pages
Similarity
No ratings yet
Similarity
19 pages
02 Data
No ratings yet
02 Data
35 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
class 1c -DataFundamentals
No ratings yet
class 1c -DataFundamentals
27 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Data Preprocessing for Clustering
No ratings yet
Data Preprocessing for Clustering
40 pages
Lect 3
No ratings yet
Lect 3
51 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Formulas at a Glance_IDS
No ratings yet
Formulas at a Glance_IDS
5 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
02 Data
No ratings yet
02 Data
41 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
02Data
No ratings yet
02Data
24 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
9-1 Data analysis and pre-processing part 1.pdf
No ratings yet
9-1 Data analysis and pre-processing part 1.pdf
19 pages
10-1 Data analysis and pre-processing part 3.pdf
No ratings yet
10-1 Data analysis and pre-processing part 3.pdf
19 pages
15_Word Embedding
No ratings yet
15_Word Embedding
11 pages
10-2 Data analysis and pre-processing part 4 PDF
No ratings yet
10-2 Data analysis and pre-processing part 4 PDF
23 pages
브릿지 14해설
No ratings yet
브릿지 14해설
8 pages
브릿지 14
No ratings yet
브릿지 14
12 pages
Acoustic Scattering From Viscoelastically Coated Spheres and Cylinders in Viscous Fluids
No ratings yet
Acoustic Scattering From Viscoelastically Coated Spheres and Cylinders in Viscous Fluids
25 pages
Grade 9 Notes Printed - 19 - 2010 - Algorithms - Practise Questions - Copyright AIS
No ratings yet
Grade 9 Notes Printed - 19 - 2010 - Algorithms - Practise Questions - Copyright AIS
4 pages
【23秋】2023-2024-1 (Fall) UndergraduateClass V2 (23.8.14)
No ratings yet
【23秋】2023-2024-1 (Fall) UndergraduateClass V2 (23.8.14)
44 pages
Methods of Measurement
No ratings yet
Methods of Measurement
10 pages
A Survey of Low Volatility Strategies
No ratings yet
A Survey of Low Volatility Strategies
30 pages
DPP-04 Electrostatics (Abhilash Sir) - DPP-04 Electrostatics.docx.pdf
No ratings yet
DPP-04 Electrostatics (Abhilash Sir) - DPP-04 Electrostatics.docx.pdf
3 pages
RHUMB LINE or LOXODROME
No ratings yet
RHUMB LINE or LOXODROME
8 pages
TCP - IP Subnet Chart
No ratings yet
TCP - IP Subnet Chart
13 pages
Simple Linear Regression: Yandell - Econ 216 Chap 13-1
No ratings yet
Simple Linear Regression: Yandell - Econ 216 Chap 13-1
70 pages
A Signal Is Defined As Any Physical A Quantity That Vaies Witin Time
No ratings yet
A Signal Is Defined As Any Physical A Quantity That Vaies Witin Time
13 pages
Ratios and Proportions
100% (1)
Ratios and Proportions
23 pages
wassce 2020 mathematics - Copy
No ratings yet
wassce 2020 mathematics - Copy
20 pages
Gateway To Success: Introductory Notes To MATH1042A First Year Semester 1 Course: Engineering Mathematics 2022
No ratings yet
Gateway To Success: Introductory Notes To MATH1042A First Year Semester 1 Course: Engineering Mathematics 2022
11 pages
Classical and Quantum Decay of One-Dimensional Finite Wells With Oscillating Walls
No ratings yet
Classical and Quantum Decay of One-Dimensional Finite Wells With Oscillating Walls
7 pages
OBJECT PASSING
No ratings yet
OBJECT PASSING
4 pages
ExtendSeismicBandwidth FB6 08
No ratings yet
ExtendSeismicBandwidth FB6 08
6 pages
Unit 3
No ratings yet
Unit 3
17 pages
Mathematics in Modern World Midterm Exam PDF
No ratings yet
Mathematics in Modern World Midterm Exam PDF
10 pages
Chapter 1: The Foundations: Logic and Proofs: Discrete Mathematics and Its Applications
No ratings yet
Chapter 1: The Foundations: Logic and Proofs: Discrete Mathematics and Its Applications
26 pages
GPS Tracker TK103 Protocol
No ratings yet
GPS Tracker TK103 Protocol
27 pages
Quantum University, Roorkee School of Technology: Strength of Materials Lab Manual
No ratings yet
Quantum University, Roorkee School of Technology: Strength of Materials Lab Manual
20 pages
Research Methodology LaTeX Sachidananda Mahato 24ME1102
No ratings yet
Research Methodology LaTeX Sachidananda Mahato 24ME1102
8 pages
FIFO and Weighted Average Costing - REVIEWER Questions
No ratings yet
FIFO and Weighted Average Costing - REVIEWER Questions
27 pages
611 Electromagnetic Theory II
No ratings yet
611 Electromagnetic Theory II
177 pages
Homework Assignment-4 POM 500 Statistical 4 Answers - Copy
No ratings yet
Homework Assignment-4 POM 500 Statistical 4 Answers - Copy
17 pages
PRD Elemen Kunci Dalam Analisis Rekayasa
No ratings yet
PRD Elemen Kunci Dalam Analisis Rekayasa
38 pages
Review of Related Literature and Studies
No ratings yet
Review of Related Literature and Studies
7 pages
Mensuration of Plane Figures-1
No ratings yet
Mensuration of Plane Figures-1
13 pages