0% found this document useful (0 votes)
9 views21 pages

Lab 2

The document discusses different types of attributes in data mining and their properties. It also covers various similarity and dissimilarity measures that can be used to calculate distances between data objects, including Euclidean distance, Mahalanobis distance, Minkowski distance, and cosine similarity.

Uploaded by

samira abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Lab 2

The document discusses different types of attributes in data mining and their properties. It also covers various similarity and dissimilarity measures that can be used to calculate distances between data objects, including Euclidean distance, Mahalanobis distance, Minkowski distance, and cosine similarity.

Uploaded by

samira abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Data Mining

Lab 2
Lab Content
What is Data?
Types of Attributes . ❑ Similarity Measures
Properties of Attribute Values ✓ cos similarity

Dissimilarity (Distance) Measures


❑ Assignment
✓ Euclidean distance
✓ Mahalanobis Distance
✓ Minkowski distance
✓ Supermum distance
What is Data?
Attributes
🠶 Collection of data objects and their attributes
🠶 An attribute is a property or characteristic of Tid Refund Marital Taxable
an object Status Income Cheat

1 Yes Single 125K No


🠶 Examples: eye color of a person, temperature, etc.
2 No Married 100K No
🠶 Attribute is also known as variable, field, 3 No Single 70K No
characteristic, or feature 4 Yes Married 120K No
Objects
🠶 A collection of attributes describe an object 5 No Divorced 95K Yes
6 No Married 60K No
🠶 Object is also known as record, point, case, 7 Yes Divorced 220K No
sample, entity, or instance 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
There are different types of attributes
🠶 Nominal
Examples: ID numbers, hair color {red, brown, black} , zip codes
🠶 Ordinal
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
🠶 Interval
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
🠶 Ratio
Examples: length, time, number of words

🠶
Properties of Attribute Values
🠶 The type of an attribute depends on which of the following properties it possesses:
❖ Distinctness: = 
❖ Order: < >
❖ Addition: + -
❖ Multiplication: */

🠶 Nominal attribute: distinctness


🠶 Ordinal attribute: distinctness & order
🠶 Interval attribute: distinctness, order & addition
🠶 Ratio attribute: all 4 properties
Similarity and Dissimilarity
🠶 Similarity
▪ Numerical measure of how alike two data objects are.
▪ Is higher when objects are more alike.
▪ Often falls in the range [0,1]

🠶 Dissimilarity
▪ Numerical measure of how different are two data objects
▪ Lower when objects are more alike
▪ Minimum dissimilarity is often 0
▪ Upper limit varies

🠶 Proximity refers to a similarity or dissimilarity


Dissimilarity (Distance) Measures

❑ Euclidean distance

❑ Mahalanobis Distance

❑ Minkowski distance

❑ Supermum distance
Euclidean Distance

Example
x = (0, 1, 0, 1), y = (1, 0, 1, 0)

Euclidean distance =
=2
Euclidean Distance (cont.)

Example
Mahalanobis Distance

Example

X= (2,3), y= (3,4)

Mahalanobis Distance = |2-3| + |3-4| = 1+1 = 2


Minkowski distance (MD)
Minkowski Distance: Examples

🠶 r = 1 City block (Manhattan, taxicab, L1 norm)


distance.

🠶 A common example of this is the Hamming distance, which is just the


number of bits that are different between two binary vectors

🠶 r = 2 Euclidean distances

🠶r→ “supremum” (Lmax norm, L norm) distance.

🠶 This is the maximum difference between any component of the vectors


Minkowski Distance: Examples (cont.)
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
point x y p4 6 4 2 0
p1 0 2
p2 2 0 L2 p1 p2 p3 p4
p3 3 1 p1 0 2.828 3.162 5.099
p4 5 1 p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
For the following vectors, x and y, calculate the distance
measures.

🠶 X= (2,0) , Y=(5,1) compute

1. supremum distance
2. Euclidean Distance
3. Mahalanobis Distance
Solutions

• X= (2,0) , Y=(5,1)

1. supremum distance = 3
2. Euclidean Distance = 10 = 3.162
3. Mahalanobis Distance = 4
Similarity: Cosine Similarity
🠶 If d1 and d2 are two document vectors, then
cos (d1, d2 ) = (d1 • d2) / ||d1|| ||d2||
where • indicates vector dot product and || d || is the length of vector d.

🠶 Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos (d1, d2)= 0.3150
Extended Jaccard Coefficient (Tanimoto)

🠶 The binary Jaccard coefficient measures the degree of


overlap between two sets and is computed as the ratio of the
number of shared attributes (words) of p AND q to the
number possessed by p OR q .
Extended Jaccard Coefficient (Tanimoto) (cont …)

🠶 Example

d1 = ( 0 , 1 , 0 , 1)
d2 = ( 1 , 0 , 1 , 0 )

d1 • d2= 0*1 + 1*0 + 0*1 + 1*0 = 0

||d1|| = (0*0 + 1*1 + 0*0 + 1*1) ) 0.5 = 2

||d2|| = (1*1 + 0*0 + 1*1 + 0*0) 0.5 = 2


𝒅𝟏 • 𝒅𝟐
Jaccard ( d1 , d2 ) = =0
||𝒅𝟏|| + ||𝒅𝟏|| − 𝒅𝟏 • 𝒅𝟐
𝟐 𝟐
Examples
🠶For the following vectors, x and y, calculate the indicated
similarity or distance measures.

(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) compute cosine, Euclidean

(b) x = (0,−1, 0, 1) , y = (1, 0,−1, 0) compute cosine, Euclidean

(c) x = (1, 1, 0, 1, 0, 1) , y = (1, 1, 1, 0, 0, 1) compute cosine, Jaccard

(d) x = (2,−1, 0, 2, 0,−3), y = (−1, 1,−1, 0, 0,−1) compute cosine


Examples (cont…)

Solutions

(a) cos(x, y) = 1 , Euclidean(x, y) = 2


(b) cos(x, y) = 0 , Euclidean(x, y) = 2
(c) cos(x, y) = 0.75 , Jaccard(x, y) = 0.6
(d) cos(x, y) = 0

You might also like