0% found this document useful (0 votes)
0 views20 pages

Lecture#7

Uploaded by

banaar66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views20 pages

Lecture#7

Uploaded by

banaar66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MIS 502: Decision Support Systems

“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Lecture 07: Types of Data in Cluster Analysis


MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Outline

• Types of data in cluster analysis


– Interval-scaled variables
– Binary variables

2
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Lecture Objectives
At the end of this lecture, you’ll be able to:

• Know interval-scaled variables and their applications


• Know binary variables (symmetric and asymmetric)
and their applications

3
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Interval-scaled Variables:
• They are continuous measurements of a linear scale
e.g. weight, height, temperature, coordinates of a
location.

• Changing measurement units (meters to inches or Kg


to lbs) may lead to different clustering structure.

• Standardization is needed to avoid dependence on


the choice of measurement.

4
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Interval-scaled Variables (contd..):
• Important for a multivariate analysis as variables
measured at different scales do not contribute
equally to the analysis e.g. a variable that ranges
between 0 and 100 may outweigh a variable that
ranges between 0 and 1.

• Standardizing measurements attempts to give all


variables an equal weight.

5
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Interval-scaled Variables (contd..):
• In some applications we may want to give more
weight to a certain set of variables e.g. height for the
basketball candidate players.

• Standardization may or may not be useful in a


particular application, thus the choice of whether and
how to perform standardization should be left to the
user.

• One choice of standardization is to convert the


original measurements to unitless variables.
6
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Interval-scaled Variables (contd..):
• Given measurements for a variable f , this can be performed as
follows:

• Calculate the mean absolute deviation, Sf


sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
• Where x1f, …., xnf are n measurements of f, and mf is the
mean value of f, i.e.
m f = 1n (x1 f + x2 f + ... + xnf ) .

Note: Using mean absolute deviation is more robust than using standard deviation.

7
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Interval-scaled Variables (contd..):

• Calculate the standard measurement or z-score:


xif − m f
zif = sf

• The dissimilarity (or similarity) between the objects


described by interval-scaled variables is typically
computed based on the distance between each pair of
objects

• Refer to Euclidean, Manhattan and Minkowski distance


measures covered earlier.
8
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables:
• A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.

• Given the variable allergy describing a patient,


– 1 indicates that the patient has allergy
– 0 indicates that the patient does not have it.

• Treating binary variables as if they are interval-scaled can


lead to misleading clustering results.

• Therefore, methods specific to binary data are necessary


for computing dissimilarities.
9
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (contd..):
• One approach involves computing a dissimilarity
matrix from the given binary data.

• If all binary variables are thought of as having the


same weight, we have the 2-by-2 contingency table.

10
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (contd..):

• q is the number of variables that equal 1 for both objects i and j,


• r is the number of variables that equal 1 for object i but that are 0
for object j,
• s is the number of variables that equal 0 for object i but equal 1 for
object j, and
• t is the number of variables that equal 0 for both objects i and j
• p is the total number of variables, p = q+r+s+t
11
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (symmetric):
• A binary variable is symmetric if both of its states are equally
valuable and carry the same weight
– Example: the attribute gender having the states male
and female.
• Dissimilarity that is based on symmetric binary variables is
called symmetric binary dissimilarity.
• The dissimilarity between objects i and j for symmetric binary
variables is computed as:

12
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):
• A binary variable is asymmetric if the outcomes of the states
are not equally important

– Example: the positive and negative outcomes of a HIV test.

• We shall code the most important outcome, which is usually


the rarest one, by 1 (HIV positive)

• Given two asymmetric binary variables, the agreement of two


1s (a positive match) is then considered more significant than
that of two 0s (a negative match).
13
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):
• Therefore, such binary variables are often considered
“monary”(as if having one state).

• The dissimilarity based on such variables is called asymmetric


binary dissimilarity.

• In asymmetric binary dissimilarity the number of negative


matches, t is considered unimportant and thus is ignored in
the computation:
r + s
d (i , j ) =
q+r + s
14
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):
• The asymmetric binary similarity between the objects i and j,
or sim(i, j), can be computed as:

• The coefficient sim(i, j) is called the Jaccard coefficient.

• When both symmetric and asymmetric binary variables occur


in the same data set, the mixed variables approach can be
applied (will be covered later).

15
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):
• Suppose that a patient record table contains the
attributes :
– name: an object identifier
– gender: a symmetric attribute
– fever, cough, test-1, test-2, test-3, test-4: the asymmetric
attributes

16
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):

• For asymmetric attribute values let the values Y (yes) and P


(positive) be set to1, and the value N (no or negative) be set to
0.

• Suppose that the distance between objects (patients) is


computed based only on the asymmetric variables.
17
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):

• The distance between each pair of the three patients, Jack,


Mary, and Jim, is:

18
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Types of Data in Cluster Analysis
Binary Variables (asymmetric):

• These measurements suggest that:


– Mary and Jim are unlikely to have a similar disease because they
have the highest dissimilarity value among the three pairs.

– Of the three patients, Jack and Mary are the most likely to have a
similar disease.
19
MIS 502: Decision Support Systems
“Don't entrust your future on others' hands. Rather make decisions by yourself with the help of God's guidance…”. Hark Herald Sarmiento

Thank You

20

You might also like