CS583 Unsupervised Learning (2) (3)
CS583 Unsupervised Learning (2) (3)
Learning
Road map
Basic concepts
K-means algorithm
Representation of clusters
Hierarchical clustering
Distance functions
Data standardization
Handling mixed attributes
Which clustering algorithm to use?
Cluster evaluation
Discovering holes and data regions
Summary
CS583, Bing Liu, UIC 2
Supervised vs. unsupervised
learning
Supervised learning: learn models or
classifiers from the data that relate data
attributes to a target class attribute.
These models are then used to predict the values
of the class attribute in test or future data
instances.
Unsupervised learning: The data have no
target/class attribute.
We want to explore the data to find some intrinsic
structures in them.
CS583, Bing Liu, UIC 3
Clustering
Clustering is one main approach to
unsupervised learning.
It finds similarity groups in data, called clusters,
it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different clusters.
Clustering is often considered synonymous
with unsupervised learning.
But, association rule mining is also unsupervised
This chapter focuses on clustering.
SSE
j 1
xC j
dist (x, m j ) 2 (1)
+
+
the cluster.
compute the radius and
standard deviation of the cluster to determine its
spread in each dimension
evaluate because
We do not know the ground truth, the correct
clusters
Some methods are used:
User inspection
Centroids, and spreads
Rules from a decision tree.
For text documents,
read some documents in each cluster and/or
inspect the most frequent words in each cluster