0% found this document useful (0 votes)

12 views34 pages

Session 37 CO4 Unsupervised Learning

This document provides an overview of unsupervised learning, focusing on clustering techniques, particularly the K-means algorithm. It distinguishes between supervised and unsupervised learning, explains the clustering process, and discusses the strengths and weaknesses of K-means. Additionally, it covers various methods for representing clusters and highlights the importance of clustering in different fields.

Uploaded by

indavonboss670

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views34 pages

Session 37 CO4 Unsupervised Learning

Uploaded by

indavonboss670

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

UNSUPERVISED LEARNING

CO-4 SESSION 37
AIM

To familiarize students with the concepts of unsupervised machine learning, its difference with
supervised machine learning and the use of unsupervised learning, particularly clustering

INSTRUCTIONAL OBJECTIVES

This session is designed to:

1. Introduction to unsupervised learning
2. K-means algorithm
3. Representation of clusters

LEARNING OUTCOMES

At the end of this session, you should be able to:

1. Supervised learning vs. unsupervised learning
2. Clustering algorithm
3. K-means clustering
4. Common ways to represent clusters
Supervised learning vs. unsupervised learning

Supervised learning: discover patterns in the data that relate data

attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target
attribute in future data instances.

Unsupervised learning: The data have no target attribute.

We want to explore the data to find some intrinsic structures in them.
Clustering

• Clustering is a technique for finding similarity groups in data, called

clusters. I.e.,
it groups data instances that are similar to (near) each other in one
cluster and data instances that are very different (far away) from each
other into different clusters.

• Clustering is often called an unsupervised learning task as no class

values denoting an a priori grouping of the data instances are given,
which is the case in supervised learning.

• Due to historical reasons, clustering is often considered synonymous

with unsupervised learning.

In fact, association rule mining is also unsupervised

An illustration

• The data set has three natural groups of data points, i.e., 3 natural
clusters.
What is clustering for?

• Let us see some real-life examples

• Example 1: groups people of similar sizes together to make “small”,

“medium” and “large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.

• Example 2: In marketing, segment customers according to their

similarities
 To do targeted marketing.
What is clustering for? (cont.…)

• Example 3: Given a collection of text documents, we want to organize

them according to their content similarities,
 To produce a topic hierarchy

• In fact, clustering is one of the most utilized machine learning

techniques.
 It has a long history, and used in almost every field, e.g., medicine,
psychology, botany, sociology, biology, archeology, marketing,
insurance, libraries, etc.
 In recent years, due to the rapid increase of online documents, text
clustering becomes important.
Aspects of clustering

• A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …

• A distance (similarity, or dissimilarity) function

• Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized

• The quality of a clustering result depends on the algorithm, the

distance function, and the application.
K-means clustering

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be

{x1, x2, …, xn}, where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X  Rr, and r is the number of attributes (dimensions) in the
data.

• The k-means algorithm partitions the given data into k clusters.

 Each cluster has a cluster center, called centroid.
 k is specified by the user
K-means algorithm

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

K-means algorithm – (cont.…)

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

cluster centers

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

K-means algorithm – (cont.…)
Stopping/convergence criterion

• no (or minimum) re-assignments of data points to different clusters

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean

vector of all the data points in Cj), and dist(x, mj) is the distance
between data point x and centroid mj.
An example
An example (cont.…)
An example distance function
A disk version of k-means

• K-means can be implemented with data on disk

 In each iteration, it scans the data once.
 as the centroids can be computed incrementally

• It can be used to cluster large datasets that do not fit in main memory

• We need to control the number of iterations

 In practice, a limited is set (< 50).

• Not the best method. There are other scale-up algorithms, e.g., BIRCH.
A disk version of k-means (cont …)
Strengths of k-means

• Strengths:
 Simple: easy to understand and to implement
 Efficient: Time complexity: O(tkn), where n is the number of data points,
k is the number of clusters, and t is the number of iterations.
 Since both k and t are small. k-means is considered a linear algorithm.

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used. The global

optimum is hard to find due to complexity.
Weaknesses of k-means

• The algorithm is only applicable if the mean is defined.

 For categorical data, k-mode - the centroid is represented by most
frequent values.

• The user needs to specify k.

• The algorithm is sensitive to outliers

 Outliers are data points that are very far away from other data points.
 Outliers could be errors in the data recording or some special data
points with very different values.
Weaknesses of k-means: Problems with outliers
Weaknesses of k-means: To deal with outliers

• One method is to remove some data points in the clustering process that are much
further away from the centroids than other data points.

 To be safe, we may want to monitor these possible outliers over a few iterations
and then decide to remove them.

• Another method is to perform random sampling. Since in sampling we only choose a

small subset of the data points, the chance of selecting an outlier is very small.

 Assign the rest of the data points to the clusters by distance or similarity
comparison, or classification
Weaknesses of k-means (cont.…)

• The algorithm is sensitive to initial seeds.

Weaknesses of k-means (cont.…)

• If we use different seeds: good results

• There are some methods to help choose good seeds
Weaknesses of k-means (cont.…)

• The k-means algorithm is not suitable for

discovering clusters that are not hyper-ellipsoids (or
hyper-spheres).
Common ways to represent clusters

• Use the centroid of each cluster to represent the cluster.

 compute the radius and

 standard deviation of the cluster to determine its spread in each

dimension

 The centroid representation alone works well if the clusters are of the
hyper-spherical shape.

 If clusters are elongated or are of other shapes, centroids are not

sufficient
Using classification model

• All the data points in a cluster are regarded to have the same class label,
e.g., the cluster ID.
 run a supervised learning algorithm on the data to find a classification
model.
Use frequent values to represent cluster

• This method is mainly for clustering of categorical data (e.g., k-modes

clustering).

• Main method used in text clustering, where a small set of frequent

words in each cluster is selected to represent the cluster.
Clusters of arbitrary shapes

• Hyper-elliptical and hyper-spherical clusters are

usually easy to represent, using their centroid
together with spreads.

• Irregular shape clusters are hard to represent. They

may not be useful in some applications.
 Using centroids are not suitable (upper figure)
in general
 K-means clusters may be more useful (lower
figure), e.g., for making 2 size T-shirts.
Combining individual distances

• This approach computes individual attribute distances and then combine


r
them. f 1
 f
ij d f
ij
dist ( x i , x j ) 

r
f 1
 f
ij
Summary

• Clustering is having along history and still active

 There are a huge number of clustering algorithms
 More are still coming every year.
 We only introduced several main algorithms. There are many others,
e.g.,
 density based algorithm, sub-space clustering, scale-up methods,
neural networks-based methods, fuzzy clustering, co-clustering, etc.
• Clustering is hard to evaluate, but very useful in practice. This partially
explains why there are still many clustering algorithms being devised
every year.
• Clustering is highly application dependent and to some extent
subjective.
Common ways to represent clusters

• Use the centroid of each cluster to represent the cluster.

 compute the radius and

 standard deviation of the cluster to determine its spread in each

dimension

 The centroid representation alone works well if the clusters are of the
hyper-spherical shape.

 If clusters are elongated or are of other shapes, centroids are not

sufficient
Using classification model

• All the data points in a cluster are regarded to have the same class
label, e.g., the cluster ID.
 run a supervised learning algorithm on the data to find a
classification model.
THANK YOU

TEAM AI&ML

Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
21 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
60 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Machine Learning: Clustering & Algorithms
No ratings yet
Machine Learning: Clustering & Algorithms
66 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Clustering
No ratings yet
Clustering
38 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
54 pages
Week 14 and 15 Machine Learning Unsupervised 2
No ratings yet
Week 14 and 15 Machine Learning Unsupervised 2
25 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Unit 4
No ratings yet
Unit 4
125 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
M3 - Unsupervised Machine Learning
No ratings yet
M3 - Unsupervised Machine Learning
35 pages
Clustering
No ratings yet
Clustering
125 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Clustering
No ratings yet
Clustering
67 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
K Mean
No ratings yet
K Mean
7 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
18 pages
04-FSSR DS610 2024 2025T1 Kmeans
No ratings yet
04-FSSR DS610 2024 2025T1 Kmeans
57 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Unit 4
No ratings yet
Unit 4
74 pages
K Means
No ratings yet
K Means
9 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Clustering
No ratings yet
Clustering
80 pages
Unsupervised Learning & Clustering Guide
No ratings yet
Unsupervised Learning & Clustering Guide
49 pages
K Means
No ratings yet
K Means
25 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Day 3
No ratings yet
Day 3
74 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unit 6 Unsupervised Learning
No ratings yet
Unit 6 Unsupervised Learning
68 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
95 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
2023advances in Machine Learning - and Artificial Intelligence-Assisted Material Design of Steels
No ratings yet
2023advances in Machine Learning - and Artificial Intelligence-Assisted Material Design of Steels
22 pages
FRM Part 1 Quants 2023 ML
100% (1)
FRM Part 1 Quants 2023 ML
8 pages
Ideal Dataset Splitting Ratios in Machine Learning Algorithms General Concerns For Data Scientists and Data Analysts
No ratings yet
Ideal Dataset Splitting Ratios in Machine Learning Algorithms General Concerns For Data Scientists and Data Analysts
10 pages
De Group 5, Public Utility Portals
No ratings yet
De Group 5, Public Utility Portals
32 pages
51 Machine Learning Interview Questions With Answers - Springboard
100% (1)
51 Machine Learning Interview Questions With Answers - Springboard
20 pages
Computatonal SC & Apps 2020
No ratings yet
Computatonal SC & Apps 2020
23 pages
Markov Decision Process
No ratings yet
Markov Decision Process
8 pages
Ieee
No ratings yet
Ieee
9 pages
Answer:: NO.1 A. B. C. D
No ratings yet
Answer:: NO.1 A. B. C. D
2 pages
Bilal Pubmed
No ratings yet
Bilal Pubmed
14 pages
Batteries 10 00324
No ratings yet
Batteries 10 00324
20 pages
Bos Cse (Ai&ml) - 1-05-25
No ratings yet
Bos Cse (Ai&ml) - 1-05-25
35 pages
2 ML
No ratings yet
2 ML
9 pages
Questions
No ratings yet
Questions
3 pages
STTP Brochure
No ratings yet
STTP Brochure
1 page
Murphy ArtificialIntelligenceApplications 2019
No ratings yet
Murphy ArtificialIntelligenceApplications 2019
21 pages
Machines 12 00681 v2 1
No ratings yet
Machines 12 00681 v2 1
27 pages
Test Bank For Testbank Analytics Data Science Artificial Intelligence Systems For Decision Support Eleventh Edition Dursun Delen Download
No ratings yet
Test Bank For Testbank Analytics Data Science Artificial Intelligence Systems For Decision Support Eleventh Edition Dursun Delen Download
405 pages
Causal Machine Learning For Supply Chain Risk Prediction and Intervention Planning
No ratings yet
Causal Machine Learning For Supply Chain Risk Prediction and Intervention Planning
22 pages
Book AI
No ratings yet
Book AI
144 pages
Depression Detection Model Based On Discrete Wavelet Transform Associated With Genetic Algorithm
No ratings yet
Depression Detection Model Based On Discrete Wavelet Transform Associated With Genetic Algorithm
19 pages
Science & Tech Insights 2023
No ratings yet
Science & Tech Insights 2023
112 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
3 pages
Manufacturing 5.0: AI & Cloud in Industry
No ratings yet
Manufacturing 5.0: AI & Cloud in Industry
9 pages
CS 485 - 685 - Foundations of Machine Learning (Fall 2021)
No ratings yet
CS 485 - 685 - Foundations of Machine Learning (Fall 2021)
5 pages
Bigmart Sales Prediction Analysis
No ratings yet
Bigmart Sales Prediction Analysis
47 pages
SMPR2019 Paper Seydi Rastiveis ISPRS
No ratings yet
SMPR2019 Paper Seydi Rastiveis ISPRS
8 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
19 pages
2025 - NN - Business Intelligence and Data Analysis in The Age of AI - Khan
No ratings yet
2025 - NN - Business Intelligence and Data Analysis in The Age of AI - Khan
476 pages
A Review On Finding Efficient Approach To Detect Customer Emotion Analysis Using Deep Learning Analysis
No ratings yet
A Review On Finding Efficient Approach To Detect Customer Emotion Analysis Using Deep Learning Analysis
17 pages

Session 37 CO4 Unsupervised Learning

Uploaded by

Session 37 CO4 Unsupervised Learning

Uploaded by

UNSUPERVISED LEARNING

This session is designed to:

At the end of this session, you should be able to:

Supervised learning: discover patterns in the data that relate data

Unsupervised learning: The data have no target attribute.

• Clustering is a technique for finding similarity groups in data, called

• Clustering is often called an unsupervised learning task as no class

• Due to historical reasons, clustering is often considered synonymous

In fact, association rule mining is also unsupervised

• Let us see some real-life examples

• Example 1: groups people of similar sizes together to make “small”,

• Example 2: In marketing, segment customers according to their

• Example 3: Given a collection of text documents, we want to organize

• In fact, clustering is one of the most utilized machine learning

• A distance (similarity, or dissimilarity) function

• The quality of a clustering result depends on the algorithm, the

• K-means is a partitional clustering algorithm

• Let the set of data points (or instances) D be

• The k-means algorithm partitions the given data into k clusters.

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

Given k, the k-means algorithm works as follows:

1) Randomly choose k data points (seeds) to be the initial centroids,

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

• no (or minimum) re-assignments of data points to different clusters

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean

• K-means can be implemented with data on disk

• We need to control the number of iterations

• K-means is the most popular clustering algorithm.

• Note that: it terminates at a local optimum if SSE is used. The global

• The algorithm is only applicable if the mean is defined.

• The user needs to specify k.

• The algorithm is sensitive to outliers

• Another method is to perform random sampling. Since in sampling we only choose a

• The algorithm is sensitive to initial seeds.

• If we use different seeds: good results

• The k-means algorithm is not suitable for

• Use the centroid of each cluster to represent the cluster.

 standard deviation of the cluster to determine its spread in each

 If clusters are elongated or are of other shapes, centroids are not

• This method is mainly for clustering of categorical data (e.g., k-modes

• Main method used in text clustering, where a small set of frequent

• Hyper-elliptical and hyper-spherical clusters are

• Irregular shape clusters are hard to represent. They

• This approach computes individual attribute distances and then combine

• Clustering is having along history and still active

• Use the centroid of each cluster to represent the cluster.

 standard deviation of the cluster to determine its spread in each

 If clusters are elongated or are of other shapes, centroids are not

You might also like