Week 10

The document discusses two unsupervised machine learning algorithms: K Nearest Neighbors and K-means clustering. It provides details on how each algorithm works, including examples and pseudocode. It also discusses techniques for selecting the optimal number of clusters like the elbow method and silhouette method.

Uploaded by

sirajquirish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views41 pages

Week 10

Uploaded by

sirajquirish

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Introduction to Data Science

Dr. Irfan Yousuf

Department of Computer Science (New Campus)
UET, Lahore
(Week # 10; March 18 - 22, 2024)
Outline
• K Nearest Neighbors
• K-means
Machine Learning Algorithms
Supervised Machine Learning Algorithms
• Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
• Regression: A regression problem is when the output variable
is numeric, such as “age” or “weight”.
Supervised Machine Learning Algorithms
k Nearest Neighbors
• K-Nearest Neighbors (kNN) is one of the simplest Machine
Learning algorithms based on Supervised Machine Learning
technique.
• A case is classified by a majority vote of its neighbors, with
the case being assigned to the class most common amongst
its K nearest neighbors measured by a distance function.
• The algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
most suitable category.
• The algorithm stores all the available data and classifies a
new data point based on the similarity.
k Nearest Neighbors
• It can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• KNN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action
on the dataset.
• KNN algorithm at the training phase just stores the dataset
and when it gets new data, then it classifies that data into a
category that is much similar to the new data.
Why kNN?
How kNN Works?
How kNN Works?
How kNN Works?
How kNN Works?
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K.
Step-2: Calculate the Euclidean distance of all the data points
from the point in question.
Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
Step-4: Among these k neighbors, count the number of the
data points in each category.
Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
Step-6: Our model is ready.
How kNN Works?
How to select the value of K in kNN?
Below are some points to remember while selecting the value
of K in the K-NN algorithm:

• There is no way to determine the best value for "K", so we

need to try some values to find the best out of them. The
most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy
and lead to the effects of outliers in the model.
• Large values for K are good, but it may find some
difficulties.
How to select the value of K in kNN?
kNN Example

X Y Label
7 7 A
7 4 A
3 4 B
1 4 B

New Point = (3, 7)

kNN Example
New Point = (3, 7)
X Y Label
7 7 A (3-7)2 + (7-7)2
7 4 A (3-7)2 + (4-7)2
3 4 B (3-3)2 + (4-7)2
1 4 B (3-1)2 + (4-7)2

X Y Label
3 4 B 9
1 4 B 13
7 7 A 16
7 4 A 25
Advantages and Disadvantages of kNN?
Advantages:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

Disadvantages:
• Always needs to determine the value of K which may be
complex some time.
• The computation cost is high because of calculating the
distance between the data points for all the training samples.
kNN Implementation

Implement kNN Algorithm

Machine Learning Algorithms
Clustering
• Clustering is one of the most common exploratory data
analysis technique used to get an intuition about the structure
of the data.
• It can be defined as the task of identifying subgroups in the
data such that data points in the same subgroup (cluster) are
very similar while data points in different clusters are very
different.
• we try to find homogeneous subgroups within the data such
that data points in each cluster are as similar as possible
according to a similarity measure such as Euclidean-based
distance or correlation-based distance.
Clustering
• Clustering is considered an unsupervised learning method
since we don’t have the ground truth to compare the output of
the clustering algorithm to the true labels to evaluate its
performance.
Clustering
K-Means
• K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.
• Unsupervised algorithms make inferences from datasets
using only input vectors without referring to known, or
labelled, outcomes.
• The objective of K-means is to group similar data points
together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters
in a dataset.
• A cluster refers to a collection of data points aggregated together
because of certain similarities.
K-Means
• We define a target number k, which refers to the number of
centroids we need in the dataset. A centroid is the
imaginary or real location representing the center of the
cluster.
• Every data point is allocated to each of the clusters through
reducing the within-cluster sum of squares.
• K-means algorithm is an iterative algorithm that tries to
partition the dataset into K pre-defined distinct non-
overlapping subgroups (clusters) where each data point
belongs to only one group
K-Means Algorithm
• The first step in k-means is to pick the number of
clusters, K.
• Next, we randomly select the centroid for each cluster. Let’s
say we want to have 2 clusters, so k is equal to 2 here. We
then randomly select the centroid.
• Once we have initialized the centroids, we assign each point
to the closest cluster centroid:
K-Means Algorithm
• Specify number of clusters K.
• Initialize centroids by first shuffling the dataset and then
randomly selecting K data points for the centroids without
replacement.
• Compute the sum of the squared distance between data points
and all centroids.
• Assign each data point to the closest cluster (centroid).
• Compute the centroids for the clusters by taking the average
of the all-data points that belong to each cluster.
• Keep iterating until there is no change to the centroids, i.e.,
assignment of data points to clusters isn’t changing.
K-Means Algorithm
Expectation Maximization
• The approach k-means follows to solve the problem is called
Expectation-Maximization.
• The E-step is assigning the data points to the closest cluster.
The M-step is computing the centroid of each cluster.
K-Means Example

Source: https://www.saedsayad.com/clustering_kmeans.htm
K-Means Example
K-Means Example
K-Means Example
K-Means Example
K-Means Implementation

Implement k-Means Algorithm

Optimal Value of K in K-means
• Elbow Method
• Silhouette Method
Elbow Method
• The elbow method runs k-means clustering on the dataset for
a range of values for k (say from 1-10) and then for each
value of k computes an average score for all clusters.
• We can compute Within-Cluster Sum of Squares (WCSS),
the sum of square distances from each point to its assigned
center.
• We then draw k vs. WCSS.
Elbow Method
Silhouette Method
• The equation for calculating the silhouette coefﬁcient for a
particular data point:

The value of the silhouette coefﬁcient is between [-1, 1]. A

score of 1 denotes the best meaning that the data point o is
very compact within the cluster to which it belongs and far
away from the other clusters. The worst value is -1. Values
near 0 denote overlapping clusters.
Silhouette Method
Summary
- K Nearest Neighbors
- K-means

1 Akshada Assignment No.21 (KM Clustering)
No ratings yet
1 Akshada Assignment No.21 (KM Clustering)
22 pages
KNN VS Kmeans
No ratings yet
KNN VS Kmeans
3 pages
UNIT 3 ML Distance Based Learning
No ratings yet
UNIT 3 ML Distance Based Learning
19 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
KNN and K Means
No ratings yet
KNN and K Means
38 pages
K-NN Algorithm and Clustering Analysis
No ratings yet
K-NN Algorithm and Clustering Analysis
93 pages
K Mean Algorithm
No ratings yet
K Mean Algorithm
18 pages
Unit 4
No ratings yet
Unit 4
125 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Understanding K-Means Clustering
No ratings yet
Understanding K-Means Clustering
12 pages
K-Means Clustering Algorithm Explained
No ratings yet
K-Means Clustering Algorithm Explained
6 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
24 pages
1 Kmeans
No ratings yet
1 Kmeans
13 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
40 pages
Chapter 9
No ratings yet
Chapter 9
8 pages
K Mean
No ratings yet
K Mean
7 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Aiml Unit 4
No ratings yet
Aiml Unit 4
20 pages
Clustering
No ratings yet
Clustering
18 pages
Unsupervised Learning & Clustering Guide
No ratings yet
Unsupervised Learning & Clustering Guide
49 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
K Means
No ratings yet
K Means
25 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Q & A Unit 3 - Clustering Methods
No ratings yet
Q & A Unit 3 - Clustering Methods
21 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Machine Algorithm
No ratings yet
Machine Algorithm
3 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Yunsu Han KNN K Means
No ratings yet
Yunsu Han KNN K Means
8 pages
K-Means Clustering and Elbow Method Guide
No ratings yet
K-Means Clustering and Elbow Method Guide
53 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
20 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
K Mean Clustering
No ratings yet
K Mean Clustering
32 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
Machine Learning: Clustering & Algorithms
No ratings yet
Machine Learning: Clustering & Algorithms
66 pages
Unit-4 Unsupervised Algorithm
No ratings yet
Unit-4 Unsupervised Algorithm
18 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering & KNN Algorithm Guide
No ratings yet
Clustering & KNN Algorithm Guide
15 pages
Unit 3 KNN
No ratings yet
Unit 3 KNN
16 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Machine Learning Examples With R
No ratings yet
Machine Learning Examples With R
30 pages
Algo
No ratings yet
Algo
59 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
K-Means Clustering Guide for Beginners
No ratings yet
K-Means Clustering Guide for Beginners
19 pages
KMEANS
No ratings yet
KMEANS
9 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
21 pages
K-Means Clustering Algorithm Guide
No ratings yet
K-Means Clustering Algorithm Guide
24 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Top 10 Laws of Project Management Li
No ratings yet
Top 10 Laws of Project Management Li
39 pages
Albot Technologies HVAC Optimization Review
No ratings yet
Albot Technologies HVAC Optimization Review
11 pages
EVS Marine Pollution Final Report PDF
57% (7)
EVS Marine Pollution Final Report PDF
49 pages
Longfellow Natick Group Lesson Schedule 2022-23
No ratings yet
Longfellow Natick Group Lesson Schedule 2022-23
2 pages
Mat 1
No ratings yet
Mat 1
6 pages
Exit Exam Questions - CoTM - April 2024
83% (6)
Exit Exam Questions - CoTM - April 2024
11 pages
Aerospace Science and Technology: David Serrano, Max Ren, Ahmed Jawad Qureshi, Sina Ghaemi
No ratings yet
Aerospace Science and Technology: David Serrano, Max Ren, Ahmed Jawad Qureshi, Sina Ghaemi
14 pages
Spec-Apple Concentrate (Nut) - 0000
No ratings yet
Spec-Apple Concentrate (Nut) - 0000
1 page
Rodolfo Torrez: Skip The Dishes
No ratings yet
Rodolfo Torrez: Skip The Dishes
5 pages
Chapter-6: Seepage: MD Aftabur Rahman, PH.D., M. ASCE
No ratings yet
Chapter-6: Seepage: MD Aftabur Rahman, PH.D., M. ASCE
19 pages
001 Cat Main+catalogue+surgery 2015
100% (1)
001 Cat Main+catalogue+surgery 2015
651 pages
Hakuin
No ratings yet
Hakuin
25 pages
Types of Radio Stations
No ratings yet
Types of Radio Stations
13 pages
Fourier Series & Dirichlet's Theorem
No ratings yet
Fourier Series & Dirichlet's Theorem
2 pages
Report On Student'S Industrial Work Experience Scheme (Siwes)
0% (1)
Report On Student'S Industrial Work Experience Scheme (Siwes)
5 pages
Earth Covered Buildings
No ratings yet
Earth Covered Buildings
338 pages
Pearson Chemistry Teaching
100% (1)
Pearson Chemistry Teaching
51 pages
Bimolecular Reactions and Collision Theory
No ratings yet
Bimolecular Reactions and Collision Theory
12 pages
C Project On Hotel Management
No ratings yet
C Project On Hotel Management
95 pages
Vietnam Welding Alloys PDF
No ratings yet
Vietnam Welding Alloys PDF
40 pages
Marine Corps Operational Update, 2013
No ratings yet
Marine Corps Operational Update, 2013
11 pages
What Business and Social Problems Does Data Center...
No ratings yet
What Business and Social Problems Does Data Center...
2 pages
Functional and Non-Functional Requirements
No ratings yet
Functional and Non-Functional Requirements
5 pages
2009-2010 School Calendar Premier High School of El Paso: Educating The Individual, Not The Class
No ratings yet
2009-2010 School Calendar Premier High School of El Paso: Educating The Individual, Not The Class
1 page
SPA New Delhi 2020-21 Admissions Guide
No ratings yet
SPA New Delhi 2020-21 Admissions Guide
1 page
Fully Lab Report Gas Diffusion
82% (11)
Fully Lab Report Gas Diffusion
17 pages
Advanced Algebra
No ratings yet
Advanced Algebra
11 pages
ADVANCED 1 - 2015 - CAE - Test #3
43% (7)
ADVANCED 1 - 2015 - CAE - Test #3
13 pages
Thesis
No ratings yet
Thesis
151 pages

Week 10

Uploaded by

Week 10

Uploaded by

Introduction to Data Science

Dr. Irfan Yousuf

• There is no way to determine the best value for "K", so we

New Point = (3, 7)

Implement kNN Algorithm

Implement k-Means Algorithm

The value of the silhouette coefﬁcient is between [-1, 1]. A

You might also like