0% found this document useful (0 votes)
26 views41 pages

Week 10

The document discusses two unsupervised machine learning algorithms: K Nearest Neighbors and K-means clustering. It provides details on how each algorithm works, including examples and pseudocode. It also discusses techniques for selecting the optimal number of clusters like the elbow method and silhouette method.

Uploaded by

sirajquirish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

Week 10

The document discusses two unsupervised machine learning algorithms: K Nearest Neighbors and K-means clustering. It provides details on how each algorithm works, including examples and pseudocode. It also discusses techniques for selecting the optimal number of clusters like the elbow method and silhouette method.

Uploaded by

sirajquirish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to Data Science

Dr. Irfan Yousuf


Department of Computer Science (New Campus)
UET, Lahore
(Week # 10; March 18 - 22, 2024)
Outline
• K Nearest Neighbors
• K-means
Machine Learning Algorithms
Supervised Machine Learning Algorithms
• Classification: A classification problem is when the output
variable is a category, such as “red” or “blue” or “disease”
and “no disease”.
• Regression: A regression problem is when the output variable
is numeric, such as “age” or “weight”.
Supervised Machine Learning Algorithms
k Nearest Neighbors
• K-Nearest Neighbors (kNN) is one of the simplest Machine
Learning algorithms based on Supervised Machine Learning
technique.
• A case is classified by a majority vote of its neighbors, with
the case being assigned to the class most common amongst
its K nearest neighbors measured by a distance function.
• The algorithm assumes the similarity between the new
case/data and available cases and put the new case into the
most suitable category.
• The algorithm stores all the available data and classifies a
new data point based on the similarity.
k Nearest Neighbors
• It can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
• KNN is a non-parametric algorithm, which means it does not
make any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not
learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action
on the dataset.
• KNN algorithm at the training phase just stores the dataset
and when it gets new data, then it classifies that data into a
category that is much similar to the new data.
Why kNN?
How kNN Works?
How kNN Works?
How kNN Works?
How kNN Works?
The K-NN working can be explained on the basis of the below
algorithm:
Step-1: Select the number K.
Step-2: Calculate the Euclidean distance of all the data points
from the point in question.
Step-3: Take the K nearest neighbors as per the calculated
Euclidean distance.
Step-4: Among these k neighbors, count the number of the
data points in each category.
Step-5: Assign the new data points to that category for which
the number of the neighbor is maximum.
Step-6: Our model is ready.
How kNN Works?
How to select the value of K in kNN?
Below are some points to remember while selecting the value
of K in the K-NN algorithm:

• There is no way to determine the best value for "K", so we


need to try some values to find the best out of them. The
most preferred value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy
and lead to the effects of outliers in the model.
• Large values for K are good, but it may find some
difficulties.
How to select the value of K in kNN?
kNN Example

X Y Label
7 7 A
7 4 A
3 4 B
1 4 B

New Point = (3, 7)


kNN Example
New Point = (3, 7)
X Y Label
7 7 A (3-7)2 + (7-7)2
7 4 A (3-7)2 + (4-7)2
3 4 B (3-3)2 + (4-7)2
1 4 B (3-1)2 + (4-7)2

X Y Label
3 4 B 9
1 4 B 13
7 7 A 16
7 4 A 25
Advantages and Disadvantages of kNN?
Advantages:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

Disadvantages:
• Always needs to determine the value of K which may be
complex some time.
• The computation cost is high because of calculating the
distance between the data points for all the training samples.
kNN Implementation

Implement kNN Algorithm


Machine Learning Algorithms
Clustering
• Clustering is one of the most common exploratory data
analysis technique used to get an intuition about the structure
of the data.
• It can be defined as the task of identifying subgroups in the
data such that data points in the same subgroup (cluster) are
very similar while data points in different clusters are very
different.
• we try to find homogeneous subgroups within the data such
that data points in each cluster are as similar as possible
according to a similarity measure such as Euclidean-based
distance or correlation-based distance.
Clustering
• Clustering is considered an unsupervised learning method
since we don’t have the ground truth to compare the output of
the clustering algorithm to the true labels to evaluate its
performance.
Clustering
K-Means
• K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms.
• Unsupervised algorithms make inferences from datasets
using only input vectors without referring to known, or
labelled, outcomes.
• The objective of K-means is to group similar data points
together and discover underlying patterns. To achieve this
objective, K-means looks for a fixed number (k) of clusters
in a dataset.
• A cluster refers to a collection of data points aggregated together
because of certain similarities.
K-Means
• We define a target number k, which refers to the number of
centroids we need in the dataset. A centroid is the
imaginary or real location representing the center of the
cluster.
• Every data point is allocated to each of the clusters through
reducing the within-cluster sum of squares.
• K-means algorithm is an iterative algorithm that tries to
partition the dataset into K pre-defined distinct non-
overlapping subgroups (clusters) where each data point
belongs to only one group
K-Means Algorithm
• The first step in k-means is to pick the number of
clusters, K.
• Next, we randomly select the centroid for each cluster. Let’s
say we want to have 2 clusters, so k is equal to 2 here. We
then randomly select the centroid.
• Once we have initialized the centroids, we assign each point
to the closest cluster centroid:
K-Means Algorithm
• Specify number of clusters K.
• Initialize centroids by first shuffling the dataset and then
randomly selecting K data points for the centroids without
replacement.
• Compute the sum of the squared distance between data points
and all centroids.
• Assign each data point to the closest cluster (centroid).
• Compute the centroids for the clusters by taking the average
of the all-data points that belong to each cluster.
• Keep iterating until there is no change to the centroids, i.e.,
assignment of data points to clusters isn’t changing.
K-Means Algorithm
Expectation Maximization
• The approach k-means follows to solve the problem is called
Expectation-Maximization.
• The E-step is assigning the data points to the closest cluster.
The M-step is computing the centroid of each cluster.
K-Means Example

Source: https://www.saedsayad.com/clustering_kmeans.htm
K-Means Example
K-Means Example
K-Means Example
K-Means Example
K-Means Implementation

Implement k-Means Algorithm


Optimal Value of K in K-means
• Elbow Method
• Silhouette Method
Elbow Method
• The elbow method runs k-means clustering on the dataset for
a range of values for k (say from 1-10) and then for each
value of k computes an average score for all clusters.
• We can compute Within-Cluster Sum of Squares (WCSS),
the sum of square distances from each point to its assigned
center.
• We then draw k vs. WCSS.
Elbow Method
Silhouette Method
• The equation for calculating the silhouette coefficient for a
particular data point:

The value of the silhouette coefficient is between [-1, 1]. A


score of 1 denotes the best meaning that the data point o is
very compact within the cluster to which it belongs and far
away from the other clusters. The worst value is -1. Values
near 0 denote overlapping clusters.
Silhouette Method
Summary
- K Nearest Neighbors
- K-means

You might also like