0% found this document useful (0 votes)
4 views

Classification and K Nearest Neighbour Algorithm

This document covers the fundamentals of classification in machine learning, focusing on the K-Nearest Neighbors (KNN) algorithm. It discusses the formulation of classification problems, the importance of decision boundaries, and various performance metrics such as accuracy, precision, and recall. Additionally, it highlights the role of data visualization using Matplotlib and provides insights on selecting the appropriate value of K in KNN.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Classification and K Nearest Neighbour Algorithm

This document covers the fundamentals of classification in machine learning, focusing on the K-Nearest Neighbors (KNN) algorithm. It discusses the formulation of classification problems, the importance of decision boundaries, and various performance metrics such as accuracy, precision, and recall. Additionally, it highlights the role of data visualization using Matplotlib and provides insights on selecting the appropriate value of K in KNN.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Machine Learning

Classification and KNN


Lecture – 5 to 6

Instructor: Qamar Askari


Topics
• What is classification?
• A few examples
• KNN Algorithm
• Confusion matrix and performance measures
• Visualizing Data using Matplotlib
What is classification?
Classification
Classification
Classification
Classification
Classification Datasets – Examples
Classification Datasets – Examples
Classification Datasets – Examples
Formulating a classification problem
Problem statement:
Can we predict whether a student will pass or fail in a subject based on
his/her session and midterm marks?

Solution:
Yes, we can formulate this problem as a classification problem. See the
steps on next slides.
Student results prediction problem
Step 1: Training dataset collection

X1 = Session marks X2 = Midterm marks Y = Results after final exam


20 25 Pass
15 10 Pass
12 15 Pass
18 12 Pass
4 7 Fail
6 8 Fail
3 5 Fail
There are two classes:
Pass = Class 1
Fail = Class 0
Student results prediction problem
Step 2: Finding decision boundary to correctly classify the data

X1 = Session X2 = Y = Results
marks Midterm after final
marks exam
20 25 Pass
15 10 Pass Data plotted
12 15 Pass
18 12 Pass
4 7 Fail
6 8 Fail
3 5 Fail

Decision boundary
Student results prediction problem
Step 2: Finding decision boundary to correctly classify the data
Decision boundary:
Decision boundary can be a line or non-linear curve, which separates
the data points of both classes. If the data points of both classes can
be separated by a line then the problem is called linearly separable
and if a non-linear curve is required to separate the data points then
the problem is called non-linearly separable.

How a decision boundary is determined will be discussed in


upcoming slides.

Decision boundary
Student results prediction problem
Step 3: Predicting the class of unseen data
What would be the result of a student having 10 session marks and
4 midterm marks?

• Find on which side of the decision boundary the data point is


situated? See red star in the plot.

• Since the data point is situated on left side of the boundary and all
the data points of failed students are also situated on left side of
the boundary, the algorithm will predict that the student will fail
after final exam.

• Similarly if a data point lies on right side of the boundary then the
algorithm will declare that student pass.
Classification algorithms
There are many algorithms; however, we’ll cover following in this
course.
1. Logistic regression
2. KNN
3. Artificial Neural Networks
4. Naïve Bayes
5. Support Vector Machine
And a lot more…
K Nearest Neighbors (KNN) Algorithm
K-Nearest Neighbors Algorithm
KNN is a very simple ML algorithm which can be used for classification
as well as regression problems. However, in this lecture we’re going to
use KNN for classification.
Inspiration
Animated demonstration of KNN algorithm
How nearest neighbors are determined?
Depending on the nature of data we can use different similarity and
dissimilarity (distance) measures.

For example:
1. Euclidean distance formula to find dissimilarity
2. Manhattan distance formula to find dissimilarity
3. Hamming distance formula to find dissimilarity
4. Cosine similarity measure to find similarity
And a lot more…
Few similarity and dissimilarity measures
Example:
Is record of Smith more similar to Kohli or Lara?
An example

Assuming K = 3 and using Euclidean distance formula find the class for the
sample (Age=48 and Loan=$142,000)
An example
Does K matters?
The answer is YES. For illustration please see following animation:
Does K matters?
• If K is small then
• Influenced by noise/outliers
• Over fitted
• Computationally better
• If K is large then
• Less precise
• Under fitted
• Computationally expensive

• Thumb rule
• K = Sqrt (n)
How to choose K?
Well the short answer is there is no rule or formula to derive the value of K. One value of K may
work wonders on one type of data set but may fail on other data set. But we do have some
guidelines and considerations to estimate a good value of K-

• To begin with you may choose a value of K = square root of number of observations in data set. At
the same time it is also advisable to choose an odd value of K to avoid any ties between most
frequent neighbor classes.
• Based on this value of K you can run K-NN algorithm on test set and evaluate the prediction using
one of many available metrics in ML.
• You may then try to increase and decrease the value K till you cant increase the prediction
accuracy any further.
Pros and cons
Dataset Splitting/Partitioning
Implementation of KNN
1. Import dataset
2. Find Euclidean distances between test sample and all training
samples
• distances = np.linalg.norm(X - new_data_point, axis=1)
3. Sort and get K nearest neighbors
• nearest_neighbor_ids = distances.argsort()[:k]
• nearest_neighbor_rings = y[nearest_neighbor_ids]
4. Make predictions
• prediction = scipy.stats.mode(nearest_neighbor_rings)

Discuss from Google Colaboratory if accessible


Also discuss Assignment – 1
Performance/Errormetrics
• Accuracy
• Precision
• Recall
• F1-score
• Specificity
Confusionmatrix

Examples:
• Spam Filtering
• Cancer diagnosing
Accuracy
• How many emails/patients did we correctly label out of all the
emails/patients?

Accuracy = (TP+TN)/(TP+FP+FN+TN)
Accuracy and skewed classes

function y = predictCancer(x)
y = 0; %ignore x!
return

Only 0.50% of patients have cancer. Above function finds that you
got 1% error on test set. (99% correct diagnoses)
Precision
• How many of those who we labeled as cancerous are actually
cancerous?
Precision = TP/(TP+FP)
Recall/Sensitivity
• Of all the people who are cancerous, how many of those we
correctly predict?
Recall = TP/(TP+FN)
Trading off precision and recall

Precision
Suppose we want to predict (cancer) 1
only if very confident.

Precision
0.5
Recall
Suppose we want to avoid missing too many
cases of cancer (avoid false negatives). 0.5 1
Recall
Trading off precision and recall

How to compare precision/recall numbers?


Precision(P) Recall (R) Average
Algorithm 1 0.5 0.4 0.45
Algorithm 2 0.7 0.1 0.4
Algorithm 3 0.02 1.0 0.51

Average:
F1-score/ F-Score/ F-Measure
• It is the harmonic mean(average) of the precision and recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)


Trading off precision and recall

How to compare precision/recall numbers?


Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392

F1 Score:
Specificity
• Of all the people who are healthy, how many of those did we
correctly predict?
Specificity = TN/(TN+FP)
Important notes
• Accuracy is a great measure but only when you have symmetric
datasets (false negatives & false positives counts are close).

• If the cost of false positives and false negatives are different then F1 is
your savior. F1 is best if you have an uneven class distribution.

• Precision is how sure you are of your true positives whilst recall is
how sure you are that you are not missing any positives.
Importantnotes
• Choose Recall if the occurrence of false negatives is intolerable.

• Choose precision if you want to be more confident of your true


positives

• Choose Specificity if you want to cover all true negatives, meaning


you don’t want any false alarms.
Important Python Library to compute metrics
• sklearn.metrics
• Important metrics: Confusion Matrix, Accuracy, Precision, Recall,
Precision_Recall curve, F1 score, and many other classification, regression,
and clustering metrics.
Example of Accuracy

Link to the library https://scikit-learn.org/stable/modules/model_evaluation.html


Visualizing Data
• Matplotlib is an important library in python to plot lines, curves, bar
graphs, histograms, scatter plots, etc.
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Important Links to read about MatplotLib
• https://matplotlib.org/
• https://www.simplilearn.com/tutorials/python-tutorial/matplotlib

You might also like