0% found this document useful (0 votes)

4 views

Classification and K Nearest Neighbour Algorithm

This document covers the fundamentals of classification in machine learning, focusing on the K-Nearest Neighbors (KNN) algorithm. It discusses the formulation of classification problems, the importance of decision boundaries, and various performance metrics such as accuracy, precision, and recall. Additionally, it highlights the role of data visualization using Matplotlib and provides insights on selecting the appropriate value of K in KNN.

Uploaded by

aliahmed23456u857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Classification and K Nearest Neighbour Algorithm

Uploaded by

aliahmed23456u857

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Machine Learning

Classification and KNN

Lecture – 5 to 6

Instructor: Qamar Askari

Topics
• What is classification?
• A few examples
• KNN Algorithm
• Confusion matrix and performance measures
• Visualizing Data using Matplotlib
What is classification?
Classification
Classification
Classification
Classification
Classification Datasets – Examples
Classification Datasets – Examples
Classification Datasets – Examples
Formulating a classification problem
Problem statement:
Can we predict whether a student will pass or fail in a subject based on
his/her session and midterm marks?

Solution:
Yes, we can formulate this problem as a classification problem. See the
steps on next slides.
Student results prediction problem
Step 1: Training dataset collection

X1 = Session marks X2 = Midterm marks Y = Results after final exam

20 25 Pass
15 10 Pass
12 15 Pass
18 12 Pass
4 7 Fail
6 8 Fail
3 5 Fail
There are two classes:
Pass = Class 1
Fail = Class 0
Student results prediction problem
Step 2: Finding decision boundary to correctly classify the data

X1 = Session X2 = Y = Results
marks Midterm after final
marks exam
20 25 Pass
15 10 Pass Data plotted
12 15 Pass
18 12 Pass
4 7 Fail
6 8 Fail
3 5 Fail

Decision boundary
Student results prediction problem
Step 2: Finding decision boundary to correctly classify the data
Decision boundary:
Decision boundary can be a line or non-linear curve, which separates
the data points of both classes. If the data points of both classes can
be separated by a line then the problem is called linearly separable
and if a non-linear curve is required to separate the data points then
the problem is called non-linearly separable.

How a decision boundary is determined will be discussed in

upcoming slides.

Decision boundary
Student results prediction problem
Step 3: Predicting the class of unseen data
What would be the result of a student having 10 session marks and
4 midterm marks?

• Find on which side of the decision boundary the data point is

situated? See red star in the plot.

• Since the data point is situated on left side of the boundary and all
the data points of failed students are also situated on left side of
the boundary, the algorithm will predict that the student will fail
after final exam.

• Similarly if a data point lies on right side of the boundary then the
algorithm will declare that student pass.
Classification algorithms
There are many algorithms; however, we’ll cover following in this
course.
1. Logistic regression
2. KNN
3. Artificial Neural Networks
4. Naïve Bayes
5. Support Vector Machine
And a lot more…
K Nearest Neighbors (KNN) Algorithm
K-Nearest Neighbors Algorithm
KNN is a very simple ML algorithm which can be used for classification
as well as regression problems. However, in this lecture we’re going to
use KNN for classification.
Inspiration
Animated demonstration of KNN algorithm
How nearest neighbors are determined?
Depending on the nature of data we can use different similarity and
dissimilarity (distance) measures.

For example:
1. Euclidean distance formula to find dissimilarity
2. Manhattan distance formula to find dissimilarity
3. Hamming distance formula to find dissimilarity
4. Cosine similarity measure to find similarity
And a lot more…
Few similarity and dissimilarity measures
Example:
Is record of Smith more similar to Kohli or Lara?
An example

Assuming K = 3 and using Euclidean distance formula find the class for the
sample (Age=48 and Loan=$142,000)
An example
Does K matters?
The answer is YES. For illustration please see following animation:
Does K matters?
• If K is small then
• Influenced by noise/outliers
• Over fitted
• Computationally better
• If K is large then
• Less precise
• Under fitted
• Computationally expensive

• Thumb rule
• K = Sqrt (n)
How to choose K?
Well the short answer is there is no rule or formula to derive the value of K. One value of K may
work wonders on one type of data set but may fail on other data set. But we do have some
guidelines and considerations to estimate a good value of K-

• To begin with you may choose a value of K = square root of number of observations in data set. At
the same time it is also advisable to choose an odd value of K to avoid any ties between most
frequent neighbor classes.
• Based on this value of K you can run K-NN algorithm on test set and evaluate the prediction using
one of many available metrics in ML.
• You may then try to increase and decrease the value K till you cant increase the prediction
accuracy any further.
Pros and cons
Dataset Splitting/Partitioning
Implementation of KNN
1. Import dataset
2. Find Euclidean distances between test sample and all training
samples
• distances = np.linalg.norm(X - new_data_point, axis=1)
3. Sort and get K nearest neighbors
• nearest_neighbor_ids = distances.argsort()[:k]
• nearest_neighbor_rings = y[nearest_neighbor_ids]
4. Make predictions
• prediction = scipy.stats.mode(nearest_neighbor_rings)

Discuss from Google Colaboratory if accessible

Also discuss Assignment – 1
Performance/Errormetrics
• Accuracy
• Precision
• Recall
• F1-score
• Specificity
Confusionmatrix

Examples:
• Spam Filtering
• Cancer diagnosing
Accuracy
• How many emails/patients did we correctly label out of all the
emails/patients?

Accuracy = (TP+TN)/(TP+FP+FN+TN)
Accuracy and skewed classes

function y = predictCancer(x)
y = 0; %ignore x!
return

Only 0.50% of patients have cancer. Above function finds that you
got 1% error on test set. (99% correct diagnoses)
Precision
• How many of those who we labeled as cancerous are actually
cancerous?
Precision = TP/(TP+FP)
Recall/Sensitivity
• Of all the people who are cancerous, how many of those we
correctly predict?
Recall = TP/(TP+FN)
Trading off precision and recall

Precision
Suppose we want to predict (cancer) 1
only if very confident.

Precision
0.5
Recall
Suppose we want to avoid missing too many
cases of cancer (avoid false negatives). 0.5 1
Recall
Trading off precision and recall

How to compare precision/recall numbers?

Precision(P) Recall (R) Average
Algorithm 1 0.5 0.4 0.45
Algorithm 2 0.7 0.1 0.4
Algorithm 3 0.02 1.0 0.51

Average:
F1-score/ F-Score/ F-Measure
• It is the harmonic mean(average) of the precision and recall.

F1 Score = 2(Recall Precision) / (Recall + Precision)

Trading off precision and recall

How to compare precision/recall numbers?

Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392

F1 Score:
Specificity
• Of all the people who are healthy, how many of those did we
correctly predict?
Specificity = TN/(TN+FP)
Important notes
• Accuracy is a great measure but only when you have symmetric
datasets (false negatives & false positives counts are close).

• If the cost of false positives and false negatives are different then F1 is
your savior. F1 is best if you have an uneven class distribution.

• Precision is how sure you are of your true positives whilst recall is
how sure you are that you are not missing any positives.
Importantnotes
• Choose Recall if the occurrence of false negatives is intolerable.

• Choose precision if you want to be more confident of your true

positives

• Choose Specificity if you want to cover all true negatives, meaning

you don’t want any false alarms.
Important Python Library to compute metrics
• sklearn.metrics
• Important metrics: Confusion Matrix, Accuracy, Precision, Recall,
Precision_Recall curve, F1 score, and many other classification, regression,
and clustering metrics.
Example of Accuracy

Link to the library https://scikit-learn.org/stable/modules/model_evaluation.html

Visualizing Data
• Matplotlib is an important library in python to plot lines, curves, bar
graphs, histograms, scatter plots, etc.
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Visualizing Data
Important Links to read about MatplotLib
• https://matplotlib.org/
• https://www.simplilearn.com/tutorials/python-tutorial/matplotlib

Random Forest For Credit Card Fraud Detection
No ratings yet
Random Forest For Credit Card Fraud Detection
6 pages
Sentiment Analysis of Twitter Data My
75% (4)
Sentiment Analysis of Twitter Data My
14 pages
Share UNIT-IV-1
No ratings yet
Share UNIT-IV-1
138 pages
Jntuk r20 ML Unit-II
No ratings yet
Jntuk r20 ML Unit-II
33 pages
Lecture Week 2 KNN and Model Evaluation PDF
100% (1)
Lecture Week 2 KNN and Model Evaluation PDF
53 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
ml unit2
No ratings yet
ml unit2
38 pages
Lecture 02 - KNN and ML Basics
No ratings yet
Lecture 02 - KNN and ML Basics
33 pages
06-knn
No ratings yet
06-knn
41 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
ML UNIT-2
No ratings yet
ML UNIT-2
33 pages
Lecture 5-KNN
No ratings yet
Lecture 5-KNN
55 pages
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
No ratings yet
JNTUK R20 B.tech CSE 3-2 Machine Learning Unit 2 Notes
33 pages
Lecture_07_slides
No ratings yet
Lecture_07_slides
45 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
ML Unit 2 r20 Jntuk
No ratings yet
ML Unit 2 r20 Jntuk
34 pages
Mlfa Autumn 22 Lec 03
No ratings yet
Mlfa Autumn 22 Lec 03
61 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
Week10 KNN Practical
No ratings yet
Week10 KNN Practical
4 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
19 pages
ML Notes
100% (2)
ML Notes
125 pages
sensitivity unit 4
No ratings yet
sensitivity unit 4
4 pages
Machine Learning: Lecture # 2 Data Normalization, KNN & Minimum Distance
No ratings yet
Machine Learning: Lecture # 2 Data Normalization, KNN & Minimum Distance
74 pages
KNN Algorithm
No ratings yet
KNN Algorithm
16 pages
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
No ratings yet
Lecture #2: Prediction, K-Nearest Neighbors: CS 109A, STAT 121A, AC 209A: Data Science
28 pages
ML Lecture#2
No ratings yet
ML Lecture#2
70 pages
MLS+2+-+Classification
No ratings yet
MLS+2+-+Classification
13 pages
T6- KNN - Features, Distances &amp; Non-Parametric Models
No ratings yet
T6- KNN - Features, Distances &amp; Non-Parametric Models
23 pages
KNN
No ratings yet
KNN
29 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
Experiment 2.2 KNN Classifier
No ratings yet
Experiment 2.2 KNN Classifier
7 pages
Foundations of Data Science - Unit 5 - Accuracy KNN
No ratings yet
Foundations of Data Science - Unit 5 - Accuracy KNN
24 pages
cs4302-lecture2
No ratings yet
cs4302-lecture2
40 pages
DATA MINING UNIT-2 (1)
No ratings yet
DATA MINING UNIT-2 (1)
37 pages
Practical 7
No ratings yet
Practical 7
6 pages
KNN 2
No ratings yet
KNN 2
53 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
Confusion Matrix
No ratings yet
Confusion Matrix
42 pages
Classification
No ratings yet
Classification
58 pages
When Do We Use KNN Algorithm?
No ratings yet
When Do We Use KNN Algorithm?
7 pages
KNN Dan KMeans
No ratings yet
KNN Dan KMeans
37 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
Module 8 - PDF
No ratings yet
Module 8 - PDF
51 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Lecture12 KNN Classification and Missingness Module 2
No ratings yet
Lecture12 KNN Classification and Missingness Module 2
67 pages
MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
No ratings yet
MACHINE LEARNING AND DATA ANALYTICS USING PYTHON LAB
36 pages
K-Nearest Neighbours (KNN)
No ratings yet
K-Nearest Neighbours (KNN)
10 pages
ML Unit 3
No ratings yet
ML Unit 3
12 pages
lecture-11
No ratings yet
lecture-11
32 pages
12_23ECE216_Nearest Neighbors
No ratings yet
12_23ECE216_Nearest Neighbors
29 pages
ML 4 (1)
No ratings yet
ML 4 (1)
33 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
KNN Model Implementation
No ratings yet
KNN Model Implementation
12 pages
Untitled 9
No ratings yet
Untitled 9
17 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
ML Lec07 KNN
100% (2)
ML Lec07 KNN
37 pages
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
No ratings yet
A Complete Guide To K Nearest Neighbors Algorithm 1598272616
13 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Unstyle: A Tool For Evading Authorship Attribution
No ratings yet
Unstyle: A Tool For Evading Authorship Attribution
80 pages
Natnael Mekuanent 2021
No ratings yet
Natnael Mekuanent 2021
86 pages
F Measure YS 26oct07 PDF
No ratings yet
F Measure YS 26oct07 PDF
5 pages
Spie2013 Down
No ratings yet
Spie2013 Down
8 pages
Data-Mining-Series-2-Important-Topics
No ratings yet
Data-Mining-Series-2-Important-Topics
22 pages
MA-MED-2020-VZ_paper_Lins_Arno
No ratings yet
MA-MED-2020-VZ_paper_Lins_Arno
8 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
machine-learning-module-3-logistic-regression
No ratings yet
machine-learning-module-3-logistic-regression
22 pages
NLP and IR Tanvier Siddiqui, U.S. Tiwary
No ratings yet
NLP and IR Tanvier Siddiqui, U.S. Tiwary
18 pages
1 s2.0 S0208521622000742 Main
No ratings yet
1 s2.0 S0208521622000742 Main
11 pages
Sculpture Eradication Using Deep Learning A Comprehensive Approach V1
No ratings yet
Sculpture Eradication Using Deep Learning A Comprehensive Approach V1
5 pages
ml_unit2
No ratings yet
ml_unit2
22 pages
Precision and Recall
No ratings yet
Precision and Recall
20 pages
Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
Answer Key Sample Paper 4 AI Class 10
No ratings yet
Answer Key Sample Paper 4 AI Class 10
10 pages
Machine Learning Algorithmsfor Predictionofmobilephone Price
No ratings yet
Machine Learning Algorithmsfor Predictionofmobilephone Price
9 pages
Evaluating Recommender Sytems
No ratings yet
Evaluating Recommender Sytems
39 pages
Assignment 7
No ratings yet
Assignment 7
3 pages
Lab-Practice-I (ML) - Lab Manual-sknIT
No ratings yet
Lab-Practice-I (ML) - Lab Manual-sknIT
57 pages
An Autonomous System Design For Mold Loading On Press Brake Machines Using A Camera Platform, Deep Learning, and Image Processing
No ratings yet
An Autonomous System Design For Mold Loading On Press Brake Machines Using A Camera Platform, Deep Learning, and Image Processing
9 pages
Research Paper Text Processing and Analysis Pipeline For Scientific MAIN
No ratings yet
Research Paper Text Processing and Analysis Pipeline For Scientific MAIN
6 pages
Research Paper1
No ratings yet
Research Paper1
5 pages
Automated Use Case Diagram Generator Using NLP And
No ratings yet
Automated Use Case Diagram Generator Using NLP And
5 pages
Computer Vision For A Camel-Vehicle Collision Mitigation System
No ratings yet
Computer Vision For A Camel-Vehicle Collision Mitigation System
9 pages
Cross Domain Sentiment Analysis
No ratings yet
Cross Domain Sentiment Analysis
17 pages
Medicinal_Leaves_Classification_Using_Random_Forest_and_AdaBoost
No ratings yet
Medicinal_Leaves_Classification_Using_Random_Forest_and_AdaBoost
8 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
978-981-97-1711-8_23
No ratings yet
978-981-97-1711-8_23
14 pages