Classification and K Nearest Neighbour Algorithm
Classification and K Nearest Neighbour Algorithm
Solution:
Yes, we can formulate this problem as a classification problem. See the
steps on next slides.
Student results prediction problem
Step 1: Training dataset collection
X1 = Session X2 = Y = Results
marks Midterm after final
marks exam
20 25 Pass
15 10 Pass Data plotted
12 15 Pass
18 12 Pass
4 7 Fail
6 8 Fail
3 5 Fail
Decision boundary
Student results prediction problem
Step 2: Finding decision boundary to correctly classify the data
Decision boundary:
Decision boundary can be a line or non-linear curve, which separates
the data points of both classes. If the data points of both classes can
be separated by a line then the problem is called linearly separable
and if a non-linear curve is required to separate the data points then
the problem is called non-linearly separable.
Decision boundary
Student results prediction problem
Step 3: Predicting the class of unseen data
What would be the result of a student having 10 session marks and
4 midterm marks?
• Since the data point is situated on left side of the boundary and all
the data points of failed students are also situated on left side of
the boundary, the algorithm will predict that the student will fail
after final exam.
• Similarly if a data point lies on right side of the boundary then the
algorithm will declare that student pass.
Classification algorithms
There are many algorithms; however, we’ll cover following in this
course.
1. Logistic regression
2. KNN
3. Artificial Neural Networks
4. Naïve Bayes
5. Support Vector Machine
And a lot more…
K Nearest Neighbors (KNN) Algorithm
K-Nearest Neighbors Algorithm
KNN is a very simple ML algorithm which can be used for classification
as well as regression problems. However, in this lecture we’re going to
use KNN for classification.
Inspiration
Animated demonstration of KNN algorithm
How nearest neighbors are determined?
Depending on the nature of data we can use different similarity and
dissimilarity (distance) measures.
For example:
1. Euclidean distance formula to find dissimilarity
2. Manhattan distance formula to find dissimilarity
3. Hamming distance formula to find dissimilarity
4. Cosine similarity measure to find similarity
And a lot more…
Few similarity and dissimilarity measures
Example:
Is record of Smith more similar to Kohli or Lara?
An example
Assuming K = 3 and using Euclidean distance formula find the class for the
sample (Age=48 and Loan=$142,000)
An example
Does K matters?
The answer is YES. For illustration please see following animation:
Does K matters?
• If K is small then
• Influenced by noise/outliers
• Over fitted
• Computationally better
• If K is large then
• Less precise
• Under fitted
• Computationally expensive
• Thumb rule
• K = Sqrt (n)
How to choose K?
Well the short answer is there is no rule or formula to derive the value of K. One value of K may
work wonders on one type of data set but may fail on other data set. But we do have some
guidelines and considerations to estimate a good value of K-
• To begin with you may choose a value of K = square root of number of observations in data set. At
the same time it is also advisable to choose an odd value of K to avoid any ties between most
frequent neighbor classes.
• Based on this value of K you can run K-NN algorithm on test set and evaluate the prediction using
one of many available metrics in ML.
• You may then try to increase and decrease the value K till you cant increase the prediction
accuracy any further.
Pros and cons
Dataset Splitting/Partitioning
Implementation of KNN
1. Import dataset
2. Find Euclidean distances between test sample and all training
samples
• distances = np.linalg.norm(X - new_data_point, axis=1)
3. Sort and get K nearest neighbors
• nearest_neighbor_ids = distances.argsort()[:k]
• nearest_neighbor_rings = y[nearest_neighbor_ids]
4. Make predictions
• prediction = scipy.stats.mode(nearest_neighbor_rings)
Examples:
• Spam Filtering
• Cancer diagnosing
Accuracy
• How many emails/patients did we correctly label out of all the
emails/patients?
Accuracy = (TP+TN)/(TP+FP+FN+TN)
Accuracy and skewed classes
function y = predictCancer(x)
y = 0; %ignore x!
return
Only 0.50% of patients have cancer. Above function finds that you
got 1% error on test set. (99% correct diagnoses)
Precision
• How many of those who we labeled as cancerous are actually
cancerous?
Precision = TP/(TP+FP)
Recall/Sensitivity
• Of all the people who are cancerous, how many of those we
correctly predict?
Recall = TP/(TP+FN)
Trading off precision and recall
Precision
Suppose we want to predict (cancer) 1
only if very confident.
Precision
0.5
Recall
Suppose we want to avoid missing too many
cases of cancer (avoid false negatives). 0.5 1
Recall
Trading off precision and recall
Average:
F1-score/ F-Score/ F-Measure
• It is the harmonic mean(average) of the precision and recall.
F1 Score:
Specificity
• Of all the people who are healthy, how many of those did we
correctly predict?
Specificity = TN/(TN+FP)
Important notes
• Accuracy is a great measure but only when you have symmetric
datasets (false negatives & false positives counts are close).
• If the cost of false positives and false negatives are different then F1 is
your savior. F1 is best if you have an uneven class distribution.
• Precision is how sure you are of your true positives whilst recall is
how sure you are that you are not missing any positives.
Importantnotes
• Choose Recall if the occurrence of false negatives is intolerable.