0% found this document useful (0 votes)
154 views27 pages

Supervised Learning & KNN Guide

The KNN algorithm is used to classify a new data point based on its similarity to labeled examples in the training set. It finds the K nearest neighbors of the new point and assigns the most common label of those neighbors as the prediction. In the example, K=3 nearest neighbors are found for a new paper sample with attributes a=3 and b=7. Two of the three nearest neighbors are labeled "Good", so the new sample is predicted to be "Good".

Uploaded by

sajjadimam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views27 pages

Supervised Learning & KNN Guide

The KNN algorithm is used to classify a new data point based on its similarity to labeled examples in the training set. It finds the K nearest neighbors of the new point and assigns the most common label of those neighbors as the prediction. In the example, K=3 nearest neighbors are found for a new paper sample with attributes a=3 and b=7. Two of the three nearest neighbors are labeled "Good", so the new sample is predicted to be "Good".

Uploaded by

sajjadimam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CHAPTER#10

SUPERVISED LEARNING & ITS ALGORITHMS


PART # 01

COURSE INSTRUCTORS:
 ENGR. FARHEEN QAZI
 ENGR. SAJJAD IMAM
 ENGR. SUNDUS ZEHRA

DEPARTMENT OF SOFTWARE ENGINEERING


SIR SYED UNIVERSITY OF ENGINEERING & TECHNOLOGY
AGENDA

 Supervised Learning
 Categories of Supervised Learning
 K-Nearest Neighbors Algorithm
 Working of K-Nearest Neighbors Algorithm
 Graphical Example
 Applications of K-NN
 Advantages
 Disadvantages
 Summary
SUPERVISED LEARNING

 Supervised learning as the name indicates the presence of a supervisor as a teacher.


 Basically supervised learning is a learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with the
correct answer.
 After that, the machine is provided with a new set of examples(data) so that
supervised learning algorithm analyses the training data(set of training examples) and
produces a correct outcome from labeled data.
CONTD….

 For instance, suppose you are given a basket filled with different kinds of fruits. Now the first step is to train the
machine with all different fruits one by one like this:

 If shape of object is rounded and depression at top having color Red then it will be labelled as –Apple.
 If shape of object is long curving cylinder having color Green-Yellow then it will be labelled as –Banana.
CONTD….

 Now suppose after training the data, you have given a new separate fruit say Banana from basket and asked to
identify it.

 Since the machine has already learned the things from previous data and this time have to use it wisely. It will first
classify the fruit with its shape and color and would confirm the fruit name as BANANA and put it in Banana
category.
 Thus the machine learns the things from training data(basket containing fruits) and then apply the knowledge to
test data(new fruit).
CATEGORIES OF SUPERVISED LEARNING

Supervised learning classified into two categories of algorithms:


 Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” or “disease” and “no disease”.
 Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
CATEGORIES OF SUPERVISED LEARNING

Supervised Learning

Classification Regression

Linear regression
Decision Tree, .
Naïve Bayes,
K-Nearest Neighbor
OVERFITTING AND UNDERFITTING IN MACHINE LEARNING
 Overfitting and Underfitting are the two main problems that occur in machine
learning and degrade the performance of the machine learning models.

 OVERFITTING
 Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset.
 Because of this, the model starts caching noise and inaccurate values present in
the dataset, and all these factors reduce the efficiency and accuracy of the model.
 The overfitted model has low bias and high variance.
 Overfitting is the main problem that occurs in supervised learning.
OVERFITTING EXAMPLE

 The concept of the overfitting can be understood by the below graph of the linear regression output:

 As we can see from the above graph, the model tries to cover all the data points present in the scatter plot. It
may look efficient, but in reality, it is not so.
 Because the goal of the regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.
UNDERFITTING

 Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data.
 To avoid the overfitting in the model, the fed of training data can be stopped at an
early stage, due to which the model may not learn enough from the training data.
 As a result, it may fail to find the best fit of the dominant trend in the data.
 In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
 An underfitted model has high bias and low variance.
UNDERFITTING EXAMPLE

 We can understand the underfitting using below output of the linear regression model:

 As we can see from the above diagram, the model is unable to capture the data points present in the plot.
K-NEAREST NEIGHBORS (KNN) ALGORITHM

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be


used for both classification as well as regression predictive problems. However, it is mainly
used for classification predictive problems in industry. The following two properties would
define KNN well −
 Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
 Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.
WORKING OF K-NEAREST NEIGHBORS ALGORITHM

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data points
which further means that the new data point will be assigned a value based on how closely it matches
the points in the training set.We can understand its working with the help of following steps −
 Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.
 Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
 Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with the help
of any of the method namely: Euclidean, Manhattan or Hamming distance. The most
commonly used method to calculate distance is Euclidean.
CONTD….

3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
Step 4 − End
DISTANCE MEASURE
GRAPHICAL EXAMPLE
The following is an example to understand the concept of K and working of KNN
algorithm.
Suppose we have a dataset which can be plotted as follows:
CONTD….
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class. We are
assuming K = 3 i.e. it would find three nearest data points. It is shown in the below diagram −

We can see in the above diagram the three nearest neighbors of the data point with black dot. Among those
three, two of them lies in Red class hence the black dot will also be assigned in red class.
EXAMPLE 1
Question: We have data from the questionnaires survey (to ask people opinion) and
objective testing with two attributes (acid durability and strength) to classify whether a
special paper tissue is good or not. Here is four training samples

a = Acid Durability b = Strength Y = Classification


(seconds) (kg/square meter)
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good

Now the factory produces a new paper tissue that pass laboratory test with a = 3 and
b = 7. Without another expensive survey, can we guess what the classification of this new
tissue is?
CONTD….
1. Determine parameter K = number of nearest neighbors
 Suppose use K = 3
2. Calculate the distance between the query-instance and all the training samples
 Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is
faster to calculate (without square root)

a = Acid Durability b = Strength Euclidean Distance to Query Instance


(seconds) (kg/square meter) (3, 7)

7 7 𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟕)𝟐 = 𝟒

7 4 𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟒)𝟐 = 𝟓

3 4 𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟑)𝟐 +(𝟕 − 𝟒)𝟐 = 𝟑

1 4 𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟏)𝟐 +(𝟕 − 𝟒)𝟐 = 𝟑. 𝟔


CONTD….
 3. Sort the distance and determine nearest neighbors based on the K-th minimum distance

a = Acid b = Strength Euclidean Distance to Rank Minimum Is it included in 3-


Durability (kg/square meter) Query Instance Distance Nearest neighbors?
(seconds) (3, 7)

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟕)𝟐 3 Yes


7 7
=𝟒

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟒)𝟐 4 No


7 4
=𝟓

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟑)𝟐 +(𝟕 − 𝟒)𝟐 1 Yes


3 4
=𝟑

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟏)𝟐 +(𝟕 − 𝟒)𝟐 2 Yes


1 4
= 𝟑. 𝟔
CONTD….
4. Gather the category of the nearest neighbors. Notice in the second row last column that the category of
nearest neighbor (Y) is not included because the rank of this data is more than 3 (=K).
X1 = Acid X2 = Strength Euclidean Distance to Rank Is it included in 3- Y = Category
Durability (kg/square Query Instance Minimum Nearest neighbors? of nearest
(seconds) meter) (3, 7) Distance Neighbor

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟕)𝟐 3 Yes Bad


7 7
=𝟒

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟕)𝟐 +(𝟕 − 𝟒)𝟐 4 No -


7 4
=𝟓

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟑)𝟐 +(𝟕 − 𝟒)𝟐 1 Yes Good


3 4
=𝟑

𝒅 𝒂𝒏 , 𝒃𝒏 = (𝟑 − 𝟏)𝟐 +(𝟕 − 𝟒)𝟐 2 Yes Good


1 4
= 𝟑. 𝟔
CONTD….

 5. Use simple majority of the category of nearest neighbors as the prediction value of the
query instance
 We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that
pass laboratory test with a = 3 and b = 7 is included in Good category.
EXAMPLE 2
Question: Consider a dataset having two variables: height(cm) & weight(kg) and each point is classified as
Normal and Underweight
Weight (X1) Height (Y1) Class
51 167 Underweight
62 182 Normal On the basis of the given data we have to classify the
69 176 Normal below set as Normal or Underweight using KNN
64 173 Normal (K=3)
65 172 Normal
56 174 Underweight
58 169 Normal
57 173 Normal
55 170 Normal
APPLICATIONS OF K-NN
The following are some of the areas in which KNN can be applied successfully −
 Banking System
KNN can be used in banking system to predict weather an individual is fit for loan approval? Does
that individual have the characteristics similar to the defaulters one?
 Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the persons having
similar traits.
 Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like “Will
Vote”,“Will not Vote”,“Will Vote to Party ‘PTI’,“Will Vote to Party ‘PMLN’.
 Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting Detection,
Image Recognition and Video Recognition.
ADVANTAGES

 It is very simple algorithm to understand and interpret.


 It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
 It is a versatile algorithm as we can use it for classification as well as
regression.
 It has relatively high accuracy but there are much better supervised
learning models than KNN.
DISADVANTAGES

 It is computationally a bit expensive algorithm because it stores all the


training data.
 High memory storage required as compared to other supervised
learning algorithms.
 It is very sensitive to the scale of data as well as irrelevant features
SUMMARY

 KNN is conceptually simple, yet able to solve complex problems.


 Can work with relatively little information
 Learning is simple (no learning at all)
 Memory and CPU cost
 Feature selection problem
 Sensitive to representation

You might also like