CLASSIFICATION
Arvind Deshpande
11/19/2024 Arvind Deshpande (VJTI) 2
K-Nearest Neighbours (KNN)
• Eager learning- The classification algorithm constructs a
classification model before receiving new data to classify.
• Lazy learning - Instance-based learning
• Training data is simply stored (or only minor processing)
and waits until it is given a test tuple to classify.
• KNN algorithm - instance-based learning
• Unlike eager learning methods, lazy learners do less work
when a training tuple is presented and more work when
making a classification or numeric prediction
11/19/2024 Arvind Deshpande (VJTI) 3
KNN classifier
• KNN classifiers are based on learning by analogy, that is,
by comparing a given test tuple with training tuples that
are similar to it.
• The training tuples are described by n attributes.
• Each tuple represents a point in an n-dimensional space.
In this way, all the training tuples are stored in an n-
dimensional pattern space.
• When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training
tuples that are closest to the unknown tuple.
• These k training tuples are the k “nearest neighbors” of
the unknown tuple.
11/19/2024 Arvind Deshpande (VJTI) 4
KNN classifier
• “Closeness” is defined in terms of a distance metric, such
as Euclidean distance. The Euclidean distance between
two points or tuples, say, 𝑋1 = 𝑥11 , 𝑥12 , … , 𝑥1𝑛 and 𝑋2 =
𝑥21 , 𝑥22 , … , 𝑥2𝑛 .
𝑛
𝑑𝑖𝑠𝑡 𝑋1 , 𝑋2 = 𝑥1𝑖 − 𝑥2𝑖 2
𝑖=1
• For categorical attributes, if 2 tuples are identical, the
difference is taken as 0 otherwise the difference is 1.
11/19/2024 Arvind Deshpande (VJTI) 5
KNN classifier
• Typically, we normalize the values of each attribute before
using the equation. This helps prevent attributes with
initially large ranges (e.g., income) from outweighing
attributes with initially smaller ranges (e.g., binary
attributes).
• Min-max normalization, for example, can be used to
transform a value v of a numeric attribute A to v’ in the
range [0, 1] by computing
′
𝑣 − 𝑚𝑖𝑛𝐴
𝑣 =
𝑚𝑎𝑥𝐴 − 𝑚𝑖𝑛𝐴
where 𝑚𝑖𝑛𝐴 and 𝑚𝑎𝑥𝐴 are the minimum and maximum
values of attribute A.
11/19/2024 Arvind Deshpande (VJTI) 6
KNN classifier
11/19/2024 Arvind Deshpande (VJTI) 7
KNN Algorithm
1. Calculate "d (x, xi)" i = 1, 2……, n where d denotes the
Euclidean distance between the points.
2. Arrange the calculated in Euclidean distances in non-
decreasing order.
3. Let k be a +ve integer, take the first k distances from
this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith
class among k points i.e. k > 0.
6. If ki > kj , i ≠ j then put x in class i.
11/19/2024 Arvind Deshpande (VJTI) 8
Choosing the right value of K
• The KNN algorithm is run several times with different values of
K and choose the K that reduces the number of errors we
encounter while maintaining the algorithm’s ability to accurately
make predictions when it's given data it hasn't seen before.
• As we decrease the value of K to 1, our predictions become
less stable.
• Inversely, as we increase the value of K, our predictions
become more stable due to majority voting / averaging, and
thus, more likely to make more accurate predictions (up to a
certain point.)
• Eventually, we begin to witness an increasing number of errors.
It is at this point we know pushed the value of K too far.
• In cases where we are taking a majority vote (e.g. picking the
mode in a classification problem) among labels, we usually
make K an odd number to have a tiebreaker.
11/19/2024 Arvind Deshpande (VJTI) 9
Advantages
• The algorithm is simple and easy to implement.
• It is lazy learning algorithm and therefore requires no
training prior to making real time prediction.
• There's no need to build a model, tune several
parameters, or make additional assumptions. This makes
the KNN algorithm much faster than other algorithms that
require training like SVM, linear regression etc.
• Since the algorithm requires no training before making
predictions, new data can be added seamlessly.
• There are only two parameters required to implement
KNN i.e. the value of K and the distance function.
• The algorithm is versatile. It can be used for classification,
regression and search.
11/19/2024 Arvind Deshpande (VJTI) 10
Disadvantages
• It doesn't work well with high dimensional data because
with large number of dimensions, it becomes difficult for
the algorithm to calculate distance in each dimension.
• KNN algorithm has a high prediction cost for large
datasets. This is because in large datasets the cost of
calculating distance between new point and each existing
point becomes higher.
• KNN algorithm doesn't work well with categorical features
since it is difficult to find the distance between dimensions
with categorical features.
• The algorithm gets significantly slower as the number of
examples and/or predictors/independent variables
increase.
11/19/2024 Arvind Deshpande (VJTI) 11
Disadvantages
• When making a classification or numeric prediction, lazy
learners can be computationally expensive.
• They require efficient storage techniques and are well
suited to implementation on parallel hardware.
• They offer little explanation or insight into the data’s
structure.
• They can suffer from poor accuracy when given noisy or
irrelevant attributes.
• Nearest-neighbor classifiers can be extremely slow when
classifying test tuples.