0% found this document useful (0 votes)
31 views20 pages

K-Nearest Neighbors Classification Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views20 pages

K-Nearest Neighbors Classification Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Similar instances have

similar classification

2
 Training phase (Model construction): a model
is constructed from the training instances.
◦ classification algorithm finds relationships between
predictors and targets
◦ relationships are summarised in a model
 Testing phase:
◦ test the model on a test sample whose class labels
are known but not used for training the model
 Usage phase (Model usage):
◦ use the model for classification on new data whose
class labels are unknown

ML – NLU 3
 No clear separation between these phases of
classification
 also called lazy classification, as opposed to
eager classification
 Examples:
◦ Rote-learner
 Memorizes entire training data and performs
classification only if attributes of record match one of the
training examples exactly
◦ Nearest neighbor
 Uses k “closest” points (nearest neighbors) for
performing classification

ML – NLU
 Model is computed  Model is computed
beforeclassification during classification
 Model is independent of the  Model is dependent on
test instance
the test instance
 Test instance is not
included in the training  Test instance is
data included in the training
 Avoids too much work at data
classification time  High accuracy for
 Model is not accurate for models at each
each instance instance level

Eager Classification Lazy Classification

ML – NLU 5
 Basic idea:
◦ If it walks like a duck, quacks like a duck, then it’s
probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

ML – NLU
Unknown record Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)

ML – NLU
X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distances to x

ML – NLU
ML – NLU 9
Voronoi Diagram

Predict the same value/class as the nearest instance in the training set

ML – NLU
 Problem: measure similarity between instances

◦ different types of data: numbers colours, geolocation,


booleans, etc.

 Solution: convert all features of the instances into


numerical values
◦ represent instances as vectors of features in an n-
dimensional space

ML – NLU 11
 Euclidean distance:
n
d ( x, y ) = (x − yi )
2
i
i =1

 Manhattan distance
n
d ( x, y ) =  xi − yi
i =1

 Minkowski distance 1/ p
 n p

d ( x, y ) =   xi − yi 
 i =1
 
ML – NLU 12
 Determine the class from nearest neighbor
list
◦ Take the majority vote of class labels among the k-
nearest neighbors

◦ Weigh the vote according to distance

 weight factor, w = 1/d2

ML – NLU
 Let k be the number of nearest neighbors and D
be the set of training examples.

ML – NLU 14
 Illustration of kNN for a 3-class problem with k=5

ML – NLU 15
 Choosing the value of k: Classification is
sensitive to the correct selection of k
◦ if k is too small ⇒ overfitting
 algorithm performs too good on the training set,
compared to its true performance on unseen test
data
➔ small k? → less stable,
influenced by noise
➔ large k? → less precise, higher
bias X

k= n
ML – NLU
 Scaling issues
◦ Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes

◦ Example:
 height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M

ML – NLU
 Selection of the right similarity measure is
critical:

111111111110 000000000001
vs
011111111111 100000000000

Euclidean distance = 1.4142 for both pairs

Solution: Normalize the vectors to unit length

ML – NLU
 Pros:
◦ Simple to implement and use
◦ Robust to noisy data by averaging k-nearest
neighbours
◦ kNN classification is based solely on local
information
◦ The decision boundaries can be of arbitrary shapes

ML – NLU 19
 Cons:
◦ Curse of dimensionality: distance can be dominated
by irrelevant attributes
◦ O(n) for each instance to be classified
◦ More expensive to classify a new instance than with
a model

ML – NLU 20
FACULTY OF INFORMATION TECHNOLOGY

You might also like