Similar instances have
similar classification
2
Training phase (Model construction): a model
is constructed from the training instances.
◦ classification algorithm finds relationships between
predictors and targets
◦ relationships are summarised in a model
Testing phase:
◦ test the model on a test sample whose class labels
are known but not used for training the model
Usage phase (Model usage):
◦ use the model for classification on new data whose
class labels are unknown
ML – NLU 3
No clear separation between these phases of
classification
also called lazy classification, as opposed to
eager classification
Examples:
◦ Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of the
training examples exactly
◦ Nearest neighbor
Uses k “closest” points (nearest neighbors) for
performing classification
ML – NLU
Model is computed Model is computed
beforeclassification during classification
Model is independent of the Model is dependent on
test instance
the test instance
Test instance is not
included in the training Test instance is
data included in the training
Avoids too much work at data
classification time High accuracy for
Model is not accurate for models at each
each instance instance level
Eager Classification Lazy Classification
ML – NLU 5
Basic idea:
◦ If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance Test
Record
Training Choose k of the
Records “nearest” records
ML – NLU
Unknown record Requires three things
– The set of labeled records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:
– Compute distance to other training
records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
ML – NLU
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distances to x
ML – NLU
ML – NLU 9
Voronoi Diagram
Predict the same value/class as the nearest instance in the training set
ML – NLU
Problem: measure similarity between instances
◦ different types of data: numbers colours, geolocation,
booleans, etc.
Solution: convert all features of the instances into
numerical values
◦ represent instances as vectors of features in an n-
dimensional space
ML – NLU 11
Euclidean distance:
n
d ( x, y ) = (x − yi )
2
i
i =1
Manhattan distance
n
d ( x, y ) = xi − yi
i =1
Minkowski distance 1/ p
n p
d ( x, y ) = xi − yi
i =1
ML – NLU 12
Determine the class from nearest neighbor
list
◦ Take the majority vote of class labels among the k-
nearest neighbors
◦ Weigh the vote according to distance
weight factor, w = 1/d2
ML – NLU
Let k be the number of nearest neighbors and D
be the set of training examples.
ML – NLU 14
Illustration of kNN for a 3-class problem with k=5
ML – NLU 15
Choosing the value of k: Classification is
sensitive to the correct selection of k
◦ if k is too small ⇒ overfitting
algorithm performs too good on the training set,
compared to its true performance on unseen test
data
➔ small k? → less stable,
influenced by noise
➔ large k? → less precise, higher
bias X
k= n
ML – NLU
Scaling issues
◦ Attributes may have to be scaled to prevent
distance measures from being dominated by one of
the attributes
◦ Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
ML – NLU
Selection of the right similarity measure is
critical:
111111111110 000000000001
vs
011111111111 100000000000
Euclidean distance = 1.4142 for both pairs
Solution: Normalize the vectors to unit length
ML – NLU
Pros:
◦ Simple to implement and use
◦ Robust to noisy data by averaging k-nearest
neighbours
◦ kNN classification is based solely on local
information
◦ The decision boundaries can be of arbitrary shapes
ML – NLU 19
Cons:
◦ Curse of dimensionality: distance can be dominated
by irrelevant attributes
◦ O(n) for each instance to be classified
◦ More expensive to classify a new instance than with
a model
ML – NLU 20
FACULTY OF INFORMATION TECHNOLOGY