0% found this document useful (0 votes)
68 views9 pages

K-Nearest Neighbor: Carla P. Gomes CS4700

The document discusses K-Nearest Neighbor (KNN), a simple machine learning classification algorithm. KNN classifies new data points based on the labels of the k closest training examples in feature space. It can be sensitive to noise and high-dimensional data. Key aspects include choosing a distance metric, determining k, and addressing the "curse of dimensionality". KNN is simple to implement but requires searching all training data for predictions.

Uploaded by

Ali Shan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views9 pages

K-Nearest Neighbor: Carla P. Gomes CS4700

The document discusses K-Nearest Neighbor (KNN), a simple machine learning classification algorithm. KNN classifies new data points based on the labels of the k closest training examples in feature space. It can be sensitive to noise and high-dimensional data. Key aspects include choosing a distance metric, determining k, and addressing the "curse of dimensionality". KNN is simple to implement but requires searching all training data for predictions.

Uploaded by

Ali Shan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 9

K- Nearest Neighbor

Carla P. Gomes
CS4700
1-Nearest Neighbor

One of the simplest of all machine learning classifiers


Simple idea: label a new point the same as the closest known point

Label it red.

Carla P. Gomes
CS4700
Distance Metrics
Different metrics can change the decision surface

Dist(a,b) =(a1 – b1)2 + (a2 – b2)2 Dist(a,b) =(a1 – b1)2 + (3a2 – 3b2)2

Standard Euclidean distance metric:


– Two-dimensional: Dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2)
Adapted from “Instance-Based Learning”
– Multivariate: Dist(a,b) = sqrt(∑ (ai – bi)2) lecture slides by Andrew Moore, CMU.

Carla P. Gomes
CS4700
1-NN’s Aspects as an
Instance-Based Learner:

A distance metric
– Euclidean
– When different units are used for each dimension
 normalize each dimension by standard deviation
– For discrete data, can use hamming distance
 D(x1,x2) =number of features on which x1 and x2 differ
– Others (e.g., normal, cosine)

How many nearby neighbors to look at?


– One
How to fit with the local points?
– Just predict the same output as the nearest neighbor.

Adapted from “Instance-Based Learning”


lecture slides by Andrew
Carla P.Moore,
Gomes CMU.
CS4700
k – Nearest Neighbor

Generalizes 1-NN to smooth away noise in the labels


A new point is now assigned the most frequent label of its k nearest
neighbors

Label it red, when k = 3

Label it blue, when k = 7


Carla P. Gomes
CS4700
KNN Example
Food Chat Fast Price Bar BigTip
(3) (2) (2) (3) (2)
1 great yes yes normal no yes
2 great no yes normal no yes
3 mediocre yes no high no no
4 great yes yes normal yes yes

Similarity metric: Number of matching attributes (k=2)


New examples:
– Example 1 (great, no, no, normal, no) Yes
 most similar: number 2 (1 mismatch, 4 match)  yes

Second most similar example: number 1 (2 mismatch, 3 match)  yes

– Example 2 (mediocre, yes, no, normal, no) Yes/No


 Most similar: number 3 (1 mismatch, 4 match)  no

Second most similar example: number 1 (2 mismatch, 3 match)  yes


Selecting the Number of Neighbors

Increase k:
– Makes KNN less sensitive to noise

Decrease k:
– Allows capturing finer structure of space

Pick k not too large, but not too small (depends on data)

Carla P. Gomes
CS4700
Curse-of-Dimensionality

Prediction accuracy can quickly degrade when number of attributes


grows.
– Irrelevant attributes easily “swamp” information from relevant
attributes
– When many irrelevant attributes, similarity/distance measure
becomes less reliable

Remedy
– Try to remove irrelevant attributes in pre-processing step
– Weight attributes differently
– Increase k (but not too much)

Carla P. Gomes
CS4700
Advantages and Disadvantages of KNN

Need distance/similarity measure and attributes that “match” target


function.

For large training sets,


 Must make a pass through the entire dataset for each classification.
This can be prohibitive for large data sets.

Prediction accuracy can quickly degrade when number of attributes


grows.

Simple to implement algorithm;


Requires little tuning;
Often performs quite weel!
(Try it first on a new learning problem). Carla P. Gomes
CS4700

You might also like