Lazy vs.
Eager Learning 1 nearest-neighbor
Lazy vs. eager learning • Nearest neighbor algorithm does not explicitly compute decision boundaries, but
Lazy learning (e.g., instance-based learning): Simply stores these can be inferred
Nearest-Neighbor Classifier training data (or only minor processing) and waits until it is given
a test tuple
• Decision boundaries: Voronoi diagram visualization, show how input space divided
into classes;
(Instance Based Learning) Eager learning : Given a set of training tuples, constructs a
classification model before receiving new (e.g., test) data to
• Each line segment is equidistant between two points of opposite classes
classify e.g Naïve-Bayes, Decision tree, SVM
Lazy: less time in training but more time in predicting
Accuracy: Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to form an implicit
Ref & Acknowledgments global approximation to the target function
1. Dr B S Panda IIT Delhi Eager: must commit to a single hypothesis that covers the entire
2. R. Zemel, R. Urtasun, S. Fidler, University of Toronto instance space
6
3. Dr Sudeshna Sarkar, IIT Kharagpur
Nearest Neighbors: Decision Boundaries
Classification: parametric vs non-parametric Instance-Based Learning
• Linear regression relates two variables with a straight line; • One way of solving tasks of approximating discrete or real
nonlinear regression relates the variables using a curve. valued target functions
• Line/curve characteristics are needed, such classification is • Have training examples: (xn, f(xn)), n=1..N.
parametric
• Key idea:
• Other alternate: non-parametric
– just store the training examples
• Typically simple methods for approximating discrete-valued
– when a test example is given then find the closest matches
or real-valued target functions (they work for classification
or regression problems)
7 Example: 2D decision boundary
Nearest Neighbors: Decision Boundaries
Instance-Based Classifiers Inductive Assumption
Set of Stored Cases • Store the training records
Atr1 ……... AtrN Class • Use training records to • Similar inputs map to similar outputs
predict the class label of
A unseen cases – If not true => learning is impossible
B – If true => learning reduces to defining “similar”
B
Unseen Case
C
Atr1 AtrN
• Not all similarities created equal
A ……...
C
– predicting a person’s weight may depend on different attributes
than predicting their IQ
B
Example: 3D decision boundary
Nearest Neighbors: Multi-modal Data
Nearest Neighbor approaches can work with multi-modal data
Instance Based Classifiers Nearest-Neighbor Classifiers Multi modal data: Multimodal data refers to data that spans different types
Unknown record and contexts (e.g., imaging, text, or genetics)
Requires three things
• Examples:
– The set of stored records
– Rote-learner – Distance Metric to compute distance
• Memorizes entire training data and performs classification only if attributes between records
of record match one of the training examples exactly – The value of k, the number of nearest
neighbors to retrieve
– Nearest neighbor l To classify an unknown record:
• Uses k “closest” points (nearest neighbors) for performing classification – Compute distance to other training records
– Identify k nearest neighbors
– Use class labels of nearest neighbors to
determine the class label of unknown record
(e.g., by taking majority vote)
Nearest Neighbors
Nearest Neighbor Classifiers Definition of Nearest Neighbor [Pic by Olga Veksler]
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute X X X
Distance Test
Record
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
Training Choose k of the K-nearest neighbors of a record x are data points
Records “nearest” records
that have the k smallest distance to x Nearest neighbors sensitive to mis-labeled data (“class noise”).
Solution?
13 / 22
k-Nearest Neighbors Nearest Neighbor Classification… K-NN: Issues (Complexity) & Remedies
• Choosing the value of k: Expensive at test time: To find one nearest neighbor of a query point x, we
[Pic by Olga Veksler] must compute the distance to all N training examples. Complexity:
– If k is too small, sensitive to noise points
O(kdN) for kNN
– If k is too large, neighborhood may include points from other classes
Use subset of dimensions
– We can use cross-validation to find k Pre-sort training examples into fast data structures (e.g., kd-trees)
– Rule of thumb is k < sqrt(n), where n is the number of training Compute only an approximate distance (e.g., LSH)
examples Remove redundant data (e.g., condensing)
Storage Requirements: Must store all training data
Remove redundant data (e.g., condensing)
X High Dimensional Data: “Curse of Dimensionality”
Required amount of training data increases exponentially with
dimension
Computational cost also increases
13 / 22
k-Nearest Neighbors Remedies: Remove Redundancy
Nearest Neighbor Classification Nearest Neighbor Classification: Issues If all Voronoi neighbors have the same class, a sample is useless, remove it
• Compute distance between two points: • Scaling issues
– Euclidean distance – If some attributes (coordinates of x) have larger ranges, they are
d ( p, q) ( pi
i
q ) i
2
treated as more important
– Manhatten distance – Example:
• height of a person may vary from 1.5m to 1.8m
𝑖 𝑖
• weight of a person may vary from 60 KG to 100KG
𝑖
– q norm distance • income of a person may vary from Rs10K to Rs 2 Lakh
𝑞 1/𝑞
𝑖 𝑖 𝑖
Nearest Neighbor Classification: Scaling Issue Nearest neighbor Classification…
• k-NN classifiers are lazy learners
• Determine the class from nearest neighbor list Scaling issues – It does not build models explicitly
– take the majority vote of class labels among the k-nearest neighbors Attributes may have to be scaled to prevent distance – Unlike eager learners such as decision tree induction and rule-based
y’ = 𝒙 𝑖 ,𝑦 𝑖 ϵ 𝐷𝑧 𝑖 measures from being dominated by one of the attributes systems
𝑣
where Dz is the set of k closest training examples to z. Normalize scale – Naturally forms complex decision boundaries;
– Weigh the vote according to distance Simple option: Linearly scale the range of each feature to – adapts to data density If we have lots of samples, kNN typically
be, e.g., in range [0,1] works well
y’ = 𝒙 𝑖 ,𝑦 𝑖 ϵ 𝐷𝑧 𝑖 𝑖
• Problems: Sensitive to class noise
𝑣 Linearly scale each dimension to have 0 mean and variance
• weight factor, w = 1/d2 – Sensitive to scales of attributes
1 (compute mean µ and variance σ2 for an attribute xj and
scale: (xj − m)/σ) – Distances are less meaningful in high dimensions
– Classifying unknown records are relatively expensive
The KNN classification algorithm Nearest Neighbor Classification: Issues
Let k be the number of nearest neighbors and D be the set of Irrelevant, correlated attributes add noise to distance
training examples. measure
1. for each test example z = (x’,y’) do eliminate some attributes Thank You
2.Compute d(x’,x), the distance between z and every or vary and possibly adapt weight of attributes
example, (x,y) ϵ D Non-metric attributes (symbols)
3. Select Dz D, the set of k closest training examples to z. Hamming distance
4. y’ = 𝑖 𝑖 ϵ 𝑧
5. end for
KNN Classification NN Classification: Issue with Distance Measure
$2,50,000 • Problem with Euclidean measure:
$2,00,000 – High dimensional data
• curse of dimensionality: all vectors are almost equidistant to the query vector
$1,50,000
Loan$ Non-Default – Can produce undesirable results
$1,00,000 Default
111111111110 100000000000
vs
$50,000
011111111111 000000000001
$0 d = 1.4142 d = 1.4142
0 10 20 30 40 50 60 70
Age
Solution: Normalize the vectors to unit length