0% found this document useful (0 votes)
9 views42 pages

Slide 2 ML Basics

The document discusses K Nearest Neighbors (KNN) classification, detailing its implementation steps, distance metrics, and the importance of feature scaling. It also covers the characteristics of the KNN model, hyperparameters to tune, and the distinction between regression and classification tasks in machine learning. Additionally, it addresses the requirements for an ML model, including hypothesis and cost functions, as well as the impact of noise and outliers.

Uploaded by

JOBIN Wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views42 pages

Slide 2 ML Basics

The document discusses K Nearest Neighbors (KNN) classification, detailing its implementation steps, distance metrics, and the importance of feature scaling. It also covers the characteristics of the KNN model, hyperparameters to tune, and the distinction between regression and classification tasks in machine learning. Additionally, it addresses the requirements for an ML model, including hypothesis and cost functions, as well as the impact of noise and outliers.

Uploaded by

JOBIN Wilson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

24CSA524: Machine Learning

Remya Rajesh
K Nearest Neighbors Classification
SCATTER PLOT
Points from the visualization (Scatter plot)
• Two dimensions are the two features of a
dataset(number_of_malignant_nodes, age)
• Target: coloured – survived (blue), Did not Survive (red)
• number_of_malignant_nodes – range of values – (0,25]
• Age – range of values – (0,60]
• Each point in the plot is corresponding to a patient.
• Number of points(50) = Number of patients(50) in the dataset
• Each patient is identified by the values corresponding to the two
features (number_of_malignant_nodes, age)
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
What is Needed to Select a KNN Model?

•Correct value for 'K'


•How to measure closeness of neighbors?
Decision Boundary
Measurement of Distance
Euclidean Distance (L2 Distance)
Euclidean Distance (L2 Distance)
Manhattan Distance (L1 or City Block Distance)
KNN for Classification
• Load the data
• Preprocess the data
• Choose the value of K and define the distance metric
• Compute distances between the test point and all training points using the
chosen distance metric.
• Sort the distances in ascending order.
• Select the K nearest neighbors (smallest distances).
• Vote for the most frequent class among the K neighbors (majority rule).
• Assign the class label of the majority as the prediction.
• Evaluate the model
• Optimize the model
For regression
• Compute distances between the test point and all training points.
• Sort the distances in ascending order.
• Select the K nearest neighbors.
• Calculate the average (or weighted average) of the target values of
the K neighbors.
• Use the computed value as the prediction.
Feature Scaling is important
Comparison of Feature Scaling Methods

•Standard Scaler: mean center data and


scale to unit variance v−
v' =
A

 A

•Minimum-Maximum Scaler: scale data to


fixed range (usually 0–1)
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
Python

•NumPy, SciPy, Pandas: numerical computation


•Matplotlib, Seaborn:data visualization
•Scikit-learn: machine learning
Example:
Import the class containing the scaling method
from sklearn.preprocessing import StandardScaler
Create an instance of the class
StdSc= StandardScaler()
Fit the scaling parameters and then transform the data
StdSc= StdSc.fit(X_data)
X_scaled= StdSc.transform(X_data)

Other scaling method: MinMaxScaler


K Nearest Neighbors: The Syntax

Import the class containing the classification method


from sklearn.neighbors import KNeighborsClassifier
Create an instance of the class
KNN= KNeighborsClassifier(n_neighbors=3)
Fit the instance on the data and then predict the expected value
KNN= KNN.fit(X_data, y_data)
y_predict= KNN.predict(X_data)

Regression can be done with KNeighborsRegressor


Characteristics of KNN model

• KNN is a non-parametric algorithm (no model parameters)


•Fast to create model because it simply stores
data
•Slow to predict because many distance
calculations
•Can require lots of memory if data set is large
Hyperparameters to Tune

• K: Number of neighbors.
• Distance metric (e.g., Euclidean, Manhattan, etc.).
• Weighting scheme (uniform vs. distance-based) , w = 1/distance
• Neighbor search algorithm (brute force, k-d tree, ball tree).

• Extra points: Use of TreeSet in Java


What We Talk About When We Talk
About“Learning”
• Learning general models from a data of particular
examples
• Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
• Build a model that is a good and useful
approximation/representation to the data!!
• Describe/ Summarize data in the form of a model

25
Learning: Knowledge iterates to improve

prior Learning knowledge


knowledge

Data/ Additional Data

26
Regression vs Classification
• regression: if 𝑦 ∈ ℝ is a continuous variable
• e.g., price prediction
• classification: the label is a discrete variable
• e.g., the task of predicting the types of residence

(living room size, parking area size) → mansion or villa?

𝑦 = mansion or
villa?
Algorithm

Model

Data
Neural Network model
Training and Test Splits
Using training and test data
Using training and test data
Train and Test Splitting: The Syntax
• Import the train and test split function
from sklearn.model_selection import train_test_split
• Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Requirements for an ML Model
• Hypothesis Function - represents the mathematical model that maps
input features (X) to output predictions (Y). Different models have
different hypothesis functions.
• Examples:

• Cost Function - represents how well the hypothesis function fits the
data. It quantifies the error between predicted and actual values.
Different models have different cost functions.
Supervised Learning
Classification
• Example: Loan
payment
• Differentiating
between low-risk
and high-risk
customers from
their income and
savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk ELSE high-risk
36
Class C
(p1  price  p2 ) AND (e1  engine power  e2 )
Is a class rule for positive
examples

37
Hypothesis class H – set of all possible rectangles

Choose hypothesis h that


predicts well on unseen
examples (“test set”)

 1 if h says x is positive
h( x) = 
0 if h says x is negative

Generalization – How well the


hypothesis will classify unseen data not
part of the training set
38
Example:
Price Engine Power Y H(X)

10,000,00 150 1 0
20,000,00 192 0 1
15,000,00 170 1 1
19,000,00 187 0 0

Empirical Error of h –
Proportion of training
instances which don’t
match the required value
Noise and Outliers
Noise – due to wrong data collection, wrong labelling, due to other
hidden (latent) attributes not considered here
Outliers – Extreme cases

40
Linear Regression
Triple Trade-Off

• There is a trade-off between three factors


(Dietterich, 2003):
1. Complexity C of H ,
2. Training set size, N,
3. Generalization error, Er, on new data
 As N Er
 As C(H) first Er and then Er

42

You might also like