Slide 2 ML Basics
Slide 2 ML Basics
Remya Rajesh
K Nearest Neighbors Classification
SCATTER PLOT
Points from the visualization (Scatter plot)
• Two dimensions are the two features of a
dataset(number_of_malignant_nodes, age)
• Target: coloured – survived (blue), Did not Survive (red)
• number_of_malignant_nodes – range of values – (0,25]
• Age – range of values – (0,60]
• Each point in the plot is corresponding to a patient.
• Number of points(50) = Number of patients(50) in the dataset
• Each patient is identified by the values corresponding to the two
features (number_of_malignant_nodes, age)
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
K Nearest Neighbors Classification
What is Needed to Select a KNN Model?
A
• K: Number of neighbors.
• Distance metric (e.g., Euclidean, Manhattan, etc.).
• Weighting scheme (uniform vs. distance-based) , w = 1/distance
• Neighbor search algorithm (brute force, k-d tree, ball tree).
25
Learning: Knowledge iterates to improve
26
Regression vs Classification
• regression: if 𝑦 ∈ ℝ is a continuous variable
• e.g., price prediction
• classification: the label is a discrete variable
• e.g., the task of predicting the types of residence
𝑦 = mansion or
villa?
Algorithm
Model
Data
Neural Network model
Training and Test Splits
Using training and test data
Using training and test data
Train and Test Splitting: The Syntax
• Import the train and test split function
from sklearn.model_selection import train_test_split
• Split the data and put 30% into the test set
train, test = train_test_split(data, test_size=0.3)
Requirements for an ML Model
• Hypothesis Function - represents the mathematical model that maps
input features (X) to output predictions (Y). Different models have
different hypothesis functions.
• Examples:
• Cost Function - represents how well the hypothesis function fits the
data. It quantifies the error between predicted and actual values.
Different models have different cost functions.
Supervised Learning
Classification
• Example: Loan
payment
• Differentiating
between low-risk
and high-risk
customers from
their income and
savings
37
Hypothesis class H – set of all possible rectangles
1 if h says x is positive
h( x) =
0 if h says x is negative
10,000,00 150 1 0
20,000,00 192 0 1
15,000,00 170 1 1
19,000,00 187 0 0
Empirical Error of h –
Proportion of training
instances which don’t
match the required value
Noise and Outliers
Noise – due to wrong data collection, wrong labelling, due to other
hidden (latent) attributes not considered here
Outliers – Extreme cases
40
Linear Regression
Triple Trade-Off
42