0% found this document useful (0 votes)
2 views119 pages

Unit-3-ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views119 pages

Unit-3-ML

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

ML

What is ML?
• As per the 1959 definition of Arthur Samuel, machine
learning can be defined as a process of inputting data
to the computer systems in a way that the computer
will learn the ability to process and perform the
activity in the future without being explicitly
programmed or being fed with similar or extra data.
• He has developed the game of checkers
• Robert Nealey, the self-proclaimed checkers master,
played the game on an IBM 7094 computer in 1962,
and he lost to the computer
• Definition:
• Machine learning is a field of artificial intelligence that allows
systems to learn and improve from experience without being
explicitly programmed
Types of ML
• Supervised Machine Learning
• Learn from the given right answers
• Maps the input (x) onto output (y)
Types of supervised Machine learning
Algorithms: Regression and Classification

• Regression
• Regression algorithms are used if there is a
relationship between the input variable and the
output variable. It is used for the prediction of
continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular
Regression algorithms which come under
supervised learning:
Types of supervised Machine learning
Algorithms:
Classification
Classification predicts Categories
Classifiers can predict a small number of possible outputs but
not all like (0.5,0.7 etc)
Types of supervised Machine learning
Algorithms:
• Classification algorithms are used when the output
variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
• K-NN
• Decision Trees
• Logistic Regression
• Support vector Machines
Unsupervised Machine Learning
•Unsupervised learning is helpful for finding useful
insights from the data.

•Unsupervised learning is much similar as a human


learns to think by their own experiences, which
makes it closer to the real AI.

•Unsupervised learning works on unlabeled and


uncategorized data which make unsupervised
learning more important.
•In real-world, we do not always have input data with
the corresponding output so to solve such cases, we
need unsupervised learning.
Unsupervised Machine Learning
Unsupervised Machine Learning
Types of Unsupervised Learning
• Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group. Cluster analysis finds the
commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
• Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large
database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such as
people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Unsupervised Learning algorithms:

• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
Supervised Vs Unsupervised Learning
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled data.

Supervised learning model takes direct feedback to check if it is Unsupervised learning model does not take any feedback.
predicting correct output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data.

In supervised learning, input data is provided to the model along with the In unsupervised learning, only input data is provided to the model.
output.
The goal of supervised learning is to train the model so that it can predict The goal of unsupervised learning is to find the hidden patterns and
the output when it is given new data. useful insights from the unknown dataset.

Supervised learning needs supervision to train the model. Unsupervised learning does not need any supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.
Supervised learning can be used for those cases where we know the Unsupervised learning can be used for those cases where we have only
input as well as corresponding outputs. input data and no corresponding output data.

Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared
to supervised learning.
Supervised learning is not close to true Artificial intelligence as in this, we Unsupervised learning is more close to the true Artificial Intelligence as it
first train the model for each data, and then only it can predict the learns similarly as a child learns daily routine things by his experiences.
correct output.
It includes various algorithms such as Linear Regression, Logistic It includes various algorithms such as Clustering, KNN, and Apriori
Regression, Support Vector Machine, Multi-class Classification, Decision algorithm.
tree, Bayesian Logic, etc.
Reinforcement Learning (Learns from
Mistakes)
• RL falls between supervised and unsupervised learning
• Unlike supervised learning where the feedback provided to the agent
is correct set of actions for performing a task, reinforcement learning
uses rewards and punishments as signals for positive and negative
behavior
• As compared to unsupervised learning, reinforcement learning is different
in terms of goals. While the goal in unsupervised learning is to find
similarities and differences between data points, in the case of
reinforcement learning the goal is to find a suitable action model that
would maximize the total cumulative reward of the agent. The figure
below illustrates the action-reward feedback loop of a generic RL model.
• Most commonly used RL algorithms are-
• Deep Q Networks
• SARSA (State-Action-Reward-State-
Action)
Importance of ML
• To find relationship between data
• Helps to make data driven decisions
• Predictions of future outcomes
Applications
Steps in ML
• Identify the Problem statement
• Data Collection
• Data Cleaning
• Building ML Models
• Improving ML models
Data Collection
• This step includes the below tasks:
• Identify various data sources
• Collect data
• Integrate the data obtained from different sources
• By performing the above task, we get a coherent set of data, also
called as a dataset. It will be used in further steps.
Data preparation
• Data exploration:
It is used to understand the nature of data that we have to work with.
We need to understand the characteristics, format, and quality of
data.
A better understanding of data leads to an effective outcome. In this,
we find Correlations, general trends, and outliers.
• Data pre-processing:
Now the next step is preprocessing of data for its analysis.
Data Wrangling
• Data wrangling is the process of cleaning and converting raw data into a useable
format. It is the process of cleaning the data, selecting the variable to use, and
transforming the data in a proper format to make it more suitable for analysis in
the next step.
• In real-world applications, collected data may have various issues, including:
• Missing Values
• Duplicate data
• Invalid data
• Noise
• So, we use various filtering techniques to clean the data.
• It is mandatory to detect and remove the above issues because it can negatively
affect the quality of the outcome.
Data Analysis
• Now the cleaned and prepared data is passed on to the analysis step.
This step involves:
• Selection of analytical techniques
• Building models
• Review the result
• we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then
build the model using prepared data, and evaluate the model.
Train Model
We use datasets to train the model using various machine learning
algorithms. Training a model is required so that it can understand the
various patterns, rules, and, features.

Test Model
• Testing the model determines the percentage accuracy of the model
as per the requirement of project or problem.
Deployment
• If the above-prepared model is producing an accurate result as per
our requirement with acceptable speed, then we deploy the model in
the real system.
• But before deploying the project, we will check whether it is
improving its performance using available data or not.
• The deployment phase is similar to making the final report for a
project.
Regression applications
• Evaluating trends and sales estimates
• Analyse pricing elasticity
• Assess risk in an insurance company
• Sports analysis
• Predicting age of a person
• Predicting house price based on area
• Predict the number of copies a music album will be sold next month
Linear Regression
Model
Types of
Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear
Linear Regression Model
Relationship between variables is a linear function

Population Population Random


y-intercept Slope Error

y =  0 + 1 x + 
Dependent Independent
(Response) (Explanatory)
Variable Variable
Line of Means

Change
β1 = Slope in y
Change in x

β0 = y-intercept
x
Linear Regression Model

y yi =  0 + 1 xi +  i Observed
value

i = Random error

E ( y ) =  0 + 1 x

x
Observed value
Sample Linear Regression Model

y yi = ˆ0 + ˆ1 xi + ˆi


^
i = Random
error
Unsampled
observation
yˆi = ˆ0 + ˆ1 xi
x
Observed value
Estimating Parameters:
Least Squares Method
Regression Modeling Steps

1. Hypothesize deterministic component


2. Estimate unknown model parameters
3. Specify probability distribution of random error
term
• Estimate standard deviation of error
4. Evaluate model
5. Use model for prediction and estimation
Scattergram
1. Plot of all (xi, yi) pairs
2. Suggests how well model will fit

y
60
40
20
0 x
0 20 40 60
Thinking Challenge

• How would you draw a line through the points?


• How do you determine which line ‘fits best’?

y
60
40
20
0 x
0 20 40 60
Least Squares
• ‘Best fit’ means difference between actual y values and predicted
y values are a minimum
• But positive differences off-set negative

n n

 ( y − yˆ ) =  ˆ
2 2
i i i
i =1 i =1
• Least Squares minimizes the Sum of the
Squared Differences (SSE)
Least Squares Graphically
n
LS minimizes   i = 1 +  2 +  3 +  4
ˆ 2
ˆ 2
ˆ 2
ˆ 2
ˆ 2

i =1

y y2 = ˆ0 + ˆ1 x2 + ˆ2


^4
^2
^1 ^3
yˆi = ˆ0 + ˆ1 xi
x
Derivation for linear regression using least
squares method
• To fit the line y = a + bx to a given set of observations
(𝑥1 , 𝑦1 ), 𝑥2 , 𝑦2 … (𝑥5 , 𝑦5 ).
• For any 𝑥𝑖 , the observed value is 𝑦𝑖 and the expected value is 𝑦ො𝑖 =
𝑎 + 𝑏𝑥𝑖 .
∴ error = 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
∴ The sum of the squares of these errors is E= 𝑒12 + 𝑒22 … . +𝑒52
= (𝑦1 − 𝑦ො1 )2 +(𝑦2 − 𝑦ො2 )2 +…..+(𝑦5 − 𝑦ො5 )2
= (𝑦1 − (𝑎 + 𝑏𝑥1 ))2 +(𝑦2 − (𝑎 + 𝑏𝑥2 ))2 +…….+(𝑦5 − (𝑎 + 𝑏𝑥5 ))2
Continued…
𝜕𝐸
• For E to be minimum, we have, = 0
𝜕𝑎
𝜕𝐸
𝜕𝑎
= 2 𝑦1 − (𝑎 + 𝑏𝑥1 ) + 2 𝑦2 − (𝑎 + 𝑏𝑥2 ) + ⋯ + 2 𝑦5 − 𝑎 + 𝑏𝑥5
=0
0 = 𝑦1 − 𝑎 − 𝑏𝑥1 + 𝑦2 − 𝑎 − 𝑏𝑥2 + ⋯ … + 𝑦5 − 𝑎 − 𝑏𝑥5
5𝑎 + 𝑏(𝑥1 + 𝑥2 +…+𝑥5 ) = 𝑦1 + 𝑦2 + ⋯ + 𝑦5
Thus,
σ5𝑖=1 𝑦−𝑏 σ5𝑖=1 𝑥 σ 𝑦−𝑏 σ 𝑥
𝑎= = (1)
5 5
𝜕𝐸
• Similarly, =0
𝜕𝑏
Simplifying same as above, 𝑎 σ 𝑥 + 𝑏 σ 𝑥 2 = σ 𝑥𝑦 (2)
5 σ 𝑥𝑦−σ 𝑥 σ 𝑦
Substituting (1) in (2), 𝑏 =
5 σ 𝑥2− σ 𝑥 2
These are expressions for a and b.
They are substituted in y = ax + b to get regression equation for given
data.
For n data points,
σ 𝑦−𝑏 σ 𝑥
Intercept=𝑎 = 𝛽0 = = 𝑦ത − 𝑏𝑥ҧ
𝑛
σ𝑥
Where 𝑥ҧ = =mean of x values
𝑛
𝑛 σ 𝑥𝑦−σ 𝑥 σ 𝑦 σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
Slope=𝑏 = 𝛽1 = =
𝑛 σ 𝑥2− σ 𝑥 2 σ(𝑥−𝑥)ҧ 2
Coefficient Equations
Prediction Equation ŷ = ˆ0 + ˆ1 x
  n 
n

  x i   yi 
  i =1 
n


i =1
x y
i i −
ˆ SS xy n
Slope 1 = = i =1
2
 
n
SS xx
 x i 
 
n


i =1
xi
2

i =1 n

y-intercept ˆ0 = y − ˆ1 x


Computation Table

2 2
xi yi xi yi x i yi
2
x1 y1 x1 y12 x 1 y1
2 2
x2 y2 x2 y2 x 2 y2
: : : : :
2
xn yn xn2 yn xnyn
2 2
Σxi Σyi Σxi Σyi Σxiyi
Interpretation of Coefficients
^
1. Slope (1)
^
• Estimated y changes by 1 for each 1unit increase
in x
^
— If 1 = 2, then Sales (y) is expected to increase by 2
for each 1 unit increase in Advertising (x)

^
2. Y-Intercept (0)
• Average value of y when x = 0
^
— If 0 = 4, then Average Sales (y) is expected to be
4 when Advertising (x) is 0
Least Squares Example
You’re a marketing analyst for Hasbro Toys.
You gather the following data:
Ad $ Sales (Units)
1 1
2 1
3 2
4 2
5 4
Find the least squares line relating
sales and advertising.
Scattergram
Sales vs. Advertising

Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Parameter Estimation Solution Table

2 2
xi yi xi yi x i yi
1 1 1 1 1
2 1 4 1 2
3 2 9 4 6
4 2 16 4 8
5 4 25 16 20
15 10 55 26 37
Parameter Estimation Solution
  n 
n

  x i   yi 
n
 i =1  i =1  (15 )(10 )
 x y
i i −
n
37 −
ˆ1 = i =1
= 5 = .70
(15 )
2 2
 
n

 x i  55 −
 
n
5

i =1
xi
2

i =1 n

?0 = y − 1 x = 2 − (.70 )( 3) = −.10

yˆ = −.1 + .7 x
Parameter Estimation
Computer Output
Parameter Estimates

^0 Parameter Standard T for H0:


Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 -0.1000 0.6350 -0.157 0.8849
ADVERT 1 0.7000 0.1914 3.656 0.0354

^1

yˆ = −.1 + .7 x
Coefficient Interpretation Solution
^
1. Slope (1)
• Sales Volume (y) is expected to increase by .7
units for each $1 increase in Advertising (x)

^
2. Y-Intercept (0)
• Average value of Sales Volume (y) is -.10 units
when Advertising (x) is 0
— Difficult to explain to marketing manager
— Expect some sales without advertising
Regression Line Fitted
to the Data

Sales
4
3 yˆ = −.1 + .7 x
2
1
0
0 1 2 3 4 5
Advertising
Least Squares
Thinking Challenge
You’re an economist for the county cooperative.
You gather the following data:
Fertilizer (lb.) Yield (lb.)
4 3.0
6 5.5
10 6.5
12 9.0
Find the least squares line relating © 1984-1994 T/Maker Co.

crop yield and fertilizer.


Scattergram
Crop Yield vs. Fertilizer*

Yield (lb.)
10
8
6
4
2
0
0 5 10 15
Fertilizer (lb.)
Parameter Estimation Solution Table*

2 2
xi yi xi yi x i yi
4 3.0 16 9.00 12
6 5.5 36 30.25 33
10 6.5 100 42.25 65
12 9.0 144 81.00 108
32 24.0 296 162.50 218
Parameter Estimation Solution*
  n 
n

  x i   yi 
n
  i =1  ( 32 )( 24 )

i =1
x y
i i − 218 −
n
ˆ1 = i =1
= 4 = .65
( 32 )
2 2
 
n

 x i  296 −
 
n
4

i =1
xi −
2

i =1 n
ˆ0 = y − ˆ1 x = 6 − (.65 )( 8 ) = .80

yˆ = .8 + .65 x
Coefficient Interpretation Solution*
^
1. Slope (1)
• Crop Yield (y) is expected to increase by .65 lb. for
each 1 lb. increase in Fertilizer (x)

^
2. Y-Intercept (0)
• Average Crop Yield (y) is expected to be 0.8 lb.
when no Fertilizer (x) is used
Regression Line Fitted
to the Data*

Yield (lb.)
10
8 yˆ = .8 + .65 x
6
4
2
0
0 5 10 15
Fertilizer (lb.)
Comments on coefficients
• The expression for the regression coefficient or slope is
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 σ(𝑥 − 𝑥)(𝑦 ҧ − 𝑦)

𝑏= 2 2
=
𝑛σ𝑥 − σ𝑥 σ(𝑥 − 𝑥)ҧ 2
• Thus : for a feature x and a target variable y, the regression coefficient
𝜎𝑥𝑦
is given by
𝜎𝑥𝑥
Zero-mean data
σ 𝑦−𝑏 σ 𝑥
• Intercept=𝑎 = = 𝑦ത − 𝑏𝑥ҧ
𝑛
• From this expression, it is seen that regression line passes through
(𝑥,ҧ 𝑦)

• Adding a constant to all x-values (a translation) will affect only the
intercept but not the regression coefficient (slope of the line does not
change).
• zero-centre the x-values by subtracting 𝑥.ҧ a = 𝑦.

• Zero-intercept can be obtained.


What Is R-Squared?
• R-squared (R2) is a statistical measure that represents the proportion
of the variance for a dependent variable that’s explained by an
independent variable in a regression model.
MAE
Normalize the data
• Normalize x by dividing all its values by x’s variance, we can take the
covariance between the normalized feature and the target variable as
regression coefficient.
Outliers →MLE
• The sum of the residuals of the least-squares solution is zero
• It also makes linear regression susceptible to outliers: points that are
far removed from the regression line, often because of measurement
errors
• Solution: the estimate we want is the value of a and b that maximizes
the probability of the residuals: MAXIMUM LIKELIHOOD ESTIMATE
(MLE)
Linear Regression
• LR using Scikit: Libraries Required
• from sklearn.linear_model import LinearRegression
• from sklearn.metrics import classification_report,
confusion_matrix
• from sklearn.preprocessing import StandardScaler
• from sklearn.model_selection import
train_test_split
Functions used from sklearn
• model = LinearRegression().fit(X, y)

• model = LinearRegression().fit(X, y)

• r_sq = model.score(X, y)
• print(f"coefficient of determination: {r_sq}")
• c=model.intercept_
• m=model.coef_
• print(f"intercept: {model.intercept_}")
• print(f"coefficients: {model.coef_}")

• y_pred = model.predict(X)
• print(f"predicted response:\n{y_pred}")
CTRP Spend Revenue
Feature Scaling 133
111
111600 1197576
104400 1053648
• Feature scaling in Machine Learning is a method 129
117
97200 1124172
79200 987144
used to normalize or standardize the range of 130 126000 1283616
independent variables or features of data. 154 108000 1295100

• It is one of the data preprocessing step 149


90
147600 1407444
104400 922416
• Real-world datasets often contain features that are 118 169200 1272012
varying in degrees of magnitude, range, and units. 131 75600 1064856
141 133200 1269960
Therefore, in order for machine learning models to 119 133200 1064760
interpret these features on the same scale, we need 115 176400 1207488
to perform feature scaling.
• Feature scaling plays important role but it work all
the times
• It can be used for –LR, KNN, NN whereas it won’t
work for DT, Random forest algos
Types of feature scaling
Normalization Equation: Min-
• Normalization Max Scaling
• Normalization is a scaling technique in ′
𝑋 − 𝑋𝑚𝑖𝑛
which values are shifted and rescaled so 𝑋 =
that they end up ranging between 0 and 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
1.
• It is also known as Min-Max scaling. Standardization Equation:

• Standardization
• Standardization is another scaling
method where the values are centered
around the mean with a unit standard
deviation. This means that the mean of
the attribute becomes zero, and the µ is the mean of the feature values and
resultant distribution has a unit standard σ is the standard deviation of the feature values.
deviation.
• Value don’t fall in perfect range
• Useful if data contains outliers
Normalization Standardization
Centers data around the mean and scales to a
Rescales values to a range between 0 and 1
standard deviation of 1
Useful when the distribution of the data is Useful when the distribution of the data is
unknown or not Gaussian Gaussian or unknown
Sensitive to outliers Less sensitive to outliers
Retains the shape of the original distribution Changes the shape of the original distribution
May not preserve the relationships between the Preserves the relationships between the data
data points points
Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation

However, the choice of using normalization or standardization will depend on problem and the
machine learning algorithm using.
There is no hard and fast rule to tell you when to normalize or standardize your data.
You can always start by fitting your model to raw, normalized, and standardized data and comparing
the performance for the best results.
It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This
would avoid any data leakage during the model testing process.
Also, the scaling of target values is generally not required.
Python Functions for Normalization
# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
X_train_scaled=scaler.fit_transform(X_train)
# Output is in an array form that needs to convert into dataframe
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
Python Functions for Standardization
# data Standardization with sklearn
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
# Output is in an array form that needs to convert into dataframe
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
# To getback data in the normal format
X_train_inv=scaler.inverse_transform(X_train_scaled)
X _train_inv= pd.DataFrame(X_train_inv, columns=X_train.columns)
Logistic Regression
KNN Algorithm - Finding K-Nearest Neighbors
• It’s a type of supervised ML algorithm which can be used for both
classification as well as regression predictive problems.
• However, it is mainly used for classification predictive problems in
industry.
• The following two properties would define KNN well :
• Lazy learning algorithm − KNN is a lazy learning algorithm because it does not
have a specialized training phase and uses all the data for training while
classification.
• Non-parametric learning algorithm − KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.
• There is only memorization of training data no actual learning
KNN algorithm
• The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
KNN- Example
•Firstly, we will choose the number •By calculating the Euclidean distance
• Suppose we have a new data we got the nearest neighbors, as three
point and we need to put it in of neighbors, so we will choose the
the required category. Consider k=5. nearest neighbors in category A and two
the below image: •Next, we will calculate nearest neighbors in category B.
the Euclidean distance between the Consider the below image:
data points. The Euclidean distance is
the distance between two points,
which we have already studied in
geometry. It can be calculated as:

•As we can see the 3 nearest neighbors


are from category A, hence this new
data point must belong to category A.
KNN… continued
• How to choose the value for K in the KNN algorithm?
• There is no particular way of determining the best value for K, so we need
to try some values to find the best out of them. The most preferred value
for K is 5. At very low value of K such as K = 1 or K = 2, can be noisy and
lead to the effects of outliers in the model. Large values of K might be
good but may find some difficulties.
• Data Preparations Required:
1.Data Scaling: To locate the data point in multidimensional feature space, it
would be helpful if all features are in the same scale. Hence, normalization
or standardization of data will help.
2.Dimensionality reduction: KNN may not work well if there are too many
features. Hence dimensionality reduction techniques like feature selection,
principal component analysis can be implemented.
3.Missing value treatment: If out of M features one feature data is missing
for a particular example in the training set then we cannot locate or
calculate distance from that point. Therefore, deleting that row or
imputation is required.
KNN
Advantages:
• It is simple to implement.
• It is robust to the noisy training data.
• It can be more effective if the training data is large.
Disadvantages:
• We always need to determine the value of K which might be
complex sometimes.
• The computation cost is high because of calculating the
distance between the data points for all the training samples.
Classification Terminologies
• Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
• Noise: Noise is unnecessary and irrelevant data that reduces the
performance of the model.
• Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference
between the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance
occurs. It is the variability in the model prediction—how much the ML
function can adjust depending on the given data set
Epoch, batch, iterations
• Mathematically, we can understand it as follows:
• Total number of training examples(Feature Vectors) = 3000;
• Assume each batch size = 500;
• Then the total number of Iterations = Total number of training
examples/Individual batch size = 3000/500
• Total number of iterations = 6
• And 1 Epoch = 6 Iterations
Batch and Batch Size

In the figure, we can understand this concept as follows:


•If the Batch size is 1000, then an epoch will complete in one
iteration.
•If the Batch size is 500, then an epoch will complete in 2
iterations.
Similarly, if the batch size is too small or such as 100, then the
epoch will be complete in 10 iterations. So, as a result, we can
conclude that for each epoch, the required number of
iterations times the batch size gives the number of data points.
However, we can use multiple numbers epochs for training the
machine learning model.
Classification Terminologies- Overfitting,
Underfitting and best fit
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Classification Terminologies
Feature Selection
Regularization?
What if Lambda is set to an extremely high?
Thetaj needs to set to very small ie closer to zero to
minimize an error function
If theta1…..Thetan=0 h(x)=Theta0
Classification Parameters
SVM-Support Vector Machine
SVM-Support Vector Machine
SVM-Support Vector Machine
Introduction to
Support Vector
Note to other teachers and users of these
slides. Andrew would be delighted if you
found this source material useful in giving
your own lectures. Feel free to use these
slides verbatim, or to modify them to fit your

Machines
own needs. PowerPoint originals are
available. If you make use of a significant
portion of these slides in your own lecture,
please include this message, or the following
link to the source repository of Andrew’s
tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully
received.

Thanks:
Andrew Moore
CMU
And
Martin Law
Michigan State University
History of SVM
• SVM is related to statistical learning theory [3]
• SVM was first introduced in 1992 [1]
• SVM becomes popular because of its success in handwritten digit
recognition
• 1.1% test error rate for SVM. This is the same as the error rates of a carefully
constructed neural network, LeNet 4.
• See Section 5.11 in [2] or the discussion in [3] for details
• SVM is now regarded as an important example of “kernel methods”,
one of the key area in machine learning
• Note: the meaning of “kernel” is different from the “kernel” function for
Parzen windows
[1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on
Computational Learning Theory 5 144-152, Pittsburgh, 1992.
[2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th
IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82.
[3] V. Vapnik. The Nature of Statistical Learning Theory. 2nd edition, Springer, 1999.
2023/10/11 102
Introduction to SVM
Definition of SVM
• Support Vector Machine or SVM is one of the most
popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine
Learning.
Significance of SVM
• The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in
the correct category in the future. This best decision
boundary is called a hyperplane.
• SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which
there are two different categories that are classified using
a decision boundary or hyperplane:
• Hyperplane and Support Vectors in the SVM algorithm:
• Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
• The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
• We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
• Support Vectors:
• The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
• Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.
How does Linear Classifiers works? Estimation:

x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
w: weight vector
denotes -1 x: data vector

How would you


classify this data?

2023/10/11 106
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

2023/10/11 107
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

2023/10/11 108
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

How would you


classify this data?

2023/10/11 109
a
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1

Any of these would


be fine..

..but which is best?

2023/10/11 110
a
Classifier Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a datapoint.

2023/10/11 111
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
SVM (Called an LSVM)

Linear SVM
2023/10/11 112
a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x + b)
denotes +1
denotes -1 The maximum margin
linear classifier is the linear
classifier with the, um,
maximum margin.
This is the simplest kind of
Support Vectors SVM (Called an LSVM)
are those
datapoints that Linear SVM
the margin pushes
up against

2023/10/11 113
f(x,w,b) = sign(w. x - b)

Why Maximum Margin? The maximum margin


linear classifier is the linear
classifier with the, um,
denotes +1 maximum margin.
This is the simplest kind of
denotes -1
SVM (Called an LSVM)

Hence, the SVM algorithm helps to find


the best line or decision boundary; this
Support Vectors best boundary or region is called as
are those a hyperplane. SVM algorithm finds the
datapoints that closest point of the lines from both the
the margin pushes classes. These points are called support
up against vectors. The distance between the
vectors and the hyperplane is called
as margin. And the goal of SVM is to
maximize this margin.
The hyperplane with maximum margin
2023/10/11 114
is called the optimal hyperplane.
How to calculate the distance from a point to a line?
denotes +1
denotes -1 x
wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value

◼ http://mathworld.wolfram.com/Point-LineDistance2-
Dimensional.html
◼ In our case, w1*x1+w2*x2+b=0,

◼ thus, w=(w1,w2), x=(x1,x2)

2023/10/11 115
Estimate the Margin
denotes +1
denotes -1 x
wx +b = 0

X – Vector
W
W – Normal Vector
b – Scale Value

• What is the distance expression for a point x to a


line wx+b= 0?
xw +b xw +b
d ( x) = =

2 d 2
w w
i =1 i
2
2023/10/11 116
Large-margin Decision Boundary
• The decision boundary should be as far away from the data of both
classes as possible
• We should maximize the margin, m
• Distance between the origin and the line wtx=-b is b/||w||

Class 2

m
Class 1

2023/10/11 117
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi  {1,-1} be the class label of
xi
• The decision boundary should classify all points correctly 
• To see this: when y=-1, we wish (wx+b)<1, when y=1, we wish
(wx+b)>1. For support vectors, we wish y(wx+b)=1.
• The decision boundary can be found by solving the following
constrained optimization problem

2023/10/11 118
A Geometrical Interpretation
Class 2

a10=0
a8=0.6

a7=0
a2=0
a5=0

a1=0.8
a4=0
a6=1.4

a9=0
a3=0
Class 1

2023/10/11 119

You might also like