MACHINE LEARNING
WHY MACHINE LEARNING WAS INTRODUCED
Statistics: How to efficiently train large complex models?
Computer Science & Artificial Intelligence: How to train more
robust version of the AI system.
Neuroscience: How to design operational models of the brain?
CAN YOU RECOGNIZE THESE PICTURES ?
If yes, How do you Recognize it ?
ORIGIN OF MACHINE LEARNING
……… Lies in very effort of understanding Intelligence
What is intelligence ?
It can be defined as the ability to comprehend; to understand and profit
from experience.
Capability of acquire and apply knowledge
LEARNING?
LEARNING?
2300
2300 YEARS
YEARS AGO……
AGO……
Plato (427 – 347 BC)
The concept of abstract ideas are known
to us a priori, through a Mystic
connection with world.
He conclude that ability to think is found
in a priori knowledge of the concept
LEARNING ?
LEARNING ?
Plato’s Pupil
Aristotle (384 – 322 BC)
Criticized his Teacher’s Theory
as it is not taking into account the
important aspect
An ability to learn or adapt to
changing world.
MACHINE LEARNING
MACHINE LEARNING
Machine Learning is a subset of AI technique which use statistical
methods to enable machines to improve with experience.
• Learning –
– A computer program is said to learn from
• experience E
• with respect to some class of tasks T
• and performance measure P
– if its performance at tasks in T , as measured by P , improves with experience
E.” (Mitchell , 1997)
LEARNING ALGORITHMS…
LEARNING ALGORITHMS…
• General Tasks
– Classification, Regression, Transcription , Machine Translation etc.
• Performance measures
– Depends on the type of problem: Examples include –
• accuracy, error rate etc.
– Performance is measured on a dataset called test dataset, that is different
from the dataset used to train the algorithms.
– Often difficult to choose a performance measure that corresponds well to the
desired behavior of the system.
• Experience
– Algorithms are termed as supervised learning or unsupervised learning
algorithms based on the experience they are allowed to have on datasets.
EXAMPLE (HANDWRITING RECOGNITION LEARNING PROBLEM)
EXAMPLE (HANDWRITING RECOGNITION LEARNING PROBLEM)
Task T: Recognition and classifying handwritten words within images
Performance Measure P: Percentage of words correctly classified.
Training experience E: A database of handwritten words with given
classification
MACHINE LEARNING
• Learning from experience on data to make predictions.
Machine
Learning
Data
algorithm
Training
Prediction
Unseen Trained Prediction
Data model
BRANCHES OF MACHINE
LEARNING
Source: [Link]
machine-learning-b9e651e1ed9d
SUPERVISED MACHINE LEARNING
SUPERVISE APPROACH
MACHINE LEARNING APPROACH
For each specific tasks
We collect lots of examples with their known outcomes
Learn a function that map inputs to outputs
These programs tend to be data centric, i.e. driven by the learning
examples and tries to learn a preconceived hypothesis function that
can describe the mapping as close as possible.
SUPERVISED MACHINE LEARNING APPROACH
We collect lots of examples with their
known outcomes
Learn a function that map inputs to
outputs
Supervised Learning models are trying to find
parameter values that will allow them to
perform well on historical data. Then they
are used for making predictions on unknown
data, that was not a part of training dataset.
There are two main problems that can be solved with Supervised
Learning:
Classification Regression
Regression Classification
Linear Regression Logistic Regression
Multiple Linear Regression K-Nearest Neighbors
Polynomial Linear Regression Support Vector Machine
Support Vector Regression Naïve Bayes
Decision Tree Regression Decision Tree Classification
Random Forest Regression Random Forest Classification
SUPERVISED EXAMPLE & USE CASES
UNSUPERVISED
EXAMPLES & USE CASES
UNSUPERVISED MACHINE LEARNING APPROACH
Finding patterns in data
Draw inferences from non-labeled data (without reference to
known or labeled outcomes).
Models based on this type of algorithms can be used for
discovering unknown data patterns and data structure itself.
CLUSTERING
ASSOCIATION RULE MINING
Source: [Link]
compared-with-collaborative-filtering-in-recommender-systems
DIMENSION REDUCTION METHOD
Association Rule
Clustering Dimension Reduction
Mining
K-Means Aprior PCA
Hierarchical FP-Growth LDA
DBSCAN Eclat
UNSUPERVISED EXAMPLE & USE CASES
REINFORCEMENT LEARNING
Reinforcement learning is a type of machine learning where an agent learns to behave
in a environment by performing actions and seeing the results.
Exploration (Trail and Error)
Exploitation (Knowledge gained from the environment)
DEEP LEARNING
• The difference in artificial intelligence approaches over
the two decades (1997-2017)
– 1997: The IBM chess computer DeepBlue, was explicitly
programmed to win against the grandmaster Garry Kasparov in
1997
– 2017: AlphaGo was not preprogrammed to play Go.
– It learned using a general-purpose algorithm that allowed it to
interpret the game’s patterns.
• AlphaGo program applied deep learning.
DEEPDEEP LEARNING
LEARNING
Deep learning is a new area of Machine Learning research,
which has been introduced with the objective of moving
machine learning closer to concept of its original goal:
Artificial Intelligence.
It is inspired by the functionality of our brain cells called
neurons which led to the concept of artificial neural network
DEEP LEARNING
DEEP LEARNING
Source: [Link]
MACHINE LEARNING VS DEEP LEARNING
Deep Learning IS Machine Learning
Data Dependency Hardware Requirement Execution time
Feature Engineering Interpretability
Problem Solving
REGRESSION
SUPERVISED
SUPERVISED LEARNING
LEARNING
Learning a discrete function- classification
algorithm attempt to estimate the mapping
function from the input variables to
discrete or categorical output variables
Learning a continuous function- regression
algorithm attempt to estimate the mapping
function from the input variables to
numeric or continuous output variables
CLASSIFICATION VS REGRESSION
Classification Regression
Source: [Link]
SUPERVISED LEARNING
Image Source: [Link]
WHAT IS REGRESSION
WHAT IS REGRESSION
It is used to predict target variables on a continuous scale.
Regression
Dataset
Map x y
Identify
Relationship
SALARY AFTER COMPLETING THE COURSE
How much will your salary be ?
Depends on x = performance in course, quality of projects, etc….
TWEET POPULARITY
How many people will retweet your tweet? (y)
Depends on x = # followers, # of followers of followers, features of text tweeted,
popularity of hashtag, # of past retweets…….
REGRESSION ANALYSIS
Regression Analysis is a statistical tool for investigating the
relationship between a dependent variable and one or more
independent variables/explanatory variable.
Regression analysis is widely used for prediction and
forecasting
INDEPENDENT AND DEPENDENT VARIABLE
Independent Variable (Explanatory Variable):
A variable whose value does not change by the effect of other variables and
is used to manipulate the dependent variable/target variable. It is often denoted
by X
Dependent Variable
A variable whose value changes when there is any manipulation in the
values of independent variable. It is often denoted by Y
CASE STUDY: PREDICTING HOUSE PRICE
CASE STUDY: PREDICTING HOUSE PRICE
Size of house (ft) is independent variable also
known as control variable
Price of house is dependent variable/response
variable
WHAT IS REGRESSION
CASE STUDY: PREDICTING HOUSE PRICE
Regression
Dataset
BIVARIATE AND MULTIVARIATE MODEL
Bivariate or simple regression model
Size of house X Y Price
Multivariate or multiple regression model
Size of house X1
# of bedrooms X2 Y Price
Age of house X3
SIMPLE/BIVARIATE LINEAR REGRESSION
Simple linear regression is a linear regression model with a single explanatory
variable.
It concerns two-dimensional sample points with one independent variable and one
dependent variable and finds a linear function (a non-vertical straight line) that, as
accurately as possible, predicts the dependent variable values as a function of the
independent variables.
The adjective simple refers to the fact that the outcome variable is related to a
single predictor.
HOW MUCH IS MY HOUSE WORTH?
LOOK AT RECENT SALES IN MY NEIGHBORHOOD
How much did they sell for ?
𝒙(𝒊) 𝑦 (𝑖)
𝒙(𝒊) 𝑦 (𝑖)
𝒙(𝒊) 𝑦 (𝑖)
𝒙(𝒊) 𝑦 (𝑖)
𝒙(𝒊)
𝑦 (𝑖)
REGRESSION (HOUSE PRICE PREDICTION) Scatter plot is a mathematical diagram to
display values of two variables for a set of data.
Size of house (ft) is independent 𝒙 𝒊 , 𝒚𝒊
variable also known as control
variable
Dependent Variable
Price of house is dependent
Variable/response variable
Independent Variable
Scatter plots are used to investigates the position
relationship between the variables
SIMPLE LINEAR REGRESSION
House Price Predication
We want to fit the best line (linear function
Y = f(X)) to explain the data
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
The equation that describe how dependent variable (y) is related to independent
variable (x). The equation is referred as a regression equation.
𝑦 = 𝑚𝑥 + 𝑐
The simple linear regression model is:
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
• x is independent variable
• Parameters/Regression coefficients are 𝜃0 (intercept) and 𝜃1 (𝑠𝑙𝑜𝑝𝑒)
Represents the relationship
REGRESSION between input
(𝑥) and output (y)
The simple linear regression equation is
House price (y)
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥 𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
Size of house (x)
1. The regression equation is a straight line
2. 𝜃0 intercept of the regression line
3. 𝜃1 𝑠𝑙𝑜𝑝𝑒 of the regression line
4. ℎ𝜃 𝑥 hypothesis of the model
ESTIMATION PROCESS
Regression Equation
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
Unknown 𝜽𝟎 , 𝜽𝟏
Sample Data
𝜽𝟎 , 𝜽𝟏 are known
(x, y)
Estimated
Regression Equation
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
GOAL OF REGRESSION MODEL
Our goal to learn the model parameters that minimize error in the
model’s prediction.
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚(𝒊)
House price (y)
𝒉𝜽 (𝒙(𝒊) )
𝒉𝜽 (𝒙(𝒊) )
𝒚(𝒊)
Size of house (x)
To find the best parameters:
Define the cost function , or loss function that measures how inaccurate our
model’s prediction are.
𝑦 (𝑖) − ℎ𝜃 (𝑥 (𝑖) )
𝒉𝜽 𝒙 = 𝜽𝟎 + 𝜽𝟏 𝒙
𝒚(𝒊)
House price (y)
ℎ𝜃 (𝑥 (𝑖) ) − 𝑦 (𝑖)
𝒉𝜽 (𝒙(𝒊) )
𝒉𝜽 (𝒙(𝒊) )
𝒚(𝒊)
Size of house (x)
SIMPLE LINEAR REGRESSION
Parameter :
Regression coefficient
Hθ(x) =
EFFECTS OF PARAMETERS ON LINE PLACEMENT
𝒉𝜽 𝒙 = 𝟏. 𝟓 + 𝟎 ∗ 𝒙 x y
3
𝒉𝜽 𝒙 = 𝟎 + 𝟎. 𝟓 ∗ 𝒙 1 1
𝒉𝜽 𝒙 = 𝟏 + 𝟎. 𝟓 ∗ 𝒙
2 2
3 3
2
1
0
0 1 2 3
EFFECTS OF PARAMETERS ON LINE PLACEMENT
𝒉𝜽 𝒙 = 𝟏. 𝟓 + 𝟎 ∗ 𝒙
3 x y
𝒉𝜽 𝒙 = 𝟎 + 𝟎. 𝟓 ∗ 𝒙
𝒉𝜽 𝒙 = 𝟏 + 𝟎. 𝟓 ∗ 𝒙 1 1
2 2
2
3 3
1
Example
Suppose x = 2.5
0
ℎ𝜃 𝑥 = 1 + 0.5 ∗ 𝑥
0 1 2 3
Predict the outcome
ℎ𝜃 𝑥 =1 + 0.5 *2.5
= 2.25
ESTIMATION PROCESS
Size of
house (x)
LEAST SQUARE METHOD
One of the most common estimation
technique for linear regression is Least
Square Estimation.
The least square method is a statistical
procedure to find the best fit for a set
of data points by minimizing the sum
of the offsets or residuals of points Size of
from plotted curve. house (x)
Least Square Method
𝑖
𝑦 = 𝜃0 + 𝜃1 𝑥 (𝑖) + 𝜀 𝑖
𝜀𝑖 = 𝑦 𝑖 − ℎ𝜃 (𝑥 𝑖 )
is residual error (RSS) in the ith observation
J(𝜃0 , 𝜃1 ) = (𝑦 1 − ℎ𝜃 (𝑥 (1) ))2 +(𝑦 2 − ℎ𝜃 (𝑥 (2) ))2 +(𝑦 3 − ℎ𝜃 (𝑥 (3) ))2
+⋯………….+
𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 ℎ𝑜𝑢𝑠𝑒𝑠
So, our aim to minimize the total error.
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖
− ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 J(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1 Cost Function
EXAMPLE
Let’s take only one parameters 𝜃1 .
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖
− ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
Goal: 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃1 )
𝜃1
x y
EXAMPLE
1 1
𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a 𝑱(𝜃1 ) is a function of 𝜃1 2 2
function of x
2
2
𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙
𝑱(𝜃1 )
y
1
1
𝜽𝟏 =1
0
0
0 1 2
0 1 2
𝜽𝟏
x
1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2(02 + 02 ) = 0
EXAMPLE
𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a 𝑱(𝜃1 ) is a function of 𝜃1
function of x
3
𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙
2
2
y
𝑱(𝜃1 )
1
𝜽𝟏 =1.5
1
0
0
0 1 2
0 1 2
𝜽𝟏
x
1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2((1 − 1.5)2 +(2 − 3)2 ) = 0.5
EXAMPLE
𝑱(𝜃1 ) is a function of 𝜃1
𝒉𝜽 𝒙 , for fixed 𝜃1 , this is a
function of x
2
2
𝒉𝜽 𝒙 = 𝜽 𝟏 ∗ 𝒙
𝑱(𝜃1 )
y
1
1
𝜽𝟏 =.75
0
0
0 1 2
0 1 2
𝜽𝟏
x
1 1
J(𝜃0 , 𝜃1 ) = 2𝑚 σ𝑚
𝑖=1(𝑦
𝑖 − 𝜃1 (𝑥 (𝑖) ))2 J(𝜃0 , 𝜃1 ) = 2∗2((1 − 0.75)2 +(2 − 1.5)2 ) = 0.07
COST FUNCTION SURFACE PLOT
CONTOUR PLOT
Contour plot is also known as level plots.
It is used to visualized the change in J(𝜃0 ,
𝜃1 ) as a function of two input 𝜃0 and 𝜃1 .
J(𝜃0 , 𝜃1 ) =f(𝜃0 , 𝜃0 )
For a function f(𝜃0 , 𝜃0 ) of two variables,
assigned different colors to different
values of F.
Pick some values to plot. The result will
be contours–curves in the graph along
which the values of f(𝜃0 , 𝜃0 ) are constant
EXAMPLE
𝐽(𝜃0 , 𝜃1 ) (function of the
ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is
parameters 𝜃1 , 𝜃1 )
a function of x
EXAMPLE
ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is a 𝐽(𝜃0 , 𝜃1 ) (function of the
function of x parameters 𝜃1 , 𝜃1 )
EXAMPLE
ℎ𝜃 𝑥 , for fixed 𝜃0 , 𝜃1 , this is a 𝐽(𝜃0 , 𝜃1 ) (function of the
function of x parameters 𝜃1 , 𝜃1 )
SUMMARY
Hypothesis ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥
Parameters 𝜃0 , 𝜃1
1
Cost Function J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
Goal 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1
CONVEX AND CONCAVE FUNCTION
Convex Function Concave Function
Slope of change is 0
g′′(𝑧) ≥ 0
g(z)
𝑔′′ 𝑧 < 0
Slope of change is 0
a b a b
Slope of change
is 0
Example
g(𝑧) = 5 − (𝑧 − 10)2
𝑑(𝑔(𝑧)
= 0 − 2 𝑧 − 10
𝑑𝑧
= -2z + 20
Set 𝑑(𝑔(𝑧)Τ𝑑𝑧 = 0
z = 10
FINDING MAXIMUM VIA HILL CLIMBING
Derivative = 0
How do we know whether to move θ to right
or left ?
(Increase the value of θ or decrease θ)
𝑑𝑔(𝜃)
>0 -ve
𝑑𝜃 slope While not converged
𝑑𝑔(𝜃)
+ve 𝑑𝑔(𝜃) 𝜃 𝑡+1 ← 𝜃 𝑡 + α
<0 𝑑𝜃
slope 𝑑𝜃
iteration
Step Size
θ θ
Max(g(θ))
FINDING MINIMUM VIA HILL DESCENT
Min(g(θ)
When derivative is positive, we want to decrease
𝜃 and when derivative is negative, we want to
θ θ
increase 𝜃
𝑑𝑔(𝜃) 𝑑𝑔(𝜃)
<0 >0
𝑑𝜃 𝑑𝜃
-ve +ve
slope slope
While not converged
𝑑𝑔(𝜃)
𝜃 𝑡+1 ← 𝜃 -α
𝑡
𝑑𝜃
iteration
Step Size
STEP SIZE/LEARNING RATE (𝛼)
With Fixed learning rate
Slowly reach to the optimum
position
STEP SIZE/LEARNING RATE (𝛼)
With Fixed learning rate
Small step size Large step size
Advantage Advantage
Will converge to global optimum Moving fast toward the optimum
Disadvantage Disadvantage
Slow convergence May overshoot the optimum point
STEP SIZE/LEARNING RATE (𝛼)
Decreasing Step Size
Step size is scheduled Common Choice:
𝑡
𝛽
α =
𝑡
𝛽
α𝑡 =
𝑡
CONVERGENCE CRITERIA
For convex function, optimum occurs when
𝑑𝑔 𝜃
=0
𝑑𝜃
In practice, stop when While not converged
𝑑𝑔(𝜃)
𝜃 𝑡+1
← 𝜃 -α
𝑡
𝑑𝑔 𝜃 iteration
𝑑𝜃
<ϵ Step Size
𝑑𝜃
FINDING THE LEAST SQUARES LINE
𝜃0 , 𝜃1
Solution is unique and
+ gradient decent will
converge to minimum
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
Cost Function J(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1
COMPUTE THE GRADIENT
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
𝑖 (𝑖)
ℎ𝜃 (𝑥 ) = 𝜃0 + 𝜃1 𝑥
𝑚
1
𝐽(𝜃0 , 𝜃1 ) = (𝑦 𝑖 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
2𝑚
𝑖=1
𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= (𝑦 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
𝜕𝜃0 2𝑚
𝑖=1
𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= (𝑦 − (𝜃0 +𝜃1 𝑥 (𝑖) ))2
𝜕𝜃1 2𝑚
𝑖=1
𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= (𝑦 − ℎ𝜃 (𝑥 (𝑖) ))2
𝜕𝜃0 2𝑚
𝑖=1
𝑚
1
= (𝑦 𝑖 −(𝜃0 + 𝜃1 𝑥 𝑖
))(−1)
𝑚
𝑖=1
𝑚
𝜕J(𝜃0 , 𝜃1 ) 1 𝑖
= (𝑦 − (𝜃0 + 𝜃1 𝑥 (𝑖) ))2
𝜕𝜃1 2𝑚
𝑖=1
𝑚
1
= (𝑦 𝑖
− (𝜃0 + 𝜃1 𝑥 (𝑖) )) . (−𝑥 (𝑖) )
𝑚
𝑖=1
COMPUTE THE GRADIENT
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
Putting it together
1 𝑚
− σ𝑖=1[𝑦 𝑖 −(𝜃0 +𝜃1 𝑥 𝑖 )]
𝛻J(𝜃0 , 𝜃1 ) = 𝑚
−
1
σ𝑚
2𝑚 𝑖=1
[𝑦 𝑖 −(𝜃 +𝜃 𝑥 (𝑖) )
0 1 ] . (𝑥 (𝑖) )
APPROACH 1 : SET GRADIENT = 0
1 𝑚
− σ𝑖=1[𝑦 𝑖 −(𝜃0 +𝜃1 𝑥 𝑖 )]
𝛻J(𝜃0 , 𝜃1 ) = 𝑚
−
1
σ𝑚
2𝑚 𝑖=1
[𝑦 𝑖 −(𝜃 +𝜃 𝑥 (𝑖) )
0 1 ] . (𝑥 (𝑖) )
Top Term
σ𝑚 𝑦 𝑖 𝜃1 σ𝑚 𝑥 𝑖
𝜃0 = 𝑖=1
− 𝑖=1
𝑚 𝑚
Bottom Term
1 𝑖 2
− σ𝑦 𝑖 𝑥 𝑖
− 𝜃0 σ 𝑥 𝑖
− 𝜃1 σ 𝑥 =0
2𝑚
σ 𝑦 𝑖 σ𝑥 𝑖
σ𝑦 𝑖 𝑥 𝑖 −
𝜃1 = 2 σ 𝑦
𝑚
𝑖 σ𝑥 𝑖
σ𝑥 𝑖 −
𝑚
Note
𝑦 𝑖 𝑥 𝑖 𝑥 𝑖
𝑥 𝑖 2 𝑦 𝑖
QUESTION 1
Find the least square regression line, for the following data.
Also estimate the value of y when x = 10
X Y
0 2
1 3
2 5
3 4
4 6
SOLUTION
ℎ𝜃 𝑥 = 2.2 + 0.9 𝑥
x = 10
ℎ𝜃 𝑥 = 2.2 + 0.9𝑥
= 11.2
APPROACH 2: GRADIENT DESCENT
Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a cost function (cost).
GRADIENT DESCENT
Gradient descent algorithm
Get estimated parameters
Intercepts
Slope
Used to form predictions
1
J(𝜃0 , 𝜃1 ) = σ𝑚 (𝑦 𝑖 − ℎ𝜃 (𝑥 (𝑖) ))2
2𝑚 𝑖=1
Have some function
ℎ𝜃 (𝑥 𝑖 )= 𝜃0 + 𝜃1 𝑥 (𝑖)
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃0 , 𝜃1 )
𝜃0 , 𝜃1
Outlines:
Start with some 𝜃0 , 𝜃1
Keep changing 𝜃0 , 𝜃1 to reduce J(𝜃0 , 𝜃1 ) until we hopefully
end up at a minimum.
𝑊ℎ𝑖𝑙𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑
{
𝑓𝑜𝑟 𝑗 = 0 𝑡𝑜 1
𝜕𝐽(𝜃0 ,𝜃1 )
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕𝜃𝑗
}
GRADIENT DESCENT ALGORITHM
𝜕𝐽(𝜃1 )
Slope of the <0
line is -ve 𝜕𝜃1
𝜃1
𝜃1 = 𝜃1 − 𝛼 −𝑣𝑒 𝑣𝑎𝑙𝑢𝑒
Increase the value of 𝜽𝟏 with some quantity
GRADIENT DESCENT ALGORITHM
𝜕𝐽(𝜃1 )
>0
𝐽(𝜃1 ) 𝜕𝜃1 Slope of the
line is +ve
𝜃1
𝜃1 = 𝜃1 − 𝛼 +𝑣𝑒 𝑣𝑎𝑙𝑢𝑒
Decrease the value of 𝜽𝟏 with some quantity
GRADIENT DESCENT ALGORITHM
𝜕𝐽(𝜃1 )
Slope of the =0
𝜕𝜃1
line is 0
𝜃1 = 𝜃1 − 𝛼 ∗ 0
No change
GRADIENT DESCENT ALGORITHM
𝑊ℎ𝑖𝑙𝑒 𝑛𝑜𝑡 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑑
{
𝑚
1 𝑖 𝑖
𝜃0 = 𝜃0 + 𝛼 (𝑦 −(ℎ𝜃 (𝑥 ))
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃1 = 𝜃1 + 𝛼 (𝑦 − ℎ𝜃 (𝑥 )𝑥
𝑚
𝑖=1
}
LINEAR REGRESSION WITH GRADIENT DESCENT
Linear Regression Model
𝒉𝜽 𝒙(𝒊) = 𝜽𝟎 + 𝜽𝟏 𝒙(𝒊)
𝟏
J(𝜽𝟎 , 𝜽𝟏 ) = σ𝒎 (𝒚 𝒊 − 𝒉𝜽 (𝒙(𝒊) ))𝟐
𝟐𝒎 𝒊=𝟏
Linear Regression
with
Gradient Descent Algorithm Gradient descent
𝑾𝒉𝒊𝒍𝒆 𝒏𝒐𝒕 𝒄𝒐𝒏𝒗𝒆𝒓𝒈𝒆𝒅
{
𝒇𝒐𝒓 𝒋 = 𝟎 𝒕𝒐 𝟏
𝝏𝑱(𝜽𝒋 ,𝜽𝒋 )
𝜽𝒋 = 𝜽 𝒋 − 𝜶
𝝏𝜽𝒋
}
GRADIENT DESCENT ALGORITHM
Types of Gradient Descent Algorithm
Stochastic gradient descent
SGD randomly picks one data point from the whole data set at each iteration.
Batch gradient descent
Every step of gradient descent uses all the training examples
Mini-batch gradient descent
A balance between the goodness of gradient descent and speed of SGD.
sample a small number of data points instead of just one point at each step.
COEFFICIENT OF DETERMINATION (𝑟 2 )
Quantifies the goodness of a fit.
𝑟2
Is a measure of how close each data
point fits to the regression line.
In other words, it represents the
fraction of variance in dependent
variable (response) that has been
explained by the regression model
R-Squared is a way of measuring how much better than the mean line
you have done based on summed squared error.
Our objective is to do better than the mean. For instance this regression line will give A
lower sum squared error than using the horizontal line.
Ideally, you would have zero regression error, i.e. Your regression line would perfectly
match the data. In that case you would get an r-squared value of 1
𝐴𝑐𝑡𝑢𝑎𝑙
𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
2
𝑦𝑖 − 𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑆𝑆𝑇𝑜𝑡𝑎𝑙 = 𝑦𝑖 − 𝑦ത 2
𝑆𝑆𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 =
2
𝑦𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 − 𝑦ത
𝑦ത
Intercept
EXAMPLE
Regression Line
X Y SS_Total Y = 6x
SS_Regression
-5
0 0 169 -5 5 25
1 1 144 1 0 0
2 4 81 7 -3 9
3 9 16 13 -4 16
4 16 9 19 -3 9
5 25 144 25 0 0
6 36 529 31 5 25
Average 13
Total 1092 84
R-squared
0.923
Source: [Link]