0% found this document useful (0 votes)
5 views

09_23ECE216_LogisticRegression

Logistic regression is a predictive analysis method used for binary classification, modeling the probability of a binary outcome using the logistic function. It estimates the relationship between one dependent binary variable and one or more independent variables, employing maximum likelihood estimation. While it is efficient and interpretable, logistic regression has limitations, including the assumption of linearity and its applicability only to discrete outcomes.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

09_23ECE216_LogisticRegression

Logistic regression is a predictive analysis method used for binary classification, modeling the probability of a binary outcome using the logistic function. It estimates the relationship between one dependent binary variable and one or more independent variables, employing maximum likelihood estimation. While it is efficient and interpretable, logistic regression has limitations, including the assumption of linearity and its applicability only to discrete outcomes.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Om Namo Bhagavate Vasudevaya

23ECE216 Machine Learning


Logistic Regression

Dr. Binoy B Nair


Logistic Regression
• Logistic regression is the appropriate regression analysis to
conduct when the dependent variable is dichotomous (binary).
• Like all regression analyses, the logistic regression is a predictive
analysis.
• Logistic regression is used to describe data and to explain the
relationship between one dependent binary variable and one or
more nominal, ordinal, interval or ratio-level independent
variables.
• Remember: though the name of algorithm carries ‘regression’,
it is used for classification.
What is logistic regression?
• Logistic regression gets its name from the logistic function (also
known as the logit function or the sigmoid function) that it uses to
model the probability of a binary outcome.
• This function maps any real-valued number into a value between 0 and
1, which is ideal for predicting probabilities1.
• The term "regression" is used because logistic regression, like linear
regression, is a type of regression analysis.
• It estimates the parameters of a model to best fit the data.
• However, instead of predicting a continuous outcome, logistic
regression predicts the probability of a binary outcome (e.g., yes/no,
true/false)
Multiple Linear Regression Logistic Regression LOGIT
 Used to make predictions about an  Used to determine which variables affect
unknown event from known evidence the probability of a particular outcome
 Decision Variable continuous  Decision Variable categorical
 Assumes linear relationship  Doesn’t assume linear relationship but
 Uses least squares estimation rather a logit transformation
(computed coefficients that minimized  Uses maximum likelihood estimation
the residuals for all cases) (when the dependent variable is not
normally distributed, ML estimates are
 Normally distributed variables
preferred to OLS estimates because they
 Equal variance are unbiased )
 Doesn’t assume normal distribution, or
equal variance
 Less stringent requirements
Type of Logistic Regression
• Binary Logistic Regression
• The categorical response has only two 2 possible outcomes. Example:
Spam or Not.
• Multinomial Logistic Regression
• Three or more categories without ordering. Example: Predicting which
food is preferred more (Veg, Non- Veg, Vegan).
• Ordinal Logistic Regression
• Three or more categories with ordering. Example: Movie rating from 1 to 5.
Logistic Function
• Logistic regression is named for the function used at the core of
the method, the logistic function.

• The logistic function, also called the logit function or the


sigmoid function was developed by statisticians to describe
properties of population growth in ecology, rising quickly and
maxing out at the carrying capacity of the environment.
Logistic Function
• It’s an S-shaped curve that can take any real-valued number and
map it into a value between 0 and 1, but never exactly at those
limits.
1
𝑓 𝑥 =
1 + 𝑒 −𝑥
Logistic Regression from logistic function
• Input values (x) are combined linearly using weights or coefficient
values (referred to as the Greek letter 𝛽) to predict an output value (y).
• A key difference from linear regression is that the output value being
modeled is a binary values (0 or 1) rather than a numeric value.

• Below is an example logistic regression equation:

1
𝑦 =
1 + 𝑒− 𝛽0 +𝛽1 ∗𝑥
Important Points
• Logistic regression models the probability of the default class (e.g. the first class).

• Note: we do not explicitly consider the second class since the idea is that if the
probability of the sample belonging to the first class is small, it by default,
means that the sample has a correspondingly higher probability of belonging to
the second class.

• For example, if we are modeling people’s gender as male or female from their height,
then the first class could be male and the logistic regression model could be written
as the probability of male given a person’s height, or more formally:
𝑃 𝑔𝑒𝑛𝑑𝑒𝑟 = 𝑚𝑎𝑙𝑒 ℎ𝑒𝑖𝑔ℎ𝑡

• Written another way, we are modeling the probability that an input (X) belongs to the
default class (𝑦 = 1), we can write this formally as:
𝑃(𝑋) = 𝑃(𝑌 = 1|𝑋)
Simple Linear Regression

Example Reference: Super Data Science


Where linear regression does not work
• A company has provided an offer by email to their customers.
Apply Linear Regression to this problem. Does
it work?
Logistic Regression
Logistic Regression – Logit Function
Logistic Regression
Logistic Regression- Probabilities
Logistic Regression-Prediction
Advantages
• Logistic Regression performs well when the dataset is linearly
separable.
• Logistic regression is less prone to over-fitting but it can overfit in
high dimensional datasets.
• Logistic Regression not only gives a measure of how relevant a
predictor (coefficient size) is, but also its direction of association
(positive or negative).
• Logistic regression is easier to implement, interpret and very
efficient to train.
Disadvantages
• Main limitation of Logistic Regression is the assumption of linearity between
the dependent variable and the independent variables.
• In the real world, the data is rarely linearly separable.
• If the number of observations are lesser than the number of features, Logistic
Regression should not be used, otherwise it may lead to overfit.
• Logistic Regression can only be used to predict discrete functions. Therefore,
the dependent variable of Logistic Regression is restricted to the discrete
number set.
Summary of Logistic Regression
1. Understanding Logistic Regression
Logistic regression is used for binary classification problems, where the outcome is either 0 or 1
(e.g., spam or not spam).

2. Logistic Function (Sigmoid Function)


1
The logistic function is defined as: 𝜎 𝑧 = where ( z ) is the linear combination of input
1+𝑒 −𝑧
features.

3. Model Representation
In logistic regression, we predict the probability that the output ( y ) is 1 given the input features 𝒙:
𝑇
1
𝑃 𝑦 =1 𝒙 =𝜎 𝒘 𝒙+𝑏 = 𝑇
1 + 𝑒 − 𝒘 𝒙+𝑏
where:
• 𝒘 is the weight vector.
• 𝒙 is the feature vector.
• b is the bias term.
Summary of Logistic Regression
3. Model Representation
Note that the 𝑃 𝑦 = 1 𝒙 can also be represented using the letter 𝑝.
1
Taking natural logarithms, we can also represent the function: 𝑃 𝑦 = 1 𝒙 = 𝑇 as:
1+𝑒 − 𝒘 𝒙+𝑏

𝑝
ln = 𝒘𝑇 𝒙 + 𝑏
1−𝑝

where:
• 𝒘 is the weight vector.
• 𝒙 is the feature vector.
• b is the bias term
𝑝
• is called the odds-ratio (ratio of probability of success (i.e. getting 1) and failure)
1−𝑝

• You may find this representation in a few text books.


Summary of Logistic Regression
4. Cost Function

The cost function for logistic regression is the log-loss (binary cross-entropy):

1 𝑚
𝐽 𝒘, 𝑏 = − σ 𝑦𝑖 ln ℎ𝑤 (𝑥𝑖 ) + 1 − 𝑦𝑖 ln(1 − ℎ𝑤 (𝑥𝑖 ))
𝑚 𝑖=1

Where:
𝑚 = number of training samples
1
𝑇
ℎ𝑤 𝑧 = 𝜎 𝒘 𝒙 + 𝑏 = − 𝒘𝑇 𝒙+𝑏
= the hypothesis
1+𝑒
Summary of Logistic Regression

5. Gradient Descent

To minimize the cost function, we use gradient descent.


The updates for the weights and bias are:

𝜕𝐽(𝒘.𝑏)
𝒘=𝒘−𝛼
𝜕𝑤
𝜕𝐽(𝒘.𝑏)
𝑏 =𝑏−𝛼
𝜕𝑏

Where: 𝛼 is the learning rate.


Summary of Logistic Regression
6. Partial Derivatives
The partial derivatives of the cost function with respect to 𝒘 and 𝑏 are:

𝜕𝐽 𝒘,𝑏 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
𝜕𝒘 𝑚 𝑖=1
𝜕𝐽(𝒘,𝑏) 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖
𝜕𝑏 𝑚 𝑖=1

7. Prediction
To make a prediction, we compute the probability:

𝑃(𝑦 = 1 ∣ 𝑥) = 𝜎(𝑤𝑥 + 𝑏)

If the probability is greater than 0.5, we predict (𝑦 = 1 ); otherwise, we predict ( 𝑦 = 0 ).


Example

Q. Let's consider the simple example with one feature (x) and a binary
outcome(y) below. Use logistic regression to model this dataset and find
the class to which each of the samples in the training set are assigned by
the model (In-sample testing).

Feature Class
X y
2 0
3 0
5 1
7 1
S. No. Input Output
Class

Example 1
2
X
2
y
0
Steps: 3 3 0
4 5 1
1.Initialize weights and bias: ( w = 0 ), ( b = 0 ). 5 7 1

Note two
classes
2. Initialize the learning rate : 𝛼 = 0.1 represen
ted using
numbers
1
3. The hypothesis: ℎ𝑤 𝑧 = 𝜎 𝑤 𝑥 + 𝑏 =
𝑇
𝑇
0 and 1
1+𝑒 − 𝑤 𝑥+𝑏
Example
Iteration 1:
S. No. Input Output
Step 1.1: Compute Z: Class
𝑧1 = 𝑤. 𝑥 + 𝑏 = 0 ∗ 2 + 0 = 0 1 X y
𝑧2 = 𝑤. 𝑥 + 𝑏 = 0 ∗ 3 + 0 = 0 2 2 0
𝑧3 = 𝑤. 𝑥 + 𝑏 = 0 ∗ 5 + 0 = 0 3 3 0
𝑧4 = 𝑤. 𝑥 + 𝑏 = 0 ∗ 7 + 0 = 0 4 5 1
5 7 1
Step 1.2: Compute hypothesis h for each zi:

1
ℎ1 = 𝜎 𝑧1 = = 0.5
1+𝑒 −0
1
ℎ2 = 𝜎 𝑧2 = = 0.5
1+𝑒 −0
1
ℎ3 = 𝜎 𝑧3 = = 0.5
1+𝑒 −0
1
ℎ4 = 𝜎 𝑧4 = = 0.5
1+𝑒 −0
Example S. No. Input Output
Class
1 X y
2 2 0

Step 1.3 : Compute the cost function: 3 3 0


4 5 1
1
𝐽 𝑤, 𝑏 = − 4 σ4𝑖=1 𝑦𝑖 ln ℎ𝑖 + 1 − 𝑦𝑖 ln(1 − ℎ𝑖 ) 5 7 1
=

1 0 ∗ ln 0.5 + 1 − 0 ln(1 − 0.5 ) + 0 ∗ ln 0.5 + 1 − 0 ln(1 − 0.5 ) + 1 ∗ ln 0.5 + 1 − 1 ln(1 − 0.5 )
4 + 1 ∗ ln 0.5 + 1 − 1 ln(1 − 0.5

1
= − 4 1 ∗ ln 0.5 + 1 ∗ ln 0.5 + 1 ∗ ln 0.5 + 1 ∗ ln 0.5

= 0.6931
S. No. Input Output
Example 1 X
Class
y
2 2 0
3 3 0
Step 1.4: Compute the gradients: 4 5 1
5 7 1
𝜕𝐽 𝒘.𝑏 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
𝜕𝒘 𝑚 𝑖=1

1 1
= 0.5 − 0 ∗ 2 + 0.5 − 0 ∗ 3 + 0.5 − 1 ∗ 5 + 0.5 − 1 ∗ 7 = 1 + 1.5 − 2.5 − 3.5 = −0.875
4 4

𝜕𝐽(𝒘.𝑏) 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖
𝜕𝑏 𝑚 𝑖=1

1 1
= 0.5 − 0 + 0.5 − 0 + 0.5 − 1 + 0.5 − 1 = 0.5 + 0.5 − 0.5 − 0.5 = 0
4 4
Example S. No. Input Output
Class
1 X y
2 2 0
3 3 0
Step 1.5: Update the weights: 4 5 1
5 7 1
𝜕𝐽 𝑤.𝑏
𝑤 =𝑤−𝛼 = 0 − 0.1 ∗ −0.875 = 0.0875
𝜕𝑤

𝜕𝐽 𝒘.𝑏
𝑏 =𝑏−𝛼 = 0 − 0.1 ∗ 0 = 0
𝜕𝑏

Now use these updates weights in the next iteration.


Iteration 2: Example S. No. Input Output
Class
1 X y
Step 2.1: Compute Z: 2 2 0
3 3 0
𝑧1 = 𝑤. 𝑥 + 𝑏 = 0.0875 ∗ 2 + 0 = 0.175 4 5 1
𝑧2 = 𝑤. 𝑥 + 𝑏 = 0.0875 ∗ 3 + 0 = 0.2625 5 7 1
𝑧3 = 𝑤. 𝑥 + 𝑏 = 0.0875 ∗ 5 + 0 = 0.4375
𝑧4 = 𝑤. 𝑥 + 𝑏 = 0.0875 ∗ 7 + 0 = 0.6125

Step 2.2: Compute hypothesis h:

1
ℎ1 = 𝜎 𝑧1 = = 0.5436
1+𝑒 −0.175
1
ℎ2 = 𝜎 𝑧2 = = 0.5652
1+𝑒 −0.2625
1
ℎ3 = 𝜎 𝑧3 = = 0.6077
1+𝑒 −0.4375
1
ℎ4 = 𝜎 𝑧4 = = 0.6485
1+𝑒 −0.6125
Example S. No. Input Output
Class
1 X y
2 2 0
Step 2.3 : Compute the cost 3 3 0
4 5 1
1
𝐽 𝑤, 𝑏 = − σ4𝑖=1 𝑦𝑖 ln ℎ𝑖 + 1 − 𝑦𝑖 ln(1 − ℎ𝑖 ) 5 7 1
4

0 ∗ ln 0.5436 + 1 − 0 ln(1 − 0.5436 ) + 0 ∗ ln 0.5652 + 1 − 0 ln(1 − 0.5652 )


1
=− + 1 ∗ ln 0.6077 + 1 − 1 ln(1 − 0.6077 )
4
+ 1 ∗ ln 0.6485 + 1 − 1 ln(1 − 0.6485

1
=− 1 ∗ ln 0.4564 + 1 ∗ ln 0.4348 + 1 ∗ ln 0.6077 + 1 ∗ ln 0.6485
4

= 0.6372
Example S. No. Input Output
Class
1 X y
2 2 0
Step 2.4: Compute the gradients: 3 3 0
4 5 1
𝜕𝐽 𝒘.𝑏 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖 𝑥𝑖 5 7 1
𝜕𝒘 𝑚 𝑖=1

1
= 0.5436 − 0 ∗ 2 + 0.5652 − 0 ∗ 3 + 0.6077 − 1 ∗ 5 + 0.6485 − 1 ∗ 7
4

1
= 1.0872 + 1.6956 − 1.9615 − 2.4545 = −0.4083
4

𝜕𝐽(𝒘.𝑏) 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖
𝜕𝑏 𝑚 𝑖=1

1
= 0.5436 − 0 + 0.5652 − 0 + 0.6077 − 1 + 0.6485 − 1 = 0.09075
4
Example S. No. Input Output
Class
Step 2.5: Update the weights:
1 X y

𝜕𝐽 𝑤.𝑏 2 2 0
𝑤 =𝑤−𝛼 = 0.0875 − 0.1 ∗ −0.4083 = 0.1283 3 3 0
𝜕𝑤
4 5 1
𝜕𝐽 𝒘.𝑏 5 7 1
𝑏 =𝑏−𝛼 = 0 − 0.1 ∗ 0.09075 = −0.0091
𝜕𝑏

Iteration 3:

Step 3.1: Compute Z:

𝑧1 = 𝑤. 𝑥 + 𝑏 = 0.1283 ∗ 2 − 0.0091 = 0.2475


𝑧2 = 𝑤. 𝑥 + 𝑏 = 0.1283 ∗ 3 − 0.0091 = 0.3758
𝑧3 = 𝑤. 𝑥 + 𝑏 = 0.1283 ∗ 5 − 0.0091 = 0.6224
𝑧4 = 𝑤. 𝑥 + 𝑏 = 0.1283 ∗ 7 − 0.0091 = 0.8690
S. No. Input Output
Example 1 X
Class
y
2 2 0
3 3 0

Step 3.2: Compute hypothesis ℎ: 4 5 1


5 7 1
1
ℎ1 = 𝜎 𝑧1 = = 0.5615
1+𝑒 −0.2475
1
ℎ2 = 𝜎 𝑧2 = = 0.5928
1+𝑒 −0.3758
1
ℎ3 = 𝜎 𝑧3 = = 0.6508
1+𝑒 −0.6224
1
ℎ4 = 𝜎 𝑧4 = = 0.7045
1+𝑒 −0.8690
Example S. No. Input Output
Class
1 X y
2 2 0

Step 3.3 : Compute the cost 3 3 0


4 5 1
1
𝐽 𝑤, 𝑏 = − σ4𝑖=1 𝑦𝑖 ln ℎ𝑖 + 1 − 𝑦𝑖 ln(1 − ℎ𝑖 ) 5 7 1
4

1
= − ሾ 0 ∗ ln 0.5615 + 1 − 0 ln(1 − 0.5615 ) + 0 ∗ ln 0.5928 + 1 − 0 ln(1 − 0.5928 ) +
4
1 ∗ ln 0.6508 + 1 − 1 ln(1 − 0.6508 ) + 1 ∗ ln 0.7045 + 1 − 1 ln(1 − 0.7045 ሿ

1
=− 1 ∗ ln 0.4385 + 1 ∗ ln 0.4072 + 1 ∗ ln 0.6508 + 1 ∗ ln 0.7045
4

= 0.6257
Example x y
2 0
3 0
Step 3.4: Compute the gradients: 5 1
7 1
𝜕𝐽 𝒘.𝑏 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
𝜕𝒘 𝑚 𝑖=1

1
= 0.5615 − 0 ∗ 2 + 0.5928 − 0 ∗ 3 + 0.6508 − 1 ∗ 5 + 0.7045 − 1 ∗ 7 = −0.2283
4

𝜕𝐽(𝒘.𝑏) 1 𝑚
= σ ℎ𝑤 𝑥𝑖 − 𝑦𝑖
𝜕𝑏 𝑚 𝑖=1

1
= 0.5615 − 0 + 0.5928 − 0 + 0.6508 − 1 + 0.7045 − 1 = 0.1247
4
S. No. Input Output
Example Class
1 X y
Step 3.5: Update the weights: 2 2 0
3 3 0
𝜕𝐽 𝑤.𝑏 4 5 1
𝑤 =𝑤−𝛼 = 0.1283 − 0.1 ∗ −0.2283 = 0.1511
𝜕𝑤 5 7 1

𝜕𝐽 𝒘.𝑏
𝑏 =𝑏−𝛼 = −0.0091 − 0.1 ∗ 0.1247 = −0.02157
𝜕𝑏

• And so on…

• After about 149 iterations, the value of 𝑤 = 0.5771 and 𝑏 = −1.9037

• Note that 149th iteration is taken just for illustration, the steepest
descent algorithm converges at a much later iteration.
S. No. Input Output Class
Example 1 X y
2 2 0

Final Step: Prediction: 3 3 0


4 5 1
5 7 1
1
𝑃 𝑦=1𝑥 = . Assign 𝑦ො = 1 if 𝑃 𝑦 = 1 𝑥 > 0.5 else 𝑦ො = 0
1+𝑒 −(𝑤𝑥+𝑏)

1 1
• 𝑃 𝑦=1𝑥=2 = = = 0.3209, Since 0.3209 < 0.5, we have 𝑦ො = 0
1+𝑒 −(0.5771∗2−1.9037) 1+𝑒 0.7495
1 1
• 𝑃 𝑦=1𝑥=3 = = = 0.4570 , Since 0.4570 < 0.5, we have 𝑦ො = 0
1+𝑒 −(0.5771∗3−1.9037) 1+𝑒 0.1724
1 1
• 𝑃 𝑦=1𝑥=5 = = = 0. 7275 , Since 0.7275 > 0.5, we have 𝑦ො = 1
1+𝑒 −(0.5771∗5−1.9037) 1+𝑒 −0.9818
1 1
• 𝑃 𝑦=1𝑥=7 = = = 0.8944 , Since0.8944 > 0.5, we have 𝑦ො = 1
1+𝑒 −(0.5771∗7−1.9037) 1+𝑒 −2.137

We see from the above results that the logistic regression model is able to
correctly predict the class for all the four samples.
Hence in-sample testing Completed
References
• Tushar B. Kute, Logistic Regression, http://tusharkute.com
• https://machinelearningmastery.com

You might also like