Logistic - Regression Class 3
Logistic - Regression Class 3
Manoj Kumar
Manoj Kumar
Logistic Regression 1 / 66
Given data
Manoj Kumar
Logistic Regression 2 / 66
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Why is Linear Regression not good for classification?
Manoj Kumar
Logistic Regression 4 / 66
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Sigmoid/Logistic Function
Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function
Sigmoid Function
σ(a)
0.5
Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function
• Bounded:
1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)
0.5
Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function
• Bounded:
1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)
• Symmetric:
exp(−a) 1
1 − σ(a) = = = σ(−a)
0.5 1 + exp(−a) exp(a) + 1
Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function
• Bounded:
1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)
• Symmetric:
exp(−a) 1
1 − σ(a) = = = σ(−a)
0.5 1 + exp(−a) exp(a) + 1
• Gradient:
exp(−a)
a σ ′ (a) = = σ(a)(1 − σ(a))
(1 + exp(−a))2
Manoj Kumar
Logistic Regression 6 / 66
Interpretation of the Sigmoid Function
Manoj Kumar
Logistic Regression 7 / 66
Interpretation of the Sigmoid Function
Note
Manoj Kumar
Logistic Regression 7 / 66
Question
3
Given a logistic regression model with a parameter vector θ = 4, and a test data point
1
1
x = 1, what is the probability of Y = 1 for the given data point x?
1
Question
Suppose that you have trained a logistic regression classifier hθ (x ) = σ(1 − x ) where σ(·) is the
logistic/sigmoid function. What does its output on a new example x = 2 mean? Check all that
apply.
• □ Your estimate for P(y = 1|x ; θ) is about 0.73.
• □ Your estimate for P(y = 0|x ; θ) is about 0.27.
• □ Your estimate for P(y = 1|x ; θ) is about 0.27.
• □ Your estimate for P(y = 0|x ; θ) is about 0.73.
Question
Let σ(a) = 13 . Using the properties of sigmoid function, calculate the value of the expression:
σ ′ (−a), where ′ represents derivative.
2
1 9
2 − 29
1
3 9
4 − 19
Question
Q3-2: Which of the following statement is true about outliers in Linear regression?
1 Linear regression is sensitive to outliers
2 Linear regression is NOT sensitive to outliers
3 Can’t say
4 None of these
Question
Suppose we have trained a logistic regression classifier for a binary classification task. The table
below provides the true labels y and the predicted probabilities P(Y = 1 | x ) for a set of data
points. We want to evaluate the accuracy of the classifier for the following thresholds:
• Model A: T = 0.25
• Model B: T = 0.5
• Model C: T = 0.75
Calculate the accuracy for each model and determine which threshold results in the highest
accuracy.
Manoj Kumar
Logistic Regression 14 / 66
Decision Boundary (Using Logistic)
• Y = 1 if
P(Y = 1|X ) ≥ 0.5
• For 1 input feature (X ):
z = w0 + w1 X
1 1
≥
1 + e −z 2
1 + e −(w0 +w1 X ) ≤ 2
w0 + w1 X ≥ 0
Y = 1 if X ≥ − ww10
Y =0 Y =1
Decision Boundary
Manoj Kumar
Logistic Regression 14 / 66
• For 2 input features (X1 , X2 ):
z = w0 + w1 X1 + w2 X2
1
≥ 0.5
1 + e −z
Decision Boundary
Y = 1 if w0 + w1 X1 + w2 X2 ≥ 0
X2
Y=1
Y=0
X1
Question
Suppose you train a logistic regression classifier and the learned hypothesis function is:
hθ (x ) = σ(θ0 + θ1 x1 + θ2 x2 ),
where θ0 = 6, θ1 = 0, θ2 = −1. Which of the following represents the decision boundary for
hθ (x )?
A B C D
10 x2 10 x2 10 x2 10 x2
8 y =1 8 y =1 y =0 8 y =0 y =1 8 y =0
6 6 6 6
4 y =0 4 4 4 y =1
2 x1 2 x1 2 x1 2 x1
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Recap: Likelihood
Manoj Kumar
Logistic Regression 17 / 66
How to Choose Parameters?
Manoj Kumar
Logistic Regression 18 / 66
Using MSE
Using MSE
Maximising Conditional Likelihood
Maximising Conditional Likelihood
How to Choose Parameters?
1
P(Y = 0|X , W ) = ∑
1 + exp(w0 + i wi Xi )
∑
exp(w0 + i wi Xi )
P(Y = 1|X , W ) = ∑
1 + exp(w0 + i wi Xi )
∏
l(W ) ≡ ln P(Y l |X l , W )
l
( ) ( )
∑ ∑ ∑
= Y l
w0 + wi Xil − ln 1 + exp(w0 + wi Xil )
l i i
Manoj Kumar
Logistic Regression 20 / 66
Cross-Entropy Loss
= −t log y − (1 − t) log(1 − y )
Manoj Kumar
Logistic Regression 21 / 66
Question
Consider the following three rows from our training data, along with their predicted probabilities
ŷ for some choice of θ:
hue abv y ŷ
−0.17 0.24 0 0.45
−1.18 1.61 0 0.19
1.25 −0.97 1 0.80
What is the mean cross-entropy loss on just the above three rows of our training data?
Suppose you are given the following classification task: predict the target Y ∈ {0, 1} given two
real-valued features X1 ∈ R and X2 ∈ R. After some training, you learn the following decision
rule:
1. Plot the decision boundary and label the region where we would predict Y = 1 and Y = 0.
2. Suppose that we learned the above weights using logistic regression. Using this model, what
would be our prediction for P(Y = 1|X1 , X2 )? (You may want to use the sigmoid function
1
σ(x ) = 1+exp(−x )
).
Question
We consider the following models of logistic regression for a binary classification with a sigmoid
function g(z) = 1+e1−z :
• Model 1: P(Y = 1|X , w1 , w2 ) = g(w1 X1 + w2 X2 )
• Model 2: P(Y = 1|X , w1 , w2 ) = g(w0 + w1 X1 + w2 X2 )
If the label of the third example is changed to -1 , does it affect the learned weights w = (w1 , w2 )
in Model 1 and Model 2?
• It affects both Model 1 and 2
• Neither Model 1 nor 2.
• Model 1
• Model 2
Homework
I have a dataset with R records in which the i-th record has one real-valued input attribute xi and one
real-valued output attribute yi .
We have the following model with one unknown parameter w which we want to learn from data.
yi ∼ N(exp(wxi ), 1)
Note that the variance is known and equal to one.
(b) Suppose you decide to do a maximum likelihood estimation of w . You do the math and figure out
that you need w to satisfy one of the following equations. Which one?
∑ ∑
A. xi exp(wxi ) = i xi yi exp(wxi )
∑i ∑
B. xi exp(2wxi ) = i xi yi exp(wxi )
∑i 2 ∑
C. x exp(wxi ) = i xi yi exp(wxi )
∑i i2 ∑ ( i)
D. x exp(wxi ) = i xi yi exp wx
∑i i ∑ 2
E. i exp(wxi ) = i yi exp(wxi )
Decision Boundary
Manoj Kumar
Logistic Regression 31 / 66
Question
We consider here a discriminative approach for solving the classification problem illustrated in Figure 1.
Figure 1: The 2-dimensional labeled training set, where ‘+’ corresponds to class y = 1 and ‘O’
corresponds to class y = 0.
We attempt to solve the binary classification task depicted in Figure 1 with the simple linear logistic
regression model:
1
P(y = 1|⃗x , w
⃗ ) = g(w0 + w1 x1 + w2 x2 ) =
1 + exp(−w0 − w1 x1 − w2 x2 )
Question
∑
n
log(P(yi |xi , w0 , w1 , w2 )) − Cwj2
i=1
for very large C . The regularization penalties used in penalized conditional log-likelihood estimation
are −Cwj2 , where j = {0, 1, 2}.
1. How does the training error change with regularization of each parameter wj ? State whether the
training error increases or stays the same (zero) for each wj for very large C . Provide a brief justification
for each of your answers.
(a) By regularizing w2 [ ]
(b) By regularizing w1 [ ]
(c) By regularizing w0 [ ]
Log odds or Logits
Manoj Kumar
Logistic Regression 35 / 66
Generative vs. Discriminative Classifiers
Manoj Kumar
Logistic Regression 38 / 66
Generative vs. Discriminative Classifiers
Manoj Kumar
Logistic Regression 38 / 66
Question
In which of the following situations can logistic regression be used? Select all that apply.
1 Predicting whether an email is a spam email or not based on its contents.
2 Predicting the rainfall depth for a given day in a certain city based on the city’s
historical weather data.
3 Predicting the cost of a house based on features of the house.
4 Predicting if a patient has a disease or not based on the patient’s symptoms and
medical history.
Question
Suppose you are given θ for the logistic regression model to predict whether a tumor is
malignant (y = 1) or benign (y = 0) based on features of the tumor x . If you get a new
patient x∗ and find that x∗T θ > 0, what can you say about the tumor? Select only one.
1 The tumor is benign
2 The tumor is more likely benign
3 The tumor is more likely to be malignant
4 The tumor is malignant
Question
P(Y =1|X )
Bubble the expression that describes the odds ratio P(Y =0|X )
of a logistic regression model.
T
X β
label=⃝
−X T β
lbbel=⃝
exp(X T β)
lcbel=⃝
σ(X T β)
ldbel=⃝
None of these
lebel=⃝
Question
Bubble the expression that describes P(Y = 0|X ) for a logistic regression model.
σ(−X T β)
label=⃝
1 − log(1 + exp(X T β))
lbbel=⃝
1 + log(1 + exp(−X T β))
lcbel=⃝
None of these
ldbel=⃝
Question
For a logistic regression model P(Y = 1|X ) = σ(−2 − 3X ), where X is a scalar random
variable, what values of x would give P(Y = 0|X = x ) ≥ 43 ?
Question
Logistic regression:
• ⃝ Minimizes cross-entropy loss
• ⃝ Has a simple, closed-form analytical solution
• ⃝ Models the log-odds as a linear function
• ⃝ Is a classification method to estimate class posterior probabilities
Question
[1 pt] In the discriminative approach to solving classification problems, we model the condi-
tional probability of the labels given the observations.
• ⃝ True
• ⃝ False
Question
Stanford and Berkeley students are trying to solve the same logistic regression problem for a
dataset. The Stanford group claims that their initialization point will lead to a much better
optimum than Berkeley’s initialization point. Stanford is correct.
• ⃝ True
• ⃝ False
Question
p
In logistic regression, we model the odds ratio 1−p
as a linear function.
• ⃝ True
• ⃝ False
Question
Select always, sometimes, or never to describe when each statement below is true about a
logistic regression model P(Y = 1|X ) = σ(X T β), where Y is binary and X is vector-valued.
X and β are finite.
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) > X T β
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) = P(Y = 0| − X )
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) < 1
• ⃝ Always ⃝ Sometimes ⃝ Never: σ(X T β) ≤ σ(X T (2 · β))
Question
For a logistic regression model P(Y = 1|X ) = σ(−2 − 3X ), where X is a scalar random
variable, what values of x would give P(Y = 0|X = x ) ≥ 43 ?
Question
If no regularization is used in logistic regression and the training data is linearly separable, the
optimal model parameters will tend towards positive or negative infinity.
□ True □ False
Question
Suppose we use the following regression model with a single model weight θ and loss function:
fθ (x ) = σ(θ − 2)
ℓ(θ, x , y ) = −y log fθ (x ) − (1 − y ) log(1 − fθ (x )) + 12 θ2
Derive the stochastic gradient descent update rule for this model and loss function, assuming
that the learning rate α = 1. Your answer may only use the following variables: θ(t+1) , θ(t) ,
y , and the sigmoid function σ. Show all your work within the space provided and draw a box
around your final answer.
Question
Suppose you have a logistic regression model for spam detection, using a dataset with a binary
outcome that indicates whether an email is spam (1) or not spam (0). The predictor variables
x1 , x2 , and x3 are boolean values (0 or 1) that indicate whether the email contains the words
"free", "order", and "homework", respectively. The model has four parameters: weights w1 ,
w2 , w3 , and offset b. You find that emails containing the words "free" and "order" have a
higher probability of being spam, while emails containing the word "homework" have a lower
probability of being spam. Given this information, which of the following signs is most likely
for the weights w1 , w2 , and w3 ?
(A) All positive
(B) All negative
(C) w1 and w2 are positive, w3 is negative
(D) w1 and w2 are negative, w3 is positive
Question
You have been tasked with performing an analysis on customers of a credit card company.
Specifically, you will be developing a classification model to classify whether or not specific
customers will fail to pay their next credit card payment. You decide to approach this problem
with a logistic regression classifier. The first 5 rows of our data are shown below.
The numerical data in the education and marriage columns correspond to the following cate-
gories:
• Education: 1 - graduate school; 2 - university; 3 - high school; 4 - other
• Marriage: 1 - married; 2 - single; 3 - other
Question
Our response variable, labeled as failed payment, can have values of 0 (makes their next
payment) or 1 (fails to make their next payment). You use the logistic regression model
ŷ = P(Y = 1|x ) = σ(x T θ). Assume that the following value of θ̂ minimizes un-regularized
mean cross-entropy loss for this dataset:
This specific customer fortunately made their next payment on time! Compute the cross-
entropy loss of the prediction in part (a). Leave your answer in terms of σ.
Question
How does a one-unit increase in age impact the log-odds of making a failed payment? Give a
precise, numerical answer, not just it increases or it decreases.
Question
Let’s consider all customers who are married and whose highest level of education is high
school. What is the minimum age of such a customer, such that they are more likely to fail
their next payment than make their next payment, under our logistic regression model?
Question
Suppose you choose a threshold T = 0.8. The decision boundary of the resulting classifier is
of the form:
Suppose with the above threshold you achieve a training accuracy of 100%. Can you conclude
your training data was linearly separable in the feature space? Answer yes or no, and explain
in one sentence.
Thank you!
Manoj Kumar
Logistic Regression 66 / 66