0% found this document useful (0 votes)
61 views88 pages

Logistic - Regression Class 3

Uploaded by

nick peak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views88 pages

Logistic - Regression Class 3

Uploaded by

nick peak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Logistic Regression

Manoj Kumar

21th October, 2024

Manoj Kumar
Logistic Regression 1 / 66
Given data

Blood Pressure Level (mm Hg) Diabetes Status


80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic

1: Blood Pressure Levels and Diabetes Status

Manoj Kumar
Logistic Regression 2 / 66
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Why is Linear Regression not good for classification?

• The output ŷ can be outside of the label range {0, 1}.


• What does a response of value of -2 mean?
Possible classification: assign ŷ > 0.5 to 1, and 0 otherwise.
• Boundary very sensitive to outliers.

Manoj Kumar
Logistic Regression 4 / 66
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Blood Diabetes
Pressure Status
Level
(mm Hg)
80 Not Diabetic
90 Not Diabetic
100 Not Diabetic
110 Not Diabetic
120 Diabetic
130 Diabetic
140 Diabetic
150 Diabetic
Sigmoid/Logistic Function

Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function

Sigmoid Function

σ(a)

0.5

Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function

• Bounded:

1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)

0.5

Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function

• Bounded:

1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)
• Symmetric:

exp(−a) 1
1 − σ(a) = = = σ(−a)
0.5 1 + exp(−a) exp(a) + 1

Manoj Kumar
Logistic Regression 6 / 66
Sigmoid/Logistic Function

• Bounded:

1
Sigmoid Function σ(a) = ∈ (0, 1)
1 + exp(−a)
σ(a)
• Symmetric:

exp(−a) 1
1 − σ(a) = = = σ(−a)
0.5 1 + exp(−a) exp(a) + 1

• Gradient:

exp(−a)
a σ ′ (a) = = σ(a)(1 − σ(a))
(1 + exp(−a))2

Manoj Kumar
Logistic Regression 6 / 66
Interpretation of the Sigmoid Function

Manoj Kumar
Logistic Regression 7 / 66
Interpretation of the Sigmoid Function

Note

For a given input x , the probability that the


Sigmoid Function
class label Y equals 1 given x , is represented
by: 1
σ(x )
1 0.8
P(Y = 1|x ) =
1 + e −(w0 +w1 x )
0.6

• P(Y = 0|x ) + P(Y = 1|x ) = 1 0.4


• P(Y = 0|x ) = 1 − 1
1+e −(w0 +w1 x ) 0.2
x
By default, we take threshold = 0.5: −5 5
• Y = 1 if P(Y = 1|x ) ≥ 0.5
• Y = 0 if P(Y = 1|x ) < 0.5

Manoj Kumar
Logistic Regression 7 / 66
Question
 
3
Given a logistic regression model with a parameter vector θ = 4, and a test data point
1
 
1
x = 1, what is the probability of Y = 1 for the given data point x?
1
Question

Suppose that you have trained a logistic regression classifier hθ (x ) = σ(1 − x ) where σ(·) is the
logistic/sigmoid function. What does its output on a new example x = 2 mean? Check all that
apply.
• □ Your estimate for P(y = 1|x ; θ) is about 0.73.
• □ Your estimate for P(y = 0|x ; θ) is about 0.27.
• □ Your estimate for P(y = 1|x ; θ) is about 0.27.
• □ Your estimate for P(y = 0|x ; θ) is about 0.73.
Question

Consider the sigmoid function f (x ) = 1


1+e −x
. The derivative f ′ (x ) is:
• f (x ) ln f (x ) + (1 − f (x )) ln(1 − f (x ))
• f (x )(1 − f (x ))
• f (x ) ln(1 − f (x ))
• f (x )(1 + f (x ))
Question

Let σ(a) = 13 . Using the properties of sigmoid function, calculate the value of the expression:
σ ′ (−a), where ′ represents derivative.
2
1 9
2 − 29
1
3 9
4 − 19
Question

Q3-2: Which of the following statement is true about outliers in Linear regression?
1 Linear regression is sensitive to outliers
2 Linear regression is NOT sensitive to outliers
3 Can’t say
4 None of these
Question

Suppose we have trained a logistic regression classifier for a binary classification task. The table
below provides the true labels y and the predicted probabilities P(Y = 1 | x ) for a set of data
points. We want to evaluate the accuracy of the classifier for the following thresholds:
• Model A: T = 0.25
• Model B: T = 0.5
• Model C: T = 0.75
Calculate the accuracy for each model and determine which threshold results in the highest
accuracy.

True Label y P(Y = 1 | x )


1 0.9
1 0.6
0 0.6
0 0.55
1 0.4
0 0.3
Decision Boundary (Using Logistic)

Manoj Kumar
Logistic Regression 14 / 66
Decision Boundary (Using Logistic)
• Y = 1 if
P(Y = 1|X ) ≥ 0.5
• For 1 input feature (X ):
z = w0 + w1 X

1 1

1 + e −z 2

1 + e −(w0 +w1 X ) ≤ 2

w0 + w1 X ≥ 0
Y = 1 if X ≥ − ww10

Y =0 Y =1

Decision Boundary

Manoj Kumar
Logistic Regression 14 / 66
• For 2 input features (X1 , X2 ):
z = w0 + w1 X1 + w2 X2

P(Y = 1|X1 , X2 ) ≥ 0.5

1
≥ 0.5
1 + e −z
Decision Boundary

Y = 1 if w0 + w1 X1 + w2 X2 ≥ 0

X2
Y=1

Y=0

X1
Question

Suppose you train a logistic regression classifier and the learned hypothesis function is:

hθ (x ) = σ(θ0 + θ1 x1 + θ2 x2 ),
where θ0 = 6, θ1 = 0, θ2 = −1. Which of the following represents the decision boundary for
hθ (x )?

A B C D
10 x2 10 x2 10 x2 10 x2
8 y =1 8 y =1 y =0 8 y =0 y =1 8 y =0
6 6 6 6
4 y =0 4 4 4 y =1
2 x1 2 x1 2 x1 2 x1
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Recap: Likelihood

Manoj Kumar
Logistic Regression 17 / 66
How to Choose Parameters?

Manoj Kumar
Logistic Regression 18 / 66
Using MSE
Using MSE
Maximising Conditional Likelihood
Maximising Conditional Likelihood
How to Choose Parameters?

Maximizing Conditional Likelihood

1
P(Y = 0|X , W ) = ∑
1 + exp(w0 + i wi Xi )

exp(w0 + i wi Xi )
P(Y = 1|X , W ) = ∑
1 + exp(w0 + i wi Xi )

l(W ) ≡ ln P(Y l |X l , W )
l
( ) ( )
∑ ∑ ∑
= Y l
w0 + wi Xil − ln 1 + exp(w0 + wi Xil )
l i i

Good news: −l(W ) is a convex function of W .


Bad news: No closed-form solution to maximize l(W ).

Manoj Kumar
Logistic Regression 20 / 66
Cross-Entropy Loss

Interpret y ∈ [0, 1] as the estimated probability that t = 1.

Heavily penalize the extreme mis-classification cases when


t = 0, y = 1 or t = 1, y = 0.

Cross-entropy loss (a.k.a. log loss) captures this intuition:


{
− log y if t = 1
LCE (y , t) =
− log(1 − y ) if t = 0

= −t log y − (1 − t) log(1 − y )

Manoj Kumar
Logistic Regression 21 / 66
Question

Consider the following three rows from our training data, along with their predicted probabilities
ŷ for some choice of θ:

hue abv y ŷ
−0.17 0.24 0 0.45
−1.18 1.61 0 0.19
1.25 −0.97 1 0.80
What is the mean cross-entropy loss on just the above three rows of our training data?

◦ − 13 (log(0.45) + log(0.19) + log(0.20))


◦ − 13 (log(0.55) + log(0.19) + log(0.80))
◦ − 13 (log(0.45) + log(0.81) + log(0.80))
◦ − 13 (log(0.55) + log(0.81) + log(0.80))
◦ None of the above
Question

Suppose you are given the following classification task: predict the target Y ∈ {0, 1} given two
real-valued features X1 ∈ R and X2 ∈ R. After some training, you learn the following decision
rule:

Predict Y = 1 if w1 X1 + w2 X2 + w0 ≥ 0 and Y = 0 otherwise


where w1 = 3, w2 = 5, and w0 = −15.

1. Plot the decision boundary and label the region where we would predict Y = 1 and Y = 0.

2. Suppose that we learned the above weights using logistic regression. Using this model, what
would be our prediction for P(Y = 1|X1 , X2 )? (You may want to use the sigmoid function
1
σ(x ) = 1+exp(−x )
).
Question

We consider the following models of logistic regression for a binary classification with a sigmoid
function g(z) = 1+e1−z :
• Model 1: P(Y = 1|X , w1 , w2 ) = g(w1 X1 + w2 X2 )
• Model 2: P(Y = 1|X , w1 , w2 ) = g(w0 + w1 X1 + w2 X2 )

We have three training examples:


[ ]T [ ]T [ ]T
1 1 0
x (1) = , x (2) = , x (3) =
1 0 0

y (1) = 1, y (2) = −1, y (3) = 1

If the label of the third example is changed to -1 , does it affect the learned weights w = (w1 , w2 )
in Model 1 and Model 2?
• It affects both Model 1 and 2
• Neither Model 1 nor 2.
• Model 1
• Model 2
Homework

I have a dataset with R records in which the i-th record has one real-valued input attribute xi and one
real-valued output attribute yi .
We have the following model with one unknown parameter w which we want to learn from data.

yi ∼ N(exp(wxi ), 1)
Note that the variance is known and equal to one.

(a) Is the task of estimating w :


A. a linear regression problem?
B. a non-linear regression problem?

(b) Suppose you decide to do a maximum likelihood estimation of w . You do the math and figure out
that you need w to satisfy one of the following equations. Which one?
∑ ∑
A. xi exp(wxi ) = i xi yi exp(wxi )
∑i ∑
B. xi exp(2wxi ) = i xi yi exp(wxi )
∑i 2 ∑
C. x exp(wxi ) = i xi yi exp(wxi )
∑i i2 ∑ ( i)
D. x exp(wxi ) = i xi yi exp wx
∑i i ∑ 2
E. i exp(wxi ) = i yi exp(wxi )
Decision Boundary

Manoj Kumar
Logistic Regression 31 / 66
Question

We consider here a discriminative approach for solving the classification problem illustrated in Figure 1.

Figure 1: The 2-dimensional labeled training set, where ‘+’ corresponds to class y = 1 and ‘O’
corresponds to class y = 0.
We attempt to solve the binary classification task depicted in Figure 1 with the simple linear logistic
regression model:
1
P(y = 1|⃗x , w
⃗ ) = g(w0 + w1 x1 + w2 x2 ) =
1 + exp(−w0 − w1 x1 − w2 x2 )
Question

Consider training regularized logistic regression models where we try to maximize:


n
log(P(yi |xi , w0 , w1 , w2 )) − Cwj2
i=1

for very large C . The regularization penalties used in penalized conditional log-likelihood estimation
are −Cwj2 , where j = {0, 1, 2}.
1. How does the training error change with regularization of each parameter wj ? State whether the
training error increases or stays the same (zero) for each wj for very large C . Provide a brief justification
for each of your answers.

(a) By regularizing w2 [ ]

(b) By regularizing w1 [ ]

(c) By regularizing w0 [ ]
Log odds or Logits

Manoj Kumar
Logistic Regression 35 / 66
Generative vs. Discriminative Classifiers

Manoj Kumar
Logistic Regression 38 / 66
Generative vs. Discriminative Classifiers

Training classifiers involves estimating f : X → Y , or P(Y |X ).


Generative classifiers (e.g., Naïve Bayes)
• Assume some functional form for P(Y ), P(X |Y )
• Estimate parameters of P(X |Y ), P(Y ) directly from training data
• Use Bayes rule to calculate P(Y = y |X = x )
Discriminative classifiers (e.g., Logistic regression)
• Assume some functional form for P(Y |X )
• Estimate parameters of P(Y |X ) directly from training data

Manoj Kumar
Logistic Regression 38 / 66
Question

In which of the following situations can logistic regression be used? Select all that apply.
1 Predicting whether an email is a spam email or not based on its contents.
2 Predicting the rainfall depth for a given day in a certain city based on the city’s
historical weather data.
3 Predicting the cost of a house based on features of the house.
4 Predicting if a patient has a disease or not based on the patient’s symptoms and
medical history.
Question

What is the purpose of the sigmoid function in logistic regression?


1 It converts continuous input into categorical data.
2 It standardizes the input to have zero mean and variance 1.
3 It optimizes the weights to reduce loss.
4 It transforms the output to a probability.
Question

Consider he following binary classification dataset.

Are these linearly separable?


1 Yes
2 No
Question

Suppose you are given θ for the logistic regression model to predict whether a tumor is
malignant (y = 1) or benign (y = 0) based on features of the tumor x . If you get a new
patient x∗ and find that x∗T θ > 0, what can you say about the tumor? Select only one.
1 The tumor is benign
2 The tumor is more likely benign
3 The tumor is more likely to be malignant
4 The tumor is malignant
Question

Which of the following explanations that applying regularization to a logistic regression


model? Select all that apply.
1 The training error is too high.
2 The test error is too low.
3 The data are high-dimensional.
4 There is a large class imbalance.
5 None of the above justify regularization for logistic regression.
Question

P(Y =1|X )
Bubble the expression that describes the odds ratio P(Y =0|X )
of a logistic regression model.
T
X β
label=⃝
−X T β
lbbel=⃝
exp(X T β)
lcbel=⃝
σ(X T β)
ldbel=⃝
None of these
lebel=⃝
Question

Bubble the expression that describes P(Y = 0|X ) for a logistic regression model.
σ(−X T β)
label=⃝
1 − log(1 + exp(X T β))
lbbel=⃝
1 + log(1 + exp(−X T β))
lcbel=⃝
None of these
ldbel=⃝
Question

For a logistic regression model P(Y = 1|X ) = σ(−2 − 3X ), where X is a scalar random
variable, what values of x would give P(Y = 0|X = x ) ≥ 43 ?
Question

Logistic regression:
• ⃝ Minimizes cross-entropy loss
• ⃝ Has a simple, closed-form analytical solution
• ⃝ Models the log-odds as a linear function
• ⃝ Is a classification method to estimate class posterior probabilities
Question

Which of the following is true of logistic regression?


• ⃝ It can be motivated by "log odds"
• ⃝ It can be used with L1 regularization
• ⃝ The optimal weight vector can be found using MLE
• ⃝ All of the above
• ⃝ None of the above
Question

[1 pt] In the discriminative approach to solving classification problems, we model the condi-
tional probability of the labels given the observations.
• ⃝ True
• ⃝ False
Question

Stanford and Berkeley students are trying to solve the same logistic regression problem for a
dataset. The Stanford group claims that their initialization point will lead to a much better
optimum than Berkeley’s initialization point. Stanford is correct.
• ⃝ True
• ⃝ False
Question

p
In logistic regression, we model the odds ratio 1−p
as a linear function.
• ⃝ True
• ⃝ False
Question

Select always, sometimes, or never to describe when each statement below is true about a
logistic regression model P(Y = 1|X ) = σ(X T β), where Y is binary and X is vector-valued.
X and β are finite.
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) > X T β
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) = P(Y = 0| − X )
• ⃝ Always ⃝ Sometimes ⃝ Never: P(Y = 1|X ) < 1
• ⃝ Always ⃝ Sometimes ⃝ Never: σ(X T β) ≤ σ(X T (2 · β))
Question

For a logistic regression model P(Y = 1|X ) = σ(−2 − 3X ), where X is a scalar random
variable, what values of x would give P(Y = 0|X = x ) ≥ 43 ?
Question

If no regularization is used in logistic regression and the training data is linearly separable, the
optimal model parameters will tend towards positive or negative infinity.
□ True □ False
Question

Suppose we use the following regression model with a single model weight θ and loss function:
fθ (x ) = σ(θ − 2)
ℓ(θ, x , y ) = −y log fθ (x ) − (1 − y ) log(1 − fθ (x )) + 12 θ2
Derive the stochastic gradient descent update rule for this model and loss function, assuming
that the learning rate α = 1. Your answer may only use the following variables: θ(t+1) , θ(t) ,
y , and the sigmoid function σ. Show all your work within the space provided and draw a box
around your final answer.
Question

Suppose you have a logistic regression model for spam detection, using a dataset with a binary
outcome that indicates whether an email is spam (1) or not spam (0). The predictor variables
x1 , x2 , and x3 are boolean values (0 or 1) that indicate whether the email contains the words
"free", "order", and "homework", respectively. The model has four parameters: weights w1 ,
w2 , w3 , and offset b. You find that emails containing the words "free" and "order" have a
higher probability of being spam, while emails containing the word "homework" have a lower
probability of being spam. Given this information, which of the following signs is most likely
for the weights w1 , w2 , and w3 ?
(A) All positive
(B) All negative
(C) w1 and w2 are positive, w3 is negative
(D) w1 and w2 are negative, w3 is positive
Question

You have been tasked with performing an analysis on customers of a credit card company.
Specifically, you will be developing a classification model to classify whether or not specific
customers will fail to pay their next credit card payment. You decide to approach this problem
with a logistic regression classifier. The first 5 rows of our data are shown below.

ID Education Marriage Age Failed Payment


28465 1 1 40 1
27622 1 2 23 0
28376 2 1 36 0
10917 3 1 54 0
27234 1 1 35 0

The numerical data in the education and marriage columns correspond to the following cate-
gories:
• Education: 1 - graduate school; 2 - university; 3 - high school; 4 - other
• Marriage: 1 - married; 2 - single; 3 - other
Question

Our response variable, labeled as failed payment, can have values of 0 (makes their next
payment) or 1 (fails to make their next payment). You use the logistic regression model
ŷ = P(Y = 1|x ) = σ(x T θ). Assume that the following value of θ̂ minimizes un-regularized
mean cross-entropy loss for this dataset:

θ̂ = [−1.30, 0.08, −0.08, 0.001]T


Here, -1.30 is the intercept term, 0.08 corresponds to education, -0.08 corresponds to marriage
status, and 0.001 corresponds to age.
Consider a customer who is 50 years old, married, and only has a high school education.
Compute the chance that they fail to pay their next credit card payment. Give your answer as
a probability in terms of σ.
Question

This specific customer fortunately made their next payment on time! Compute the cross-
entropy loss of the prediction in part (a). Leave your answer in terms of σ.
Question

How does a one-unit increase in age impact the log-odds of making a failed payment? Give a
precise, numerical answer, not just it increases or it decreases.
Question

Let’s consider all customers who are married and whose highest level of education is high
school. What is the minimum age of such a customer, such that they are more likely to fail
their next payment than make their next payment, under our logistic regression model?
Question

Suppose you choose a threshold T = 0.8. The decision boundary of the resulting classifier is
of the form:

A · education + B · marriage + C · age + D = 0


What are the values of A, B, C , and D? Your answers may contain a log, but should not
contain σ. Show your work.
Question

Suppose with the above threshold you achieve a training accuracy of 100%. Can you conclude
your training data was linearly separable in the feature space? Answer yes or no, and explain
in one sentence.
Thank you!

Manoj Kumar
Logistic Regression 66 / 66

You might also like