0% found this document useful (0 votes)
31 views40 pages

Chap4 Logistic Regression

The document provides an overview of logistic regression and its application in classification problems, particularly in predicting binary outcomes. It discusses the limitations of linear regression for qualitative response variables and introduces the logistic model, which uses the logistic function to estimate probabilities. The document also covers model estimation, interpretation of odds ratios, and making predictions using logistic regression with both continuous and categorical predictors.

Uploaded by

vogiahuy330
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views40 pages

Chap4 Logistic Regression

The document provides an overview of logistic regression and its application in classification problems, particularly in predicting binary outcomes. It discusses the limitations of linear regression for qualitative response variables and introduces the logistic model, which uses the logistic function to estimate probabilities. The document also covers model estimation, interpretation of odds ratios, and making predictions using logistic regression with both continuous and categorical predictors.

Uploaded by

vogiahuy330
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

STAT452

Logistic Regression and Classification

Hoàng Văn Hà


University of Science, VNU - HCM
[email protected]

V. H. Hoang Logistic Regression 1 / 44


Outline

1 Introduction

2 Why Not Linear Regression?

3 Logistic Regression

4 Classification

V. H. Hoang Logistic Regression 2 / 44


Introduction

Classification: Definition

Given a set of records (called the training set)


Each record contains a set of attributes. One of the attributes is the class.
Find a model for the class attribute as a function of the values of other attributes.
Goal: Previously unseen records should be assigned to a class as accurately as
possible.
Usually, the given data set is divided into training and test set, with training set used
to build the model and test set used to validate it. The accuracy of the model is
determined on the test set.

V. H. Hoang Logistic Regression 4 / 44


Introduction

Classification: Definition

Consider a qualitative target or response variable Y and associated predictor variables


X1 , X2 , . . . , Xp .
The classification task is to build a function or a rule set in terms of X1 , X2 , . . . , Xp
that takes as input and predicts its value (or category) for Y .
Examples: Qualitative variables take values in an unordered set C , such as:
Eye color ∈ {brown, blue, green}
Species ∈ {versicolour, verginica, sethosa}
Insurance claim ∈ {fraudulent, legitimate}

Note that these Qualitative variables take values in an unordered set C .

V. H. Hoang Logistic Regression 5 / 44


Introduction

Classification Problems

Classification problems occur often, perhaps even more so than regression problems.
Some examples include:
1 A person arrives at the emergency room with a set of symptoms that could possibly
be attributed to one of three medical conditions.
Which of the three conditions does the individual have?
2 An online banking service must determine whether a transaction is fraudulent based
on the user’s IP address, transaction history, etc.
3 A biologist analyzing DNA sequences wants to determine which mutations are
disease-causing and which are not.

V. H. Hoang Logistic Regression 6 / 44


Introduction

Example: Exploring the Default Dataset in R

Load dataset from the ISLR package:

library(’ISLR’)
attach(Default)
dim(Default)
# [1] 10000 4

Dataset has 10,000 observations and 4 variables.


Preview of the dataset:

head(Default)
# default student balance income
#1 No No 729.53 44361.62
#2 No Yes 817.18 12106.13
#3 No No 1073.55 31767.14
#4 No No 529.25 35704.49
#5 No No 785.66 38463.50
#6 No Yes 919.59 7491.56

V. H. Hoang Logistic Regression 7 / 44


Introduction

Example: Exploring the Default Dataset in R

Goal: Predict whether an individual will default on a credit card payment.


Predictors: Annual income and monthly credit card balance.
Data: Simulated dataset of 10,000 individuals.
Left panel: Scatter plot — defaulted (orange) vs not (blue).
Right panel: Boxplots showing distribution of balance and income by default status.
Observation: Defaulters tend to have higher balances.
Since default (Y ) is qualitative, linear regression is not appropriate.

V. H. Hoang Logistic Regression 8 / 44


Why Not Linear Regression?

Can we use Linear Regression when Y is qualitative?

To build a linear regression model between default and balance, we need to


encode the default response into numbers:

Set Code:
Y = 0, if default is No
Y = 1, if default is Yes

Question: Can we simply perform a linear regression of Y on X and classify as ‘Yes’


if Ŷi > 0.5?

V. H. Hoang Logistic Regression 10 / 44


Why Not Linear Regression?

Limitations of using Linear Regression when Y is binary

Consider Y is default and X is balance.


In simple linear regression model: Yi = β0 + β1 xi + i or

E (Yi ) = β0 + β1 xi .

For binary outcomes:

E (Yi ) = P(Y = 1) · 1 + P(Y = 0) · 0 = P(Y = 1) = P(Default).

This implies
P(Default) = β0 + β1 xi = E (Yi ).

NOTE: Linear regression might produce probabilities less than 0 or greater than 1.

V. H. Hoang Logistic Regression 11 / 44


Why Not Linear Regression?

Limitations of using Linear Regression when Y is binary

Classification using the Default data: estimated probability of default using linear
regression. Some estimated probabilities are negative!

V. H. Hoang Logistic Regression 12 / 44


Logistic Regression

What is Logistic Regression?

A statistical model for predicting a binary outcome.


Appropriate when the response variable Y ∈ {0, 1}.
Models the probability π(x) = P(Y = 1 | X = x) directly.
The Logistic Model use logistic function. It maps any real-valued number into the
interval (0, 1).
ex
y= .
1 + ex

1 y

0.5

x
−10 −5 5 10

V. H. Hoang Logistic Regression 14 / 44


Logistic Regression

The Logistic Regression Model

The model form:


 
π(X)
log = β0 + β1 X1 + · · · + βp Xp = XT β
1 − π(X)
Equivalently, in probability scale:
T
e β0 +β1 X1 +···+βp Xp eX β
π(X) = β +β X +···+β X
= .
1+e 0 1 1 p p
1 + e XT β

V. H. Hoang Logistic Regression 15 / 44


Logistic Regression

Estimating the Model

For a sample of size n, the likelihood for a binary logistic regression is given by:

n n
!yi !1−yi
Y Y exp(XTi β) 1
L(β; y, X) = πiyi (1 − πi ) 1−yi
= .
i=1 i=1
1 + exp(XTi β) 1 + exp(XTi β)

This yields the log-likelihood:


n
X  
`(β) = yi log(πi ) + (1 − yi ) log(1 − πi )
i=1
n h
X i
= yi XTi β − log(1 + exp(XTi β)) .
i=1

Maximizing the likelihood (or log-likelihood) has no closed-form solution, so a technique


like iteratively reweighted least squares (IRLS) is used to find an estimate of the
regression coefficients, β̂.

V. H. Hoang Logistic Regression 16 / 44


Logistic Regression

Odds and Log Odds

There are algebraically equivalent ways to write the logistic regression model.
First, express odds:
π
= exp(β0 + β1 X1 + . . . + βp Xp ).
1−π

The odds is the ratio of the probability of success to the probability of failure.
Example: If the probability of success is 0.8, then odds = 0.8/(1 − 0.8) = 4, i.e., 4:1.

Second, take the log of odds:


 
π
log = β0 + β1 X1 + . . . + βp Xp .
1−π

This is known as the logit transformation.


The log of the odds is a linear function of the predictors.

V. H. Hoang Logistic Regression 17 / 44


Logistic Regression

Odds: Example and Interpretation

P(success) π
Definition: Odds = = .
P(failure) 1−π
Example: In a study, 100 individuals are surveyed to check if they default on credit:
20 individuals defaulted ⇒ P(default) = 0.2.
So, P(no default) = 0.8.
0.2
Odds of default = = 0.25.
0.8

Interpretation:
Odds = 0.25 means: for every 1 person who defaults, there are 4 who do not.
If the probability of success is 0.5, then Odds = 1 (equal chance).
Odds > 1: Success is more likely than failure.
Odds < 1: Failure is more likely than success.

V. H. Hoang Logistic Regression 18 / 44


Logistic Regression

Interpreting the Odds Ratio

We compare two sets of predictors X(1) and X(2) , which differ in only one predictor
(i.e., one predictor changes by one unit).
This lets us assess how that single predictor affects the response.
The odds ratio can take any nonnegative value:
If OR = 1: no association between predictor and response.
If OR > 1: higher predictor values increase the odds of success.
If OR < 1: higher predictor values decrease the odds of success.
Specifically, the odds increase multiplicatively by exp(βj ) for every one-unit increase
in Xj .
This applies to both continuous predictors and categorical levels of factors.
Values further from 1 indicate a stronger degree of association.

V. H. Hoang Logistic Regression 19 / 44


Logistic Regression

Interpreting Log-Odds in Logistic Regression

Logistic regression models the log-odds of success as a linear combination of


predictors:
 
π
log = β0 + β1 X1 + · · · + βk Xk
1−π
Each coefficient βj represents the change in log-odds for a one-unit increase in Xj ,
holding other predictors constant.
A positive βj increases the log-odds ⇒ higher probability of success.
A negative βj decreases the log-odds ⇒ lower probability of success.
Exponentiating βj gives the odds ratio: exp(βj )

V. H. Hoang Logistic Regression 20 / 44


Logistic Regression

Computing Odds Ratio in R: ”defaut” dataset

Given a logistic regression model in R:

model <- glm(default ~ balance + income,


data = Default, family = "binomial")
summary(model)

To get the odds ratios:

exp(coef(model))

To get confidence intervals for the odds ratios:

exp(confint(model))

Each odds ratio shows the multiplicative change in odds for a one-unit increase in the
predictor.

V. H. Hoang Logistic Regression 21 / 44


Logistic Regression

Interpreting Logistic Regression Output

We fit a logistic regression model using the Default dataset to predict the probability of
default = Yes based on balance, i.e. an increase in balance is associated with an
increase in the probability of Default .
Estimated model:  
π
log = β0 + β1 · balance.
1−π

Estimated coefficients:
Variable Coefficient Std. error Z-statistic P-value
Intercept −10.6513 0.3612 −29.5 <0.0001
balance 0.0055 0.0002 24.9 <0.0001
Table 1: Estimated coefficients of the logistic regression model that predicts the probability of
default using balance.

V. H. Hoang Logistic Regression 22 / 44


Logistic Regression

Interpreting the Odds Ratio

The estimated coefficient for balance is

β̂1 = 0.0055.

We interpret this in terms of the odds ratio:

Odds Ratio = exp(β̂1 ) = exp(0.0055) ≈ 1.0055.


Interpretation:
For each additional unit increase in balance, the odds of default increase by
approximately 0.55%.
That is, the odds are multiplied by 1.0055 for every 1-unit increase in balance.
Since the odds ratio is greater than 1, balance is positively associated with the
likelihood of default.

V. H. Hoang Logistic Regression 23 / 44


Logistic Regression

Effect of Balance on Probability of Default

1
Logistic Curve
Cutoff P̂ = 0.5
Probability of default

0.5

0
0 500 1,000 1,500 2,000 2,500
balance

V. H. Hoang Logistic Regression 24 / 44


Logistic Regression

Statistical Significance of Coefficients: Wald Test

The Wald test is used to test the significance of individual regression coefficients in
logistic regression (similar to the t-test in linear regression).
For maximum likelihood estimates, the test statistic is:

β̂i
Z = .
SE(β̂i )

Used to test the null hypothesis H0 : βi = 0.


Z is compared to the standard normal distribution to compute the p-value.
Variables with small p-values are more likely to be significant predictors.
In the ”defaut” data, since the p-value for balance is tiny (<0.0001), we reject H0 . We
conclude that there is a statistically significant association between balance and
probability of default.
Confidence Interval:
β̂i ± z1−α/2 · SE(β̂i ).

V. H. Hoang Logistic Regression 25 / 44


Logistic Regression

Making Predictions

Using estimated coefficients from Table 1:

e β0 +β1 X e −10.6513+0.0055·1000
p̂(X ) = = ≈ 0.00576.
1+e β 0 +β 1 X 1 + e −10.6513+0.0055·1000

For a balance of $1,000, the predicted probability of default is ≈ 0.00576 (less than
1%)
For a balance of $2,000, the predicted probability is:

p̂(X = 2000) ≈ 0.586 (or 58.6%).

Shows how default risk increases non-linearly with balance.

V. H. Hoang Logistic Regression 26 / 44


Logistic Regression

Prediction using Categorical Predictors


Using a logistic model with a qualitative predictor student:
Student variable coded as: 1 = student, 0 = non-student.
Estimated model:
Variable Coefficient Std. error Z-statistic P-value
Intercept −3.5041 0.0707 −49.55 <0.0001
student[Yes] 0.4049 0.1150 3.52 0.0004

Table 2: Logistic regression predicting probability of default using student status as a


dummy variable.

 
π
log = −3.5041 + 0.4049 · student.
1−π
Predicted probabilities:
e −3.5041+0.4049
p̂(default|student = Yes) = ≈ 0.0431
1 + e −3.5041+0.4049
e −3.5041
p̂(default|student = No) = ≈ 0.0292
1 + e −3.5041
Conclusion: Students have higher predicted probability of default than non-students.
V. H. Hoang Logistic Regression 27 / 44
Logistic Regression

Evaluating Logistic Regression: Likelihood & Deviance

1. Log-Likelihood `(β)
Measures goodness-of-fit
Higher `(β) implies better model fit.

2. Deviance:

Deviance = −2 · `(model) + 2 · `(saturated model).

A lower deviance indicates a better fit


Deviance of the null model (intercept-only) is used as a baseline

3. Likelihood Ratio Test (LRT):

G 2 = −2 · (`null − `model ) ∼ χ2df .

Compares two nested models


Test if adding predictors significantly improves model fit

V. H. Hoang Logistic Regression 28 / 44


Logistic Regression

Evaluating Logistic Regression: AIC and Pseudo-R 2

4. AIC (Akaike Information Criterion):

AIC = −2`(β) + 2p.

Balances model fit and complexity


Penalizes overfitting
Lower AIC indicates a better model

5. Pseudo-R 2 (e.g., McFadden’s):


`model
R2 = 1 − .
`null

Measures improvement over null model


Interpretation is less direct than in linear regression
Commonly used to communicate model quality

V. H. Hoang Logistic Regression 29 / 44


Logistic Regression

Evaluating Logistic Regression in R (Default Data)

Fit model:
model1 <- glm(default ~ balance, data = Default, family = "binomial")
summary(model1)
Selected output:
Residual deviance: 2908.7 on 9998 degrees of freedom
Null deviance: 2920.6 on 9999 degrees of freedom
AIC: 2912.7
p-value for balance < 2e − 16 ⇒ highly significant
Likelihood Ratio Test:
model0 <- glm(default ~ 1, data = Default, family = "binomial")
anova(model0, model1, test = "Chisq")

Compare null vs. full model


Deviance reduction: 1324.1 (p < 2e − 16)
Conclusion: balance improves model fit significantly

V. H. Hoang Logistic Regression 30 / 44


Logistic Regression

Comparing Logistic Models in R

We compare two nested models:

model1 <- glm(default ~ balance, data = Default, family = "binomial")


model2 <- glm(default ~ balance + income, data = Default,
family = "binomial")

Step 1: Compare AIC values


model1 AIC = 1600.5
model2 AIC = 1585 ⇒ model2 is slightly better (lower AIC)

Step 2: Likelihood Ratio Test (LRT)


> anova(model1, model2, test = "Chisq")
Analysis of Deviance Table
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 9998 1596.5
2 9997 1579.0 1 17.485 2.895e-05 ***

Test statistic = 17.485, df = 1, p = 2.895 × 10−5


⇒ significant improvement by adding income.

V. H. Hoang Logistic Regression 31 / 44


Logistic Regression

Model Selection: AIC vs. Likelihood Ratio Test

Two common criteria:


AIC (Akaike Information Criterion):
Balances goodness-of-fit and model complexity
Lower AIC = better model (penalizes extra predictors)
Can compare non-nested models
Likelihood Ratio Test (LRT):
Formal hypothesis test for nested models
Tests if additional predictors improve the model significantly
Based on χ2 distribution

When they disagree:


AIC prefers more complex model, but LRT is not significant:
Extra predictor does not improve fit enough to justify complexity
Prefer simpler model (especially for interpretability)
Guiding principle: Use LRT for significance, AIC for prediction.

V. H. Hoang Logistic Regression 32 / 44


Classification

Classification Using Logistic Regression

Logistic regression predicts the probability that Y = 1, given X :

e β0 +β1 X
π(X ) =
1 + e β0 +β1 X
To classify a new observation:
If π(X ) ≥ 0.5 → predict class 1 (success)
If π(X ) < 0.5 → predict class 0 (failure)
This threshold of 0.5 can be changed depending on:
Application context (e.g., fraud detection, medical diagnosis)
Desired trade-off between false positives and false negatives
The boundary π(X ) = 0.5 corresponds to:
 
π
log = 0 implies β0 + β1 X = 0
1−π
→ A linear decision boundary in predictor space.

V. H. Hoang Logistic Regression 34 / 44


Classification

Confusion Matrix

Definition: A table showing predicted vs. actual classifications:

Predicted: 1 Predicted: 0
Actual: 1 True Positive (TP) False Negative (FN)
Actual: 0 False Positive (FP) True Negative (TN)

Goal: Maximize TP and TN, minimize FP and FN


Change the threshold, 0.5 by defaut, to maximize TP and TN.
Different applications value different types of errors (e.g. fraud, medical)

V. H. Hoang Logistic Regression 35 / 44


Classification

Classification Metrics

Based on Confusion Matrix:


Accuracy:
TP + TN
.
TP + TN + FP + FN
True Positive Rate (Sensitivity or Recall):
TP
TPR = .
TP + FN
False Positive Rate:
FP
FPR = .
FP + TN
Precision:
TP
Precision = .
TP + FP
F1 Score: Harmonic mean of Precision and Recall
Precision · Recall
F1 = 2 · .
Precision + Recall

V. H. Hoang Logistic Regression 36 / 44


Classification

ROC Curve (Receiver Operating Characteristic)

ROC curve plots TPR (sensitivity) vs. FPR (1 - specificity) across thresholds. ROC
helps visualize the trade-off between Sensitivity and Specificity.
The closer the curve to the top-left corner, the better the model. A perfect classifier
hugs the top-left corner.
A random classifier lies along the diagonal
True Positive Rate (TPR) 1

0.8

0.6

0.4

0.2 Random (AUC = 0.5)


Logistic Model ROC
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate (FPR)
V. H. Hoang Logistic Regression 37 / 44
Classification

AUC and Threshold Selection

AUC = Area Under the ROC Curve


Measures the ability of the model to discriminate between classes
AUC = 1: perfect model, AUC = 0.5: no discrimination (random).
Choosing Optimal Threshold τ :
Depends on the cost of false positives and false negatives
Often chosen to maximize:
Youden’s Index = TPR − FPR.
Or to balance Precision and Recall (F1 score)
Tools in R: ROCR, pROC, yardstick (tidymodels).

V. H. Hoang Logistic Regression 38 / 44


Classification

Classification with Logistic Regression (Default Dataset)

Step 1: Fit model and predict probabilities


library(ISLR)
model <- glm(default ~ balance, data = Default, family = "binomial")
probs <- predict(model, type = "response")
Step 2: Classify with threshold τ = 0.5
pred <- ifelse(probs > 0.5, "Yes", "No")
Step 3: Confusion Matrix and Evaluation
table(Predicted = pred, Actual = Default$default)
Step 4: Compute metrics
library(caret)
confusionMatrix(as.factor(pred), Default$default, positive = "Yes")

V. H. Hoang Logistic Regression 39 / 44


Classification

Classification Evaluation Example (R Output)

Example Confusion Matrix:


Actual: No Actual: Yes
Pred: No 9625 233
Pred: Yes 42 100
Accuracy ≈ 97.25%
Sensitivity (TPR) ≈ 3.0 %
Specificity (TNR) ≈ 99.57%
Precision ≈ 70.42%
Conclusion: Good overall accuracy, but very poor sensitivity → default cases are
rare and hard to detect with 0.5 threshold.

V. H. Hoang Logistic Regression 40 / 44


Classification

Adjusting the Threshold τ

Default threshold τ = 0.5 gave poor sensitivity


Try lower threshold to capture more defaulters:

pred_30 <- ifelse(probs > 0.3, "Yes", "No")


table(Predicted = pred_30, Actual = Default$default)
confusionMatrix(as.factor(pred_30), Default$default, positive = "Yes")

Lowering τ increases True Positives (TP)


But also increases False Positives (FP)
Trade-off: better sensitivity vs. worse specificity

V. H. Hoang Logistic Regression 41 / 44


Classification

ROC Curve and AUC for Model Evaluation

Using the pROC package:


library(pROC)
roc_obj <- roc(Default$default, probs)
plot(roc_obj, col = "blue", lwd = 2)
auc(roc_obj)

ROC curve: shows performance across all thresholds


AUC: Area Under the Curve
AUC = 0.5: random guessing
AUC = 1.0: perfect model
AUC > 0.8: strong discrimination
Use ROC to choose threshold that balances TPR and FPR.

V. H. Hoang Logistic Regression 42 / 44


Classification

Choosing Threshold from ROC Curve

Each point on the ROC curve corresponds to a threshold τ


Changing τ changes:
TPR (sensitivity)
FPR (1 - specificity)

Common strategy: Choose τ to maximize Youden’s Index:

Youden = TPR − FPR

This maximizes the vertical distance from the diagonal (random guessing)

Alternative: choose τ that balances Precision and Recall (maximize F1)


R function with pROC:
coords(roc_obj, "best", best.method = "youden")

V. H. Hoang Logistic Regression 43 / 44


Classification

Optimal Threshold via ROC (Youden)

Step-by-step in R:
library(ISLR)
library(pROC)

# Fit logistic regression model


model <- glm(default ~ balance, data = Default, family = "binomial")
probs <- predict(model, type = "response")

# Compute ROC object


roc_obj <- roc(Default$default, probs)

# Plot ROC curve


plot(roc_obj, col = "blue", lwd = 2)

# Find optimal threshold (Youden’s Index)


coords(roc_obj, "best", best.method = "youden")

V. H. Hoang Logistic Regression 44 / 44

You might also like