0% found this document useful (0 votes)
37 views

Multiple Logistic Regression (SPSS) 2021

Multiple logistic regression allows modeling of the relationship between a binary dependent variable and multiple independent variables. It can be used to establish which factors are associated with an outcome like disease status. The document describes multiple logistic regression, including odds, odds ratios, and the steps involved in building a model using a dataset on coronary artery disease. Variables are first explored using descriptive statistics and univariable analysis identifies potentially important predictors for the model.

Uploaded by

notepadhajar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Multiple Logistic Regression (SPSS) 2021

Multiple logistic regression allows modeling of the relationship between a binary dependent variable and multiple independent variables. It can be used to establish which factors are associated with an outcome like disease status. The document describes multiple logistic regression, including odds, odds ratios, and the steps involved in building a model using a dataset on coronary artery disease. Variables are first explored using descriptive statistics and univariable analysis identifies potentially important predictors for the model.

Uploaded by

notepadhajar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Multiple Logistic Regression

Professor Dr. Syed Hatim Noor


Dr. Wan Arfah Nadiah Wan Abdul Jamil
Universiti Sultan Zainal Abidin
Introduction
• Multiple logistic regression • Independent variables are
is the estimation of the the combination of
relationship between a numerical and categorical
dichotomous dependent variables
variable and more than one • Outcome is binary
independent variables or categorical variable
covariates • If the outcome is
dichotomous (called
• Applied in exploratory Multiple Logistic
studies, explanatory studies Regression)
• If the outcome is
polytomous (called
Multinomial Logistic
Regression)

Syed Hatim Noor 2


Introduction
• The goals of the regression analysis is to
establish a model that is
– Best fit
– Parsimonious
– Biologically sound (Biological plausibility)
– Statistically significant
• To answer the research question: What are
the factors associated to the dependent
variable (event-yes/no)??
– Eg.: What are the factors that associate with
coronary artery disease (CAD)
Syed Hatim Noor 3
Odds
• The odds = the chance • Eg.: On average 54 girls
• The odds of an event is are born in every 100
the ratio of the number births. What is the odds
of ways the event can of any randomly chosen
occur to the number of delivery to be is a girl?
ways the event can not • number of girls/number
occur of boys=54/46
• So, about 1.17 the odds
to get a baby girl

Syed Hatim Noor 4


Odds Ratio
• Odds ratio is calculated by dividing the 2 odds
• Eg.: What is the odds ratio of men to have
CAD compared to women?
• The odds of men having CAD/The odds of
women having CAD
Have CAD = 1 No CAD = 0

Men = 1 a b

Women = 0 c d

Syed Hatim Noor 5


• The odds for men having CAD = Have CAD = 1 No CAD = 0
n of men having CAD (a) Men = 1 a b

n of men not having CAD (b) Women = 0 c d

• The odds for women having CAD = Have CAD = 1 No CAD = 0


n of women having CAD (c) Men = 1 a b
n of women not having CAD (d) Women = 0 c d

• The odds ratio (OR) of men to have CAD compared to


women =
(a/b) / (c/d)

• Thus, OR = ad/bc

Syed Hatim Noor 6


Syed Hatim Noor 7
The Dataset
• Using dataset: coronary.sav
• Objective of the study: to determine the
factors that associated with CAD
• The dependent variable:
– CAD: coronary artery disease
– Label: 0 – no CAD
1 – has CAD

Syed Hatim Noor 8


• Independent variable
– systolic blood pressure (sbp): mmHg
– diastolic blood pressure (dbp): mmHg
– serum cholesterol (chol): mmol/l
– body mass index (bmi): unit
– age of the patient (age): years
– race of the patient (race): 0 – Malays
1 – Chinese
2 – India
– gender of the patient (gender): 0 – women
1 – men
Syed Hatim Noor 9
How to Code Categorical Variables
• Always start with 0
• 0: reference group (low risk, non-diseased, normal)
• For 2 leveled-categorical variable, eg.:
– smoking status: 0 (non-smoker), 1 (smoker)
– cancer status: 0 (no cancer), 1 (has cancer)
• For more than 3 leveled-categorical variable
– race: 0 (Malay), 1 (Chinese), 2 (India)
– income level: 0 (low), 1 (medium), 2 (high)

Syed Hatim Noor 10


Steps in Multiple Logistic Regression
(1) Data exploration & cleaning
(2) Univariable analysis (Simple Logistic Regression)
(3) Variable selection (Multiple Logistic Regression)
(preliminary main effect model)
(4) Checking multicollinearity & interaction
(preliminary final model)
(5) Checking assumptions (final model)
(6) Interpretation, conclusion & presentation

Syed Hatim Noor 11


Step 1: Data exploration and
cleaning

Syed Hatim Noor 12


• Descriptive statistics
– Numerical independent variable: mean(SD)

Analyze > Descriptive Statistics > Explore

Syed Hatim Noor 13


Put all the numerical
variable in the Dependent
List box:
• systolic blood pressure
• diastolic blood pressure
• serum cholesterol
• age of the patient
• body mass index

Put dependent variable in


the Factor List box:
• coronary artery disease

Syed Hatim Noor 14


Summarize as below:-
Table I: Descriptive statistics for numerical variables
No CAD Has CAD
Variables Mean (SD) Mean (SD)
systolic blood pressure (mmHg) 130.78 (21.73) 145.73 (25.31)
diastolic blood pressure (mmHg) 81.32 (12.18) 90.45 (13.43)
serum cholesterol (mmol/l) 6.12 (1.32) 6.57 (1.26)
age (years) 45.70 (8.41) 48.19 (8.76)
bmi (unit) 36.91 (3.75) 36.73 (3.83)

Syed Hatim Noor 15


– Categorical independent variable: n(%)

Analyze > Descriptive Statistics > Crosstabs

Syed Hatim Noor 16


Put all the categorical variable
in Row(s) box:
• race of the patient
• gender of the patient

Put dependent variable in


the Column(s) box:
• coronary artery disease

Syed Hatim Noor 17


Syed Hatim Noor 18
Summarize as below:-
Table II: Descriptive statistics for variable gender
No CAD Has CAD

Gender n (%) n (%)


Woman 1819 (88.9) 228 (11.1)

Man 2250 (85.1) 393 (14.9)

Table III: Descriptive statistics for variable race


No CAD Has CAD
Race n (%) n (%)
Malay 1315 (86.3) 209 (13.7)
Chinese 1375 (86.6) 212 (13.4)
Indian 1379 (87.3) 200 (12.7)

Syed Hatim Noor 19


Step 2: Univariable analysis
(Simple Logistic Regression)

Syed Hatim Noor 20


• To screen for important independent variables
– Look for variables with p-value < 0.25 and/or
clinically important
Analyze > Regression > Binary Logistic

Syed Hatim Noor 21


• For categorical independent variable

Syed Hatim Noor 22


• N – the number of cases in the dataset
• Included in Analysis – This row gives the
number of cases that were included in the
analysis
• Missing Cases – This row gives the number of
missing cases (not include in the analysis)
Syed Hatim Noor 23
• Showing the coding of dependent variable
– 0: no cad
– 1: has cad
• Our interest is the chance of someone who
has cad (comparing has cad to no cad)
• Thus, no cad is the reference group
Syed Hatim Noor 24
This table showing
the reference group
defined in the
analysis

By choosing the
option first and click
Change, SPSS will
analyze the data
based on coding 0 as
your reference group

Syed Hatim Noor 25


• Omnibus test tests
the significance of
the independent
variable in the model

Syed Hatim Noor 26


• B – regression coefficient
• S.E. – standard error
• Wald and Sig. – a test to test the null
hypothesis that the regression coefficient
equals 0. The hypothesis is rejected when p-
value (Sig.) is <0.05.
• Thus, the independent variable is significant to
the model
Syed Hatim Noor 27
• df – degrees of freedom
• Exp (B) – exponentiation of the B coefficient
– ODDS RATIO
– Exp(0.332) = 1.394
• 95% C.I. for EXP (B) – 95% confidence interval
for odds ratio

Syed Hatim Noor 28


• If 95% CI of Odds Ratio does not include 1, P-
value must be significant, then Odds Ratio is
interpretable

• If 95% CI of Odds Ratio includes 1, P-value


must not be significant, then Odds Ratio is not
interpretable

Syed Hatim Noor 29


• Interpretation:-
– 95% confidence interval does not include 1
(1.17,1.66), coefficient is positive (0.332), variable
gender is significant to the model (p<0.001). Men
has 39.4% higher odds (chance) to have coronary
artery disease compared to women when other
confounders were not adjusted.
– If the odds ratio less than 1, it is protective (less
risk)
Syed Hatim Noor 30
• For numerical independent variable

Syed Hatim Noor 31


• Interpretation:-
– 95% confidence interval does not include 1
(1.02,1.03), coefficient is positive (0.024), variable
systolic blood pressure is significant to the model
(p<0.001). A person with 1mmHg increase in
systolic blood pressure has 2.5% higher odds
(chance) to have coronary artery disease when
other confounders were not adjusted.

Syed Hatim Noor 32


Hands-on
• Try to do Simple Logistic Regression to other
variables in the dataset:-
– dbp
– chol
– age
– bmi
– race

Syed Hatim Noor 33


Results from Simple Logistic Regression
Table IV: Associated factors of coronary artery disease by Simple Logistic
Regression model
Variable Regression Crude Odds Ratio Wald statistic p-value
coefficient (b) (95%CI)
systolic blood pressure (mmHg) 0.02 1.025 (1.02,1.03) 203.49 <0.001
diastolic blood pressure (mmHg) 0.05 1.053 (1.05,1.06) 245.58 <0.001
serum cholesterol (mmol/l) 0.25 1.28 (1.21,1.37) 63.07 <0.001
age of the patient (years) 0.03 1.04 (1.02,1.05) 45.50 <0.001
body mass index (unit) -0.01 0.99 (0.97,1.01) 1.15 0.285
gender of the patient
women 0 1
men 0.33 1.39 (1.17,1.66) 13.89 <0.001
race of the patient
Malay 0 1
Chinese -0.03 0.97 (0.79,1.19) 0.08 0.772
India -0.09 0.91 (0.74,1.12) 0.74 0.389

Syed Hatim Noor 34


Step 3: Variable selection
(Multiple Logistic Regression)

Syed Hatim Noor 35


• Review all the p-values from univariable
analysis (simple logistic regression)
• Select the candidate variables with p-value
<0.25
• May select variable with p-value >0.25 BUT
clinically important
• In this dataset, we select
– sbp - age
– dbp - bmi
– chol - gender

Syed Hatim Noor 36


Methods of Variable Selection
• Enter – manual
– Enter or remove manually the independent
variable
• Forward selection
– Automatically enters the IMPORTANT
independent variable into the model
• Backward elimination
– Automatically removes the UNIMPORTANT
independent variable out of the model

Syed Hatim Noor 37


• Forward selection
– Conditional
• Stepwise selection method with entry testing based on
the significance of the score statistic
• Removal testing based on the probability of a
likelihood-ratio statistic based on conditional parameter
estimates
– Likelihood Ratio (LR)
• Stepwise selection method with entry testing based on
the significance of the score statistic
• Removal testing based on the probability of a
likelihood-ratio statistic based on the maximum partial
likelihood estimates

Syed Hatim Noor 38


• Forward selection
– Wald
• Stepwise selection method with entry testing based on
the significance of the score statistic
• Removal testing based on the probability of the Wald
statistic

Syed Hatim Noor 39


• Backward elimination
– Conditional
• Backward stepwise selection
• Removal testing based on the probability of a
likelihood-ratio statistic based on conditional parameter
estimates
– Likelihood Ratio (LR)
• Backward stepwise selection
• Removal testing based on the probability of a
likelihood-ratio statistic based on the maximum partial
likelihood estimates

Syed Hatim Noor 40


• Backward elimination
– Wald
• Backward stepwise selection
• Removal testing based on the probability of the Wald
statistic

Syed Hatim Noor 41


Backward Elimination (LR)

• sbp
• dbp
• chol
• age
• bmi
• gender
Syed Hatim Noor 42
• Start with all variables
• Based on removal probability 0.10 available in the
Option, eliminate one variable at each step
• At the final step, variable dbp, chol, age and gender
retain in the model Syed Hatim Noor 43
Forward Selection (LR)

• sbp
• dbp
• chol
• age
• bmi
• gender
Syed Hatim Noor 44
• dbp has the smallest p-value from change in -2 log
likelihood (LR test). It is included first (step 1)
• Followed by gender in step 2 and chol in step 3
• At the final step, the variable being included in the
model are dbp, chol and gender
Syed Hatim Noor 45
Comparing Forward & Backward Results

Forward Selection (LR)

Backward Elimination(LR)

Syed Hatim Noor 46


• Using backward elimination, independent variable
age is retained
• However, the p-value of age is 0.089 (>0.05)
• If researcher uses p-value of 0.05 as a selection
criteria (the cut off point), then age may need to be
excluded from the model
• The researcher’s decision on removal based on
• p-value from Wald statistic
• clinical importance
• At the variable selection step, preliminary main
effect model is obtained

Syed Hatim Noor 47


Variable Selection Methods
• Researcher should use various methods
• Each model may differ from the other
• Advisable to do both forward selection and
backward elimination method to compare
which model is the best model in consideration
of
– The model which is the most biologically
parsimonious
– The model which is the most fit (check assumptions)
Syed Hatim Noor 48
Interpretation

• A person with 1 mmHg increase in dbp has 1.05 times the


odds to have cad
(b=0.05, OR=1.05, 95%CI 1.04,1.06, p<0.001)
• A person with 1 mmol/l increase in chol has 1.15 times the
odds to have cad
(b=0.14, OR=1.15, 95%CI 1.07,1.23, p<0.001)
• Men has 1.49 times the odds to have cad
(b=0.40, OR=1.49, 95%CI 1.24,1.78, p<0.001)
Syed Hatim Noor 49
Step 4: Checking multicollinearity
& interaction
• Check multicollinearity
– Checked to assess which variable (2 or more)
correlate highly
– Check correlation estimates
– Check standard errors
• May omit the variable if standard error is big
• Decision is subjective (depend on researcher)

Syed Hatim Noor 50


Syed Hatim Noor 51
• Based on the SPSS output, the correlation
between variables are relatively small
– dbp & chol: -0.19
– dbp & gender: 0.07
– chol & gender: -0.05

Syed Hatim Noor 52


• Based on the SPSS output, the standard error
of variables are relatively small
– dbp: 0.003
– chol: 0.035
– gender: 0.092

Syed Hatim Noor 53


• Check interaction
– Test 2-way biologically / clinically meaningful
interaction term one at a time
– Choose 2 independent variables in the model
based on practical consideration
– Create an interaction term
– Add it into model one at a time and check p-value
• if <0.05, include it in the model

Syed Hatim Noor 54


• Possible 2-way interaction in this model:-
1. dbp & chol
2. dbp & gender
3. chol & gender

Syed Hatim Noor 55


Using Ctrl key on the
keyboard to select two
variables at once

Syed Hatim Noor 56


• The interaction term (cholesterol and diastolic
blood pressure) is not significant (p=0.053)

Syed Hatim Noor 57


• The interaction term (diastolic blood pressure
and gender) is not significant (p=0.203)

• The interaction term (cholesterol and gender)


is not significant (p=0.745)
Syed Hatim Noor 58
• As a conclusion, the standard error and
correlation are relatively small for three of the
independent variables in the model
• There is no significant interaction effect in the
model
• Preliminary final model is obtained

Syed Hatim Noor 59


Step 5: Checking Assumptions
• Assessing the goodness of fit
1. The Hosmer-Lemeshow test
2. Classification table
3. Area under the Receiver Operating Characteristic
(ROC) curve

Syed Hatim Noor 60


Hosmer-Lemeshow test
• It is based on grouping cases into deciles of
risk
• It compares the observed probability with the
expected probability within each deciles
• Check the p-value. If it is >0.05, there is no
significant difference between the observed
probability and the expected probability
• Thus, assumption is met

Syed Hatim Noor 61


Syed Hatim Noor 62
deciles

• Compare the discrepancy between the


observed and expected probability
• Better (fitter) if there is small discrepancy

Syed Hatim Noor 63


• The p-value is >0.05, which is 0.214,
assumption is met
• The model is fit

Syed Hatim Noor 64


Classification table
• Default in SPSS logistic regression
• Overall correctly classified percentage is good
if above 70%
• You can manually calculate
– Sensitivity
– Specificity
– PPV
– NPV

Syed Hatim Noor 65


• In this context, the overall correctly classified
percentage is 86.4%
• Assumption is met
• Model is fit

Syed Hatim Noor 66


Area under the ROC curve
• Ranges from 0 to 1
• Able to assess the model discrimination
• A value of 0.5 means the model is useless for
discrimination
• The recommended area under the ROC curve
is at least 0.70
• Values near to 1 is better

Syed Hatim Noor 67


• Create predicted value

Syed Hatim Noor 68


• Create ROC curve
Analyze > ROC Curve

Syed Hatim Noor 69


• Area under the
ROC curve is
0.709 (95% CI
0.69,0.73)
• It is significantly
different from 0.5
(p-value<0.05)
• The model can
accurately
discriminate
70.9% of the
cases
Syed Hatim Noor 70
• As a conclusion,
– Hosmer-Lemeshow test: p-value=0.214, which is
>0.05
– Classification table: overall correctly classified
percentage is 86.4%, which is >70%
– ROC curve: Area under the curve is 70.9%, which
is >70%
• Assumptions are met
• Final model is achieved

Syed Hatim Noor 71


Step 6: Interpretation, Conclusion & Presentation
• Establish final model

Syed Hatim Noor 72


• diastolic blood pressure, serum cholesterol
and gender have significant association with
presence of coronary artery disease

Syed Hatim Noor 73


Table V: Associated factors of coronary artery disease by Multiple Logistic
Regression model
Variable Regression Adjusted Odds Ratio a Wald p-value
coefficient (b) (95%CI) statistic
diastolic blood pressure (mmHg) 0.05 1.05 (1.04,1.06) 212.62 <0.001
serum cholesterol (mmol/l) 0.14 1.15 (1.07,1.23) 15.66 <0.001
gender of the patient
women 0 1
men 0.40 1.49 (1.24,1.78) 18.55 <0.001
aForward LR Multiple Logistic Regression model was applied
Multicollinearity and interaction term were checked and not found
Hosmer-Lemeshow test, (p=0.214), classification table (overall correctly classified percentage=86.4%) and
area under the ROC curve (70.9%) were applied to check the model fit

Syed Hatim Noor 74


Table VI: Associated factors of coronary artery disease by simple and
multiple logistic regression model
Variable Simple Logistic Regression Multiple Logistic Regression a
b Crude OR (95% CI) p b Adjusted OR (95% CI) P
diastolic blood pressure 0.05 1.053 (1.05,1.06) <0.001 0.05 1.05 (1.04,1.06) <0.001
(mmHg)
serum cholesterol 0.25 1.28 (1.21,1.37) <0.001 0.14 1.15 (1.07,1.23) <0.001
(mmol/l)
gender
women 0 1 0 1
men 0.33 1.39 (1.17,1.66) <0.001 0.40 1.49 (1.24,1.78) <0.001
aForward LR Multiple Logistic Regression model was applied
Multicollinearity and interaction term were checked and not found
Hosmer-Lemeshow test, (p=0.214), classification table (overall correctly classified percentage=86.4%) and
area under the ROC curve (70.9%) were applied to check the model fit

Syed Hatim Noor 75


Interpretation
• A person with an increase in 1 mmHg of diastolic blood
pressure has 5% higher odds to have coronary artery
disease (95% CI 1.04,1.06, p<0.001) when adjusted for
serum cholesterol and gender
• A person with an increase in 1 mmol/l of serum
cholesterol has a 15% higher odds to have coronary
artery disease (95% CI 1.07,1.23, p<0.001) when adjusted
for diastolic blood pressure and gender
• Men has 49% higher odds compared to women to have
coronary artery disease (95% CI 1.24,1.78, p<0.001)
when adjusted for diastolic blood pressure and serum
cholesterol
Syed Hatim Noor 76
Prediction
• B (regression coefficient)
– This is the value for the logistic regression
equation for predicting the dependent variable
from the independent variable
– It is in log-odds unit
• The prediction equation is
log (p/1-p) = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4

Syed Hatim Noor 77


• where p is the probability of being the honors
composition
• Expressed in terms of the variables used in
this example, the logistic regression equation
is
log (p/1-p) = -7.242 + 0.050*dbp + 0.137*chol
+ 0.398*gender(1) Logit Equation

Syed Hatim Noor 78


Conclusion
• Binary logistic regression deals with a
dichotomous outcome variable
• Odds ratio help the interpretation of
association between independent and
dependent variables
• Follow proper steps to ensure the best model
can be obtained
• Proper coding must be practiced
• Understand the statistical importance and
clinical important
Syed Hatim Noor 79

You might also like