Binary Logistic regression
Prof Sami Abdo Radman
• Why
• 1. Descriptive - form the strength of the association between outcome
and factors of interest
• 2. Adjustment - for covariates/confounders
• 3. Predictors - to determine important risk factors affecting the outcome
• 4. Prediction - to quantify new cases (equation)→ probability
• Logistic regression predict the probability of having the outcome (0-
100%)
• Logistic regression : detect the accurate relation between y and x
(predict changes in Y when x change)
• Prediction of probability of y for a given x
Assumptions
• Dependent= categorical (dichotomous)(binary)
• Independent = categorical or continuous
Coding of the dependent
Dependent variable:
• yes =1 yes should have the bigger code
• No = 0
• the SPSS will predict the value 1 (yes)
• If yes=1 and No=2 , SPSS will predict 2 (no)
• i.e SPSS predict the bigger value (coded value)
Eg to predict systolic BP≥ 180
• Dependent = SBP≥ 180 (yes , no)
• Independent=
• race (Chinese, Indians, Malay, others)
• smoking (yes ,no)
• Age (continuous)
• Analyze →regression → binary
• Enter the dependent
• Enter the independent
• Click categorical to define categorical variables , select reference
(reference category will be =0 in the output)
output
For the independent categorical
variable we should select one
category as a reference group
Race: Reference = Chinese =0
smoking: Reference = no =0
don’t forget: the given
OR is the adjusted OR)
• Exp(B)= OR
• B = beta coefficient
• Wald = give the most important variable in the model ( age is the most important predictor variable:
Wald=11.007)
• A smoker compared to a non-smoker is 9.9 (95% CI 1.4 to 68.4) times more likely to have SBP
≥180.
• or A smoker is 9.9 (95% CI 1.4 to 68.4) times more likely to have SBP compared to a non-
smoker
• or the odds of having SBP>180 is 9.9 greater for smokers compared to nom smokers
• Since age is a quantitative numerical variable, an increase of one-year in age has a 23.3%
(95% CI 8.9% to 39.5%) increase in odds of having SBP ≥180. (Exp(B) age -1 x100/100).
The best interpretation
• Controlling for other explanatory variables: an increase in one-year in age
has a 23.3% increase in odds of having SBP ≥180. (Exp(B)-1 age (100)
/100).
• Controlling for other explanatory variables A smoker compared to a non-
smoker is 9.9 times more likely to have SBP ≥180.
• Interpretation of OR:
• exposed have n times likelihood to have the outcome compared to
non exposed
• exposed are n times more likely to have the outcome compared to
non exposed
• If OR is less than 1 (protective) for example = 0.32
• You can change the reference group or divide 1 by the OR (1/OR)
• Example : male compared to female OR= 2
• Female compared to male OR = ½= 0.5
Total model is significant (p<0.001)
If the over all model is significant this means that at least there is one significant
variable in the model (omnibus test) (chi square x2) , or there is at least one
beta not equal to zero
R square (Pseudo R square)
• The Nagelkerke R Square shows that about 50.6% of the variation in
the outcome variable (SBP ≥180) is explained by this logistic model.
• Cox and snell R square shows that 35%of the variation in the outcome
variable (SBP ≥180) is explained by this logistic model.
• 35% to 51% of the variation in the outcome variable (SBP ≥180) is
explained by this logistic model
Classification table
Classification accuracy
calibration
• The overall accuracy of this model to predict subjects having SBP ≥180 is
85.5% . (To be a good model it should be >50%)
• The sensitivity is given by 9/15 = 60%
• the specificity is 38/40 = 95%.
• Positive predictive value (PPV) = 9/11 = 81.8%
• negative predictive value (NPV) = 38/44 = 86.4%.
• The overall accuracy of this model to predict subjects having SBP ≥180 is
• Hosmer-Lemeshow goodness of fit
• a p value >0.05 is expected
• The model fits the data because p=0.555
Multicollinearity (if high → not accepted)
• 1- Check SE
• If > 5 → multicollinearity (some books say >2)
• 2- Correlation matrix
• If there is multicollinearity between two variables in the model→ we
should remove one of them (the one which has higher SE)
• Or We can remove the one which is not significant
• Or We can remove the one which is less important (if both are sig)
Conclusion
• The logistic regression model was statistically significant, χ2= 35.1, p < .001.
• The model explained 51.0% (Nagelkerke R2) of the variance in SBP.
• The model correctly classified 85.5% of cases.
• The model fit the data :Hosmer-Lemeshow goodness of fit p>0.05
• Increasing age was associated with an increased likelihood of exhibiting
SBP>180: an increase in one-year in age has a 23.3% increase in odds of
having SBP ≥180.
• smoker compared to a non-smoker is 9.9 times more likely to have SBP
≥180
• Race is not a significant predictor.
Predicting equation (to predict probability)
first calculate z (logit), then calculate the exponential
function
• z = -14.462 + 0.209 * Age + 2.292 * Smoker(1) + 0.640 * Race(1) +1.303 *
Race(2) - 0.097 * Race(3)
• e: denotes exponential function وظيفة األس
• https://www.medcalc.org/manual/exp_function.php
• Z is the log of OR (log odds) (logit transformation)
• For example, we have a 45-year-old non-smoking Chinese,
• then nonSmoker =0
• Race(1) = Race(2) = Race(3) = 0, and
• z = -14.462 + 0.209 * 45 = -5.057
• -z= -(-5.057 )= 5.057
• e-z = 157.1 • https://www.medcalc.org/manual/exp_function.php
• Probability= 1/1+ e-z
• Probability = 1/1+157.1 = 0.006
• the Prob (SBP ≥ 180) = 0.006 (0.06%); very unlikely that this subject
has SBP ≥180
• In general a probability of less than .50 is considered unlikely
• Probability from 0 to 1
• another example, a 65-year-old Indian, smoker,
• then Smoker(1) = 1, Race(1) = 1, Race(2) = Race(3) = 0
• z = -14.462 + 0.209 * 65 + 2.292 * 1 +0.64 * 1 = 2.055
• -z= - 2.055 وظيفة األس
• e-z =0.128
• Probability = 1/1+0.128 = 0.89
• the Prob(SBP ≥180) = 0.89= 89% → very likely that this subject has SBP
≥ 180
Hypothesis of logistic regression
• H0: 1= 2= 3 = ... = n= 0
• H1: At least one regression coefficient is not equal to zero
• The function used in logistic regression is called Sigmoid function ,
log function, z function
• Used to create probabilities.
Probability ≥0.5,class=1
Probability <0.5,class=0
Sample size for logistic regression
• Enter technique → 5-10 subjects for each variable
• Backward/ forward → 20 subjects for each variable
Comparison to linear regression
• Linear Regression predict continuous numbers
• Logistic Regression could help use predict whether the person will
have the outcome or not (probability from 0 to 1)
• Probability ≥0.5→ have the outcome
• Probability<0.5→ will not have the outcome
• Logistic regression is used to classify sample
•
In linear regression
• It fits the line using the least square
In logistic regression
• Unlike linear regression which outputs continuous number
values, logistic regression transforms its output using
the logistic sigmoid function to return a probability value
Probability having the outcome
References
• https://statistics.laerd.com/spss-tutorials/binomial-logistic-
regression-using-spss-statistics.php#procedure
• Check the file sent by email
SPSS example
• To predict Ischemic heart disease
• By
• Age
• Residency
• Level of education
• Diabetes status
notes
• Binary Logistic regression does not make any assumptions of
normality, linearity, and homogeneity of variance for the independent
variables.
Sample size needed: for each variable entered into the model we need
at least 10 participants (cases) (better to be 20) :
For example if we have 5 variables in the model we need at least 50
participants (sample size=50)
Table 5. Bivariate Analysis: Association between Breast Self
Examination (BSE) and Socio-Demographic Variables
Table 6. Results of the Binary Multiple Regression Predicting
BSE in a Sample of Urban Women in Shah Alam, Malaysia (n=
222)
Predictors of smoking among university
students
Sample size for logistic regression
• Enter technique → 5-10 subjects for each variable
• Backward/ forward → 20 subjects for each variable