Advanced Quantitative Techniques
Logistic regressions
Difference between linear and logistic
regression
Linear (OLS) Regression Logistic Regression
For an interval-ratio dependent variable For a categorical (usually binary)*
dependent variable
Predicts value of dependent variable Predicts probability that dependent
given values of independent variables variable will show membership to a
category given values of independent
variables
*For this class, we are only using interval-ratio or binary variables. Count
variables (categorical variables with more than two outcomes) require a
more advanced regression (Poisson regression).
Logistic / logit
• Open divorce.dta
• list divorce
positives
• scatter divorce
positives
Logistic / logit
• logistic divorce
positives
• predict preddiv
• scatter preddiv
positives
Logistic / logit
Logit=ln(odds ratio)
• In Stata, there are two commands for logistic regression: logit
and logistic.
• The logit command gives the regression coefficients to estimate
the logit score.
• The logistic command gives us the odds ratios we need to
interpret the effect size of the predictors.
• The logit is a function of the logistic regression: it is just a
different way of presenting the same relationship between
independent and dependent variables (see Acock, section 11.2)
Logistic / logit
• Open nlsy97_chapter11.dta
• We want to test the impact of some variables on the
likelihood that a young person will drink alcohol
• summarize drank30 age97 pdrink97 dinner97 male
if !missing(drank30, age97, pdrink97, dinner97, male)
Logistic
• Interpretation:
• The odds of drinking are multiplied by 1.169 for each more year of age.
• The odds of drinking are multiplied by 1.329 for each peer that drinks.
• The odds of drinking are multiplied by 0.942 for every day the person has dinner with
their family.
• The LR chi2(4)=78.01, P<0.0001, means the model is statistically significant
Logit
Coefficients tell the amount of increase in
the predicted log odds of low = 1 that
would be predicted by a 1 unit increase in
the predictor, holding all other predictors
constant.
Comparing effects of variables
• It is hard to compare the effect of two independent variables
using odds ratio when they are measured in different scales.
• For example, the variable male is binary (0 to 1), so it is simple to
observe its effect in odds ratio terms.
• But it is hard to compare the effect of “male” with the effect of
variable dinner97 (number of days the person has dinner with
his or her family), which goes from 0 to 7.
• If he odds ratio of “male” tells us how more likely it is that a male
will drink compared to a female, dinner97 tells us the probability
change for each day.
• Beta coefficients standardize the effects, allowing a comparison
based on standard deviations.
Comparing effect of variables
• listcoef, help
. listcoef, help
logit (N=1654): Factor Change in Odds
Odds of: 1 vs 0
If listcoef does ------------------------------------------------------------------
not work, use
drank30 | b z P>|z| e^b e^bStdX SDofX
---------+--------------------------------------------------------
age97 | 0.15635 2.672 0.008 1.1692 1.1578 0.9371
findit listcoef to male | -0.02072
pdrink97 | 0.28463
-0.194
6.325
0.846
0.000
0.9795
1.3293
0.9897
1.4131
0.4985
1.2149
install command dinner97 | -0.05966 -2.693 0.007 0.9421 0.8692 2.3494
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
e^b = exp(b) = factor change in odds for unit increase in X
e^bStdX = exp(b*SD of X) = change in odds for SD increase in X
SDofX = standard deviation of X
Comparing effect of variables
• listcoef, help percent
. listcoef, help percent
logit (N=1654): Percentage Change in Odds
Odds of: 1 vs 0
------------------------------------------------------------------
drank30 | b z P>|z| % %StdX SDofX
---------+--------------------------------------------------------
age97 | 0.15635 2.672 0.008 16.9 15.8 0.9371
male | -0.02072 -0.194 0.846 -2.1 -1.0 0.4985
pdrink97 | 0.28463 6.325 0.000 32.9 41.3 1.2149
dinner97 | -0.05966 -2.693 0.007 -5.8 -13.1 2.3494
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
% = percent change in odds for unit increase in X
%StdX = percent change in odds for SD increase in X
SDofX = standard deviation of X
Hypothesis testing
• 1. Wald chi-squared test: z reported by Stata in
logistic regression.
• 2. Likelihood-ratio chi-squared test.
• Compare LR chi2 with and without the variable
you want to test.
• To test variable “age97”:
logistic drank30 male dinner97 pdrink97
estimates store a
logistic drank30 age97 male dinner97 pdrink97
lrtest a
Hypothesis testing
. logistic drank30 male dinner97 pdrink97
Logistic regression Number of obs = 1654
LR chi2(3) = 70.83
Prob > chi2 = 0.0000
Log likelihood = -1064.6372 Pseudo R2 = 0.0322
drank30 Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
male .9798341 .1045179 -0.19 0.849 .7949792 1.207673
dinner97 .9418512 .020823 -2.71 0.007 .9019105 .9835606
pdrink97 1.376461 .0594022 7.40 0.000 1.264823 1.497953
_cons .4153532 .0673672 -5.42 0.000 .3022449 .5707898
.
. estimates store a
.
. logistic drank30 age97 male dinner97 pdrink97
Logistic regression Number of obs = 1654
LR chi2(4) = 78.01
Prob > chi2 = 0.0000
Log likelihood = -1061.0474 Pseudo R2 = 0.0355
drank30 Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
age97 1.169241 .0684191 2.67 0.008 1.042546 1.311332
male .9794922 .1046935 -0.19 0.846 .7943646 1.207764
dinner97 .942086 .0208682 -2.69 0.007 .9020603 .9838878
pdrink97 1.329275 .0598174 6.33 0.000 1.217056 1.451841
_cons .0524677 .0415938 -3.72 0.000 .0110944 .2481314
.
. lrtest a
Likelihood-ratio test LR chi2(1) = 7.18
(Assumption: a nested in .) Prob > chi2 = 0.0074
Hypothesis testing
• Same process, but for each of the variables
• lrdrop1
• (install command using ssc install lrdrop1)
. lrdrop1
Likelihood Ratio Tests: drop 1 term
logistic regression
number of obs = 1654
------------------------------------------------------------------------
drank30 Df Chi2 P>Chi2 -2*log ll Res. Df AIC
------------------------------------------------------------------------
Original Model 2122.09 1649 2132.09
-age97 1 7.18 0.0074 2129.27 1648 2137.27
-male 1 0.04 0.8463 2122.13 1648 2130.13
-dinner97 1 7.23 0.0072 2129.33 1648 2137.33
-pdrink97 1 40.62 0.0000 2162.72 1648 2170.72
------------------------------------------------------------------------
Terms dropped one at a time in turn.
Marginal effects
• We will use the variable race97 and dropping
the variable male.
• We want to test the effect of a person being
black compared to being white. Thus, we will
drop observations where the person has other
racial background.
generate black = race97 – 1
replace black=. If race97>2
Marginal effects
label define black 0 “White” 1 “Black”
label define drank30 0 “No” 1 “Yes”
label values drank30 drank30
label values black black
logit drank30 age97 i.black pdrink97 dinner97
Marginal effects
. logit drank30 age97 i.black pdrink97 dinner97
Iteration 0: log likelihood = -935.86755
Iteration 1: log likelihood = -901.48553
Iteration 2: log likelihood = -901.37312
Iteration 3: log likelihood = -901.37311
Logistic regression Number of obs = 1413
LR chi2(4) = 68.99
Prob > chi2 = 0.0000
Log likelihood = -901.37311 Pseudo R2 = 0.0369
drank30 Coef. Std. Err. z P>|z| [95% Conf. Interval]
age97 .138153 .0635579 2.17 0.030 .0135818 .2627241
black
“Black” -.3804608 .1352133 -2.81 0.005 -.645474 -.1154476
pdrink97 .2822417 .048233 5.85 0.000 .1877067 .3767767
dinner97 -.069024 .0246204 -2.80 0.005 -.1172791 -.0207689
_cons -2.590308 .8609411 -3.01 0.003 -4.277722 -.9028946
Marginal effects
• The margins command tell the difference in
the probability of having drunk in the last 30
days is an individual is black compared with an
individual is white.
• Initially, we are setting the covariates at the
mean.
• So the command will tell us what is the
difference between blacks and whites who are
average on the other covariates.
Marginal effects
• margins, dydx(black) atmeans
. margins, dydx(black) atmeans
Conditional marginal effects Number of obs = 1413
Model VCE : OIM
Expression : Pr(drank30), predict()
dy/dx w.r.t. : 1.black
at : age97 = 13.67445 (mean)
0.black = .7523001 (mean)
1.black = .2476999 (mean)
pdrink97 = 2.112527 (mean)
dinner97 = 4.760793 (mean)
Delta-method
dy/dx Std. Err. z P>|z| [95% Conf. Interval]
black
“Black” -.0862436 .0296054 -2.91 0.004 -.144269 -.0282181
Note: dy/dx for factor levels is the discrete change from the base level.
• dy/dx: derivate at the point selected (where all other variables are at the mean)
• Interpretation: a black individual that is 13.67 years old, etc. will be 8.6% less likely to
drink that a white individual that is 13.67 years old, etc.
Marginal effects
• We can also test marginal effects at points
other than the mean using the at( ) option.
• margins, at(pdrink97=(1 2 3 4 5)) atmeans
Marginal effects
For an individual with pdrink97 coded 2 we estimate a 36% probability that he or she
drank in the last 30 days
Marginal effects
Estimated probability that an adolescent drank in last month adjusted for age,
race, and frequency of family meals (testing all of those at the mean).
Marginal effects
For an individual that has dinner with his or her family 3 times a week, we estimate a
39% probability that he or she drank in the last 30 days
Example 1
• Use severity.dta
. logit severity liberal female
Iteration 0: log likelihood = -331.35938
Iteration 1: log likelihood = -217.1336
Iteration 2: log likelihood = -216.94677
Iteration 3: log likelihood = -216.94661
Iteration 4: log likelihood = -216.94661
Logistic regression Number of obs = 480
LR chi2(2) = 228.83
Prob > chi2 = 0.0000
Log likelihood = -216.94661 Pseudo R2 = 0.3453
severity Coef. Std. Err. z P>|z| [95% Conf. Interval]
liberal 1.055704 .090975 11.60 0.000 .8773958 1.234011
female .6526588 .2423695 2.69 0.007 .1776233 1.127694
_cons -3.547764 .3393913 -10.45 0.000 -4.212959 -2.882569
Example 1
• Use severity.dta
• We are trying to see what predicts whether an individual
thinks that prison sentences are too severe
. logistic severity liberal female
Logistic regression Number of obs = 480
LR chi2(2) = 228.83
Prob > chi2 = 0.0000
Log likelihood = -216.94661 Pseudo R2 = 0.3453
severity Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
liberal 2.873997 .2614619 11.60 0.000 2.404629 3.434981
female 1.920641 .4655047 2.69 0.007 1.194375 3.088527
_cons .0287889 .0097707 -10.45 0.000 .0148025 .0559907
Example 1
. margins, at(liberal=(1 3 5))
Predictive margins Number of obs = 480
Model VCE : OIM
Expression : Pr(severity), predict()
1._at : liberal = 1
2._at : liberal = 3
3._at : liberal = 5
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
_at
1 .1075282 .0224531 4.79 0.000 .063521 .1515354
2 .4887975 .0297587 16.43 0.000 .4304716 .5471234
3 .8833568 .0211126 41.84 0.000 .8419769 .9247367
Example 1
. margins, at(liberal=(1 3 5)) atmeans
Adjusted predictions Number of obs = 480
Model VCE : OIM
Expression : Pr(severity), predict()
1._at : liberal = 1
female = .5125 (mean)
2._at : liberal = 3
female = .5125 (mean)
3._at : liberal = 5
female = .5125 (mean)
Delta-method
Margin Std. Err. z P>|z| [95% Conf. Interval]
_at
1 .1036257 .0217902 4.76 0.000 .0609177 .1463337
2 .4884607 .0305729 15.98 0.000 .428539 .5483824
3 .8874787 .0202504 43.83 0.000 .8477885 .9271688