Binary Data
Binary Data
Learning outcomes:
Discrimination in mortgage application; linear regression model
1/15
Preamble
• Two people, identical but for their race, walk into a bank and apply
for a mortgage, a large loan so that each can buy an identical house.
• Does the bank treat them the same way?
• Are they both equally likely to have their mortgage application
accepted?
• By law, they must receive identical treatment.
• But whether or not they do is a matter of great concern among
bank regulators.
2/15
Do banks discriminate?
3/15
How can we test for discrimination?
4/15
How can we test for discrimination?
4/15
How can we test for discrimination?
4/15
How can we test for discrimination?
• But this comparison does not really answer the question of interest,
because the black and white applicants are not necessarily “identical
but for their race”.
4/15
How can we test for discrimination?
• But this comparison does not really answer the question of interest,
because the black and white applicants are not necessarily “identical
but for their race”.
• Instead, we need a method for comparing rates of denial, holding
4/15
other applicant characteristics constant.
How do we deal with a binary dependent variable?
• This sounds like a job for multiple regression analysis—and it is, but
with a twist.
• The twist is that the dependent variable—whether or not the
applicant is denied—is binary.
• Using binary variables as regressors do not cause particular problems.
• But when the dependent variable is binary, things are more difficult:
what does it mean to fit a line to a dependent variable that can
take on only two values, zero and one?
• The answer to this question is to interpret the regression function as
a predicted probability.
5/15
Binary Dependent Variables and the Linear Regression Model
6/15
Data description
7/15
Relevant information
8/15
Scatterplot of Mortgage Application Denial and the Payment-
to-Income Ratio
• The plot is not that clear, in that it does not show a clear pattern.
However, the superimposed line does.
• The line plots the predicted value of deny as a function of the
regressor, the payment-to-income ratio, using a linear regression
model.
• For example, when P/I ratio = 0.3, the predicted value of deny is
0.2. But what, precisely, does it mean for the predicted value of the
binary variable deny to be 0.2?
• The key to answering this question is to interpret the regression as
modelling the probability that the dependent variable equals one.
• Thus, the predicted value of 0.2 is interpreted as meaning that,
when P/I ratio is 0.3, the probability of denial is estimated to be
20%. Said differently, if there were many applications with PIratio
= 0.3, then 20% of them would be denied. 10/15
Application to the Boston HMDA data
12/15
The effect of race on the probability of denial
• The linearity that makes the linear regression model easy to use is
also its major flaw.
• Looking again at the figure, we see that the estimated line
representing the predicted probabilities drops below zero for very
low values of the PIratio and exceeds one for high values!
• But this is nonsense: a probability cannot be less than zero or
greater than one.
• This nonsensical feature is an inevitable consequence of the linear
regression.
• To address this problem, we use nonlinear models specifically
designed for binary dependent variables, the probit and logit
regression models.
14/15
Associated files
• Data sets:
• “HDMA.dta”
• Do files:
• “STATA200303.do”
15/15