hw3 Spring2024 Solution
hw3 Spring2024 Solution
Homework #3
Do not copy and paste the answers from your classmates. Two identical homework
will be treated as cheating. Do not copy and paste the entire output of your statistical
package's. Report only the relevant part of the output. Please also submit your R-script
for the empirical part. Please put all your work in one single le and upload via Moodle.
a. makes little sense, because variables in the real world are related linearly.
c. is a concept that only applies to the case of a single or two explanatory variables
Answer: d
a. can only be applied when there are two binary variables, but not three or more.
d. allows the eect of changing one of the binary independent variables to depend
Answer: d
Yi = β0 + β1 ln(Xi ) = ui is as follows:
1
Answer: b
a continuous variable and D is a binary variable, to test that the two regressions are
Answer: d
1.2 To test whether or not the population regression function is linear rather than
a polynomial of order r,
a. check whether the regression for the polynomial regression is higher than that of
c. look at the pattern of the coecients: if they change from positive to negative
Answer: d
a. the actuals can only be 0 and 1, but the predicted are almost always dierent
from that.
Answer: d
1.4 The following tools from multiple regression analysis carry over in a meaningful
a. F-statistic.
d. regression R2 .
2
Answer: d
1.6 In the binary dependent variable model, a predicted value of 0.6 means that
a. the most likely value the dependent variable will take one is 60 percent.
b. given the values for the explanatory variables, there is a 60 percent probability
c. the model makes little sense, since the dependent variable can only be 0 or 1.
d. given the values for the explanatory variables, there is a 40 percent probability
Answer: b
function.
Answer: a
1.8 The following problems could be analyzed using probit and logit estimation with
Answer: b
3
Part 2 Short Questions (29 points in total)
Note: for each sub-question, the answer should not be longer than 7 lines.
(Non-graded exercise) Dr. Qin would like to analyze the Return to Education and
the Gender Gap. The equation below shows the regression result using the 2005 Cur-
rent Population Survey. lnEearnings refer to the logarithem of the monthly earnings;
educ refers to the year of education; DF emme is a dummy variable, if the individual
is female, =1; exper is the working experience, measured by year; M idwest, South
and W est are dummy variables indicating the residence regions, while Northeast is the
ommited region. Interpret the major results(discuss the estimates for all variables and
ˆ
LnEarnings = 1.215 + 0.0899 × educ − 0.521 × DF emme + 0.0180 × (DF emme × educ)
(0.018) (0.0011) (0.022) (0.0016)
+0.0232 × exper − 0.000368 × exper2 − 0.058 × M idwest − 0.0078 × South − 0.030 × W est
(0.0008) (0.000018) (0.006) (0.006) (0.006)
¯
n = 57, 863 R2 = 0.242
Answer: The return to education for males is approximately 9% higher for 1 more
year education, and the estimate is statistically signicant at 1% level. For females, the
return of education is slightly higher, approximately 11% (0.0899+0.018). Since the bi-
nary variable for females is interacted with the number of years of education, the gender
gap depends on the number of years of education. For the typical high school graduate
while for the typical college graduate (16 years of education) the gender gap narrows to
which is to be expected given the shape of age-earnings proles and the fact that poten-
tial experience depends on the age of the individual. There is a declining marginal value
for each year of potential experience until it eventually becomes negative. Northeast is
the omitted region, and all other regions have lower (log) earnings, ranging from 0.8%
in the South to 5.8% in the Midwest. All coecients are statistically signicant.
(15 points) 2.1 Sports economics typically looks at winning percentages of sports
teams as one of various outputs, and estimates production functions by analyzing the
relationship between the winning percentage and inputs. In Major League Baseball
(MLB), the determinants of winning are quality pitching and batting. All 30 MLB
teams for the 1999 season. Pitching quality is approximated by Team Earned Run
4
Average (teamera), and hitting quality by On Base Plus Slugging Percentage (ops).
(a) (5 points) Interpret the regression. Are the results statistically signicant and
important?
Answer: Lowering the team ERA by one results in a winning percentage increase of
roughly ten percent. Increasing the OPS by 0.1 generates a higher winning percentage
winning percentages. Both slope coecients are statistically signicant, and given the
(b) (8 points) There are two leagues in MLB, the American League(AL) and the
National League (NL). One major dierence is that the pitcher in the AL does not
have to bat. Instead there is a designatedhitter in the hitting line-up. You are
concerned that, as a result, there is a dierent eect of pitching and hitting in the AL
from the NL. To test this Hypothesis, you allow the AL regression to have a dierent
intercept and dierent slopes from the NL regression. You therefore create a binary
variable for the American League (DAL) and estiamte the following specication:
How should you interpret the winning percentage for AL and NL? Can you tell the
dierent eect of pitching and hitting between AL and NL? If so, how much?
Answer: For AL, lowering the team ERA by one results in a winning percentage
increase of 9.2 (-0.1+0.008) percents, while the number for NL is 10 percents. Increasing
the OPS by 0.1 will increase the winning percentage by 14.35 (0.1622-0.0187) percent
for AL but 16 percent for NL. However, the coecient estimates of both interaction
terms are not statistically signicant. It is dicult to conclude that there is dierent
(2 points) (c) You remember that sequentially testing the signicance of slope coef-
cients is not the same as testing for their signicance simultaneously. Hence you ask
your regression package to calculate the F-statistic that all three coecients involving
5
the binary variable for the AL are zero. Your regression package gives a value of 0.35.
Looking at the critical value from the F-table, can you reject the null hypothesis at the
Answer: The critical value of the F-statistic is 3.78 at the 1% level, and hence you
cannot reject the null hypothesis, that all three coecients are zero. However, the
sample size is too small (30 is much smaller than 100) and thus the F-statistic is not
2.2 A study analyzed the probability of Major League Baseball (MLB) players to
survive for another season, or, in other words, to play one more season. The re-
searchers had a sample of 4,728 hitters and 3,803 pitchers for the years 1901-1999. All
explanatory variables are standardized. The probit estimation yielded the results as
where the limited dependent variable takes on a value of one if the player had one
is measured in years, performance is the batting average for hitters and the earned run
average for pitchers, and average performance refers to performance over the career.
(Note that all variables are standardized, so that the mean is zero, and the variance is
1 )
(4 points) (a) Interpret the two probit equations and calculate survival probabilities
for hitters and pitchers at the sample mean. Why are these so high?
Answer: Note that all variables are standardized, so that the mean is zero. This
results in a survival probability of 0.978 (Φ(2.01) = 0.9778) for hitters and 0.948
(Φ(1.63) = 0.9484) for pitchers. These results are so high because there is a high
(4 points) (b) Calculate the change in the survival probability for a player who has
6
a very bad year by performing two standard deviations below the average (assume also
that this player has been in the majors for many years so that his average performance
is hardly aected). How does this change the survival probability when compared to
Answer: Since the variables are standardized, this implies a change of two for the
performance variable. The result for hitters is a lowering of the survival probability to
0.66 (Φ(2.01− 2 ∗ 0.794) = Φ(0.42) = 0.6628), and for pitchers to 0.61 (Φ(1.625 −2∗
0.677) = Φ(0.27) = 0.6064).
(6 points) (c) Since the results seem similar, the researcher could consider combining
the two samples. Explain in some detail how this could be done and how you could
Answer: After combining the sample for hitters and pitchers, you would allow for
a dierent intercept and slopes by introducing a binary variable for pitchers if hitters
are the default. This binary variable would be introduced by itself and in combination
with each of the above variables, thereby allowing all coecients to dier. You could
then conduct an F-test for the joint hypothesis that all coecients involving the binary
variables are zero. If the hypothesis cannot be rejected, then there is no dierence
Note: for each sub-question, the answer should not be longer than 10 lines.
(32 points) 3.1 Use the data set CollegeDistance.dta and read the description le
(3 points) (a) Run a regression of ed on dist, female, black, hispanic, dadcoll, mom-
coll, tuition and report your result. Interpret the coecient for tuition. Does it makes
sense?
(3 points) (b) Run a regression of ln(ed) on dist, female, black, hispanic, dadcoll,
momcoll, ln(tuition) and report your result. Interpret the coecient for tuition. Does it
make sense? (Note, ln(ed) is the (natural) logarithem of ed , ln(tuition) is the (natural)
logarithem of tuition.
(6 points) (c) If we are interested in the causal eect of tuition on years of education
completed. Considering the available variables in the data, what are the variables that
7
might cause the omitted variables bias? Justify your answer by both economic logic
(4 points) (d) After additing the possible omitted variables (in(c)), what does the
(6 points) (e) Now we are interested in the eect of dist and parents' education on
years of education completed. Generate a dummy variable for those whose fathers are
not college graduates (named as dadnoncoll ) and a dummy variable for those whose
dist, female, black, hispanic, dadnoncoll, momnoncoll, tuition and report your result.
Interpret the coecients for dist, dadnoncoll and momnoncoll.
mom's education background? Use regression(s) and test(s) to justify your discussion.
(6 points) (g) Now we are interested in the eect of the ethic groups on years of
education completed. Base on regression in (a), how to interpret such eect? Does this
eect depend on parents' education background? If so, how? Justify your answer by
regression(s)/test(s).
Table 1
8
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
(Intercept) 13.530 *** 2.608 *** 2.585 *** 15.189 *** 13.238 *** 13.518 ***
(0.117) (0.004) (0.006) (0.136) (0.133) (0.119)
dist -0.047 *** -0.003 *** -0.003 *** -0.047 *** -0.049 *** -0.047 ***
(0.013) (0.001) (0.001) (0.013) (0.015) (0.013)
female 0.043 0.003 0.005 0.043 0.066 0.045
(0.056) (0.004) (0.004) (0.056) (0.056) (0.056)
black -0.371 *** -0.025 *** -0.019 *** -0.371 *** -0.288 *** -0.356 ***
(0.069) (0.005) (0.005) (0.069) (0.070) (0.075)
hispanic -0.012 0.001 0.006 -0.012 0.058 0.013
(0.085) (0.006) (0.006) (0.085) (0.086) (0.094)
dadcoll 0.992 *** 0.071 *** 0.061 *** 0.759 *** 1.018 ***
(0.080) (0.006) (0.006) (0.100) (0.094)
momcoll 0.667 *** 0.047 *** 0.042 *** 0.667 *** 0.670 ***
(0.091) (0.006) (0.006) (0.118) (0.109)
tuition 0.149 0.149 0.124 0.154
(0.103) (0.103) (0.103) (0.103)
lntuition 0.013 ** 0.012 **
(0.006) (0.006)
incomehi 0.029 *** 0.407 ***
(0.005) (0.069)
ownhome 0.018 *** 0.248 ***
(0.005) (0.072)
dadnoncoll -0.992 ***
(0.080)
momnoncoll -0.667 ***
(0.091)
ddedu 0.069 *
(0.036)
dmedu -0.051
(0.052)
bdedu -0.147
(0.248)
bmedu 0.047
(0.241)
hdedu -0.070
(0.237)
hmedu -0.134
(0.313)
N 3796 3796 3796 3796 3796 3796
Fstatistics 1.9949 0.1854
Pr(>F) 0.1362 0.9461
R2 0.108 0.109 0.121 0.108 0.121 0.109
Standard errors are heteroskedasticity robust. *** p < 0.01; ** p < 0.05; * p < 0.1.
9
(a) The result is reported in column (1). The coecient of tuition is 0.149, suggest-
ing that holding other variables constant, when the average state 4yr college tuition is
$1000 higher, the years of education completed is 0.149 year higher. It makes sense in
reality, since high average tuition usually means the large amount of excellent univer-
sities, people prefer more years of education. However, the estimate is not statistically
(b) The result is reported in column (2). The coecient of lntuition is 0.013. Holding
other variables constant, when the average state 4 yr college tuition increased by 1%,
signicant at 5% level.
(c) Variables indicating the income level might cause the omitted variable bias. Two
variables are found to represent for the income level, incomehi and ownhome. Adding
the two variables into the regression, we re-estimate it and report the result in column
(3). Both coecients of the two variables are positive and statistically signicant, while
the coecient of lntuition becomes smaller and the signicance level is also lower. It
suggests that, without the two income related variables, there is a positive bias. The
economic logic is that, tuition fees in rich regions are higher while households in rich
(d) In column (3), the coecient 0.012 means that when the average state 4 yr
college tuition increaes by 1%, the years of education completed will be increased by
(e) The result is reported in column (4). The coecient of dist is -0.047, which
suggests that if the individual lives 10 miles closer to a 4yr college, his/her years of
education completed will be 0.047 year higher. For those whose fathers are not college
graduates, their years of education completed is 0.99 year lower; for those whose mothers
are not college graduates, their years of education completed is 0.67 year lower. All
(f ) In column (5), I add the interaction terms between dist and dadcoll (ddedu) ,
between dist and momcoll (dmedu) into the regression model. (It is totally ne for you
to choose any model as the baseline model to add these interaction terms) As the result
shows, only the interaction term between dist and dadcoll is statistically signicant at
10%, which suggests the impact of distance to a 4yr college might depends on daddy's
education but not moms' education. To further investigate, I conduct a F test to test
whether the distance to a 4yr college depends on either parents' education or not,
i.e., the coecients of both interaction terms are jointly equal to zero. The p vaue is
10
0.1362, suggesting that I cannot reject the null hypothesis that the eect of dist does
(g) By the result of (a), the coecient of black is -0.371, which means that given
other factors the same, if the individual is black, the years of education completed is
0.371 year less. To test whether the eect of ethic groups depends on parents' education,
I add four interaction terms into the regression in (a) (you can also use regression from
hmedu(hispanic ∗ momcoll). Estimates of all four coecients are not statistically sig-
nicant. Then, I conduct a joint hypothesis test to test whether the four coecients
are jointly equal to zero. F statistics is 0.185 and p value is 0.946 (as reported in the
bottom panel of table 1), which suggests that we cannot reject the hypothesis that the
3.2 We try to study health insurance, health status, age, and employment using
a random sample of more than 8000 workers in the United States surveyed in 1996.
Please download the data set insurance.dta from Moodle to nish the question. Here
11
For the following questions, please use observations from those who report their
12
Model 1 Model 2 Model 3 Model 4
(Intercept) 0.3391 *** -0.5272 *** -0.3157 *** 0.4527 ***
(0.0577) (0.1968) (0.1125) (0.0299)
selfemp1 -0.1795 *** -1.2452 *** -0.7091 *** -0.2822 ***
(0.0144) (0.0858) (0.0492) (0.0319)
age 0.0097 *** 0.0260 *** 0.0151 *** 0.0036 ***
(0.0028) (0.0032) (0.0018) (0.0004)
age2 -0.0001 **
(0.0000)
familysz -0.0183 *** -0.1041 *** -0.0595 *** -0.0183 ***
(0.0033) (0.0208) (0.0121) (0.0033)
male -0.0399 *** -0.3069 *** -0.1646 *** -0.0395 ***
(0.0082) (0.0634) (0.0355) (0.0082)
married 0.1441 *** 1.0212 *** 0.5754 *** 0.1348 ***
(0.0104) (0.0731) (0.0409) (0.0104)
deg_ged 0.1470 *** 0.6801 *** 0.4106 *** 0.1485 ***
(0.0288) (0.1488) (0.0877) (0.0287)
deg_hs 0.2444 *** 1.2873 *** 0.7625 *** 0.2461 ***
(0.0169) (0.0835) (0.0493) (0.0168)
deg_ba 0.3072 *** 1.8568 *** 1.0765 *** 0.3123 ***
(0.0178) (0.1126) (0.0631) (0.0177)
deg_ma 0.3256 *** 2.2679 *** 1.2812 *** 0.3282 ***
(0.0195) (0.2030) (0.1029) (0.0194)
deg_phd 0.3548 *** 2.5232 *** 1.4108 *** 0.3553 ***
(0.0270) (0.3858) (0.1929) (0.0272)
deg_oth 0.2819 *** 1.5989 *** 0.9232 *** 0.2853 ***
(0.0207) (0.1456) (0.0810) (0.0206)
race_wht1 0.0306 ** 0.2192 **
(0.0137) (0.0919)
race_ot1 -0.0248 -0.1751
(0.0271) (0.1784)
reg_ne -0.0130 -0.1226 -0.0641 -0.0114
(0.0116) (0.1020) (0.0562) (0.0116)
reg_so -0.0449 *** -0.3857 *** -0.2093 *** -0.0443 ***
(0.0105) (0.0879) (0.0486) (0.0105)
reg_we -0.0556 *** -0.4365 *** -0.2396 *** -0.0541 ***
(0.0120) (0.0932) (0.0520) (0.0120)
race_wht 0.1287 ** 0.0292 **
(0.0526) (0.0137)
race_ot -0.1077 -0.0256
(0.1021) (0.0271)
selfemp1:married 0.1384 ***
(0.0354)
N 8173 8173 8173 8173
R2 0.1484 0.1503
AIC 6727.8681 6868.5232 6865.1613 6709.6267
BIC 6861.0314 6987.6692 6984.3073 6842.7899
Pseudo R2 0.2164 0.2170
Standard errors are heteroskedasticity robust. *** p < 0.01; ** p < 0.05; * p < 0.1.
13
(4 points) (a) Estimate a linear probability model with insured as the depen-
dent variable and the following regressors: selfemp age age2 deg_ged deg_hs deg_ba
deg_ma deg_phd deg_oth race_wh race_ot reg_ne reg_so reg_we male married.
How does health insurance status vary with age? Is there a nonlinear relationship
Answer: The coecient on linear term age is positive while the coecient on
quardratic term age2 is negative. The probability of being insured is higher as age
increaes, and the eect of a change in age on health insurance status is declining with
age. The eect is greater for young people than for old people. There is a nonlin-
ear relationship between probability of being insured and age since the coecient on
(4 points) (b) Now please get rid of the variable age2 and estimate a logit model
using the left regressors. How does health insurance status vary with age by this model?
Are the self-employed less likely to have health insurance than wage earners? How does
a white individuals dier with the black individual in terms of having insurance? (Note:
From the marginal eect calculation, if the individual is one year older, the prob-
ability for this individual to have health insurance is higher by 0.3%. Compared with
14
wage earners, given other factors the same, self-employed individuals are 19.8% less
likely to have health insurance. Compared with a black individual and given other fac-
tors the same, if the individual is white, the probability for him/her to have the health
(4 points) (c) Estimate a probit model using the same regressors as in (b). In
terms of having health insurance, how do the white individuals who aged at 25 behave
dierently when he/she is self-employed? How about the white individuals who aged
Answer: The regression outcome is reported in the third column of the regression
table. And the following tables present the marginal eects based on the probit model.
15
16
For the white individuals who aged at 25, the probability for them to have an health
insurance is 22% lower if they are self-employed, compared with wage earners. For the
white individuals who aged at 35, compared with wage earner, the probability for
them to have an health insurance is 20% lower if they are self-employed. For married
individuals, the probability to have an health insurance is 17.3% lower if they are
self-employed, compared with wage earner. The number is 23.7% lower for unmarried
17
insurance between married and unmarried individuals.
(3 points) (d) Use a linear probability model to answer the question: Is the eect of
self-employment on insurance dierent for married workers than for unmarried workers
Answer: The result is reported in the fourth column of the regression table. The
estimate for the interaction term between self-employment and the marriage status is
18