Multiple Linear Regression - Practice Problems
1. Use the data in discrimination.csv to answer this question. These are zip code–level
data on food and housing prices along with characteristics of the zip code popula-
tion, in New Jersey and Pennsylvania, USA. The idea is to see whether the housing
prices are lower in areas with a larger concentration of blacks.
a. Estimate the effect of percentage of Black population, prpblck residing in the zip-
code on the median housing prices, hseval.
R code is given below
rm ( l i s t = l s ( ) )
library ( stargazer )
d i s c r i m d a t a <−read . csv ( f i l e . choose ( ) )
model1<−lm ( h s e v a l ~prpblck , data= d i s c r i m d a t a )
s t a r g a z e r ( model1 , type =" t e x t " )
b. Interpret the coefficient on prpblck. Do you think it is economically large?
Increase in black population by 1% decreases median housing prices by USD
983.365. It is not economically large since the average housing price is 142,300
and so it is a small change.
c. Is prpblck statistically significant at the 5% level against a two-sided alternative?
−983.4
The t-statistic is t = 144.4
= −6.812. Since t-calculated is greater than t-critical,
we can reject the null and conclude that the coefficient on prpblck is statistically
significant at the 1% level.
d. Now, add number of managers employed at the fast food restaurants in the zip-
code as a control. How does your result change from (a)? Is number of man-
agers a relevant variable? Comment on the coefficient and standard error on
prpblck
model2<−lm ( h s e v a l ~prpblck+nmgrs , data= d i s c r i m d a t a )
s t a r g a z e r ( model1 , model2 , type =" t e x t " )
1
Coefficient on prpblck doesn’t change by much but we notice an increase in
Standard Error. Number of managers doesn’t seem to be a relevant variable.
One can see that it is not statistically significant.
e. Start again with model in (a). Now add median housing income as a control. How
does your result change from (a)? What does it tell about the causal effect of
prpblck on housing prices?
model3<−lm ( h s e v a l ~prpblck+income , data= d i s c r i m d a t a )
s t a r g a z e r ( model1 , model2 , model3 , type =" t e x t " )
We see that the coefficient on prpblck changes signs and magnitude. Note that
the interpretation on prpbclk is "an increase in black population by 1% increases
housing prices by USD 164 keeping income constant." In other words, control-
ling for income, the effect of increase in black population on housing prices. But
this effect could be solely driven by income and hence it can be seen as a bad
control.
f. Now control for prpnoblck. Compare your estimates with (d)? What is the coeffi-
cient on prpnoblck? Why?
model4<−lm ( h s e v a l ~prpblck+prpnoblck , data= d i s c r i m d a t a )
s t a r g a z e r ( model1 , model3 , model4 , type =" t e x t " )
Coefficient on prpnoblck drops out because of perfect collinearity between prp-
blck and prpnoblck.
2. A problem of interest to health officials (and others) is to determine the effects of
smoking during pregnancy on infant health. One measure of infant health is birth
weight; a birth weight that is too low can put an infant at risk for contracting var-
ious illnesses. Since factors other than cigarette smoking that affect birth weight
are likely to be correlated with smoking, we should take those factors into account.
For example, higher income generally results in access to better prenatal care, as
well as better nutrition for the mother. An equation that recognizes this is bwght =
β0 + β1 cigs + β2 f aminc + u.
2
a. What is the most likely sign for β2 ?
β2 > 0 as more income typically means better nutrition for the mother and bet-
ter prenatal care.
b. Do you think cigs and faminc are likely to be correlated? Explain why the corre-
lation might be positive or negative.
Yes, it is very likely that they are correlated. The direction of the relationship
is ambiguous. On the one hand, an increase in income generally increases the
consumption of a good, and therefore cigs and faminc could be positively cor-
related. On the other, family incomes are also higher for families with more ed-
ucation, and more education and cigarette smoking tend to be negatively corre-
lated. In the sample data, the correlation is found to be -0.173.
c. Now, estimate the equation with and without faminc using the data in birth-
weight.csv. Report the results in equation form, including the sample size and
R-squared. Discuss your results, focusing on whether adding faminc substan-
tially changes the estimated effect of cigs on bwght.
rm ( l i s t = l s ( ) )
library ( stargazer )
l i b r a r y ( car )
b i r t h d a t a <−read . csv ( f i l e . choose ( ) )
model1<−lm ( bwght~ c i g s +faminc , data= b i r t h d a t a )
model2<−lm ( bwght~ c i g s , data= b i r t h d a t a )
s t a r g a z e r ( model1 , model2 , type =" t e x t " )
The effect of cigarette smoking is slightly smaller when faminc is added to the
regression, but the difference is not great. This is due to the fact that cigs and
faminc are not highly correlated, and the coefficient on faminc is practically
small. The variable faminc is measured in thousands, so USD 1000 more in 1988
income increases predicted birth weight by only .093 units
3. The following model is a simplified version of the multiple regression model used
by Biddle and Hamermesh (1990) to study the tradeoff between time spent sleeping
3
and working and to look at other factors affecting sleep:
sleep = β0 + β1 ∗ totwrk + β2 ∗ educ + β3 ∗ age + u
where sleep and totwrk (total work) are measured in minutes per week and educ
and age are measured in years.
a. If adults trade off sleep for work, what is the sign of β1 ?
If adults trade off sleep for work, more work implies less sleep (other things
equal), so β1 < 0.
b. What signs do you think β2 and β3 will have?
The signs of β2 and β3 are not obvious. One could argue that more educated
people like to get more out of life, and so, other things equal, they sleep less (
β2 < 0). The relationship between sleeping and age is more complicated than
this model suggests. It can go either way i.e. β3 > 0 or β3 < 0.
c. The estimated equation is
ˆ = 3638.25 − 0.148 ∗ totwrk − 11.13 ∗ educ + 2.20 ∗ age
sleep
R2 = 0.113, n = 706
If someone works five more hours per week, by how many minutes is sleep
predicted to fall? Is this a large trade-off?
Since totwrk is in minutes, we must convert five hours into minutes i.e. 300
minutes. Then sleep is predicted to fall by 0.148 ∗ 300 = 44.4 minutes. For a
week, 45 minutes less sleep is not an overwhelming change.
d. Discuss the sign and magnitude of the estimated coefficient on educ. between
sleeping and working?
More education implies less predicted time sleeping, but the effect is quite small.
An additional year of education decreases sleep by 11.13 minutes, holding other
things constant.
4
e. Do totwrk, educ, and age explain much of the variation in sleep? What other fac-
tors might affect the time spent sleeping? Are these likely to be correlated with
totwrk?
Not surprisingly, the three explanatory variables explain only about 11.3% of
the variation in sleep. One important factor in the error term is general health.
Another is marital status, and whether the person has children. Health (how-
ever we measure that), marital status, and number and ages of children would
generally be correlated with totwrk. (For example, less healthy people would
tend to work less.)
4. In a study relating college grade point average to time spent in various activities,
you distribute a survey to several students. The students are asked how many hours
they spend each week in four activities: studying, sleeping, working, and leisure.
Any activity is put into one of the four categories, so that for each student, the sum
of hours in the four activities must be 168.
a. In the model does it make sense to hold sleep, work, and leisure fixed, while
changing study?
GP A = β0 + β1 ∗ study + β2 ∗ sleep + β3 ∗ work + β4 ∗ leisure + u
No. By definition, study + sleep + work + leisure = 168. Therefore, if we change
study, we must change at least one of the other categories so that the sum is still
168.
b. Explain why this model violates the assumption of no perfect collinearity.
We can write, say, study as a perfect linear function of the other independent
variables: study = 168 - sleep - work - leisure. This holds for every observation,
so the assumption of no perfect collinearity is violated.
c. How could you reformulate the model so that its parameters have a useful in-
terpretation and it satisfies the assumption of no perfect collinearity?
Simply drop one of the independent variables, say leisure.
5
GP A = β0 + β1 ∗ Study + β2 ∗ Sleep + β3 ∗ work + u
Now, for example, β1 is interpreted as the change in GPA when study increases
by one hour, where sleep, work, and u are all held fixed. If we are holding sleep
and work fixed but increasing study by one hour, then we must be reducing
leisure by one hour. The other slope parameters have a similar interpretation.
5. Suppose the estimated equation in (3) with and without controls is
ˆ = 3638.25 − 0.148 ∗ totwrk − 11.13 ∗ educ + 2.20 ∗ age
sleep
(112.28) (0.017) (5.88) (1.45)
R2 = 0.113, n = 706
ˆ = 3586.377 − 0.150 ∗ totwrk
sleep
(112.28) (0.016)
R2 = 0.1033, n = 706
where we now report standard errors along with the estimates.
a. Is either educ or age individually significant at the 5% level against a two-sided
alternative? Show your work.
11.13−0
To test the coefficient on educ, we need to calculate t = 5.88
. Notice that here
we take df = ∞ as n − k − 1 is very large. Since tcal is 1.89, we do not reject
the null hypothesis that education has no effect on sleep. Similarly, since t =
2.20−0
1.45
= 1.52, tcal < tcrit , we do not reject that the coefficient on age is 0 at the
5% significance level. Therefore, both educ and age are not significant at the 5%
level.
b. Does including educ and age in the model greatly affect the estimated tradeoff
between sleeping and working?
In Q(3), we concluded that sleep should fall by 0.148· 300 = 44.4 minutes per
week if someone works 5 more hours per week. From the second equation, we
6
get that sleep should fall by 0.150·300 = 45 minutes per week if someone works
5 more hours per week after controlling for educ and age, which is a very, very
small change. So including educ and age in the model does not greatly affect
the estimated trade-off between sleeping and working.
6. Suppose we want to estimate the effects of alcohol consumption (alcohol) on college
grade point average (colGPA). In addition to collecting information on grade point
averages and alcohol usage, we also obtain attendance information (say, percent-
age of lectures attended, called attend), A standardized test score (say, SAT) and high
school GPA (hsGPA).
a. Should we include attend along with alcohol as explanatory variables in a multi-
ple regression model? (Think about how you would interpret βalcohol .)
The answer is not entirely obvious, but one must properly interpret the coeffi-
cient on alcohol in either case. If we include attend, then we are measuring the
effect of alcohol consumption on college GPA, holding attendance fixed. Be-
cause attendance is likely to be an important mechanism through which drink-
ing affects performance, we probably do not want to hold it fixed in the analy-
sis. If we do include attend, then we interpret the estimate of βalcohol as measur-
ing those effects on colGPA that are not due to attending class. (For example,
we could be measuring the effects that drinking alcohol has on study time.) To
get a total effect of alcohol consumption, we would leave attend out.
b. Should entrance exam scores and high school grades be included as explana-
tory variables? Explain.
We would want to include entrance exam scores and high school grades as con-
trols, as these measure student abilities and motivation. Drinking behavior in
college could be correlated with one’s performance in high school and on en-
trance tests. Other factors, such as family background, would also be good con-
trols
7. Are rent rates influenced by the student population in a college town? Let rent be
the average monthly rent paid on rental units in a college town in the United States.
7
Let pop denote the total city population, avginc the average city income, and pctstu
the student population as a percentage of the total population. One model to test for
a relationship is
log(rent) = β0 + β1 log(pop) + β2 log(avginc) + β3 pctstu + u
(a) What signs do you expect for β1 and β2 ?
β1 > 0 and β2 > 0 since both increase in population and an increase in average
income should increase rents.
(b) The equation estimated using 1990 data from RENTAL for 64 college towns is
ˆ = 0.043 + 0.066 ∗ lpop + 0.507 ∗ lavginc + 0.0056 ∗ pctstu
lrent
(0.844) (0.039) (0.081) (0.0017)
R2 = 0.458, n = 64
What is wrong with the statement: “A 10% increase in population is associated
with about a 6.6% increase in rent”?
Correct statement is "A 10% increase in population is associated with a 0.066*10
or 0.06% increase in rent.
(c) State the null hypothesis that size of the student body relative to the population
has no ceteris paribus effect on monthly rents. State the alternative that there is
an effect. Test the hypothesis stated at the 1% level.
βˆ3 −β3 0.056−0
H0 : β3 = 0 against HA : β3 ̸= 0. t = se(βˆ3 )
= 0.0017
= 3.29. This is greater than
the t-critical value and therefore we can reject the null hypothesis and conclude
that Student body relative to the population of town has a significant effect on
rents.
8. Consider the following equation:
log(price) = β0 + β1 log(assess) + β2 log(lotsize) + β3 log(sqrf t) + β4 bdrms + u
where price is the house price, assess is the assessed housing value (before the house
8
was sold), lotsize is the size of the plot, in feet, sqrft is the square footage and bdrms is
the number of bedrooms.
(a) What signs do you expect for β1 , β2 , β3 and β4 ?
β1 > 0, β2 > 0, β3 > 0 and β4 > 0.
(b) Using data in HPRICE1, test the hypothesis: HO : β1 = 1 against HA : β1 ̸= 1
rm ( l i s t = l s ( ) )
l i b r a r y ( wooldridge )
data ( ’ h p r i c e 1 ’ )
force ( hprice1 )
model1<−lm ( l p r i c e ~ l a s s e s s + l l o t s i z e + l s q r f t +bdrms , data= h p r i c e 1 )
s t a r g a z e r ( model1 , type =" t e x t " )
The estimated equation is
ˆ = 0.264 + 1.043 ∗ l(assess) + 0.007 ∗ llotsize − 0.103 ∗ lsqrf t + 0.034 ∗ bdrms
lprice
(0.570) (0.151) (0.039) (0.138) (0.022)
R2 = 0.773, n = 88
βˆ1 −β1 1.043−1
t= se(βˆ1 )
= 0.151
= 0.84. Thus, we do not have enough evidence to reject the
null hypothesis.
(c) Suppose we would like to test whether the assessed housing price is a ratio-
nal valuation. If this is the case, then a 1% change in assess should be associ-
ated with a 1% change in price; that is, β1 = 1. In addition, lotsize, sqrft, and
bdrms should not help to explain log(price), once the assessed value has been
controlled for. Test these hypotheses.
HO : β1 = 1, β2 = 0, β3 = 0, β4 = 0 against HA : β1 ̸= 1, β2 ̸= 0, β3 ̸= 0, β4 ̸= 0.
l i b r a r y ( car )
l i n e a r H y p o t h e s i s ( model1 ,
c ( " l a s s e s s = 1 " , " l l o t s i z e = 0 " , " l s q r t f t = 0 " , " bdrms = 0 " )
Since p-value of the test is greater than 5%, we cannot reject the null and there-
fore use the restricted model.
9
9. The variable rdintens is expenditures on research and development (R&D) as a per-
centage of sales. Sales are measured in millions of dollars. The variable profmarg is
profits as a percentage of sales. Using the data in RDCHEM for 32 firms in the chem-
ical industry, the following equation is estimated:
ˆ
rdintens = 0.472 + 0.321 ∗ log(sales) + 0.050 ∗ prof marg
(1.369) (0.216) (0.046)
R2 = 0.099, n = 32
(a) Interpret the coefficient on log(sales). In particular, if sales increases by 10%,
what is the estimated percentage point change in rdintens? Is this an economi-
cally large effect?
Therefore, if sales increases by 10%, rdintens increases by 0.321 ∗ 10/100 or 0.032
percentage point. For such a large percentage increase in sales, this seems like a
practically small effect.
(b) Test the hypothesis that R&D intensity does not change with sales against the
alternative that it does increase with sales. Do the test at the 5% and 10% levels.
0.321−0
H0 : β1 = 0 against HA : β1 > 0. t = 0.216
= 1.486. The 5% critical value for
a one-tailed test, with df = 32 − 3 = 29 is 1.699 and so we cannot reject the null
at the 5% level of significance. But the 10% critical value is 1.311 and so we can
reject the null at the 10% level of significance.
(c) Interpret the coefficient on profmarg. Is it economically large?
A 1% increase in profit margins increases rdintens by 0.05 percentage points.
(d) Does profmarg have a statistically significant effect on rdintens?
t = 0.050/0.046 = 1.08. This is less than the t-critical value at the 5% level and
so we do not reject the null of no effect. In other words, profmarg does not have
a statistically significant effect on rdintens.
10. The researcher is interested in estimating the output elasticity of labour and capital.
The estimation is done by fitting 2017 data for 171 countries on aggregate output,
10
capital and labor to the cobb-douglas production function Y = AK α Lβ where Y is
the aggregate output, A is the total factor productivity, K is capital and L is labour.
(a) What do the parameters α and β capture in the production function?
α is the output elasticity of capital and β is the output elasticity of labour.
(b) Estimate the production function using OLS. What is the output elasticity of
labour? Is it statistically significant?
ˆ ) = 0.374 + 0.218 ∗ log(L) + 0.810 ∗ log(K)
log(Y
(0.247) (0.023) (0.021)
R2 = 0.972, n = 171
0.810−0
β = 0.810. t = 0.021
= 38.57. Yes, it is statistically significant at the 5% and 1%
level.
(c) What is the output elasticity of capital? Is it statistically significant?
0.218−0
α = 0.218. t = 0.023
= 9.47. Yes, it is statistically significant at the 5% and 1%
level.
(d) Is the global economy characterised by constant returns to scale? Justify using
an appropriate test. [Hint: Use the code vcov(name of regression model) to obtain
the variance-covariance matrix.]
rm ( l i s t = l s ( ) )
penndata <−read . csv ( f i l e . choose ( ) )
penndata$logemp <−l o g ( penndata$emp )
penndata$logcgdpo <−l o g ( penndata$cgdpo )
penndata$logcn <−l o g ( penndata$cn )
cobbdouglasmodel <−lm ( logcgdpo~logemp+logcn , data=penndata )
s t a r g a z e r ( cobbdouglasmodel , type =" t e x t " )
vcov ( cobbdouglasmodel )
H0 : α + β = 1 against HA : α + β ̸= 1.
α̂ + β̂ − 1 0.218 + 0.810 − 1 0.028
tcal = =√ = =2
se(α̂ + β̂) 0.0232 + 0.0212 + 2 ∗ −0.00038 0.014
11
Since tcal > tcrit , we reject the null hypothesis and conclude that the global econ-
omy is not characterized by constant returns to scale.
(e) Conduct a test of overall significance of the regression model.
l i n e a r H y p o t h e s i s ( cobbdouglasmodel , c ( " logemp = 0 " , " logcn = 0 " ) )
H0 : α = 0, β = 0 against HA : α ̸= 0, β ̸= 0
ˆ ) = β0 and the unrestricted model is the one given in
Restricted model is log(Y
2 −R2
Rur r 0.972−0
q
(b). Therefore, F = 1−Rur2 = 2
1−0.972 = 2933.35. Therefore, since F calculated
n−k−1 171−2−1
is greater than F-critical, we can reject the null and conclude that both the vari-
ables, log(K) and log(L) belong in the model.
11. Use Smoke.csv dataset for this exercise. A model to estimate the effects of smoking
on annual income (perhaps through lost work days due to illness, or productivity
effects) is
log(income) = β0 + β1 cigs + β2 educ + β3 age + β4 age2 + u
(a) Estimate the coefficient on β1 and interpret it.
rm ( l i s t = l s ( ) )
l i b r a r y ( wooldridge )
library ( stargazer )
l i b r a r y ( car )
data ( ’ smoke ’ )
f o r c e ( smoke )
model<−lm ( lincome~ c i g s +educ+age+agesq , data=smoke )
s t a r g a z e r ( model , type =" t e x t " )
The coefficient on cigs is 0.002. An additional cigarette consumed in a day is
associated with 0.2% increase in income.
(b) Is the coefficient on β1 statistically significant at the 5% level?
0.002−0
t= 0.002
= 1. The calculated test statistic is less than the t-critical value (1.96)
12
and hence we cannot reject the null. In other words, cigarette consumption does
not have a statistically significant relationship with income.
(c) Find the marginal effect of age on income.
∂income 1 ∂income
∂age
∗ income
= β3 + 2 ∗ β4 ∗ age or ∂age
= income(β3 + 2β4 ∗ age)
(d) What is the interpretation of β4 ?
It implies that income and age are non-linearly related. It represents the associa-
tion between Income and age-squared.
(e) Comment on the relationship between income and age? At what age does in-
come peak?
t u r n i n g p o i n t <<− m o d e l 2 $ c o e f f i c i e n t s [ 4 ] / ( 2 * m o d e l 2 $ c o e f f i c i e n t s [ 5 ] )
print ( turningpoint )
The relationship is non-linear since β2 > 0 and β4 < 0. This shows that marginal
income first increases with increase in age but after a certain age, we see the
∂lnincome
marginal income declining with increasing age. The peak is given by ∂age
=
β3 + 2β4 age. At the peak slope is zero implying that β3 = −2β4 ∗ age or age peaks
β3
at the point − 2β 4
= 45.54.
(f) Test the hypothesis HO : β3 = β4 = 0
l i n e a r H y p o t h e s i s ( model , c ( " age = 0 " , " agesq = 0 " ) )
F calculated is 28.7 which is greater than F-critical and therefore we can reject
the null and conclude that both age and age − squared terms belong in the
model.
13