0% found this document useful (0 votes)
9 views7 pages

Solutions Week 10

Uploaded by

pranav.garg1006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Solutions Week 10

Uploaded by

pranav.garg1006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Nanyang Business School

Tutorial Solutions : 10
Topics : Regression Analysis

NOTE: students are advised to use “full numbers” throughout your working as obtained
from R in the examination, to avoid error propagations. Please round the answer once at
the end, following the instruction of the questions (e.g., round your answer to 4 decimal
places). But be assured that the examination answer will allow a reasonable margin of
errors due to rounding.

1. (1) calculating the OLS coefficients, which are obtained by minimizing the sum of squares
of residuals (i.e., observed values of Y – predicted value of Y)
(2) Like other statistics you know of, the coefficients are a function of the random sample
and therefore would change from sample to sample
(3) The interpretation that we give to a coefficient (i.e., change to the average value of
DV) is built on the basis that values of other IV’s are held constant (or fixed).
Mathematically, if an IV X has a coefficient of β , then
E ( Y |X =a , … ) −E ( Y |X =a+1 , … ) = β , where by “…” I held constant all the other IVs in
the regression (è we said ”Everything else being equal” for simplicity)).

2. (1) the regression coefficient of Income is 10. Since income is measured in $1,000, the
model suggests that if income increased by $1,000, then debt payments are predicted to
increase by $10, holding the unemployment rate constant.

The regression coefficient of the unemployment rate is 0.6, implying that a 1-point
increase in the unemployment index is associated with an increase in the average debt
payments of $0.6, holding income constant.

The intercept 20 is supposedly the average debt payments when the income level and
unemployment rate of this study area are both zero. Evidently, such areas are unlikely to
exist in practice. In practice, the intercept often does not have a meaningful
interpretation, but it must be included in regression for a technical reason.

(2) ^
Debt =20+10 ×80+ 0.6× 7.5=¿824.5

3. (1) As Poverty rate increases by 1%, the average Crime rate is predicted to rise by 53.23,
holding Income constant.
(2) ^
Crime=−301.45+53.23 ×20+ 4.93∗50=1009.65
(3) Poverty is significantly and positively associated with the crime rate at the 0.01
significance level, whereas Income is not statistically significant.

Tutorial Solutions 10 Page 1


Nanyang Business School

(4) The results indicate that “crime rate increases with poverty,” but not “crime rate
decrease with income” (because it is not statistically significant).

4. (1) Observe that E(score) = b0 + b1 E(female).


b0 is the expected score of male students; b1 is the difference between the expected score
of female students and the expected score of male students.

(2) import the file as “data. “


summary(lm(data$score~data$female))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.0500 2.5568 28.961 <2e-16 ***
data$female -0.3357 3.5726 -0.094 0.926
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
*answers highlighted.

The above result shows that 1) the sample mean of female is 0.3357 points lower than the
sample mean of male students. 2) that b1 is not statistically different from 0 (p-value is
0.926). This roughly says E(score|female=1) is statistically no different from E(score|
female=0)=b0, which is about 74.05. As such, we can conclude that we don’t reject H0:
μmale =μ female that the male and female students have the same mean test score.

This example also illustrates why we would need “n-1” 0-1 variables for a n-level categorical
variable. The intercept (b0) would absorb the avg mean of the omitted base group and
capture its mean value, which in this case is the “female=0” group.

5. This question demonstrates how the regression can be implemented without the use of
“factor” command. We would have to include n-1 binary variables for an n-level categorical
variable (the “-1”, namely the one being omitted, would be the baseline level, to which we
compare other levels).

(1) install.packages(“wooldridge”)
library(wooldridge)
reg.results<-lm(salary~sales+finance+consprod+utility,data=ceosal1)
summary(reg.results)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.035e+03 1.791e+02 5.776 2.82e-08 ***
sales 1.249e-02 8.833e-03 1.414 0.1588
finance 2.371e+02 2.567e+02 0.924 0.3568
consprod 5.867e+02 2.374e+02 2.471 0.0143 *
utility -3.529e+02 2.791e+02 -1.264 0.2076

Residual standard error: 1336 on 204 degrees of freedom


Multiple R-squared: 0.07097, Adjusted R-squared: 0.05275
F-statistic: 3.896 on 4 and 204 DF, p-value: 0.004517
Tutorial Solutions 10 Page 2
Nanyang Business School

^
salary=1035+0.012 sales+ 237.1 finance+5 86.7 consprod−352.9 utility

Note that the industry variables are 0-1 variables. For example, the coefficient of
finance is 237.1. So this means
E ( salary|Finance=1 , … )−E ( salary|Finance=0 , … )=¿ $ 237.1 (k), where by “…”
I omitted other IV’s at arbitrary fixed values.

(2) Let β 1 be the coefficient of sales. H 0: β 1=0, H 1: β 1 ≠ 0. The test result (p-
value=0.1588) does not reject H 0 at any common significance level.

(3) Since now we effectively want to use the finance category as the baseline group for
comparison, we proceed by estimating the following regression, where the finance
dummy is omitted:
salary=β 0+ β1 sales + β 2 consprod + β 3 utility + β 4 indus+ ϵ

In R,
> reg.results1<-lm(salary~sales+consprod+utility+indus,data=ceosal1)
> summary(reg.results1)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.272e+03 2.036e+02 6.248 2.39e-09 ***
sales 1.249e-02 8.833e-03 1.414 0.1588
consprod 3.496e+02 2.625e+02 1.332 0.1844
utility -5.900e+02 2.978e+02 -1.981 0.0489 *
indus -2.371e+02 2.567e+02 -0.924 0.3568

The result shows that the average salary of a CEO from a utility firm is $590k lower than
that of a finance firm at the 5% significance level (holding sales constant).

 Here is an important observation. You can obtain the difference itself from the
model in part (1) as well 2.371e+02 - (-3.529e+02), because, given any arbitrary
sales level,
Avg finance CEO salary = Avg industry CEO salary + $237.1 k
Avg utility CEO salary = Avg industry CEO salary - $352.9 k

**So difference in avg salary is then 237.1 - (-352.9)= $590 (k)

We can easily calculate the group difference for any pair…without needing to switch the
base group and re-run the regression.

 Another important observation is how we compute the expected salary for a


group. For example, based on the regression results right above,
Avg finance CEO salary, conditional on sales =1, can be calculated as
= (1.272e+03) + 1*(1.249e-02 ) = 1272.012
Tutorial Solutions 10 Page 3
Nanyang Business School

For the consprod, their avg ceo salary = (1.272e+03) + 1*(1.249e-02 ) + 3.496e+02 =
1621.612
Observe that (1.272e+03) + 1*(1.249e-02 ) is the average salary of the base group,
finance CEOs (given sales=1)

 Finally, it is worth mentioning that typically we delegate the task of creating 0-1
variables to R. So, normally, all four industry types can be stored in the same
column (e.g., =1 for indus, =2 for finance, =3 for consumer product and =4 for
utility in the same column). Then use the factor() command in the regression
(see Question 9), then R will auto-generate all needed 0-1 variables. This
question is just to show you behind-the-scene operations for categorical
variables

6. (1)
reg.results <-lm(bwght ~ cigs, data= bwght)
summary(reg.results)

Estimate Std. Error t value Pr(>|t|)


(Intercept) 119.77190 0.57234 209.267 < 2e-16 ***
cigs -0.51377 0.09049 -5.678 1.66e-08 ***

Residual standard error: 20.13 on 1386 degrees of freedom


Multiple R-squared: 0.02273, Adjusted R-squared: 0.02202
F-statistic: 32.24 on 1 and 1386 DF, p-value: 1.662e-08

^
bwght =119.77190−0.51377 cigs

(2) when cigs=0, the predicted birth weight is 119.77190 ounces. When cigs=20, it
reduces to 109.4965.

7. (1)
> library(wooldridge)
> summary(lm(termGPA~ attend+priGPA+frosh,data=attend))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.426845 0.117101 -3.645 0.000288 ***
attend 0.045701 0.003996 11.437 < 2e-16 ***
priGPA 0.702881 0.042060 16.711 < 2e-16 ***
frosh 0.063302 0.049023 1.291 0.197049
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Tutorial Solutions 10 Page 4
Nanyang Business School

(2) From the help file (?attend in Rstudeio) or the summary statistics of “attend”, we know
that the total number of classes is 32. Thus, the estimated difference in the average GPA is
(32-0)* 0.045701=1.462432.

8. (1) you can either type the variable names or use the “class()” command. For example,
ChickWeight$Diet will show “4 Levels:…” at the end, which is exclusive to factor
variables. Alternatively, use levels(ChickWeight$Diet) to see a full list of the levels in the
variable. When you apply levels to a continuous variable (for example,
levels(ChickWeight$weight)), you will see a “Null” result.
(2) Use
lm(weight~Time+Diet,data=ChickWeight) to obtain the coefficient estimates.

Results are:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.9244 3.3607 3.251 0.00122 **
Time 8.7505 0.2218 39.451 < 2e-16 ***
Diet2 16.1661 4.0858 3.957 8.56e-05 ***
Diet3 36.4994 4.0858 8.933 < 2e-16 ***
Diet4 30.2335 4.1075 7.361 6.39e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(3) Time is measured on a continuous scale and thus has the usual interpretation that
changes in average weight with one additional day, “ELBE.” The variable Diet is
categorical. Thus, their coefficients represent the difference in average weight compared
with the reference group---those eating Diet type #1.

(4) First, factorize the Time variable by creating a new variable


ChickWeight$Time_factor= factor(ChickWeight$Time). Then run the regression as
usual lm(weight~ Time_factor +Diet,data=ChickWeight).

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.497 5.373 4.559 6.30e-06 ***
Time_factor2 8.160 7.141 1.143 0.25367
Time_factor4 18.561 7.178 2.586 0.00996 **
Time_factor6 32.908 7.178 4.585 5.60e-06 ***
Time_factor8 49.847 7.178 6.945 1.05e-11 ***
Time_factor10 66.439 7.178 9.256 < 2e-16 ***
Time_factor12 87.847 7.178 12.239 < 2e-16 ***
Tutorial Solutions 10 Page 5
Nanyang Business School

Time_factor14 102.062 7.216 14.145 < 2e-16 ***


Time_factor16 125.968 7.255 17.362 < 2e-16 ***
Time_factor18 148.074 7.255 20.409 < 2e-16 ***
Time_factor20 167.875 7.296 23.010 < 2e-16 ***
Time_factor21 176.461 7.338 24.046 < 2e-16 ***
Diet2 16.103 4.053 3.973 8.03e-05 ***
Diet3 36.436 4.053 8.989 < 2e-16 ***
Diet4 30.277 4.075 7.430 4.05e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
---
 Note that the Time factor advances at the increment of “2” because the weights
are measured at a two-day interval in the original data, except for the last
measurement (0,2,4,8…,20,21).

 There is no right or wrong formulation. These two models just work under
different assumptions about the effect of “Time,” so they have different
specifications.

 You will get the same result by using the following two-in-one command
lm(weight~ factor(Time) +Diet,data=ChickWeight). This command is handy
when you only need the factorized variable on a one-off basis.

 Since now Time_factor is a categorical variable, its coefficients are the change in
average weight relative to the baseline, reference group (Time=0), those
chicken’s birthweight, ELBE.

 Finally, you are encouraged to compare part 4 of this question and Question 5
above (the CEO salary)
In Question 5, the regression involves a 4-level categorical variable and the four 0-1
variables have been created. We chose to omit one of them and include the other three in
the regression. This is how categorical variables are handled in regression: you create several
0-1 variables and omit one.
However, we almost always will not save our data in this “0-1” way. We would, for
example, save the company’s sector information in one column, for example,
name Salary sales Sector
A 100 50 Finance
B 80 35 Utility
C 85 37 Finance
. . . .
. . . .
. . . .

Tutorial Solutions 10 Page 6


Nanyang Business School

For this data, R wouldn’t know that you want to recognize “Sector” as categorical. The
factor() command allows us to use R to recognize a variable as a factor and create all
necessary 0-1 variables in the regression. This is what is done with “Time” in part (4)

Tutorial Solutions 10 Page 7

You might also like