D Linear Regression With R
D Linear Regression With R
> model1<-lm(BWT~LWT)
> model1
Call:
lm(formula = BWT ~ LWT)
Coefficients:
(Intercept) LWT
2369.672 9.765
The above R output gives the least squares estimates of ˆ1 = 2369.672 and ˆ2 = 9.765
We can now draw the regression line on the scatter plot, using the function abline()
> plot(BWT~LWT,xlab="Mother weight",ylab="Child weight")
> abline(model1,col="blue")
3. Tests on Parameters
Note that the function lm() performs a complete analysis of the linear model and that you can get
a summary of the calculations related to the data set with the function summary().
> summary(model1)
Call:
lm(formula = BWT ~ LWT)
Residuals:
Min 1Q Median 3Q Max
-2192.18 -503.63 -3.91 508.25 2075.53
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2369.672 228.431 10.374 <2e-16 ***
LWT 9.765 3.777 2.586 0.0105 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Adjusted R-squared: adjusted ra 2 (of limited interest for simple linear regression).
We can also propose a prediction interval at level 1 − for Y0 , by finding two random bounds
such that the random variable falls in the interval with probability 1 − :
1 (x − x)
Yˆ 0p t1(−n− 2)
/2 1 +
ˆ + n 0
( xi − x )
n 2
i =1
Note that the realization yˆ 0p = ˆ1 + ˆ2 x0 is called the prediction of the unobserved value
y0 = 1 + 2 x0 + 0 .
Similarly, note that an estimator of the fixed and unknown value E (Y0 | X = x0 ) = 1 + 2 x0 is
given by Eˆ (Y | X = x ) = Yˆ = ˆ + ˆ x .
0 0 0 1 2 0
1 ( x0 − x )
2
ˆ
Y0 t1− /2ˆ
( n − 2)
+
n n
( xi − x )
2
i =1
The function to define the prevision interval and confidence interval for a new value x0 is
predict().
Use data Weight – Birth, we calculate the prediction of the weight of a baby whose mother
weighs lwt = 56 kg.
> lwt0 <- 56
> predict(model1,data.frame(LWT=lwt0),interval="prediction")
fit lwr upr
1 2916.504 1495.699 4337.309
For the confidence interval of the mean value of the weight of babies with a mother weighing 56
kg:
> predict(model1,data.frame(LWT=lwt0),interval="confidence")
fit lwr upr
1 2916.504 2811.225 3021.783
We now represent the confidence interval and prediction interval for a series of new values of the
mother’s weight
> x <- seq(min(LWT),max(BWT),length=50)
> predint <- predict(model1,data.frame(LWT=x),interval=
+ "prediction")[,c("lwr","upr")]
> confint <- predict(model1,data.frame(LWT=x),interval=
+ "confidence")[,c("lwr","upr")]
> plot(BWT~LWT,xlab="Mother weight",ylab="Child weight")
> abline(model1)
> matlines(x,cbind(confint,predint),lty=c(2,2,3,3),
+ col=c("red","red","blue","blue"),lwd=c(2,2,1,1))
> legend("bottomright",lty=c(2,3),lwd=c(2,1),
+ c("confidence","prediction"),col=c("red","blue"))
2. Parameter Estimation
As for simple linear regression, the model is estimated using function lm():
> model2 <- lm(BWT~AGE+LWT+SMOKE)
> model2
Call:
lm(formula = BWT ~ AGE + LWT + SMOKE)
Coefficients:
(Intercept) AGE LWT SMOKE
2362.720 7.093 8.860 -267.213
3. Tests on Parameters
Tests on parameters are performed by function summary().
> summary(model2)
Call:
lm(formula = BWT ~ AGE + LWT + SMOKE)
Residuals:
Min 1Q Median 3Q Max
-2069.89 -433.18 13.67 516.45 1813.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2362.720 300.687 7.858 3.11e-13 ***
AGE 7.093 9.925 0.715 0.4757
LWT 8.860 3.791 2.337 0.0205 *
SMOKE -267.213 105.802 -2.526 0.0124 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results output by summary() are presented in the same fashion as for simple linear
regression. Parameter estimates are given in the column Estimate.
The realizations of Student’s test statistics associated with the hypotheses H 0 : 2 = 0; H1 : 2 0
are given in column t-value; the associated p-values are in column Pr(>|t|). Residual
standard error gives the estimate of and the number of associated degrees of freedom n – p – 1.
The coefficient of determination R2 (multiple R-squared) and an adjusted version
(adjusted R-squared) are given, as are the realization of Fisher’s global test statistic (F-
statistic) and the associated p-value.
3. Interpreting Results from Study “Weight at Birth”
Given the result of Fisher’s global test (p-value D 0.003781), we can conclude that at least one of
the explanatory variables is associated with child weight at birth, after adjusting for the other
variables. The individual Student tests indicate that:
• Mother weight is linearly associated with child weight, after adjusting for age and
smoking status, with risk of error less than 5 % (p-value = 0.0205). At same age and
smoking status, an increase of 1 kg in mother weight corresponds to an increase of 8.860
g of average child weight at birth.
• The age of the mother is not significantly linearly associated with child weight at birth
when mother weight and smoking status are already taken into account (p-value =
0.20661).
• Weight at birth is significantly lower for a child born to a mother who smokes, compared
to children born to non-smoker mothers of same age and weight, with a risk of error less
than 5 % (p-value = 0.012). At same age and mother weight, child weight at birth is
267.213 g less for a smoker mother than for a non-smoker mother.
4. Interpreting Results from Study “Weight at Birth”
Suppose we wish to predict the weight at birth of a child whose mother is 23 years old, weighs
57 kg and smokes. The function predict() gives a prediction, a prediction interval and a
confidence interval for the mean weight of children whose mothers have these characteristics.
> newdata <- data.frame(AGE=23,LWT=57,SMOKE=1)
> predict(model2,newdata,interval="pred")
fit lwr upr
1 2763.693 1355.943 4171.444
> predict(model2,newdata,interval="conf")
fit lwr upr
1 2763.693 2600.914 2926.472
5. Testing a Linear Sub-hypothesis: Partial Fisher Test
Fisher’s partial test is used to test the contribution of a subset of explanatory variables in a model
which already includes other explanatory variables. For example, consider the following two
models:
Model 1: BWT = 1 + 2 LWT +
Fisher’s test is used to test the joint contribution of variables AGE and SMOKE in model 2. The
hypotheses of the test are H 0 : 2 = 3 = 0 and H1 : at least one of the coefficients 2 or 3 is
non-zero. The following instructions are used for this test:
> anova(model1,model2)
Analysis of Variance Table