0% found this document useful (0 votes)
58 views9 pages

D Linear Regression With R

This document discusses simple and multiple linear regression. It presents the relevant R commands for performing linear regression and uses birth weight data to demonstrate key concepts. Simple linear regression is used to model birth weight based on mother's weight. The regression line is plotted and parameters are estimated. Diagnostics like residuals and confidence intervals are also examined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views9 pages

D Linear Regression With R

This document discusses simple and multiple linear regression. It presents the relevant R commands for performing linear regression and uses birth weight data to demonstrate key concepts. Simple linear regression is used to model birth weight based on mother's weight. The regression line is plotted and parameters are estimated. Diagnostics like residuals and confidence intervals are also examined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Simple and Multiple Linear Regression

A. Simple linear regression with R


This chapter is a brief introduction to simple and multiple linear regression and how to use this
method in a real context. We present the relevant R commands and use a real data set as a
connecting thread as we present the key concepts for this method. We treat the case of qualitative
explanatory variables, as well as interaction of explanatory variables.
We discuss model validation with a study of residuals and mention the issue of collinearity. We
also present a few methods for variable selection.
We return to the data set Birth-weight. We wish to explain the variability of child weight
at birth as a function of characteristics of the mother, of family history and of behaviour during
pregnancy. The explained variable is weight at birth (quantitative variable BWT, expressed in
grammes); the explanatory:
This study focused on risks associated with low weight at birth; the data were collected at the
Baystate Medical Centre, Massachusetts, in 1986. Physicians have been interested in low weight
at birth for several years, because underweight babies have high rates of infant mortality and
infant anomalies. The behaviour of the mother-to-be during pregnancy (diet, smoking habits) can
have a significant impact on the chances of having a full-term pregnancy, and thus of giving
birth to a child of normal weight. The data file includes information on 189 women
(identification number: ID) who came to the centre for consultation. Weight at birth is
categorized as low if the child weighs less than 2,500 g.

Loading the data:


> mydata <- read.csv("wb.csv",header=TRUE,sep = "\t")
> summary(mydata)
ID AGE LWT
Min. : 4.0 Min. :14.00 Min. : 80.0
1st Qu.: 68.0 1st Qu.:19.00 1st Qu.:110.0
Median :123.0 Median :23.00 Median :121.0
Mean :121.1 Mean :23.24 Mean :129.8
3rd Qu.:176.0 3rd Qu.:26.00 3rd Qu.:140.0
Max. :226.0 Max. :45.00 Max. :250.0
RACE SMOKE PTL
Min. :1.000 Min. :0.0000 Min. :0.0000
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :1.000 Median :0.0000 Median :0.0000
Mean :1.847 Mean :0.3915 Mean :0.1958
3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :3.000 Max. :1.0000 Max. :3.0000
HT UI FVT
Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.06349 Mean :0.1481 Mean :0.7937
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :1.00000 Max. :1.0000 Max. :6.0000
BWT LOW
Min. : 709 Min. :0.0000
1st Qu.:2414 1st Qu.:0.0000
Median :2977 Median :0.0000
Mean :2945 Mean :0.3122
3rd Qu.:3475 3rd Qu.:1.0000
Max. :4990 Max. :1.0000
The weight of the mother is expressed in pounds. We first transform the data.frame to recode this
variable in kilogrammes (1 pound = 0.45359237kg).
> mydata <- transform(mydata,LWT=LWT*0.4535923)
> attach(mydata)
1. Graphical Inspection
To study the relationship between child weight at birth and weight of the mother, we first draw
the scatterplot of the points (child weight; mother weight) using the instruction plot(BWT~LWT)
> plot(BWT~LWT,xlab="Mother weight",ylab="Child weight at birth")
We observe a slight increase in child weight when mother weight increases, although this
relationship is not very clear.
2. Parameter Estimation
We now study the following model:
BWT = 1 +  2 LWT + 

> model1<-lm(BWT~LWT)
> model1

Call:
lm(formula = BWT ~ LWT)

Coefficients:
(Intercept) LWT
2369.672 9.765
The above R output gives the least squares estimates of ˆ1 = 2369.672 and ˆ2 = 9.765

We can now draw the regression line on the scatter plot, using the function abline()
> plot(BWT~LWT,xlab="Mother weight",ylab="Child weight")
> abline(model1,col="blue")

3. Tests on Parameters
Note that the function lm() performs a complete analysis of the linear model and that you can get
a summary of the calculations related to the data set with the function summary().

> summary(model1)

Call:
lm(formula = BWT ~ LWT)
Residuals:
Min 1Q Median 3Q Max
-2192.18 -503.63 -3.91 508.25 2075.53

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2369.672 228.431 10.374 <2e-16 ***
LWT 9.765 3.777 2.586 0.0105 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 718.2 on 187 degrees of freedom


Multiple R-squared: 0.03452, Adjusted R-squared: 0.02935
F-statistic: 6.686 on 1 and 187 DF, p-value: 0.01048

Here is a description of the information in this output.


Call: formula used in the model.
Residuals: descriptive analysis of residuals ˆi = yˆi − yi , we shall see that residuals are used to
validate the assumptions of the regression model.
Coefficients: this table has four columns:
- Estimate gives the estimates of the parameters of the regression line.
- Std. Error gives the estimate of the standard deviation of the estimators of the regression
line.
- t value gives the realization of Student’s test statistic associated with the hypotheses
H 0 : i = 0; H1 : i  0

- Pr(>|t|) gives the p-value of Student’s test.


Signif. codes: codes for significance levels.
Residual standard error: an estimate of the standard deviation of the noise  and the associated
degree of freedom n – 2 .

Multiple R-squared: coefficient of determination r 2 (percentage of variation explained by the


regression).

Adjusted R-squared: adjusted ra 2 (of limited interest for simple linear regression).

F-statistic: realization of Fisher’s test statistic associated with the hypotheses


H 0 :  2 = 0; H1 :  2  0 . The associated degrees of freedom (1 and n = 2) are given, as is the p-
value.
To get an estimate by confidence interval of the regression coefficients, we can use the function
confint().
> confint(model1)
2.5 % 97.5 %
(Intercept) 1919.039836 2820.30429
LWT 2.314692 17.21502
4. Confidence and Prediction Intervals for a New Value
Consider a new observation x0 of variable X for which we have not observed the corresponding
value y0 of the response variable Y. This value y0 is unknown, since it is not observed, and is a
realization of the random variable Y0 = 1 +  2 X 0 +  0 .

The predictor of Y0 for the new value x0 is given by Yˆ 0p = ˆ1 + ˆ2 x0 .

We can also propose a prediction interval at level 1 −  for Y0 , by finding two random bounds
such that the random variable falls in the interval with probability 1 −  :

 
 1 (x − x) 
Yˆ 0p  t1(−n− 2)
/2 1 +
ˆ + n 0 
 
 ( xi − x )
n 2
 
 i =1 

Note that the realization yˆ 0p = ˆ1 + ˆ2 x0 is called the prediction of the unobserved value
y0 = 1 +  2 x0 +  0 .

Similarly, note that an estimator of the fixed and unknown value E (Y0 | X = x0 ) = 1 +  2 x0 is
given by Eˆ (Y | X = x ) = Yˆ = ˆ + ˆ x .
0 0 0 1 2 0

We can also propose a prediction interval at level 1 −  for E (Y0 | X = x0 ) ,

 
 1 ( x0 − x )
2 
ˆ
Y0  t1− /2ˆ
( n − 2)
+ 
 n n 
 ( xi − x )
2
 
 i =1 

The function to define the prevision interval and confidence interval for a new value x0 is
predict().
Use data Weight – Birth, we calculate the prediction of the weight of a baby whose mother
weighs lwt = 56 kg.
> lwt0 <- 56
> predict(model1,data.frame(LWT=lwt0),interval="prediction")
fit lwr upr
1 2916.504 1495.699 4337.309
For the confidence interval of the mean value of the weight of babies with a mother weighing 56
kg:
> predict(model1,data.frame(LWT=lwt0),interval="confidence")
fit lwr upr
1 2916.504 2811.225 3021.783
We now represent the confidence interval and prediction interval for a series of new values of the
mother’s weight
> x <- seq(min(LWT),max(BWT),length=50)
> predint <- predict(model1,data.frame(LWT=x),interval=
+ "prediction")[,c("lwr","upr")]
> confint <- predict(model1,data.frame(LWT=x),interval=
+ "confidence")[,c("lwr","upr")]
> plot(BWT~LWT,xlab="Mother weight",ylab="Child weight")
> abline(model1)
> matlines(x,cbind(confint,predint),lty=c(2,2,3,3),
+ col=c("red","red","blue","blue"),lwd=c(2,2,1,1))
> legend("bottomright",lty=c(2,3),lwd=c(2,1),
+ c("confidence","prediction"),col=c("red","blue"))

B. Multiple Linear Regression


1. Graphical Inspection
With data in part A, we make regression of child weight at birth as a function of mother age,
weight and smoking status during pregnancy.
Before estimating the model, we present a scatter plot of all pairs of variables:
> pairs(BWT~LWT+AGE+SMOKE)

2. Parameter Estimation
As for simple linear regression, the model is estimated using function lm():
> model2 <- lm(BWT~AGE+LWT+SMOKE)
> model2

Call:
lm(formula = BWT ~ AGE + LWT + SMOKE)

Coefficients:
(Intercept) AGE LWT SMOKE
2362.720 7.093 8.860 -267.213

3. Tests on Parameters
Tests on parameters are performed by function summary().
> summary(model2)
Call:
lm(formula = BWT ~ AGE + LWT + SMOKE)

Residuals:
Min 1Q Median 3Q Max
-2069.89 -433.18 13.67 516.45 1813.75

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2362.720 300.687 7.858 3.11e-13 ***
AGE 7.093 9.925 0.715 0.4757
LWT 8.860 3.791 2.337 0.0205 *
SMOKE -267.213 105.802 -2.526 0.0124 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 708.8 on 185 degrees of freedom


Multiple R-squared: 0.06988, Adjusted R-squared: 0.05479
F-statistic: 4.633 on 3 and 185 DF, p-value: 0.003781

The results output by summary() are presented in the same fashion as for simple linear
regression. Parameter estimates are given in the column Estimate.
The realizations of Student’s test statistics associated with the hypotheses H 0 :  2 = 0; H1 :  2  0
are given in column t-value; the associated p-values are in column Pr(>|t|). Residual
standard error gives the estimate of and the number of associated degrees of freedom n – p – 1.
The coefficient of determination R2 (multiple R-squared) and an adjusted version
(adjusted R-squared) are given, as are the realization of Fisher’s global test statistic (F-
statistic) and the associated p-value.
3. Interpreting Results from Study “Weight at Birth”
Given the result of Fisher’s global test (p-value D 0.003781), we can conclude that at least one of
the explanatory variables is associated with child weight at birth, after adjusting for the other
variables. The individual Student tests indicate that:
• Mother weight is linearly associated with child weight, after adjusting for age and
smoking status, with risk of error less than 5 % (p-value = 0.0205). At same age and
smoking status, an increase of 1 kg in mother weight corresponds to an increase of 8.860
g of average child weight at birth.
• The age of the mother is not significantly linearly associated with child weight at birth
when mother weight and smoking status are already taken into account (p-value =
0.20661).
• Weight at birth is significantly lower for a child born to a mother who smokes, compared
to children born to non-smoker mothers of same age and weight, with a risk of error less
than 5 % (p-value = 0.012). At same age and mother weight, child weight at birth is
267.213 g less for a smoker mother than for a non-smoker mother.
4. Interpreting Results from Study “Weight at Birth”
Suppose we wish to predict the weight at birth of a child whose mother is 23 years old, weighs
57 kg and smokes. The function predict() gives a prediction, a prediction interval and a
confidence interval for the mean weight of children whose mothers have these characteristics.
> newdata <- data.frame(AGE=23,LWT=57,SMOKE=1)
> predict(model2,newdata,interval="pred")
fit lwr upr
1 2763.693 1355.943 4171.444
> predict(model2,newdata,interval="conf")
fit lwr upr
1 2763.693 2600.914 2926.472
5. Testing a Linear Sub-hypothesis: Partial Fisher Test
Fisher’s partial test is used to test the contribution of a subset of explanatory variables in a model
which already includes other explanatory variables. For example, consider the following two
models:
Model 1: BWT = 1 +  2 LWT + 

Model 2: BWT = 1 +  2 LWT + 3 AGE +  4 SMOKE + 

Fisher’s test is used to test the joint contribution of variables AGE and SMOKE in model 2. The
hypotheses of the test are H 0 :  2 = 3 = 0 and H1 : at least one of the coefficients  2 or  3 is
non-zero. The following instructions are used for this test:
> anova(model1,model2)
Analysis of Variance Table

Model 1: BWT ~ LWT


Model 2: BWT ~ AGE + LWT + SMOKE
Res.Df RSS Df Sum of Sq F Pr(>F)
1 187 96468171
2 185 92935223 2 3532949 3.5164 0.03171 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value of the test (Pr(>F)=0.03171) indicates that at least one of the two variables
AGE or SMOKE gives extra information to predict child weight at birth, when mother weight has
already been taken into account.

You might also like