Inference in regression
Brian Caffo, Jeff Leek and Roger Peng
Johns Hopkins Bloomberg School of Public Health
Recall our model and fitted values
Considerthemodel
Yi = 0 + 1 Xi + i
N(0, 2 ) .
Weassumethatthetruemodelisknown.
Weassumethatyou'veseenconfidenceintervalsandhypothesistestsbefore.
0 = Y 1 X
1 = Cor(Y, X)
Sd(Y)
Sd(X)
2/14
Review
Statisticslike oftenhavethefollowingproperties.
1. IsnormallydistributedandhasafinitesampleStudent'sTdistributioniftheestimatedvariance
isreplacedwithasampleestimate(undernormalityassumptions).
2. CanbeusedtotestH0 : = 0 versusHa : >, <, 0 .
3. Canbeusedtocreateaconfidenceintervalfor via Q1/2 where Q1/2 istherelevant
quantilefromeitheranormalorTdistribution.
Inthecaseofregressionwithiidsamplingassumptionsandnormalerrors,ourinferenceswillfollow
verysimilarilytowhatyousawinyourinferenceclass.
Wewon'tcoverasymptoticsforregressionanalysis,butsufficeittosaythatunderassumptionson
thewaysinwhichthe X valuesarecollected,theiidsamplingmodel,andmeanmodel,thenormal
resultsholdtocreateintervalsandconfidenceintervals
3/14
Standard errors (conditioned on X)
ni=1 (Yi Y )(Xi X )
Var( 1 ) = Var
(
n (Xi X )2
i=1
n
Var (i=1 Yi (Xi X ))
2
n
i=1 (Xi X )2 )
n 2 (Xi X )2
(
i=1
2
n
i=1 (Xi X) 2 )
2
ni=1 (Xi X) 2
4/14
Results
2 = Var( 1 ) = 2 / i=1 ( Xi X ) 2
n
2 = Var( ) =
0
1
n
(
2
X
ni=1 (X iX ) 2
Inpractice, isreplacedbyitsestimate.
It'sprobablynotsurprisingthatunderiidGaussianerrors
j j
followsat distributionwithn 2 degreesoffreedomandanormaldistributionforlargen.
Thiscanbeusedtocreateconfidenceintervalsandperformhypothesistests.
5/14
Example diamond data set
library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2))
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
6/14
Example continued
coefTable
Estimate Std. Error t value P(>|t|)
(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40
fit <- lm(y ~ x);
summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -259.6
17.32 -14.99 2.523e-19
x
3721.0
81.79 45.50 6.751e-40
7/14
Getting a confidence interval
sumCoef <- summary(fit)$coefficients
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
[1] -294.5 -224.8
sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]
[1] 3556 3886
With95%confidence,weestimatethata0.1caratincreaseindiamondsizeresultsina355.6to388.6
increaseinpricein(Singapore)dollars.
8/14
Prediction of outcomes
ConsiderpredictingY atavalueofX
Predictingthepriceofadiamondgiventhecarat
Predictingtheheightofachildgiventheheightoftheparents
Theobviousestimateforpredictionatpointx 0 is
0 + 1 x 0
Astandarderrorisneededtocreateapredictioninterval.
There'sadistinctionbetweenintervalsfortheregressionlineatpoint x 0 andthepredictionofwhat
aywouldbeatpointx 0 .
Lineatx se,
0
1
(x0 X ) 2
+
n
2
n
i=1 (X iX )
Predictionintervalseatx ,
0
(x0
X ) 2
1 + 1n + n
2
i=1 (X iX )
9/14
Plotting the prediction intervals
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
xVals <- seq(min(x), max(x), by = .01)
yVals <- beta0 + beta1 * xVals
se1 <- sigma * sqrt(1 / n + (xVals - mean(x))^2/ssx)
se2 <- sigma * sqrt(1 + 1 / n + (xVals - mean(x))^2/ssx)
lines(xVals, yVals + 2 * se1)
lines(xVals, yVals - 2 * se1)
lines(xVals, yVals + 2 * se2)
lines(xVals, yVals - 2 * se2)
10/14
Plotting the prediction intervals
11/14
Discussion
Bothintervalshavevaryingwidths.
LeastwidthatthemeanoftheXs.
Wearequiteconfidentintheregressionline,sothatintervalisverynarrow.
Ifweknew 0 and 1 thisintervalwouldhavezerowidth.
Thepredictionintervalmustincorporatethevariabilibityinthedataaroundtheline.
Evenifweknew 0 and 1 thisintervalwouldstillhavewidth.
12/14
In R
newdata <- data.frame(x = xVals)
p1 <- predict(fit, newdata, interval = ("confidence"))
p2 <- predict(fit, newdata, interval = ("prediction"))
plot(x, y, frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black", bg="lightblue", cex=2)
abline(fit, lwd = 2)
lines(xVals, p1[,2]); lines(xVals, p1[,3])
lines(xVals, p2[,2]); lines(xVals, p2[,3])
13/14
In R
14/14