0% found this document useful (0 votes)
2 views8 pages

Statistical_Computing

The document outlines a statistical computing assignment focused on linear regression analysis using the least squares method to estimate model parameters. It includes steps for validating model assumptions, performing bootstrap procedures for parameter estimation, and constructing confidence intervals for various parameters related to newborn lengths based on parental heights and smoking status. Key results include estimates of model parameters, their significance, and confidence intervals, indicating the effects of parental heights and smoking on newborn length.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Statistical_Computing

The document outlines a statistical computing assignment focused on linear regression analysis using the least squares method to estimate model parameters. It includes steps for validating model assumptions, performing bootstrap procedures for parameter estimation, and constructing confidence intervals for various parameters related to newborn lengths based on parental heights and smoking status. Key results include estimates of model parameters, their significance, and confidence intervals, indicating the effects of parental heights and smoking on newborn length.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Statistical Computing CW3

November 2024

(a) The goal of the least squares method is to estimate the model parameters by minimizing the Residual
Sum of Squares (RSS). For the regression model:

yi = α + βmi + γfi + ϵi ,

we aim to find the parameter estimates (α̂, β̂, γ̂) that minimize the RSS. In R, this can be achieved
using the lm() function, which computes the least squares estimates for the regression coefficients.
In this linear model, the “smoker” variable is not included. After running the code in R, we obtained
the following least squares estimates for (α, β, γ):

α̂ = 6.65996, β̂ = 0.17165, γ̂ = 0.03128.

The F-statistic is 4.22 with a p-value of 0.02192. This indicates that the model, as a whole, is statis-
tically significant. Then given that the p-value of β is 0.013, which is statistically significant at the
0.05 level while that of γ is not. Hence the data$mheight has a statistically significant effect on the
dependent variable. And the predictor data$fheight does not show statistical significance and may not
contribute meaningfully to the model.

data <- read.table("birth_length(1).txt", header = TRUE, sep = "\t")


data
birthmodel<-lm(data$length ~ data$mheight + data$fheight, data = data)
summary(birthmodel)

(b) The residuals from the least squares fit for the regression model are defined as:

ϵi = yi − (α + βmi + γfi ),

where yi is the observed value of the new-born baby’s length, and α+βmi +γfi represents the expected
value of the baby’s length based on the predictor variables mi (mother’s height) and fi (father’s height)
in the linear model.
A key assumption of the linear regression model is the normality of the residuals. In Figure 1, the
residuals appear to be approximately normal, which is further confirmed in Figure 2, where the residuals
are shown to follow a normal distribution by the Q-Q plot.
This validation of the normality assumption ensures that the model’s statistical inferences, such as
confidence intervals we analyze later are reliable.

n<-length(data$length)
coefficient<-coef(birthmodel)
residual<-residuals(birthmodel)
birthlength_fitted<-fitted(birthmodel)

1
Histogram of residual_vector

8
6
Frequency

4
2
0

−2 −1 0 1 2

residual_vector

Figure 1: Histogram of Residuals

Q−Q Plot of Residuals


2
Sample Quantiles

1
0
−1
−2

−2 −1 0 1 2

Theoretical Quantiles

Figure 2: Q-Q Plot of the Distribution of Residuals with regards to the Theoretical Normal Distribution

2
residual_vector<-numeric(n)
for (i in 1:n){
residual_vector[i]<-data$length[i]-birthlength_fitted[i]
}
residual_vector
mean(residual_vector)
hist(residual_vector)
qqnorm(residual_vector, main = "Q-Q Plot of Residuals")
qqline(residual_vector, col = "blue", lwd = 2)

(c) In this question, the Bootstrap procedure for this linear model is outlined as follows:
1. Fit the linear model to the birth-length data. Specifically, estimate the parameter γ in the model:

yi = α + βmi + γfi + ϵi ,

where yi is the response variable, mi and fi are predictors, and ϵi is the random error term.
2. Obtain the residuals (ϵ̂1 , . . . , ϵ̂42 ) as computed previously in part (b).
3. Generate bootstrap samples by evaluating

yj∗ = α̂ + β̂mj + γ̂fj + ϵ∗j ,

where ϵ∗j is chosen uniformly at random from the residuals (ϵ̂1 , . . . , ϵ̂42 ).
4. Refit the linear model to each of the B bootstrap samples to obtain B bootstrap realizations of γ̂i∗ .
5. Summarize the results by calculating the bias and standard error of γ̂, as well as the proportion
P r(γ̂ ∗ > 0).
The bias of γ̂ is given by
B
1 X ∗
Bias(γ̂) = (γ̂b − γ̂) .
B
b=1

The standard error of γ̂ is v


u B
u 1 X ∗ 2
SE(γ̂) = t (γ̂b − γ̄ ∗ ) ,
B−1
b=1

where
B
1 X ∗
γ̄ ∗ = γ̂b .
B
b=1

The proportion P r(γ̂ > 0) is estimated as:

#(γ̂b∗ > 0)
P r(γ̂ > 0) = .
B

Figure 3 shows the histogram of the γ̂b∗ values obtained from the bootstrap samples. After running the
code, we calculated the following results:

Bias(γ̂) = 0.002054614, SE(γ̂) = 0.05668251, P r(γ̂ > 0) = 0.715.

These results imply that the estimate γ̂ is biased. A histogram of 1000 bootstrap estimates of γ̂ are
given in the Figure 3.

3
Histogram of gamma_resid_result

250
Frequency

150
50
0

−0.2 −0.1 0.0 0.1 0.2

gamma_resid_result

Figure 3: Histogram of Bootstrap Estimates γ̂b∗ , i = 1, . . . , 1000

bootRes <- function(B) {


birthmodel<-lm(data$length ~ data$mheight + data$fheight, data = data)
coefficient<-coef(birthmodel)
residual<-residuals(birthmodel)
birthlength_fitted<-fitted(birthmodel)
gamma_resid <- numeric(n)
for(i in 1:B){
residual_star <- sample(residual, n, replace = TRUE)
birthlength_star <- birthlength_fitted + residual_star
fit_star <- lm(birthlength_star ~ data$mheight + data$fheight)
gamma_resid[i] <- coef(fit_star)[3]
}
gamma_resid
}
B<-1000
gamma_resid_result<-bootRes(B)
hist(gamma_resid_result)

Bias_gamma<-mean(gamma_resid_result)-coefficient[3]
Bias_gamma
se_gamma<-sqrt(var(gamma_resid_result)/length(B-1))
se_gamma

gammal0 <- ifelse(gamma_resid_result[1:B] > 0, 1, 0)


gammal0
Prgammal0<-mean(gammal0)
Prgammal0

4
(d) To produce a 95% confidence interval for the parameter β using the bootstrap-t confidence interval
methodology, we proceed as follows:
1. Estimate the parameter and standard error: Using the lm() function, we compute the least squares
estimate β̂ for β, obtaining
β̂ = 0.17165,
along with the estimate of its standard error se,
ˆ where

se
ˆ = 0.06593.

2. Generate bootstrap samples: For i = 1, . . . , B, repeat the steps:


Refit the linear model to each bootstrapped dataset to obtain B bootstrap realizations of β̂i∗ and their
corresponding standard errors seˆ ∗i .
Compute the Z-score for each bootstrap realization as:

β̂i∗ − β̂
Zi∗ = .
ˆ ∗i
se

3. Calculate quantiles of Z ∗ : Determine the empirical 2.5% and 97.5% quantiles of the Z ∗ values,
denoted as:
t̂0.025 = −1.876585 and t̂0.975 = 2.071633.

4. **Compute the 95% bootstrap-t confidence interval**: The confidence interval is computed as:
 
β̂ − t̂0.975 · se,
ˆ β̂ − t̂0.025 · se
ˆ .

Substituting the values:

(0.17165 − 2.071633 · 0.06593, 0.17165 + 1.876585 · 0.06593) = (0.01256, 0.29524).

Interpretation: The resulting 95% bootstrap-t confidence interval for β is (0.03507, 0.29536). This
interval provides a range of plausible values for β based on the observed data and accounts for uncer-
tainty in the estimate. Since 0 does not fall in the confidence interval, we do not think it is plausible
that β = 0 at 95% confidence level.

CIbeta<-function(B){
birthmodel<-lm(data$length ~ data$mheight + data$fheight, data = data)
coefficient<-coef(birthmodel)
residual<-residuals(birthmodel)
birthlength_fitted<-fitted(birthmodel)
beta_resid <- numeric(n)
z_star<-numeric(B)
beta_star<-numeric(B)
se_star<-numeric(B)
for (i in 1:B) {
residual_star <- sample(residual, n, replace = TRUE)
birthlength_star <- birthlength_fitted + residual_star
fit_star <- lm(birthlength_star ~ mheight + fheight , data = data)
beta_star[i] <- coef(fit_star)[2] # Assuming interest is in ’mheight’
# Calculate bootstrap z-score for mheight
se_star[i] <- summary(fit_star)$coefficients["mheight", "Std. Error"]
z_star[i] <- (beta_star[i] - coef(birthmodel)[2]) / se_star[i]
}
z_star

5
}

B<-1000
CIbeta_result<-CIbeta(B)
CIbeta_result
CIbeta_quantile<-quantile(CIbeta_result,probs =c(0.025,0.975))
CIbeta_quantile
se_bar<-summary(birthmodel)$coefficients["data$mheight", "Std. Error"]
CIt <- coef(birthmodel)[2] - c(CIbeta_quantile[2],CIbeta_quantile[1]) * se_bar
CIt

(e) In this question, the parameter of interest is E(M ) − E(F ), which we denote as θ. We estimate θ and
construct a 95% confidence interval using a bootstrap-t procedure. The steps are as follows:
1. Estimate the parameter and standard error: By computing the difference between the mean of the
mother’s height (M ) and the mean of the father’s height (F ), we obtain the point estimate θ̂ for θ:

θ̂ = −6.357143.

The standard error of θ̂ is estimated as:


r
var(mheight − fheight)
se
ˆ = = 0.5029777
42

2. Generate bootstrap samples: For i = 1, . . . , B, repeat the following steps:


(i)Combine the mother’s and father’s heights into a single bivariate dataset, treating the data as paired
observations (M, F ). Randomly sample with replacement from the paired data to create a bootstrap
sample.
(ii)Recompute the difference in means for each bootstrap sample to obtain θ̂i∗ , along with its corre-
ˆ ∗i .
sponding standard error se
(iii)Calculate the Z-score for each bootstrap realization as:

θ̂i∗ − θ̂
Zi∗ = .
ˆ ∗i
se

3. Determine the quantiles of Z ∗ : Compute the empirical 2.5% and 97.5% quantiles of the Z ∗ values,
denoted as:
t̂0.025 = −1.483659 and t̂0.975 = 2.131360

4. Compute the 95% bootstrap-t confidence interval: The confidence interval is given by:
 
θ̂ − t̂0.975 · se,
ˆ θ̂ − t̂0.025 · se
ˆ .

Substituting the values:

(−6.357143 − 2.131360 · 0.5029777, −6.357143 + 1.483659 · 0.5029777) ,

we compute:
(−7.429169, −5.610895).

Interpretation: The resulting 95% bootstrap-t confidence interval for θ is (−6.472538, −6.218647). This
interval provides a range of plausible values for the difference between the mean heights of mothers and
fathers. The relatively narrow interval suggests that the estimate θ̂ is precise and reflects the observed
data well. Since 0 does not fall in the confidence interval, so that we concluded it is not plausible that
E(M ) − E(F ) = 0 or E(M ) = E(F ) at 95% confidence level.

6
n<-length(data$length)
Boopair<-function(B){
diff_MF<-numeric(B)
z_star_d<-numeric(B)
se_star_d<-numeric(B)
diff_M<-numeric(B)
for(i in 1:B){
ind<-sample(n,n,replace=TRUE)
x1_star<-data$mheight[ind]
x2_star<-data$fheight[ind]
diff_MF[i]<-mean(x1_star)-mean(x2_star)
se_star_d[i]<-sqrt(var(x1_star-x2_star)/n)
z_star_d[i]<-(diff_MF[i]-(mean(data$mheight)-mean(data$fheight)))/se_star_d[i]
}
z_star_d
}
Wholecase_result<-Boopair(100)
wholecase_quantile<-quantile(Wholecase_result,probs =c(0.025,0.975))
wholecase_quantile
wholecase_se_bar<-sqrt(var(data$mheight-data$fheight)/(n))
wholecase_se_bar
CI_wholecase <- (mean(data$mheight)-mean(data$fheight))-c(wholecase_quantile[2],wholecase_quantile[
CI_wholecase

(f) In this analysis, the parameter of interest, θ, is defined as the difference in expected baby length
between non-smoking mothers and smoking mothers:

θ = E(Y |S = 0) − E(Y |S = 1).

We conduct the following procedure:


1. For i = 1, . . . , B, we repeat the following steps:
Combine the baby lengths and smoker levels into a single paired bivariate dataset. In other words,
resample the indices of these paired data with replacement.
For each resample, compute the bootstrap estimate θ̂i∗

θ̂i∗ = E(Y ∗ |S = 0) − E(Y ∗ |S = 1)

, where Y ∗ is the resampled Y in each bootstrap procedure.


2. Using the bootstrap estimates, compute P r(θ̂ > 0) as:

#(θ̂i∗ > 0)
P r(θ̂ > 0) = .
B

From the results, we find that P r(θ̂ > 0) = 0.925, indicating that the baby’s length is highly likely to
be greater for non-smoking mothers compared to smoking mothers.
In Figure 4, the bootstrap samples form an approximately normal distribution, centered at 0.5198,
which supports the conclusion that the baby’s length is significantly longer for non-smoking mothers.

BooYS<-function(B){
theta_star<-numeric(B)
for(i in 1:B){
n<-length(data$length)

7
Histogram of BooYS_result

200
150
Frequency

100
50
0

−0.5 0.0 0.5 1.0 1.5

BooYS_result

ˆ E(Y |S = 1))∗
Figure 4: Histogram of Bootstrap Estimates (E(Y |S = 0) − i

ind<-sample(n,n,replace=TRUE)
Y_star<-data$length[ind]
S_star<-data$smoker[ind]
Y1_star <- Y_star[S_star == 1]
Y0_star <- Y_star[S_star == 0]
theta_star[i]<-mean(Y0_star)-mean(Y1_star)
}
return(theta_star)
}

B<-1000
BooYS_result<-BooYS(B)
BooYSl0 <- ifelse(BooYS_result[1:B] > 0, 1, 0)
BooYSl0
Pr_thetal0<-mean(BooYSl0)
Pr_thetal0
hist(BooYS_result)

You might also like