Statistical_Computing
Statistical_Computing
November 2024
(a) The goal of the least squares method is to estimate the model parameters by minimizing the Residual
Sum of Squares (RSS). For the regression model:
yi = α + βmi + γfi + ϵi ,
we aim to find the parameter estimates (α̂, β̂, γ̂) that minimize the RSS. In R, this can be achieved
using the lm() function, which computes the least squares estimates for the regression coefficients.
In this linear model, the “smoker” variable is not included. After running the code in R, we obtained
the following least squares estimates for (α, β, γ):
The F-statistic is 4.22 with a p-value of 0.02192. This indicates that the model, as a whole, is statis-
tically significant. Then given that the p-value of β is 0.013, which is statistically significant at the
0.05 level while that of γ is not. Hence the data$mheight has a statistically significant effect on the
dependent variable. And the predictor data$fheight does not show statistical significance and may not
contribute meaningfully to the model.
(b) The residuals from the least squares fit for the regression model are defined as:
ϵi = yi − (α + βmi + γfi ),
where yi is the observed value of the new-born baby’s length, and α+βmi +γfi represents the expected
value of the baby’s length based on the predictor variables mi (mother’s height) and fi (father’s height)
in the linear model.
A key assumption of the linear regression model is the normality of the residuals. In Figure 1, the
residuals appear to be approximately normal, which is further confirmed in Figure 2, where the residuals
are shown to follow a normal distribution by the Q-Q plot.
This validation of the normality assumption ensures that the model’s statistical inferences, such as
confidence intervals we analyze later are reliable.
n<-length(data$length)
coefficient<-coef(birthmodel)
residual<-residuals(birthmodel)
birthlength_fitted<-fitted(birthmodel)
1
Histogram of residual_vector
8
6
Frequency
4
2
0
−2 −1 0 1 2
residual_vector
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Figure 2: Q-Q Plot of the Distribution of Residuals with regards to the Theoretical Normal Distribution
2
residual_vector<-numeric(n)
for (i in 1:n){
residual_vector[i]<-data$length[i]-birthlength_fitted[i]
}
residual_vector
mean(residual_vector)
hist(residual_vector)
qqnorm(residual_vector, main = "Q-Q Plot of Residuals")
qqline(residual_vector, col = "blue", lwd = 2)
(c) In this question, the Bootstrap procedure for this linear model is outlined as follows:
1. Fit the linear model to the birth-length data. Specifically, estimate the parameter γ in the model:
yi = α + βmi + γfi + ϵi ,
where yi is the response variable, mi and fi are predictors, and ϵi is the random error term.
2. Obtain the residuals (ϵ̂1 , . . . , ϵ̂42 ) as computed previously in part (b).
3. Generate bootstrap samples by evaluating
where ϵ∗j is chosen uniformly at random from the residuals (ϵ̂1 , . . . , ϵ̂42 ).
4. Refit the linear model to each of the B bootstrap samples to obtain B bootstrap realizations of γ̂i∗ .
5. Summarize the results by calculating the bias and standard error of γ̂, as well as the proportion
P r(γ̂ ∗ > 0).
The bias of γ̂ is given by
B
1 X ∗
Bias(γ̂) = (γ̂b − γ̂) .
B
b=1
where
B
1 X ∗
γ̄ ∗ = γ̂b .
B
b=1
#(γ̂b∗ > 0)
P r(γ̂ > 0) = .
B
Figure 3 shows the histogram of the γ̂b∗ values obtained from the bootstrap samples. After running the
code, we calculated the following results:
These results imply that the estimate γ̂ is biased. A histogram of 1000 bootstrap estimates of γ̂ are
given in the Figure 3.
3
Histogram of gamma_resid_result
250
Frequency
150
50
0
gamma_resid_result
Bias_gamma<-mean(gamma_resid_result)-coefficient[3]
Bias_gamma
se_gamma<-sqrt(var(gamma_resid_result)/length(B-1))
se_gamma
4
(d) To produce a 95% confidence interval for the parameter β using the bootstrap-t confidence interval
methodology, we proceed as follows:
1. Estimate the parameter and standard error: Using the lm() function, we compute the least squares
estimate β̂ for β, obtaining
β̂ = 0.17165,
along with the estimate of its standard error se,
ˆ where
se
ˆ = 0.06593.
β̂i∗ − β̂
Zi∗ = .
ˆ ∗i
se
3. Calculate quantiles of Z ∗ : Determine the empirical 2.5% and 97.5% quantiles of the Z ∗ values,
denoted as:
t̂0.025 = −1.876585 and t̂0.975 = 2.071633.
4. **Compute the 95% bootstrap-t confidence interval**: The confidence interval is computed as:
β̂ − t̂0.975 · se,
ˆ β̂ − t̂0.025 · se
ˆ .
Interpretation: The resulting 95% bootstrap-t confidence interval for β is (0.03507, 0.29536). This
interval provides a range of plausible values for β based on the observed data and accounts for uncer-
tainty in the estimate. Since 0 does not fall in the confidence interval, we do not think it is plausible
that β = 0 at 95% confidence level.
CIbeta<-function(B){
birthmodel<-lm(data$length ~ data$mheight + data$fheight, data = data)
coefficient<-coef(birthmodel)
residual<-residuals(birthmodel)
birthlength_fitted<-fitted(birthmodel)
beta_resid <- numeric(n)
z_star<-numeric(B)
beta_star<-numeric(B)
se_star<-numeric(B)
for (i in 1:B) {
residual_star <- sample(residual, n, replace = TRUE)
birthlength_star <- birthlength_fitted + residual_star
fit_star <- lm(birthlength_star ~ mheight + fheight , data = data)
beta_star[i] <- coef(fit_star)[2] # Assuming interest is in ’mheight’
# Calculate bootstrap z-score for mheight
se_star[i] <- summary(fit_star)$coefficients["mheight", "Std. Error"]
z_star[i] <- (beta_star[i] - coef(birthmodel)[2]) / se_star[i]
}
z_star
5
}
B<-1000
CIbeta_result<-CIbeta(B)
CIbeta_result
CIbeta_quantile<-quantile(CIbeta_result,probs =c(0.025,0.975))
CIbeta_quantile
se_bar<-summary(birthmodel)$coefficients["data$mheight", "Std. Error"]
CIt <- coef(birthmodel)[2] - c(CIbeta_quantile[2],CIbeta_quantile[1]) * se_bar
CIt
(e) In this question, the parameter of interest is E(M ) − E(F ), which we denote as θ. We estimate θ and
construct a 95% confidence interval using a bootstrap-t procedure. The steps are as follows:
1. Estimate the parameter and standard error: By computing the difference between the mean of the
mother’s height (M ) and the mean of the father’s height (F ), we obtain the point estimate θ̂ for θ:
θ̂ = −6.357143.
θ̂i∗ − θ̂
Zi∗ = .
ˆ ∗i
se
3. Determine the quantiles of Z ∗ : Compute the empirical 2.5% and 97.5% quantiles of the Z ∗ values,
denoted as:
t̂0.025 = −1.483659 and t̂0.975 = 2.131360
4. Compute the 95% bootstrap-t confidence interval: The confidence interval is given by:
θ̂ − t̂0.975 · se,
ˆ θ̂ − t̂0.025 · se
ˆ .
we compute:
(−7.429169, −5.610895).
Interpretation: The resulting 95% bootstrap-t confidence interval for θ is (−6.472538, −6.218647). This
interval provides a range of plausible values for the difference between the mean heights of mothers and
fathers. The relatively narrow interval suggests that the estimate θ̂ is precise and reflects the observed
data well. Since 0 does not fall in the confidence interval, so that we concluded it is not plausible that
E(M ) − E(F ) = 0 or E(M ) = E(F ) at 95% confidence level.
6
n<-length(data$length)
Boopair<-function(B){
diff_MF<-numeric(B)
z_star_d<-numeric(B)
se_star_d<-numeric(B)
diff_M<-numeric(B)
for(i in 1:B){
ind<-sample(n,n,replace=TRUE)
x1_star<-data$mheight[ind]
x2_star<-data$fheight[ind]
diff_MF[i]<-mean(x1_star)-mean(x2_star)
se_star_d[i]<-sqrt(var(x1_star-x2_star)/n)
z_star_d[i]<-(diff_MF[i]-(mean(data$mheight)-mean(data$fheight)))/se_star_d[i]
}
z_star_d
}
Wholecase_result<-Boopair(100)
wholecase_quantile<-quantile(Wholecase_result,probs =c(0.025,0.975))
wholecase_quantile
wholecase_se_bar<-sqrt(var(data$mheight-data$fheight)/(n))
wholecase_se_bar
CI_wholecase <- (mean(data$mheight)-mean(data$fheight))-c(wholecase_quantile[2],wholecase_quantile[
CI_wholecase
(f) In this analysis, the parameter of interest, θ, is defined as the difference in expected baby length
between non-smoking mothers and smoking mothers:
#(θ̂i∗ > 0)
P r(θ̂ > 0) = .
B
From the results, we find that P r(θ̂ > 0) = 0.925, indicating that the baby’s length is highly likely to
be greater for non-smoking mothers compared to smoking mothers.
In Figure 4, the bootstrap samples form an approximately normal distribution, centered at 0.5198,
which supports the conclusion that the baby’s length is significantly longer for non-smoking mothers.
BooYS<-function(B){
theta_star<-numeric(B)
for(i in 1:B){
n<-length(data$length)
7
Histogram of BooYS_result
200
150
Frequency
100
50
0
BooYS_result
ˆ E(Y |S = 1))∗
Figure 4: Histogram of Bootstrap Estimates (E(Y |S = 0) − i
ind<-sample(n,n,replace=TRUE)
Y_star<-data$length[ind]
S_star<-data$smoker[ind]
Y1_star <- Y_star[S_star == 1]
Y0_star <- Y_star[S_star == 0]
theta_star[i]<-mean(Y0_star)-mean(Y1_star)
}
return(theta_star)
}
B<-1000
BooYS_result<-BooYS(B)
BooYSl0 <- ifelse(BooYS_result[1:B] > 0, 1, 0)
BooYSl0
Pr_thetal0<-mean(BooYSl0)
Pr_thetal0
hist(BooYS_result)