Họ và tên: Phạm Thị Thảo Trang
MSSV: 31241027994
Mã LHP: 25C1BUS50321001
_____________________________________________________________________
17. The Excel file Cereal Data provides a variety of nutritional information about
67 cereals and their shelf location in a supermarket. Use regression analysis to
find the best model that explains the relationship between calories and the other
variables. Investigate the model assumptions and clearly explain your
conclusions. Keep in mind the principle of parsimony!
R Square: 0.726 → Sugars and Carbs together explain 72.6% of the variation in
Calories.
H₀: The regression coefficients of Sugars and Carbs = 0
H₁: The regression coefficients of Sugars and Carbs ≠ 0
ANOVA’s table: F = 84.702, p-value = 0.000
→ Reject H₀: Sugars and Carbs significantly explain variation in Calories.
Individual Coefficient testing:
For Sugars: t = 12.387, p-value = 0.000, CI [3.276; 4.535]
For Carbs: t = 9.325, p-value = 0.000, CI [2.639; 4.078]
→ Both variables are statistically significant → Reject H₀
18. The Excel file Salary Data provides information on current salary, beginning
salary, previous experience (in months) when hired, and total years of education
for a sample of 100 employees in a firm.
a. Develop a multiple regression model for predicting current salary as a function
of the other variables.
b. Find the best model for predicting current salary using the t-value criterion
R Square = 0.832 → Beginning Salary, Previous experience, and Education
together explain 83.2% of the variation in Current salary.
H₀: The regression coefficients of all independent variables = 0
H₁: At least one regression coefficient ≠ 0
ANOVA’s table: F = 130.521, p-value = 0.000
→ Reject H₀: The model significantly explains variation in Current salary.
Individual Coefficient testing:
- Beginning salary: t = 15.203, p = 0.000 → significant
- Experience: t = -1.404, p = 0.164 → not significant
- Education: t = 2.045, p = 0.044 → significant
=> The best model includes Beginning Salary and Education.
21. The Excel file Major League Baseball provides data on the 2010 season.
a. Construct and examine the correlation matrix. Is multicollinearity a potential
problem?
b. Suggest an appropriate set of independent variables that predict the number
of wins by examining the correlation matrix.
c. Find the best multiple regression model for predicting the number of wins.
How good is your model? Does it use the same variables you thought were
appropriate in part (b)?
a. Construct and examine the correlation matrix. Is multicollinearity a
potential problem?
From the correlation matrix, we can observe:
- The dependent variable Won (number of wins) is highly correlated with several
independent variables such as
+ Runs (r = 0.785, p = 0.000 < 0.01)
+ Runs Batted In (r = 0.758, p = 0.000 < 0.01)
+ Earned Run Average (r = -0.682, p = 0.000 < 0.01)
+ Walks (r = 0.533, p = 0.002 < 0.01)
- Among the independent variables, there are also very high correlations between:
+ Runs and Runs Batted In (r = 0.995)
+ Runs and Home Runs (r = 0.727)
+ Runs Batted In and Home Runs (r = 0.759)
+ Doubles and Hits (r = 0.434)
b. Suggest an appropriate set of independent variables that predict the
number of wins by examining the correlation matrix.
Based on the correlation matrix:
- The dependent variable Won is strongly and significantly correlated with:
+ Runs (r = 0.785, p = 0.000)
+ Runs Batted In (r = 0.758, p = 0.000)
+ Earned Run Average (ERA) (r = –0.682, p = 0.000)
+ Walks (r = 0.533, p = 0.002)
+ Home Runs (r = 0.438, p = 0.016)
- However, there are very high correlations among some independent variables, such
as:
+ Runs and Runs Batted In: r = 0.995 → almost identical information.
+ Runs and Home Runs: r = 0.727
+ Runs Batted In and Home Runs: r = 0.759
c. Find the best multiple regression model for predicting the number of wins.
How good is your model? Does it use the same variables you thought were
appropriate in part (b)?
R Square = 0.941 → This means that 94.1% of the variation in the number of wins
(Won) can be explained by the predictors (Walks, Earned Run Average, Runs
Batted In, and Runs).
ANOVA’s table: F = 100.430, p-value = 0.000
→ Reject H₀: The regression model as a whole is significant, meaning that at least
one independent variable has a significant relationship with the dependent
variable
Individual Coefficient testing:
- Runs: t = 1.898, p = 0.069 -> not significant
- Runs battled in: t = -0.355, p = 0.725 -> not significant
- Earned run average: t = -10.875, p = 0.000 -> significant
- Walks: t = -2.068, p = 0.049 -> not significant
24. The State of Ohio Department of Education has a mandated ninth-grade
proficiency test that covers writing, reading, mathematics, citizenship (social
studies), and science. The Excel file Ohio Education Performance provides data
on success rates (defined as the percent of students passing) in school districts in
the greater Cincinnati metropolitan area along with state averages.
a. Suggest the best regression model to predict math success as a function of
success in the other subjects by examining the correlation matrix; then run the
regression tool for this set of variables
b. Develop a multiple regression model to predict math success as a function of
success in all other subjects using the systematic approach described in this
chapter. Is multicollinearity a problem?
c. Compare the models in parts (a) and (b). Are they the same? Why or why not?
a. Suggest the best regression model to predict math success as a function of
success in the other subjects by examining the correlation matrix; then run
the regression tool for this set of variables
Among the independent variables (Writing, Reading, Citizenship, and Science) and
the dependent variable (Math), all show strong positive correlations. And Science
shows the strongest correlation with Math with 0.895
b. Develop a multiple regression model to predict math success as a function
of success in all other subjects using the systematic approach described in
this chapter. Is multicollinearity a problem?
R Square = 0.875 → This means that 87.5% of the variation in Math can be
explained by the predictor Science.
ANOVA’s table: F = 210.184, p-value = 0.000
→ Reject H₀: The regression model as a whole is significant, meaning that
Science success has a statistically significant relationship with Math success.
Individual Coefficient testing:
Science: t = 14.498, p = 0.000 → significant