Workbook.regression.solutions
Workbook.regression.solutions
1. The table gives weight in pounds and length in inches for 3-month-old
baby girls. Graph the points from the table in a scatterplot and describe
the trend.
9.7 21.6
10.2 22.1
12.4 23.6
13.6 25.1
9.8 22.4
11.2 23.9
14.1 25.8
Solution:
1
3-month-old baby girls
26
25
Length (inches)
24
23
22
21
7 8 9 10 11 12 13 14 15
Weight (pounds)
The points rise from left to right and are fairly linear. We can say that there
is a strong positive linear correlation between the points. There do not
appear to be any outliers in the data.
2. The following values have been computed for a data set of 14 points.
Calculate the line of best fit.
∑
x = 86
∑
y = 89.7
∑
xy = 680.46
x 2 = 654.56
∑
2
Solution:
We’re told that there are 14 items in the data set, so n = 14.
To find the line of best fit, we need its slope and y-intercept. The slope is
given by
n ∑ xy − ∑ x ∑ y
b=
n ∑ x 2 − ( ∑ x)
2
14(680.46) − (86)(89.7)
b=
14(654.56) − (86)2
9,526.44 − 7,714.2
b=
9,163.84 − 7,396
1,812.24
b=
1,767.84
b ≈ 1.0251
∑ y − b∑ x
a=
n
89.7 − 1.0251(86)
a=
14
89.7 − 88.1599
a=
14
3
1.5401
a=
14
a ≈ 0.1100
ŷ = bx + a
ŷ = 1.0251x + 0.1100
3. For the data set given in the table, calculate each of the following
values:
x ,(
∑ )
2
2
∑ ∑ ∑ ∑
n, x, y, xy, x
Month 1 2 3 4 5 6 7 8 9 10 11 12
Temperatu
73 73 75 75 77 79 79 81 81 81 77 75
re
Solution:
4
Month, x Temperature, y xy x2
1 73 1(73)=73 12=1
2 73 2(73)=146 22=4
3 75 3(75)=225 32=9
4 75 4(75)=300 42=16
5 77 5(77)=385 52=25
6 79 6(79)=474 62=36
7 79 7(79)=553 72=49
8 81 8(81)=648 82=64
9 81 9(81)=729 92=81
10 81 10(81)=810 102=100
11 77 11(77)=847 112=121
12 75 12(75)=900 122=144
∑
Summing the first column gives x = 78
∑
Summing the second column gives y = 926
∑
Summing the third column gives xy = 6,090
x 2 = 650
∑
Summing the fourth column gives
(∑ )
2
2
Squaring the sum from the first column gives x = 78 = 6,084
5
4. Use the Average Global Sea Surface Temperatures data shown in the
table to create a line of best fit for the data. Consider 1910 as year 10. Use
the equation to predict the average global sea surface temperature in the
year 2050.
Year Temperature, F
1910 -1.11277
1920 -0.71965
1930 -0.58358
1940 -0.17977
1950 -0.55318
1960 -0.30358
1970 -0.30863
1980 0.077197
1990 0.274842
2000 0.232502
2010 0.612718
Solution:
6
Year Temperature, F xy x2
∑
Summing the first column gives x = 660
∑
Summing the second column gives y = − 2.5639
∑
Summing the third column gives xy = 3.86392
x 2 = 50,600
∑
Summing the fourth column gives
( ∑ x) = 660 = 435,600
2
2
To find the regression line for the data, we need the slope and y-intercept
of the line. The slope is
7
n ∑ xy − ∑ x ∑ y
b=
n ∑ x 2 − ( ∑ x)
2
11(3.86392) − (660)(−2.5639)
b=
11(50,600) − 435,600
42.50312 + 1,692.174
b=
556,600 − 435,600
1,734.67712
b=
121,600
b ≈ 0.0143
The y-intercept is
∑ y − b∑ x
a=
n
−2.5639 − 0.0143(660)
a=
11
−2.5639 − 9.415188
a=
11
−11.979088
a=
11
a ≈ − 1.0890
ŷ = bx + a
ŷ = 0.0143x − 1.0890
8
To predict average global sea surface temperature in 2050, we’ll need to
plug 150 into this equation.
ŷ = 0.0143(150) − 1.0890
ŷ = 2.145 − 1.0890
ŷ = 1.056
0.6
0.3
Temperature
-0.3
-0.6
-0.9
-1.2
-1.5
1920 1940 1960 1980 2000 2020
Year
9
Average Global Sea Surface Temperatures, 1880-2010
0.9
0.6
0.3
Temperature
-0.3
-0.6
-0.9
-1.2
-1.5
1880 1900 1920 1940 1960 1980 2000 2020
Year
Solution:
Adding in these extra three points make the graph from 1880 to 2010
appear more scattered and not as linear as the graph that only includes
the points from 1910 to 2010.
Both data sets have a positive correlation because the general trend of the
scatterplot is to increase as we move from left to right, but we might
consider graphs that are exponential in shape instead of linear. If we use a
line of best fit for the data from 1880 to 2010, it might not be as accurate as
a line predicting only the points from 1910 to 2010.
In other words, the best fit line for 1880 to 2010 would have a weaker
correlation than the line for 1910 to 2010, because the additional points to
the left of the graph are more spread out.
10
But even though cutting off the points makes the line of best fit have a
stronger correlation, it would be good to include them in the data so that
our line of best fit is not misleading.
6. A small coffee shop wants to know how hot chocolate sales are
affected by daily temperature. Find the rate of change of hot chocolate
sales, with respect to temperature.
28 110
29 115
31 108
33 103
45 95
48 93
55 82
57 76
Solution:
Create a scatterplot.
11
Hot Chocolate Sales by Temperature
120
96
Hot Chocolate Sales
72
48
24
0
30 35 40 45 50 55 60
Daily Temperature, F
From the plot we can see there’s a relatively strong, negative linear
relationship with no outliers. The rate of change is the slope, so we need
to look at the slope of the line of best fit for the data set. The formula for
the slope of the best-fit line is
n ∑ xy − ∑ x ∑ y
b=
n ∑ x 2 − ( ∑ x)
2
Extend the table to find the values we need for the formula.
12
Daily Hot Chocolate
xy x2
Temperature, F Sales
45 95 4,275 2,025
48 93 4,464 2,304
55 82 4,510 3,025
57 76 4,332 3,249
Sum
326 782 30,743 14,278
:
8(30,743) − (326)(782)
b=
8(14,278) − (326)2
245,944 − 254,932
b=
114,224 − 106,276
−8,988
b=
7,946
b ≈ − 1.1311
The units of the slope are “hot chocolate sales per degree Fahrenheit.” So
the shop can expect hot chocolate sales to decrease by 1.1311 cups for
every one degree increase in temperature.
13
CORRELATION COEFFICIENT AND THE RESIDUAL
1. What does the shape of this residual plot tell us about the line of best
fit that was created for the data?
Solution:
The shape of this graph tells us that the linear model is probably not the
best choice for our data set, and that we should consider another type of
regression curve, probably one that’s quadratic.
14
2. What does the shape of this residual plot tell us about the line of best
fit that was created for the data?
Solution:
The points in this residual plot are evenly spaced around the line y = 0. It
has about the same number of points on the left and right, and about the
same number of points above and below 0. It doesn’t appear to have
outliers or interesting features. So the line of best fit for the data is
probably a good one and can be useful for making predictions.
3. Calculate and interpret the correlation coefficient for the data set.
15
x y
54 0.162
57 0.127
62 0.864
77 0.895
81 0.943
93 1.206
Solution:
n − 1 ∑ ( sx ) ( sy )
1 xi − x̄ yi − ȳ
r=
We need to start by finding the means and standard deviations for both x
and y. The means are
54 + 57 + 62 + 77 + 81 + 93
x̄ =
6
x̄ ≈ 70.6667
and
ȳ ≈ 0.6995
16
and the standard deviations are
6
∑i=1 (xi − x̄)2
sx =
n−1
sx ≈ 15.3970
and
6
∑i=1 (yi − ȳ)2
sy =
n−1
sy ≈ 0.4467
6 − 1 [( 15.3970 ) ( ) ( 15.3970 ) ( )
1 54 − 70.6667 0.162 − 0.6995 57 − 70.6667 0.127 − 0.6995
r= +
0.4467 0.4467
( 15.3970 ) ( ) ( 15.3970 ) ( )
62 − 70.6667 0.864 − 0.6995 77 − 70.6667 0.895 − 0.6995
+ +
0.4467 0.4467
( 15.3970 ) ( ) ( 15.3970 ) ( )]
81 − 70.6667 0.943 − 0.6995 93 − 70.6667 1.206 − 0.6995
+ +
0.4467 0.4467
17
1
r=
5 [ (−1.0825)(−1.2033) + (−0.8876)(−1.2816) + (−0.5629)(0.3683)
1
r= (1.3026 + 1.1375 − 0.2073 + 0.1723 + 0.3501 + 1.6447)
5
1
r= (4.3999)
5
r ≈ 0.88
The positive correlation coefficient tells us that the regression line has a
positive slope. The fact that the positive value is closer to 1 than it is to 0
tells us the data is strongly correlated, or that it most likely has a strong
linear relationship. If we looked at a scatterplot of the data and sketched in
the regression line, we’d see that this was true.
4. Calculate the residuals, draw the residual plot, and interpret the
results. Compare the results to the r-value in the previous problem. The
equation of the line of best fit for the data is
ŷ = 0.0257x − 1.1142
18
x y
54 0.162
57 0.127
62 0.864
77 0.895
81 0.943
93 1.206
Solution:
19
0.40
0.30
0.20
0.10
-0.10
-0.20
-0.30
-0.40
60 70 80 90 100
From the residual plot, it looks like the data had an outlier at x = 62. We
already have a somewhat strong positive linear correlation from the
correlation coefficient from the previous problem of r ≈ 0.88, so it’s likely
the relationship would be even stronger without the outlier.
20
Year Temperature, F
1880 -0.47001
1890 -0.88758
1900 -0.48331
1910 -1.11277
1920 -0.71965
1930 -0.58358
1940 -0.17977
1950 -0.55318
1960 -0.30358
1970 -0.30863
1980 0.077197
1990 0.274842
2000 0.232502
2010 0.612718
Solution:
Since this is a larger set of data, it can be nice to use a program like Excel
to expand the table.
21
Year Temperature, F (xi-x)2 (yi-y)2
14
∑i=1 (xi − x̄)2 22,750
sx = = ≈ 41.8330
n−1 13
14
∑i=1 (yi − ȳ)2 2.98986
sy = = ≈ 0.4796
n−1 13
22
Now that we have the means and standard deviations, we can find the
correlation coefficient. If we expand the table, then we’ll be able to pull
just the one sum out of the table to plug into the correlation coefficient
formula.
Sum
11.00042
:
23
n − 1 ∑ ( sx ) ( sy )
1 xi − x̄ yi − ȳ
r=
1
r= (11.00042)
14 − 1
1
r= (11.00042)
13
r ≈ 0.8462
There’s a strong positive linear relationship between the year and the
temperature of the ocean’s surface.
6. Calculate the residuals and create the residual plot for the data in the
table. Compare this with the r-value we calculated in the last question and
interpret the results. Use the equation for the regression line
ŷ = 0.0143x − 28.332.
24
Year Temperature, F
1880 -0.47001
1890 -0.88758
1900 -0.48331
1910 -1.11277
1920 -0.71965
1930 -0.58358
1940 -0.17977
1950 -0.55318
1960 -0.30358
1970 -0.30863
1980 0.077197
1990 0.274842
2000 0.232502
2010 0.612718
Solution:
25
Year Actual y Predicted y Residual
26
1.00
0.75
0.50
0.25
-0.25
1890 1910 1930 1950 1970 1990 2010
27
COEFFICIENT OF DETERMINATION AND RMSE
1. Linda read an article about the predictions of high school students and
their GPA. The article studied three factors, the number of volunteer
organizations each student participated in, the number of hours spent on
homework, and the student’s individual scores on standardized tests.
The article concluded that the number of hours spent on homework are
the best predictor of GPA, because they found 24 % of the variance in GPA
to be from hours spent on homework, 15 % from the number of volunteer
organizations, and 11.5 % from individual scores on standardized tests.
Solution:
In this context, the percent of the variance in GPA due to hours spent on
homework is 24 % . So, we’re talking about a least squares line where
r 2 = 0.24. This is a very weak positive relationship, so the line of best fit is
probably not a good predictor of the connection between hours spent on
homework and GPA.
28
2. For the data in the table, calculate the sum of the squared residuals
based on the mean of the y-values.
x y
1 3.1
2 3.4
3 3.7
4 3.9
5 4.1
Solution:
18.2
ȳ =
5
ȳ = 3.64
In this case, the predicted value is the mean of the y-values, ȳ = 3.64. Let’s
expand the table and calculate the residuals.
29
x y e
1 3.1 -0.54
2 3.4 -0.24
3 3.7 0.06
4 3.9 0.26
5 4.1 0.46
Now we just need to find the squares of these residuals and add them
together.
x y e e2
Sum: 0.632
3. Use the same data as the previous question to calculate the sum of
the squared residuals based on the least squares regression line,
ŷ = 0.25x + 2.89.
30
Solution:
x Actual y Predicted y e
Now we just need to find the squares of these residuals and add them
together.
Predicte
x Actual y e e2
dy
Sum: 0.007
31
So the sum of the squared residuals is 0.007.
Solution:
The sum of the squared residuals for the mean of the y-values was
residuals2 = 0.632
∑
The sum of the squared residuals for the line of best fit was
residuals2 = 0.007
∑
This means using the line of best fit reduces the error by
0.632 − 0.007
0.625
This is
0.625
= 0.9889 = 98.89 %
0.632
32
of a reduction in error by using the least squares regression line. This is
another way to calculate the coefficient of determination, so r 2 = 0.9889.
5. What is the RMSE of the data set and what does it mean?
x y
1 3.1
2 3.4
3 3.7
4 3.9
5 4.1
Solution:
∑ residuals2
RMSE =
n
33
Predicte
x Actual y e e2
dy
Sum: 0.007
0.007
RMSE = ≈ 0.0374
5
Since the RMSE we found is a small standard deviation, the data points are
going to be more tightly clustered around the line-of-best-fit and the
correlation in the data will be stronger.
34
6. Calculate the RMSE for the data set, given that the least squares line is
ŷ = 0.0028x + 1.2208.
x y
5 1.25
10 1.29
12 1.17
15 1.24
17 1.32
Solution:
∑ residuals2
RMSE =
n
35
Predicte
x Actual y e e2
dy
Sum: 0.01223424
0.01223424
RMSE = ≈ 0.0495
5
36
CHI-SQUARE TESTS
Northeast 72 33 8 113
Southeast 48 26 7 81
Northwest 59 33 10 102
Southwest 61 27 9 97
Solution:
37
iPhone Android Other Totals
Now we’ll check our sampling conditions. The problem told us that we
took a random sample, and all of our expected values are at least 5, so
we’ve met the random sampling and large counts conditions. And even
though we’re sampling without replacement, 560 is far less than 10 % of the
US population, so we’ve met the independence condition as well.
Calculate χ 2.
38
(59 − 63.20)2 (33 − 30.78)2 (10 − 8.01)2
+ + +
63.20 30.78 8.01
χ 2 ≈ 2.6759
df = (5 − 1)(3 − 1)
df = (4)(2)
df = 8
39
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
7 9.04 9.80 10.75 12.02 14.07 16.01 16.62 18.48 20.28 22.04 24.32 26.02
8 10.22 11.03 12.03 13.36 15.51 17.53 18.17 20.09 21.95 23.77 26.12 27.87
9 11.39 12.24 13.29 14.68 16.92 19.02 19.68 21.67 23.59 25.46 27.88 29.67
We’re off the chart on the left, which means we will definitely not exceed
the alpha level α = 0.05. Therefore, we’ll fail to reject the null hypothesis,
and conclude that geographic region of the country does not affect cell
phone brand preference.
Beverage
A B C Totals
Men 35 34 31 100
Women 31 33 36 100
Totals 66 67 67 200
Solution:
40
Beverage
A B C Totals
Totals 66 67 67 200
Now we’ll check our sampling conditions. The problem told us that we
took a random sample and that we sampled less than 10 % of the
population, so we’ve met the random sampling and independence
conditions. And all of our expected values are at least 5, so we’ve met the
large counts condition as well.
Calculate χ 2.
2 2 2
(35 − 33) (34 − 33.5) (31 − 33.5)
χ2 = + +
33 33.5 33.5
χ 2 = 0.6306
41
The degrees of freedom are
df = (2 − 1)(3 − 1)
df = (1)(2)
df = 2
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
1 1.32 1.64 2.07 2.71 3.81 5.02 5.41 6.63 7.88 9.14 10.83 12.12
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
We’re off the chart on the left, which means we will definitely not exceed
the alpha level α = 0.1. Therefore, we’ll fail to reject the null hypothesis,
and conclude that gender does not affect beverage preference.
42
Bagel Muffin Totals
Coffee 38 34 72
Tea 25 29 54
Totals 63 63 126
Solution:
Totals 63 63 126
Now we’ll check our sampling conditions. The problem told us that we
took a random sample and that we sampled less than 10 % of the
population, so we’ve met the random sampling and independence
conditions. And all of our expected values are at least 5, so we’ve met the
large counts condition as well.
Calculate χ 2.
43
(38 − 36)2 (34 − 36)2 (25 − 27)2 (29 − 27)2
2
χ = + + +
36 36 27 27
42 4 4 4
χ = + + +
36 36 27 27
2 8
2
χ = +
9 27
χ 2 ≈ 0.52
df = (2 − 1)(2 − 1)
df = (1)(1)
df = 1
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
1 1.32 1.64 2.07 2.71 3.81 5.02 5.41 6.63 7.88 9.14 10.83 12.12
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
We’re off the chart on the left, which means we will definitely not exceed
the alpha level α = 0.01. Therefore, the coffee company will fail to reject
the null hypothesis, and conclude that beverage preference does not
affect pastry preference.
44
4. A school district wants to know whether or not GPA is affected by
elective preference. They randomly sampled fewer than 10 % of their
students, and recorded their elective preference an GPA. What can the
school district conclude using a chi-square test at α = 0.1?
GPA range
<2 2 3 4+ Totals
Music 12 26 31 34 103
Theater 21 22 23 21 87
Art 36 29 29 32 126
Totals 69 77 83 87 316
Solution:
GPA range
<2 2 3 4+ Totals
12 26 31 34
Music 103
(22.49) (25.10) (27.05) (28.36)
21 22 23 21
Theater 87
(19.00) (21.20) (22.85) (23.95)
36 29 29 32
Art 126
(27.51) (30.70) (33.09) (34.69)
Totals 69 77 83 87 316
45
Now we’ll check our sampling conditions. The problem told us that we
took a random sample and that we sampled less than 10 % of the
population, so we’ve met the random sampling and independence
conditions. And all of our expected values are at least 5, so we’ve met the
large counts condition as well.
Calculate χ 2.
2 2 2 2
(12 − 22.49) (26 − 25.10) (31 − 27.05) (34 − 28.36)
χ2 = + + +
22.49 25.10 27.05 28.36
χ 2 = 10.65
df = (3 − 1)(4 − 1)
46
df = (2)(3)
df = 6
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
5 6.63 7.29 8.12 9.24 11.07 12.83 13.39 15.09 16.75 18.39 20.52 22.11
6 7.84 8.56 9.45 10.64 12.59 14.45 15.03 16.81 18.55 20.25 22.46 24.10
7 9.04 9.80 10.75 12.02 14.07 16.01 16.62 18.48 20.28 22.04 24.32 26.02
The χ 2 value just clears α = 0.1, which means that the school district can
reject the null hypothesis and conclude that elective choice affects GPA. If
they had set a higher confidence level of 95 % (with α = 0.05), they would
not have been able to reject the null.
Solution:
47
With 18.42 (or 18,420,000) total flights, the expected number of flights in
each quarter would be 18.42/4 = 4.605.
Calculate χ 2.
χ 2 = 0.1533
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
4 5.39 5.99 6.74 7.78 9.49 11.14 11.67 13.28 14.86 16.42 18.47 20.00
48
We’re off the chart on the left, which means we will definitely not exceed
the alpha level α = 0.05. Therefore, the airline will fail to reject the null
hypothesis, and conclude that number of flights taken is not affected by
quarter.
Solution:
49
Ha: Number of sandwiches sold is affected by time of day.
Calculate χ 2.
1 36 49
χ2 = + +
214 214 214
86
2
χ =
214
χ 2 ≈ 0.4019
Upper-tail probability p
df 0.25 0.20 0.15 0.10 0.05 0.025 0.02 0.01 0.005 0.003 0.001 5E-04
1 1.32 1.64 2.07 2.71 3.81 5.02 5.41 6.63 7.88 9.14 10.83 12.12
2 2.77 3.22 3.79 4.61 5.99 7.38 7.82 9.21 10.60 11.98 13.82 15.20
3 4.11 4.64 5.32 6.25 7.81 9.35 9.84 11.34 12.84 14.32 16.27 17.73
We’re off the chart on the left, which means we will definitely not exceed
the alpha level α = 0.1. Therefore, the sandwich company will fail to reject
the null hypothesis, and conclude that number of sandwiches sold is not
affected by time of day.
50
51