S1.4 Correlation and regression (3)
S1.4 Correlation and regression (3)
Statistics 1
for Edexcel
S1.4 Correlation
and regression
These icons indicate that teacher’s notes or useful web addresses are available in the Notes Page.
This icon indicates the slide contains activities created in Flash. These activities are not editable.
For more detailed instructions, see the Getting Started presentation.
11 of
of 58
58 © Boardworks Ltd 2005
Scatter graphs
22 of
of 58
58 © Boardworks Ltd 2005
Correlation
25
20
15
10
5
0
-5 0 10 20 30 40 50 60
Latitude
vertical axis.
30
25
20
15
Strong negative correlation –
y
30
25
20
15 Weak negative correlation –
y
25
20 No correlation –
15 the points are scattered
across the graph area
y
10
5 indicating no relationship
0 between the variables.
0 2 4 6 8 10 12
x
It can be shown that a line of best fit always passes through the
mean point, ( x , y ) .
Example: A line of best fit can be added to the scatter
graph showing mean January temperatures and latitude.
mean point
16
16 of
of 58
58 © Boardworks Ltd 2005
Product–moment correlation coefficient
–1 ≤ r ≤ 1 80
60
20
positive correlation; 0
0 1 2 3 4
100
negative correlation;
60
40
20
60
variables. 40
20
0
0 1 2 3 4
where: S xy ( xi x )( yi y ) xi yi x y
i i
x
2
S xx ( xi x ) x
2 2
i
Usually, the
i
n second version
of each formula
y
2
S yy ( yi y ) y
2 2
i
is used.
i
n
19 of 58 © Boardworks Ltd 2005
Product–moment correlation coefficient
200
Brain size (kg)
150
100
50
0
0 2 4 6 8 10 12
Body m ass (kg)
So: S xy xi yi
x y i i
3519
31.02 392.4
1490
n 6
x
2
S xx xi2
i
255.78
31.022
95.41
n 6
y
2
S yy yi2
i
50 344
392.42
24 681
n 6
S xy 1490
Therefore: r 0.971
S xx S yy 95.4124 681
xc 619.6
Calculate the value of the product–moment correlation
coefficient and comment on the implications of your
answer.
So: S xc xi ci
x c
i i
619.6
96.5 54
98.5
n 10
x
2
S xx xi2
i
2156.9
96.52
1225.675
n 10
ci
2
2
54
Scc ci2 383.54 91.94
n 10
Therefore, the product-moment correlation coefficient is:
S xc 98.5
r 0.293
S xx Scc 1225.675 91.94
Income shows weak positive correlation with CO2 emissions –
emissions are generally higher in wealthier countries. However,
as the correlation is low, the result is somewhat inconclusive.
26
26 of
of 58
58 © Boardworks Ltd 2005
Effect of coding on the correlation
u = ax + b
v = cy + d
where x = height of father (in cm) and y = height of son (in cm).
Calculate the value of the product–moment correlation
coefficient between the fathers’ and sons’ heights.
15 20
15
10
10
5
5
0 0
0 2 4 6 8 10 0 2 4 6 8 10
34
34 of
of 58
58 © Boardworks Ltd 2005
Regression – random on random
The best fitting line is the one that minimizes the sum of the
squared deviations, di , where di is the vertical distance
2
d6
d3
d5
d1 d4
d2
x
2
Recall: S xy xy
x y
and S xx x 2
n n
Kuala Lumpur 3 27
y 2494
2
Madrid 40 5
New York 41 0
xy 2000
Reykjavik 30 –1 We then use these to
Tokyo 36 5 calculate the gradient
(b) and y-intercept (a)
for the regression line.
x
2
S xx x 2
11 636
312 2
y 2494
2
n 10 xy 2000
1901.6
Therefore:
S xy 1369.6
b –0.720 (to 3 sig. figs.)
S xx 1901.6
10 y 2494
2
a)
562 397
x 2
32 890
y 2
xy 131 541
These can be used to find the gradient of the regression line:
S xy xy
x y
131 541
1623 466
n 9
47 505.67
x
2
S xx x 2
562 397
16232
n 9
269 716
S xy
47 505.67
Therefore: b 0.176 (to 3 sig. figs.)
S xx 269 716
47 of 58 © Boardworks Ltd 2005
Examination style question: regression
c 888 c 2
58 362 s 943 s 2
66 445 cs 61 878
a) Calculate the regression line of s on c and the regression
line of c on s.
b) Caroline was absent for her C1 examination, but scored
52% in S1. Use the appropriate regression line to
c) estimate
Calculateher
thepercentage score in
product–moment the C1 paper.
correlation coefficient
between the marks in the two papers. Comment on the
implications of this for the accuracy of the estimate
found in b).
51 of 58 © Boardworks Ltd 2005
Predicting x from y – random on random
8882 943 2
Scc 58 362 5792.4 S ss 66 445 7161 .733
15 15
888 943
Scs 61 878 6052.4 c 59.2 s 62.8667
15
For the regression line of c on s:
6052.4
b 0.8451 a 59.2 (0.845162.8667) 6.07
7161.733
So, the equation of the regression line of c on s is:
c = 6.07 + 0.845s
b) We wish to estimate the value of c when s = 51. Both
variables are random, so we use the regression line of c on s:
c = 6.07 + 0.845s = 6.07 + (0.845 × 51) = 49.2
So we estimate Caroline to have scored 49% in C1.
53 of 58 © Boardworks Ltd 2005
Predicting x from y – random on random
8882 943 2
Scc 58 362 5792.4 S ss 66 445 7161 .733
15 15
888 943
Scs 61 878 6052.4 c 59.2 s 62.8667
15
c) The PMCC is calculated as follows:
Scs 6052.4
r 0.94
Scc S ss 5792.4 7161.733
The PMCC indicates that there is very strong positive
correlation between the marks in C1 and S1 – the points on
the scatter graph would lie very close to a straight line.
This suggests that the mark estimated in b) is likely to fairly
accurate.
x 42 364
x 2
y 60.79 623.20
y 2
xy 447.74
Also: x 7, y 10.132
Note: The intercept (7.91) represents the crop yield that might
be expected if no fertilizer were to be applied. The equation of
the line also shows that increasing the amount of fertilizer by
1 kg, increases the expected crop yield by 0.317 kg.