6 Correlation and Regression
6 Correlation and Regression
CHAPTER 6:
CORRELATION AND SIMPLE LINEAR REGRESSION
Correlation
▪ measures the strength of the relationship between two
variables
▪ involves a bivariate data / distribution
Regression
▪ a study to identify the relationship between two or
more variables using a mathematical equation ▪ is
normally used for estimation purposes
Example:
A study on relationship between the sales of ice cream
and the temperatures
▪ temperature is an independent variable since it can
be used to explain the sales of ice cream
▪ sales is a dependent variable since the sales
depends on temperature
Univariate distribution
▪ data of single characteristic is grouped
together ▪ Example: height of student , price of
item etc
Bivariate distribution
▪ data of two characteristics are grouped together ▪
Example: sales of ice cream and temperature, sales of
good and advertisement expenses.
Chapter 6 – Page 1
Scatter Diagram
▪ a plot of paired observations ( X, Y )
▪ illustrates whether
Example:
The data below relates the weekly maintenance cost ($) to the age
(in months) of ten machines of similar type in a manufacturing
company.
Machine 1 2 3 4 5 6 7 8 9 10
Age (X) 5 10 15 20 30 30 30 50 50 60
Cost (Y) 190 240 250 300 310 335 300 300 350 395
Construct a scatter diagram and comment on
it. Solution:
Scatter Diagram of Weekly Maintenance
Cost and Age of Machine
)
250
Maintenance cost ($
400 200
350 150
0 10 20 30 40 50 60 Age of machine
300
(months)
Comment:
Chapter 6 – Page 2
Two types of correlation
1. Linear correlation
✓ correlation is said to be linear if the relationship can be
represented by a straight line
2. Non-linear correlation (or curvilinear correlation) ✂ correlation is
said to be non-linear if the relationship can be represented by a
curve
Correlation Coefficient ( r)
▪ measure the strength of linear relationship between two
variables
▪ has a range of values from –1 to +1 i.e. −1 ≤ r ≤ +1
Absent 0
y
0 x
0
0 2 4 6 8 10
0 2 4 6 8 10
y
y 0
0 2 4 6 8 10
10
y
10
x
0 2 4 6 8 10
5
x
0
x
0
0 2 4 6 8 10
Chapter 6 – Page 4
Product Moment Correlation Coefficient, r
= ΣΣ−Σ
r 2222
( )( )
[ ( ) ][ ( ) ] n X X n Y Y Σ −
Example:
Calculate the product moment correlation coefficient for the
following data. What does the value of the coefficient indicate?
X 5 6 7 9 8
Y 8 9 9 11 13
Solution:
2 2
X Y XY
X Y
5 8 40 25 64
6 9 54 36 81
7 9 63 49 81
9 11
8 13
n XY X Y
Σ−ΣΣ
2222
r
( )( ) [ ( ) ][ ( ) ] n X X n Y Y
= Σ−ΣΣ−Σ
22
5(360) (35)(50)
[5(255) (35) ][5(516) (50) ]
−−
− =
0.7906
=
Chapter 6 – Page 5
Coefficient of Determination,
2
r
▪ is the square of the coefficient of correlation
Example:
r==
If 0.8 0.64 2 2
r = 0.8then
Example:
Refer to the data given in the previous example, calculate the
product moment correlation coefficient between age and
maintenance cost. Hence, find the coefficient of determination and
comment on the results.
Machine 1 2 3 4 5 6 7 8 9 10
Age (X) 5 10 15 20 30 30 30 50 50 60
Cost (Y) 190 240 250 300 310 335 300 300 350 395
Solution:
X Y XY X2 Y2
5 190 950 25 36100
10 240 2400 100 57600
15 250 3750 225 62500
20 300 6000 400 90000
30 310 9300 900 96100
30 335 10050 900 112225
30 300 9000 900 90000
50 300 15000 2500 90000
50 350 17500 2500 122500
60 395 23700 3600 156025
Chapter 6 – Page 6
∑X= ∑Y= ∑ XY = ∑ X2 = ∑ Y2 =
n XY X Y
Σ−ΣΣ
=
r
( )( ) 2222
[ ( ) ][ ( ) ] n X X n Y Y Σ − Σ Σ − Σ
Coefficient of Determination = r 2 =
Comment:
r = indicates that there is a linear correlation between and
. As increases,
would also increase.
✓ Causation ⇒ Correlation
E.g. Age of machine causes the maintenance cost to increase.
Therefore, definitely there is a correlation between the age and
the maintenance of the machine.
⇒Causation
✓ Correlation not always
E.g. There might be a strong positive correlation between ice
cream sales and umbrella sales, but they are NOT THE CAUSE
of each other. The real cause for changes of both variables may
possibly be the weather.
6 d
Σ 2
s ( 1)
r with −1≤ rs
−
=−
12 ≤ +1
nn
Chapter 6 – Page 8
Example (Data had been ranked)
X and Y were judges at a beauty contest in which there were 10
competitors. Their rankings are shown below.
Competitor A B C D E F G H I J
X 4 9 2 5 3 10 6 7 8 1
Y 6 10 2 8 1 9 7 4 5 3
Solution:
Competitor A B C D E F G H I J
R1 4 9 2 5 3 10 6 7 8 1
R2 6 10 2 8 1 9 7 4 5 3
d = R 1– R 2 – 2 – 1 0 –3 2 1 –1 3 3 –2
2
4 1 0 9 4 1 1 9 9 4
d
42 2
=−
2
n = 10, d 6 6(42)
Σ
Σd = rs
1 22= −
−
nn 1
( 1) −
0.7455
10(10 1) =
Chapter 6 – Page 9
Solution:
2
Rate (X) Rank of X Rent (Y) Rank of Y d = R1 – R2
(R1 ) (R2) d
1.68 3.81
1.46 4.19
1.57 4.87
13.37 22.85
3.18 6.47
1.95 6.48
1.07 2.66
1.71 6.49
1.22 5.33
6.46 15.23
2
Σd =
=−
d 6
Σ 2
rs
−
n=10, nn
12 ( 1)
=
Note:
Sometimes two or more individuals or entries may be tied in rank,
in this case, each is given the average of the ranks as shown by
the following example
Salesman 1 2 3 4 5 6 7 8
Sales 20 35 25 20 35 40 20 10
Ranking
Solution:
2
X R1 Y R2 d = R 1– R 2
d
30 30 - 5.5 30.25
31 14 5 25
32 30 -2 4
30 23 -1 1
46 32 - 0.5 0.25
30 26 -3 9
19 20 -1 1
35 21 5 25
40 23 4.5 20.25
46 30 1.5 2.25
57 35 0 0
30 26 -3 9
127 2
Σd =
=−
d 6 6(127)
Σ 2
rs
1 22= −
( 1) − −
n=12, 1
0.5559
nn 12(12 1) =
Chapter 6 – Page 11
Spearman’s rank coefficient, r s
▪ Only an approximation to the product moment coefficient
▪ Easier to use with less calculations
LINEAR REGRESSION
▪ Regression is concerned with obtaining a mathematical
equation which describes the relationship between two
variables
▪ The equation can be used for comparison or estimation
purpose
Σ−ΣΣ
Least squares regression line
Σ−Σ
b ( )( ) ()
=
22
nXX
Y
Σ
a = Y − bX
or
nX aΣ
= −
n b
Notes:
For any set of bivariate data, the least squares regression line of Y
on X
1. is used to estimate a value of Y given a value of X
2. passes through the mean point
(X ,Y )of the data
Example
Solution:
Let X = output in 000’s units; Y = total costs in RM’000.
X Y XY X2
20 82 1640 400
16 70 1120 256
24 90 2160 576
22 85 1870 484
18 73 1314 324
Chapter 6 – Page 13
n XY X Y Σ − Σ Σ
b 22 ()
nXX
=
Σ−Σ =
( )( )
= − =
Y
Σ Σ
a b X
n n ˆ
= a + bX =
∴The regression line is Y
Example:
The data below relates the weekly maintenance cost ($) to the age
(in months) of machines of similar type in a manufacturing
company.
Machine 1 2 3 4 5 6 7 8 9 10
Age (X) 5 10 15 20 30 30 30 50 50 60
Cost (Y) 190 240 250 300 310 335 300 300 350 395
Σ−ΣΣ
= =
(a) 22 ()
b nXXΣ−Σ
( )( )
= − =
Y
Σ Σ
a b X
n n
300
Y=
Chapter 6 – Page 15
(ii) When X = 80, Comment: ∧
Y=
Example:
∧
If
Y= a + b X = 3.33 + 0.47 X, then interpret the values of 'a ' and
'b ' ; where Y = sales ($'000) and X = advertising costs ($'00),
Solution:
Chapter 6 – Page 16
Example:
If Y= a + interpret the values of 'a '
∧ bX =28 + 2.6X, then and 'b '
∧
; where Y= expenditure in $'000 and X = output in 000's units.
Solution:
a:
b:
Disadvantages:
(a) It assumes a linear relationship between the two variables,
whereas a non-linear relationship may exist.
Example
In Mr. Steve's physical fitness course, several fitness scores were taken. The
following sample is the number of push-ups and sit-ups done by ten randomly
selected students:
Student 1 2 3 4 5 6 7 8 9 10
Push-ups (X) 27 22 15 35 30 52 35 55 40 40
Sit-ups (Y) 30 26 25 42 38 40 32 54 50 43
Step 4: Labels in first row → Check this box if you had entered the
variable name in your first cell.
Step 5: Output range → Key in one cell destination where your output will
start.
Chapter 6 – Page 18
Figure 1
Chapter 6 – Page 19
Scatter Diagram
Step 1: Highlight both columns of data. On the Insert tab, click the Scatter (X, Y)
chart command button. Select the Chart subtype that doesn’t include
any lines as shown in Figure 2.
Figure 2
Step 2: Right-click the x axis or y axis and click Format Axis. On the Format
Axis pane, set the desired Minimum and Maximum bounds as
appropriate. Additionally, you can change the Major units that control
the spacing between the gridlines.
Figure 3
Chapter 6 – Page 20
Step 3: Add Axis Titles and a Trendline by clicking the Add Chart Element
Menu.
Figure 4
A plot of the data points (scatter plot) and the fitted regression line is shown in
Figure 5.
Chapter 6 – Page 21
AAMS1773 QUANTITATIVE STUDIES
Tutorial 6 (CORRELATION AND SIMPLE LINEAR REGRESSION)
1. The following shows the number of price quotations issued and the number of
sales made by a random sample of 8 salesmen. The figures relate to the
same period of time:
Number of price 105 213 96 157 114 103 237 185
quotations
issued
Number of sales 78 104 63 83 54 59 137 96
Chapter 6 – Page 22
Employee A B C D E F G H
Weeks of experience 4 5 7 9 10 11 12 14
Number of rejects 21 22 15 18 14 14 11 13
=
a + bX in part (b).
(d) Calculate the Spearman's rank correlation coefficient.
4. The total monthly cost and the monthly output of electronic component in a
factory for ten months are tabulated below:
Output (‘000s) 21 3 5 24 19 15 11 9 14 9
Total cost ($’000) 65 30 31 71 54 52 40 33 45 38
(a)Calculate the product moment correlation coefficient and coefficient of
determination. Interpret your answers.
(b) Find the least squares regression line of total cost on output. (c) State the
fixed cost of the factory and the average variable cost per unit of production.
(d)Estimate the total cost if the production levels were:
(i) 8,500 units
(ii) 25,000 units
Comment on the accuracy or the estimates.
Chapter 6 – Page 23
(c) Draw a scatter diagram and comment on the appropriateness of fitting a
straight line relationship. By the method of least squares draw the
regression line of best fit.
(d)What will be the expected total expenses when the volume of sales is 7500
units?
A 12 42
B 95 124
C 8 12
D 45 102
E 34 53
F 72 145
G 60 88
H 19 26
(a) Find the least squares regression line of number of complaints on selling
area. Interpret the values of the constants ‘a’ and ‘b’ of the regression line
=a + bX.
ˆ
Y
(b)Hence using the least squares line, estimate the number of complaints
received when the selling area of the store is:
(i) 5000 square meters; (ii) 10000 square meters.
(c) Which estimate obtained in part (b) would be more reliable? Give reasons.
Answers
1. (b) r = 0.9319
2. (a) r = 0.6370 (b) rs = 0.6339
3. (a) r = -0.8714, 75.93% (b) Ŷ = 24.8929 – 0.9881X, 23.9048 (d) -
0.9107
4. (a) r = 0.9745, 94.97% (b)Ŷ = 19.5945 + 2.0235X
(c)$19,595, $2.02 per unit (d)(i)36.7943($000) (ii) 70.182($000) 5. (a) r =
0.9963 (b) 99.26% (c) Ŷ = 51.3333 + 4.4X (d) 84.3333 ($’000)
6. (a) Ŷ = 12.9609 + 1.4154X (b) (i) 83.7309 (ii) 154.5009
Chapter 6 – Page 24