Understanding Correlation in Statistics
Understanding Correlation in Statistics
CORRELATION
Definition:
Correlation is an analysis in which we study the degree of closeness of relationship
between the variables.
Explanation
Where the values of two variables vary in such a way that the movements i.e. increase
or decrease, in one variable is connected with the movements of another variable, such
variables are said to be as correlated. Therefore, correlation is the degree of
simultaneous variations between the variables. In some cases, the movements of the
variables are in the same direction i.e. an increase in one variable results in another
variable, and vice versa. In such case, when movements of variables are in same
direction, the correlation is said to be positive. For example, an increase in heights of
children is normally accompanied by increase of their weights. Similarly, if the
movements of the variables are in opposite direction, such variables are said to be
as negative or inverse. For example, an increase in the temperature during the winter is
accompanied by a decrease in the sale of warm clothes.
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 1
Tanzania Institute of accountancy
o If both the variables tend to move in the same direction, such as, Y tends to
increase as X increases, or Y tends to decrease as X decreases, the correlation is
called positive or direct correlation.
o On the other hand, if the variables tend to move in the opposite direction, such
as, Y tends to decrease as X increases or vice versa, the correlation is said to be as
negative or inverse correlation.
Interpretation
The coefficient of correlation lies between -1 and 1. i.e. it cannot be less than -1 and
greater than +1. Symbolically, it is written as - 1 £ r £ 1 If there is perfect positive
correlation i.e. r = +1 and if there is perfect negative correlation i.e.
r = -1. Lesser degree of correlation results in smaller positive correlation values of r.
When r = 0, there is no correlation.
The interpretation of coefficient of correlation as a measure of strength of the linear
relationship between two variables is purely a mathematical interpretation and is
completely devoid of any cause or effect implications. The fact that two variables tend
to increase or decrease together does imply that one has any direct or indirect effect on
one another. Both may be influence by other variables in such a manner as to give rise
to a strong mathematical relationship. The high correlation merely reflects the common
effect of the upward trend of the two variables.
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 2
Tanzania Institute of accountancy
Mean method.
r=
å (X - X )(Y - Y )
[å (X - X ) ] [(Y - Y ) ]
2 2
Example:
Compute the coefficient of linear correlation between the price (X) and the quantity
demanded (Y) for the following pair of values;
X 1 2 4 4 5 7 8 9
Y 1 3 4 6 8 9 11 14
Solution:
X=
åX =
40
=5 Y=
å Y = 56 = 7
n 8 n 8
Computations are needed to find coefficients of correlation which are given in the
following Table below
X Y (X - X ) ( X - X ) 2
(Y - Y ) (Y - Y ) 2
(X - X )(Y - Y )
1 1 -4 16 -6 36 24
2 3 -3 9 -4 16 12
4 4 -1 1 -3 9 3
4 6 -1 1 -1 1 1
5 8 0 0 1 1 0
7 9 2 4 2 4 4
8 11 3 9 4 16 12
9 14 4 16 7 49 28
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 3
Tanzania Institute of accountancy
∑X ∑Y
å (X - X )(Y - Y ) = 84
å (X - X )å
= 0(X - X ) =å56(Y - Y ) = 0å (Y - Y )
2 2
= = = 132
40 56
r=
å (X - X )(Y - Y )
[å (X - X ) ] [(Y - Y ) ]
2 2
84
=
(56)(132)
84
=
85.97
= 0.977
Short Method:
There is another short formula for the computation of coefficient of correlation. It
involves fewer computations as compared to previous formula and it is easy to
calculate the required answer. The short computational formula for coefficient is as
follows;
nå XY - (å X )(å Y )
r=
[nå X 2
- (å X )
2
] [nå Y 2
- (å Y )
2
]
Example:
Find the coefficient of linear correlation, by using the short computational formula,
using the following pairs of values;
X 1 2 3 4 5 6 7 8
Y 3 4 6 8 10 12 14 15
Solution:
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 4
Tanzania Institute of accountancy
X Y X2 Y2 XY
1 3 1 9 3
2 4 4 16 8
3 6 9 36 18
4 8 16 64 32
5 10 25 100 50
6 12 36 144 72
7 14 49 196 98
8 15 64 225 120
∑X = 36 ∑Y = 72 ∑X2 = 204 ∑Y2 = 790 ∑XY =401
nå XY - (å X )(å Y )
r=
[nå X 2
- (å X )
2
] [nå Y 2
- (å Y )
2
]
8(401) - (36)(72)
=
[
8 (204) - (36)
2
] [8(7900 - (72) )]2
3208 - 2592
=
(1632 - 1296)(6320 - 5184)
616
=
(336)(1136)
616
=
617.8155
= 0.997
Coefficient of Determination
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 5
Tanzania Institute of accountancy
One very convenient and useful way of interpreting the value of coefficient of
correlation is the use of the square of coefficient of correlation. The square of coefficient
of correlation is called coefficient of determination.
Coefficient of determination(R) = r 2
Coefficient of determination is the ratio of the explained variance to the total variance.
For example, suppose the value of r = 0.9, then r 2 = 0.81=81%
This means that 81% of the variation in the dependent variable has been explained by
(determined by) the independent variable. Here 19% of the variation in the dependent
variable has not been explained by the independent variable. Therefore, this 19% is
called coefficient of non-determination.
K2 = 1- coefficient of determination
Example:
Calculate coefficient of determination and non-determination if coefficient of correlation
is 0.8
Solution:
r = 0.8
Coefficient of determination = r 2
= 0.82 = 0.64 = 64%
Co efficient of non-determination = 1 – r 2
= 1- 0.64
= 0.36
= 36%
6å D 2
Spearman’s coefficient correlation (R ) = 1 -
n n2 -1( )
Where D = difference of ranks between the two variables
N = number of pairs
Example:
Find the rank correlation coefficient between poverty and overcrowding from the
information given below:
Town: A B C D E F G H I J
Poverty: 17 13 15 16 6 11 14 9 7 12
Over 36 46 35 24 12 18 27 22 2 8
crowing:
Solution:
Here ranks are not given. Hence we have to assign ranks first
R1 R2 D= R2 D2
- R1
Tow Povert Over
n: y: crowing:
A 17 36 1 2 1 1
B 13 46 5 1 4 16
C 15 35 3 3 0 0
D 16 24 2 5 3 9
E 6 12 10 8 2 4
F 11 18 7 7 0 0
G 14 27 4 4 0 0
H 9 22 8 6 2 4
I 7 2 9 10 1 1
J 12 8 6 9 3 9
å D 2 = 44
6å D 2
R = 1-
Spearman’s coefficient correlation
(
n n2 -1 )
Where D = difference of ranks between the two variables
n = 10
6å D 2 6 ´ 44 264
R = 1- = 1- = 1- = 1 - 0.2667 = + 0.733
(
n n -1
2
) (
10 10 - 1
2
) 990
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 7
Tanzania Institute of accountancy
Therefore, there is a high degree of positive correlation between the Poverty and Over
crowing.
Example:
Following were the ranks given by three judges in a beauty context. Determine which
pair
of judges has the nearest approach to Common tastes in beauty.
Judge I: 1 6 5 10 3 2 4 9 7 8
Judge I: 3 5 8 4 7 10 2 1 6 9
Judge I: 6 4 9 8 1 2 3 10 5 7
Solution:
Spearman’s coefficient correlation
6å D 2
R = 1-
(
n n2 -1 )
Where D = difference of ranks between the two variables
n = 10
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 8
Tanzania Institute of accountancy
The rank correlation coefficient in case of I & III judges is greater than the other two
pairs. Therefore, judges I & III have highest similarity of thought and have the nearest
approach to common taste in beauty.
Example:
The Co-efficient of rank correlation of the marks obtained by 10 students in statistics &
English was 0.2. It was later discovered that the difference in ranks of one of the
students was wrongly takes as 7 instead of 9. Find the correct result.
Solution:
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 9
Tanzania Institute of accountancy
Example:
The coefficient of rank correlation between marks in English and maths obtained by a
group students is 0.8. If the sum of the squares of the difference in ranks is given to be
33, find the number of students in the group.
Solution:
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 10
Tanzania Institute of accountancy
6å D 2
R = 1-
(
n n2 -1 )
Example:
Obtain rank correlation co-efficient for the data:-
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
REGRESSION
Definition
Regression is a process by which we estimate one of the variables, which is dependent
variable, on the basis of another variable, which is independent.
Dependent Variable
Dependent variable is the one which is intended to be estimate or predicted is referred
as dependent variable. Dependent variable is also known as regress and, predicted
variable, responsive variable or explained variable. The dependent variable whose
values are determined on the basis of the independent variable is called random
variable. or
Independent Variable
The variable on the basis of which dependent variable is predicted estimated is called
the independent variable. The independent variable is also called regressor, predictor,
regression variable or explanatory variable. The values of independent variable are
chosen by the experimenter and are assumed to be as fixed.
For Example
If we want to estimate the heights of children on the basis of their ages, then the height
will be the dependent variable and age will be the independent variable. Similarly, if we
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 12
Tanzania Institute of accountancy
want to calculate yield of crop on the basis of amount of fertilizer used, then the yield
will be dependent variable and the amount of fertilizer used will be independent
variable.
Explanations
If X is the independent variable and Y is the dependent variable, then the relation
between the two variables will be described by a straight line. Mathematically, straight
line will be represented as follows;
Y = a + bX
The above equation is called linear equation. It is the usual practice to express relation
between the variables in the form of equation. Such an equation connects the two
variables.
To solve such equation, first of all we need to collect data showing corresponding value
of the variables, which will be further be considered for solving the problem. Suppose X
and Y denote the respectively the heights and weights of students. Then a sample
will be as Y1 , Y2 ,....Yn .
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 13
Tanzania Institute of accountancy
The relationship between two variables is divided into four categories, which as
follows;
i. Positive Linear Relationship ii. Negative Linear Relationship
Graph 1 Graph 2
Graph 3 Graph 4
Calculation of Regression
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 14
Tanzania Institute of accountancy
The problem of regression is solved by the method of least square. Since the regression
is represented by the linear equation, therefore, the equation of least square is also the
linear equation. i.e. Y = a + bX . The normal equations for regressions are as follows;
å Y = å (a + bX )
= na + bå X
The second equation is obtained by multiplying the linear equation by X and them
summing both sides of the equation. i.e.
å XY = å (aX + bX ) 2
= a å X + bå X 2
By solving Eq. (i) and (ii) simultaneously we the equations for solving the constants a
and b, which are provided below;
(å Y )(å X ) - (å X )(å XY )
2
a=
n(å X ) - (å X )
2 2
nå XY - (å X )(å Y )
b=
n(å X ) - (å X )
2 2
For Example:
Find the least square line of regression for the following values of X and Y, taking
i) X as independent variable
ii) Estimate the productivity index of a worker whose test score is 92 on the
regression equation of y on x.
X 1 3 5 6 7 9 10 13
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 15
Tanzania Institute of accountancy
Y 1 2 5 5 6 7 7 8
Solution:
For obtaining the least square line of regression we will use the following below
X Y XY X2 Y2
1 1 1 1 1
3 2 6 9 4
5 5 25 25 25
6 5 30 36 25
7 6 42 49 36
9 7 63 81 49
10 7 70 100 49
13 8 104 169 64
∑X = 54 ∑Y = 41 ∑XY = 341 ∑X = 470
2 ∑Y = 263
2
n=8
We will solve the problem by using the formulas for constant a, b, c and d.
i.
(å Y )(å X ) - (å X )(å XY )
2
a=
n(å X ) - (å X ) 2 2
=
(41)(470) - (54)(341)
8(470) - (54)
2
18270 - 18414
=
3760 - 2916
856
=
844
= 1.014
nå XY - (å X )(å Y )
b=
(
n å X 2 - (å X )) 2
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 16
Tanzania Institute of accountancy
8(341) - (54)(41)
=
8(470) - (54)
2
2718 - 2214
=
3760 - 2916
514
=
844
= 0.609
For X as independent variable, the least square line for regression will be
Y = 1.014 + 0.609 X
ii From Y = 1.014 + 0.609 X
You can also use the following Normal equations to determine the value of a and b.
SOLUTION
X Y X2 Y2 XY
10 15 100 225 150
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 17
Tanzania Institute of accountancy
DISCUSSION QUESTIONS
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 18
Tanzania Institute of accountancy
QUESTION 1:
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
Find the sample regression line and predict the sales revenue if the appliance store
spends 4.5 thousand TShs for advertising in a month.
QUESTION 2:
(a) From the following information draw a scatter diagram and by the method of
least squares draw the regression line of best fit.
What will be the total expenses when the volume of sales is 7,500 units?
(c) If the selling price per unit is TShs 11, at what volume of sales will the total income
from sales equal the total expenses?
QUESTION 3:
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 19
Tanzania Institute of accountancy
The grades of a class of 9 students on a midterm report (x) and on the final examination
(y) are as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 68
(b) Estimate the final examination grade of a student who received a grade of
85 on the midterm report but were ill at the time of the final examination.
STUDY QUESTIONS
CORRELATION & REGRESSION
STQ 1:
The following is data relate to the number of bicycles and motorcycles owned in seven
cities of Tanzania.
Cities 1 2 3 4 5 6 7
Bicycles per 1000 population 245 236 238 232 250 247 252
Motorcycles per 10,000 23 35 18 36 41 43 48
population
Calculate the value of spearman’s rank correlation coefficient
STQ 2
The following data shows the average rent and rates (sh. per square foot) for a selection
of areas. Calculate Spearman’s rank correlation to assess whether there is any
correlation between rates and rents.
Rates (x) 168 146 157 1337 318 195 107 171 122 646
Rent (y) 381 419 487 2285 647 648 266 649 533 1523
STQ 3
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 20
Tanzania Institute of accountancy
The following data relate to the number of vehicles owned and road deaths for the
populations of 12 countries.
Vehicles per
100 30 31 32 30 46 30 19 35 40 46 57 30
population
Road deaths
per 100,000 30 14 30 23 32 26 20 21 23 30 35 26
population
REQUIRED:
Calculate Spearman’s rank correlation coefficient and comment on the result.
STQ 4
The following figures give (in units of sh.10m) the turnover and profit before taxation
for a firm.
Turnover 106 125 147 167 187 220
Profit 10 12 16 17 18 22
REQUIRED
Calculate the coefficient of determination for this data and comment on the result.
STQ 5
A cost accountant has derived the total cost (sh.0000) against output (000) of standard
size boxes from a factory over a period of ten weeks, yielding the following data.
Output 20 2 4 23 18 14 10 8 13 8
Cost 60 25 26 66 49 48 35 18 40 33
REQUIRED
i). Calculate the product moment coefficient of correlation
ii). Interpret the result with a view to future extrapolation
STQ 6
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 21
Tanzania Institute of accountancy
The following data shows median regional incomes for men aged 21 years and over in
full-time employment and average regional house purchase prices for a particular year
for twelve major regions of the URT
Median income
57 54 54 51 63 56 52 56 55 55 56 50
(sh)
House purchase
10 9 19 12 15 15 12 11 10 10 11 10
price (sh.000)
REQUIRED
Calculate:
a) The product moment correlation coefficient
b) Spearman’s rank correlation coefficient
c) Coefficient of determination
REQUIRED
Using Spearman’s rank correlation coefficient, determine whether the consumer
generally gets value for money.
QUESTION 2
Marks and Spencer plc
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 22
Tanzania Institute of accountancy
REQUIRED
i). Plot a scatter diagram showing the relationship between profit before taxation
and turnover
ii). Calculate the least square regression line before taxation on turnover
iii). Calculate coefficient correlation and determination coefficient
iv). Comment generally on your results
QUESTION 3
A company keeps extension records on its sales people on the premise that sales should
increase with experience. A random sample of eight new sales people produced data on
experience and sales provided in the following table below:
Months on the job 2 4 8 12 1 5 9 7
Monthly sales
2.4 7.0 11.3 15.0 0.8 3.7 12.0 5.2
(sh.000)
REQUIRED
i). Use the regression line to estimate quantitatively the relationship between the
number of months on the job and the level of monthly sales.
ii). Compute and interpret both the coefficient of correlation and that of
determination
iii). Estimate the level of sales in sh. if the experience of the sales people is exactly 10
months.
QUESTION 4
The following data gives the actual sales (in millions of shillings) of a company in each
of 8 regions of a country together with the forecast of sales by two different methods
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 23
Tanzania Institute of accountancy
QUESTION 5
Brand A B C D E F G H
Panel of Women 5 4 3 6 7 8 1 2
(X)
Determine how closely men’s and women’s tastes in tea are related.
b) After investigation it has been found that the demand for automobiles in a town
depends mainly, if not entirely, upon the number of families residing in that
town. Below are given figures for the sales of automobiles in the five cities for the
year 1986, and the number of families residing in those cities
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 24
Tanzania Institute of accountancy
A 70 25.2
B 75 28.6
C 80 30.2
D 60 22.3
E 90 35.4
REQUIRED
Fit a linear equation of Y on X by the least square method and estimate the sales for the
year 1988 for city A which is estimated to have 300,000 families assuming that the same
relationship holds true.
QUESTION 6:
The following data give the experience of machine operators and their performance
ratings as given by the number of good parts turned out per 100 pieces.
Operator 1 2 3 4 5 6 7 8
Experience (X) 16 12 18 4 3 10 5 12
REQUIRED
(a) Calculate the regression lines of performance ratings on experience.
QUESTION 7:
Coca-Cola Company provides you with the following table showing the number of
units of goods produced and total cost incurred for the period of ten years as follows:
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 25
Tanzania Institute of accountancy
(sh.'000,000')
2003 10 40
2004 15 45
2005 30 50
2006 40 65
2007 55 70
2008 60 70
2009 70 80
2010 90 85
2011 95 90
2012 100 98
REQUIRED
i) Determine the linear regression equation
ii) Coca-Cola Company plans to produce 120,000 units in year 2013, estimate the
expected total cost of production.
iii) Estimate the Pearson's correlation coefficient of the variables and interpret on
your answer.
iv) Estimate the coefficient of determination and interpret on your answer
Business Mathematics and Statistics - Mr. Mwita B.C – Mwanza Campus. 2022 Page 26