Correlation
Correlation
Applied Mathematics-II
Module – II: CORRELATION
Introduction:
In today’s business world we come across many activities, which are dependent
on each other. In businesses we see large number of problems involving the use of two
or more variables. Identifying these variables and its dependency helps us in resolving
the many problems. Many times, there are problems or situations where two
variables seem to move in the same direction such as both are increasing or
decreasing. At times an increase in one variable is accompanied by a decline in
another. For example, family income and expenditure, price of a product and its
demand, advertisement expenditure and sales volume etc. If two quantities vary in
such a way that movements in one are accompanied by movements in the other, then
these quantities are said to be correlated.
Meaning:
Correlation is a statistical technique to ascertain the association or relationship
between two or more variables. Correlation analysis is a statistical technique to study
the degree and direction of relationship between two or more variables.
A correlation coefficient is a statistical measure of the degree to which changes
to the value of one variable predict change to the value of another. When the
fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change
in the other.
Uses of correlations:
1. Correlation analysis helps inn deriving precisely the degree and the direction of
such relationship.
2. The effect of correlation is to reduce the range of uncertainty of our prediction.
The prediction based on correlation analysis will be more reliable and near to
reality.
3. Correlation analysis contributes to the understanding of economic behavior,
aids in locating the critically important variables on which others depend, may
reveal to the economist the connections by which disturbances spread and
suggest to him the paths through which stabilizing farces may become effective
4. Economic theory and business studies show relationships between variables
like price and quantity demanded advertising expenditure and sales promotion
measures etc.
5. The measure of coefficient of correlation is a relative measure of change.
Page | 1
GLA University Mathura BCSS 0152
Types of Correlation:
Correlation is described or classified in several different ways. Three of the
most important are:
I. Positive and Negative
II. Simple, Partial and Multiple
III. Linear and non-linear
I. Positive, Negative and Zero Correlation:
Whether correlation is positive (direct) or negative (in-versa) would depend
upon the direction of change of the variable.
Positive Correlation: If both the variables vary in the same direction, correlation is
said to be positive. It means if one variable is increasing, the other on an average is
also increasing or if one variable is decreasing, the other on an average is also
deceasing, then the correlation is said to be positive correlation. For example, the
correlation between heights and weights of a group of persons is a positive
correlation.
Height (cm): X 158 160 163 166 168 171 174 176
Weight (kg): Y 60 62 64 65 67 69 71 72
Negative Correlation: If both the variables vary in opposite direction, the correlation
is said to be negative. If means if one variable increases, but the other variable
decreases or if one variable decreases, but the other variable increases, then the
correlation is said to be negative correlation. For example, the correlation between the
price of a product and its demand is a negative correlation.
Price of Product (Rs. Per Unit): X 6 5 4 3 2 1
Demand (In Units): Y 75 120 175 250 215 400
Zero Correlation: Actually, it is not a type of correlation but still it is called as zero or
no correlation. When we don’t find any relationship between the variables then, it is
said to be zero correlation. It means a change in value of one variable doesn’t influence
or change the value of another variable. For example, the correlation between weight
of person and intelligence is a zero or no correlation.
II. Simple, Partial and Multiple Correlation:
The distinction between simple, partial and multiple correlation is based upon
the number of variables studied.
Simple Correlation: When only two variables are studied, it is a case of simple
correlation. For example, when one studies relationship between the marks secured
by student and the attendance of student in class, it is a problem of simple correlation.
Partial Correlation: In case of partial correlation, one studies three or more variables
but considers only two variables to be influencing each other and the effect of other
influencing variables being held constant. For example, in above example of
relationship between student marks and attendance, the other variable influencing
such as effective teaching of teacher, use of teaching aid like computer, smart board
etc. are assumed to be constant.
Page | 2
GLA University Mathura BCSS 0152
Illustration 01:
State in each case whether there is
(a) Positive Correlation
(b) Negative Correlation
(c) No Correlation
Sl. No Particulars Solution
1 Price of commodity and its demand Negative
2 Yield of crop and amount of rainfall Positive
3 No of fruits eaten and hungry of a person Negative
4 No of units produced and fixed cost per unit Negative
5 No of girls in the class and marks of boys No Correlation
6 Ages of Husbands and wife Positive
7 Temperature and sale of woolen garments Negative
8 Number of cows and milk produced Positive
9 Weight of person and intelligence No Correlation
10 Advertisement expenditure and sales volume Positive
Page | 3
GLA University Mathura BCSS 0152
Scatter Diagram:
This is graphic method of measurement of correlation. It is a diagrammatic
representation of bivariate data to ascertain the relationship between two variables.
Under this method the given data are plotted on a graph paper in the form of dot. i.e.
for each pair of X and Y values we put dots and thus obtain as many points as the
number of observations. Usually, an independent variable is shown on the X-axis
whereas the dependent variable is shown on the Y-axis. Once the values are plotted on
the graph it reveals the type of the correlation between variable X and Y. A scatter
diagram reveals whether the movements in one series are associated with those in the
other series.
• Perfect Positive Correlation: In this case, the points will form on a straight line
falling from the lower left-hand corner to the upper right-hand corner.
• Perfect Negative Correlation: In this case, the points will form on a straight line
rising from the upper left-hand corner to the lower right-hand corner.
• High Degree of Positive Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a rising tendency from the lower left-hand
corner to the upper right-hand corner.
Page | 4
GLA University Mathura BCSS 0152
• High Degree of Negative Correlation: In this case, the plotted points fall in a
narrow band, wherein points show a declining tendency from upper left-hand
corner to the lower right-hand corner.
• Low Degree of Positive Correlation: If the points are widely scattered over the
diagrams, wherein points are rising from the left-hand corner to the upper right-
hand corner.
• Low Degree of Negative Correlation: If the points are widely scattered over the
diagrams, wherein points are declining from the upper left-hand corner to the
lower right-hand corner.
• Zero (No) Correlation: When plotted points are scattered over the graph
haphazardly, then it indicates that there is no correlation or zero correlation
between two variables.
Diagram – I Diagram – II
Page | 5
GLA University Mathura BCSS 0152
Diagram – V Diagram – VI
Diagram – VII
Illustration 02:
Given the following pairs of values:
Capital Employed (Rs. In Crore) 1 2 3 4 5 7 8 9 11 12
Profit (Rs. In Lakhs) 3 5 4 7 9 8 10 11 12 14
(a) Draw a scatter diagram
(b) Do you think that there is any correlation between profits and capital
employed? Is it positive or negative? Is it high or low?
Solution:
From the observation of scatter diagram, we can say that the variables are
positively correlated. In the diagram the points trend toward upward rising from the
lower left-hand corner to the upper right-hand corner, hence it is positive
correlation. Plotted points are in narrow band which indicates that it is a case of
high degree of positive correlation.
Page | 6
GLA University Mathura BCSS 0152
16
14
12
0
0 2 4 6 8 10 12 14
Capital Employed (Rs. in Crore)
Above different formulas can be used in different situation depending upon the
information given in the problem.
Page | 7
GLA University Mathura BCSS 0152
Illustration 03:
From following information find the correlation coefficient between advertisement
expenses and sales volume using Karl Pearson’s coefficient of correlation method.
Firm 1 2 3 4 5 6 7 8 9 10
Advertisement Exp. (Rs. In Lakhs) 11 13 14 16 16 15 15 14 13 13
Sales Volume (Rs. In Lakhs) 50 50 55 60 65 65 65 60 60 50
Solution:
Let us assume that advertisement expenses are variable X and sales volume are
variable Y.
Calculation of Karl Pearson’s coefficient of correlation
Firm X Y x=X-Ẋ x2 y=Y-Ẏ y2 xy
1 11 50 -3 9 -8 64 24
2 13 50 -1 1 -8 64 8
3 14 55 0 0 -3 9 0
4 16 60 2 4 2 4 4
5 16 65 2 4 7 49 14
6 15 65 1 1 7 49 7
7 15 65 1 1 7 49 7
8 14 60 0 0 2 4 0
9 13 60 -1 1 2 4 -2
10 13 50 -1 1 -8 64 8
140 580 22 360 70
∑X ∑Y ∑x2 ∑y2 ∑xy
Ẋ = ∑X = 140 = 14 Ẏ = ∑Y = 580 = 58
n 10 n 10
∑xy 70 70
r= = = = 0.7866
√∑x2 ∑y2 √22∗360 88.9944
Interpretation: From the above calculation it is very clear that there is high degree of
positive correlation i.e. r = 0.7866, between the two variables. i.e. Increase in
advertisement expenses leads to increased sales volume.
Illustration 04:
Find the correlation coefficient between age and playing habits of the following
students using Karl Pearson’s coefficient of correlation method.
Age 15 16 17 18 19 20
Number of students 250 200 150 120 100 80
Regular Players 200 150 90 48 30 12
Page | 8
GLA University Mathura BCSS 0152
Solution:
To find the correlation between age and playing habits of the students, we need to
compute the percentages of students who are having the playing habit.
Percentage of playing habits = No. of Regular Players / Total No. of Students * 100
Now, let us assume that ages of the students are variable X and percentages of playing
habits are variable Y.
Interpretation: From the above calculation it is very clear that there is high degree of
negative correlation i.e. r = -0.9912, between the two variables of age and playing
habits. i.e. Playing habits among students decreases when their age increases.
Illustration 05:
Find Karl Pearson’s coefficient of correlation between capital employed and profit
obtained from the following data.
Capital Employed (Rs. In Crore) 10 20 30 40 50 60 70 80 90 100
Profit (Rs. In Crore) 2 4 8 5 10 15 14 20 22 50
Solution:
Let us assume that capital employed is variable X and profit is variable Y.
Page | 9
GLA University Mathura BCSS 0152
Illustration 06:
A computer while calculating the correlation coefficient between the variable X and Y
obtained the following results:
N = 30; ∑X = 120 ∑X2 = 600 ∑Y = 90 ∑Y2 = 250 ∑XY = 335
It was, however, later discovered at the time of checking that it had copied down two
pairs of observations as: (X, Y): (8, 10) (12, 7)
While the correct values were: (X, Y): (8, 12) (10, 8)
Obtain the correct value of the correlation coefficient between X and Y.
Solution:
Correct ∑X = 120 – 8 – 12 + 8 + 10 = 118
Correct ∑X2 = 600 – 8 – 12 + 8 + 10
2 2 2 2
Page | 10
GLA University Mathura BCSS 0152
Illustration 07:
Coefficient of correlation between X and Y is 0.3. Their covariance is 9. The variance of
X is 16. Find the standard devotion of Y series.
Solution:
Given information:
r = 0.3 Cov (X, Y) = 9 Var (X) = 16
𝐶𝑜𝑣 (𝑋, 𝑌) 9 9
r= 0.3 = 0.3 =
√𝑉𝑎𝑟(𝑋) ∗ 𝑉𝑎𝑟 (𝑌) √16 ∗ 𝑉𝑎𝑟 (𝑌) 4 ∗ √ 𝑉𝑎𝑟 (𝑌)
Illustration 08:
Calculate correlation coefficient from the following two-way table, with X representing
the average salary of families selected at random in a given area and Y representing
the average expenditure on entertainment.
Expenditure on Average Salary (Rs. ‘000)
Entertainment (Rs. ‘000) 100-150 150-200 200-250 250-300 300-350
0 – 10 5 4 5 2 4
10 – 20 2 7 3 7 1
20 – 30 - 6 - 4 5
30 – 40 8 - 4 - 8
40 – 50 - 7 3 5 10
Solution:
Let us assume that Average Salary is variable X and Expenditure on
Entertainment is variable Y.
In case of grouped data, we need to follow the assumed mean method to
calculate Karl Pearson’s Coefficient of Correlation. Following steps are followed to
compute correlation.
1. Identify the mid-point of the class intervals for variable X and Y.
2. Chose an assumed mean from the mid-point identified above for both X and Y.
3. To simplify further, deviation from assumed mean is computed by dividing
deviation by a common factor.
4. Add the values in cell, row-wise and column-wise, to compute frequencies (f).
Sum of either row-wise or column-wise represent the value of N.
5. Obtain the product of dx and dy and the corresponding frequencies (f) in each
cell. Write the figure thus obtained in the right corner of each cell which
represent the value of fdxdy.
Page | 11
GLA University Mathura BCSS 0152
20 8 0 -4 -16
0 – 10 5 20 -2 -40 80 8
5 4 5 2 4
4 7 0 -7 -2
10 – 20 15 20 -1 -20 20 2
2 7 3 7 1
- 0 - 0 0
20 – 30 25 15 0 0 0 0
- 6 - 4 5
-16 - 0 - 16
30 – 40 35 20 1 20 20 0
8 - 4 - 8
- -14 0 10 40
40 – 50 45 25 2 50 100 36
- 7 3 5 10
100
f 15 24 15 18 28 10 220 46
=N
dx -2 -1 0 1 2 ∑fdy ∑fdy2 ∑fdxdy
fdxdy 8 1 0 -1 38 46 ∑fdxdy
Interpretation: From the above calculation it is very clear that there is low degree of
positive correlation i.e. r = 0.2052, between the two variables of salary and
expenditure. It means average salary of income have slightly or low influence over
entertainment expenditure.
Page | 12
GLA University Mathura BCSS 0152
To find out correlation under this method, the following formula is used.
2
R=1- 6∑D
where, D =Difference of the ranks between paired items in two series.
N 3− N
N = Number of pairs of ranks
Illustration 09:
Find out spearman’s coefficient of correlation between the two kinds of assessment of
graduate students’ performance in a college.
Name of students A B C D E F G H I
Internal Exam 51 68 73 46 50 65 47 38 60
External Exam 49 72 74 44 58 66 50 30 35
Page | 13
GLA University Mathura BCSS 0152
Solution:
Calculation of Spearman’s Rank Coefficient of Correlation
Internal External
Name Ranks (R1) Ranks (R2) D = R1 – R2 D2
Exam Exam
A 51 5 49 6 -1 1
B 68 2 72 2 0 0
C 73 1 74 1 0 0
D 46 8 44 7 1 1
E 50 6 58 4 2 4
F 65 3 66 3 0 0
G 47 7 50 5 2 4
H 36 9 30 9 0 0
I 60 4 35 8 -4 16
∑D2 = 26
Calculation. …………………………………
Interpretation: From the above calculation it is very clear that there is high degree of
positive correlation i.e. R = 0.7833, between two exams. It means there is a high
degree of positive correlation between the internal exam and external exam of the
students.
Illustration 10:
The coefficient of rank correlation of the marks obtained by 10 students in statistics
and accountancy was found to be 0.8. It was later discovered that the difference in
ranks in the two subjects obtained by one of the students was wrongly taken as 7
instead of 9. Find the correct coefficient of rank correlation.
Illustration 11:
Ten competitors in a beauty contest are ranked by three judges in the following order:
1st Judge 1 6 5 10 3 2 4 9 7 8
2 Judge
nd 3 5 8 4 7 10 2 1 6 9
3 Judge
rd 6 4 9 8 1 2 3 10 5 7
Page | 14
GLA University Mathura BCSS 0152
Use the rank correlation coefficient to determine which pairs of judges has the nearest
approach to common tastes in beauty.
Solution:
In order to find out which pair of judges has the nearest approach to common tastes in
beauty, we compare rank correlation between the judgements of
1. 1st Judge and 2nd Judge
2. 2nd Judge and 3rd Judge
3. 1st Judge and 3rd Judge
Calculation of Spearman’s Rank Coefficient of Correlation
Rank by 1st Rank by 2nd Rank by 3rd
Judge (R1) Judge (R2) Judge (R3) D2 = (R1–R2)2 D2 = (R2–R3)2 D2 = (R1–R3)2
1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 36 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1
N = 10 N = 10 N = 10 ∑D = 200
2 ∑D = 214
2 ∑D = 60
2
2
6∑D 6∗200
1. 1st Judge and 2nd Judge: R = 1 - =1– = 1 – 1200 = 1 – 1.2121= -0.2121
N3 − N 103− 10 990
2
6∑D 6∗214 1284
2. 2nd Judge and 3rd Judge: R = 1 - =1– =1– = 1 – 1.297 = -0.297
N3 − N 10 3− 10 990
2
6∑D 6∗60 360
3. 1st Judge and 3rd Judge: R = 1 - =1– =1– = 1 – 0.3636 = 0.6364
N3 − N 10 3− 10 990
Illustration 12:
From the following data, compute the rank correlation.
X 82 68 75 61 68 73 85 68
Y 81 71 71 68 62 69 80 70
Page | 16
GLA University Mathura BCSS 0152
REGRESSION
Meaning:
A study of measuring the relationship between associated variables, wherein
one variable is dependent on another independent variable, called as Regression. It is
developed by Sir Francis Galton in 1877 to measure the relationship of height between
parents and their children.
Regression analysis is a statistical tool to study the nature and extent of
functional relationship between two or more variables and to estimate (or predict) the
unknown values of dependent variable from the known values of independent
variable.
The variable that forms the basis for predicting another variable is known as
the Independent Variable and the variable that is predicted is known as dependent
variable. For example, if we know that two variables price (X) and demand (Y) are
closely related we can find out the most probable value of X for a given value of Y or
the most probable value of Y for a given value of X. Similarly, if we know that the
amount of tax and the rise in the price of a commodity are closely related, we can find
out the expected price for a certain amount of tax levy.
Page | 17
GLA University Mathura BCSS 0152
In the above two regression lines or regression equations, there are two
regression parameters, which are “a” and “b”. Here “a” is unknown constant and “b”
which is also denoted as “byx” or “bxy”, is also another unknown constant popularly
called as regression coefficient. Hence, these “a” and “b” are two unknown constants
(fixed numerical values) which determine the position of the line completely. If the
value of either or both of them is changed, another line is determined. The parameter
“a” determines the level of the fitted line (i.e. the distance of the line directly above or
below the origin). The parameter “b” determines the slope of the line (i.e. the change
in Y for unit change in X).
Page | 18
GLA University Mathura BCSS 0152
If the values of constants “a” and “b” are obtained, the line is completely
determined. But the question is how to obtain these values. The answer is provided by
the method of least squares. With the little algebra and differential calculus, it can be
shown that the following two normal equations, if solved simultaneously, will yield
the values of the parameters “a” and “b”.
Two normal equations:
X on Y Y on X
∑X = Na + b∑Y ∑Y = Na + b∑X
∑XY = a∑Y + b∑Y 2 ∑XY = a∑X + b∑X2
This above method is popularly known as direct method, which becomes quite
cumbersome when the values of X and Y are large. This work can be simplified if
instead of dealing with actual values of X and Y, we take the deviations of X and Y
series from their respective means. In that case:
Regression equation Y on X:
Y = a + bX will change to (Y – Ẏ) = byx (X – Ẋ)
Regression equation X on Y:
X = a + bY will change to (X – Ẋ) = bxy (Y – Ẏ)
In this new form of regression equation, we need to compute only one
parameter i.e. “b”. This “b” which is also denoted either “byx” or “bxy” which is called as
regression coefficient.
Regression Coefficient:
The quantity “b” in the regression equation is called as the regression
coefficient or slope coefficient. Since there are two regression equations, therefore, we
have two regression coefficients.
1. Regression Coefficient X on Y, symbolically written as “bxy”
2. Regression Coefficient Y on X, symbolically written as “byx”
Different formula’s used to compute regression coefficients:
Method Regression Coefficient X on Y Regression Coefficient Y on X
Using the correlation σ𝑥 σ𝑦
coefficient (r) and bxy = 𝑟 byx = 𝑟
σ𝑦 σ𝑥
standard deviation (σ)
Direct Method: Using N∑XY− ∑X∑Y N∑XY− ∑X∑Y
bxy = byx =
sum of X and Y N∑Y2− (∑Y)2 N∑X2− (∑X)2
∑𝑥𝑦 ∑𝑥𝑦
When deviations are bxy = byx =
taken from arithmetic ∑𝑦2 ∑𝑥2
mean where x = X - Ẋ and y = Y - Ẏ where x = X - Ẋ and y = Y - Ẏ
Page | 19
GLA University Mathura BCSS 0152
2. If one of the regression coefficients is greater than unity, the other must be less
than unity, since the value of the coefficient of correlation cannot exceed unity.
For example, if bxy = 1.2 and byx = 1.4 “r” would be = √1.2 ∗ 1.4 = 1.29, which is
not possible.
3. Both the regression coefficient will have the same sign. i.e. they will be either
positive or negative. In other words, it is not possible that one of the regression
coefficients is having minus sign and the other plus sign.
4. The coefficient of correlation will have the same sign as that of regression
coefficient, i.e. if regression coefficient have a negative sign, “r” will also have
negative sign and if the regression coefficient have a positive sign, “r” would also
be positive. For example, if bxy = -0.2 and byx = -0.8 then r = - √0.2 ∗ 0.8 = – 0.4
5. The average value of the two-regression coefficient would be greater than the
value of coefficient of correlation. In symbol (bxy + byx) / 2 > r. For example, if
bxy = 0.8 and byx = 0.4 then average of the two values = (0.8 + 0.4) / 2 = 0.6 and
the value of r = r = √0.8 ∗ 0.4 = 0.566 which less than 0.6
6. Regression coefficients are independent of change of origin but not scale.
Illustration 01:
Find the two-regression equation of X on Y and Y on X from the following data:
X : 10 12 16 11 15 14 20 22
Y : 15 18 23 14 20 17 25 28
Illustration 02:
After investigation it has been found the demand for automobiles in a city depends
mainly, if not entirely, upon the number of families residing in that city. Below are the
given figures for the sales of automobiles in the five cities for the year 2019 and the
number of families residing in those cities.
City No. of Families (in lakhs): X Sale of automobiles (in ‘000): Y
Belagavi 70 25.2
Bangalore 75 28.6
Hubli 80 30.2
Kalaburagi 60 22.3
Mangalore 90 35.4
Fit a linear regression equation of Y on X by the least square method and estimate the
sales for the year 2020 for the city Belagavi which is estimated to have 100 lakh
families assuming that the same relationship holds true.
Page | 20
GLA University Mathura BCSS 0152
Illustration 03:
From the following data obtain the two regression lines:
Capital Employed (Rs. in lakh): 7 8 5 9 12 9 10 15
Sales Volume (Rs. in lakh): 4 5 2 6 9 5 7 12
Illustration 04:
From the following information find regression equations and estimate the production
when the capacity utilization is 70%.
Average (Mean) Standard Deviation
Production (in lakh units) 42 12.5
Capacity Utilization (%) 88 8.5
Correlation Coefficient (r) 0.72
Illustration 05:
The following data gives the age and blood pressure (BP) of 10 sports persons.
Name : A B C D E F G H I J
Age (X) : 42 36 55 58 35 65 60 50 48 51
BP (Y) : 98 93 110 85 105 108 82 102 118 99
i. Find regression equation of Y on X and X on Y (Use the method of deviation
from arithmetic mean)
ii. Find the correlation coefficient (r) using the regression coefficients.
iii. Estimate the blood pressure of a sports person whose age is 45.
Illustration 06:
There are two series of index numbers, P for price index and S for stock of commodity.
The mean and standard deviation of P are 100 and 8 and S are 103 and 4 respectively.
The correlation coefficient between the two series is 0.4. With these data, work out a
linear equation to read off values of P for various values of S. Can the same equation be
used to read off values of S for various values of P?
Page | 21
GLA University Mathura BCSS 0152
Page | 22