Correlation
DR. K.M. EARFAN ALI
PROFESSOR
DEPARTMENT OF STATISTICS
Introduction
In the previous two topics, we concentrated
entirely on distributions and measures of one
variable;
but in reality, we normally collect data on
several items at once. We are interested in
links, or relationships, between the different
variables (or, sometimes, between variables
and attributes).
Bivariate Data
Definition: When we come across a large
number of problems involving the use of two or
more than two variables with the help of which
their relationship are studied then it is called
bivariate quantitative data
Example
the number of fish and their feeds
fish production technique used
weather conditions prevailing
the surface temperature of the farm areas
the effect of other farmers operating nearby.
Scatter Plots and Correlation
A scatter plot (or scatter diagram) is used to
show the relationship between two variables
Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
– Only concerned with strength of the
relationship
– No causal effect is implied
Correlation Coefficient
The correlation coefficient is a measure of the
strength and the direction of a linear
relationship between two variables. The symbol
𝑟 represents the sample correlation coefficient.
The formula for 𝑟 is
𝑛Σ𝑋𝑌 − Σ𝑋 Σ𝑌
𝑟= .
𝑛Σ𝑋 2 − 𝑋 2 𝑛Σ𝑌 2 − 𝑌 2
The range of the correlation coefficient is 1 to
1. If 𝑥 and 𝑦 have a strong positive linear
correlation, 𝑟 is close to 1.
If 𝑥 and 𝑦 have a strong negative linear
correlation, 𝑟 is close to 1. If there is no linear
correlation or a weak linear correlation, 𝑟 is
close to 0.
Scatter Plot Examples
Strong relationships Weak relationships
y y
x x
y y
x x
No relationship
x
Scatter Plot Examples
Rectangular coordinate
Two quantitative variables
One variable is called independent (X) and the
second is called dependent (Y)
Points are not joined
No frequency table
Scatter diagram
It is the simplest way of the diagrammatic
representation of bivariate data. Thus for the
bivariate distribution (𝑥𝑖, 𝑦𝑖 ); 𝑖 = 𝑗 = 1,2, … 𝑛,
If the values of the variables 𝑋 and 𝑌 be plotted
along the 𝑋-axis and 𝑌-axis respectively in the
𝑥𝑦-plane, the diagram of dots so obtained is
known as scatter diagram.
Definition
Correlation is the study of statistical
relationship between two or more variables.
In other words, correlation is the degree or
intensity of association or inter-relationship
between two (or more) variables.
The correlation is a measure of how close the
relationship between 𝑥 and 𝑦 is to a straight
line.
Karl Pearson’s Correlation Coefficient
A measure of intensity or degree of linear
relationship between two variables is called
coefficient of correlation. Correlation is
measured by the coefficient of correlation
which is denoted by ρ.
It is also called Pearson's correlation or
product moment correlation coefficient.
It measures the nature and strength between
two variables of the quantitative type.
Mathematical definition
If 𝑥 and 𝑦 be two random variables of a bivariate
population, then the correlation coefficient
between these variables is defined as 𝜌𝑥𝑦
or 𝜌 and that between the random variables 𝑥
and 𝑦 of a sample is denoted by 𝑟𝑥𝑦 𝑜𝑟 𝑟.
The sign of 𝑟 denotes the nature of association
while the value of 𝑟 denotes the strength of
association.
𝐶𝑜𝑣(𝑥, 𝑦)
𝑟𝑥𝑦 = , Theoretical formula
𝑣 𝑥 × 𝑣(𝑦)
𝑛
𝑖=1 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦
𝑟𝑥𝑦 = 𝑛
Mathetical formula
𝑥𝑖 − 𝑥 2 𝑦𝑖 − 𝑦 2
𝑖=1
Σ𝑥Σ𝑦
Σ𝑥𝑦 −
= 𝑛 Calculated formula
Σ𝑥 2 Σ𝑦 2
2
Σ𝑥 − 2
Σ𝑦 −
𝑛 𝑛
𝑆𝑃(𝑥, 𝑦)
𝑟𝑥𝑦 =
𝑆𝑆 𝑥 × 𝑆𝑆(𝑦)
Types of correlation
Mainly, there are three types of correlation.
Depending on its extent and direction there are five
types of correlation. Each type of correlation
described mathematically and graphically below:
Positive Correlation
• i). Perfect Positive Correlation
• ii) Partial Positive Correlation
Negative Correlation
• i). Perfect Negative Correlation
• ii) Partial Negative Correlation
Zero Correlation
Perfect Positive correlation
Y axis
X axis
Fig 1: Perfect positive (r = +1)
Perfect Positive correlation
If the two variables deviate in the same direction
in one unit. i.e., if the increase in one variable one
unit results in a corresponding increase one unit in
the other variable, correlation is said to be perfect
positive correlation.
In this, the two variables denoted by X and Y are
directly proportional and fully correlated with
each other. The correlation coefficient r = +1.
i.e., both variables rise or fall in the same
proportion.
Example
Perfect correlation are not found in nature but
some approaching to that extent are there such
as height and weight, age and height, age and
weight of cattle to a certain age.
𝑋 varies directly and proportionately to 𝑌 ,
(𝑋 ∞𝑌). If all the data points lie exactly on an
upward sloping line, then 𝑟 will be +1; (in figure
1)
Perfect Negative Correlation
Y axis
X axis
Fig 2: Perfect negative (r = -1)
Perfect Negative correlation
If the two variables constantly deviate in the
opposite direction in one unit.
i.e., if increase in one variable in one unit results
in corresponding decrease one unit opposite
direction in the other variable, correlation is said
to be perfect negative correlation.
Example
Perfect negative correlation are not found in
nature but some approaching to that extent are
there such as mean weekly temperature and
number of colds in winter;
pressure and volume gas at a particular
temperature, etc. X varies as (X ∞ ).
Partial positive & negative correlation
Y axis Y axis
X axis X axis
Fig 4: Partial positive (0< r < 1) Fig 5: Partial positive (-1 > r >0)
Partial positive correlation
If the two variables deviate in the same
direction, 𝑖. 𝑒. , if the increase (or decrease) in
one variable results in a corresponding increase
(or decrease) in the other variable, correlation is
said to be partial or moderately positive.
In this case, the non-zero values of coefficient
(𝑟) lie between 0 and +1,
𝑖. 𝑒. , 0 < 𝑟 < 1.
Example:
1. Infant mortality rate and overcrowding
2. Temperature and pulse rate;
3. Age and weight of fishes;
4. Plasma volume in ml and total circulating
albumin in gm.
5. Prices and supply of fish feed
6. Feed and yield of fish
Partial negative correlation
If the two variables constantly deviate in the
opposite direction
i.e., if increase (or decrease) in one variable
results in corresponding decrease (or increase)
in the other variable, correlation is said to be
inverse or negative.
Example
Age and vital capacity in adults cattle;
Income and infant mortality rate of cow;
Rainfall and grass
In such moderately negative correlation, the
scatter diagram will be of the same type but
mean imaginary line will rise from the extreme
values of one variable in following figure.
Example
The following data on boats operating and catch
obtained shows the scatter diagram weights of
fish and number of boat operating.
Number of boats Weight of fish (in kg)
67 120
69 125
85 140
83 160
74 130
81 180
97 150
92 140
114 200
85 130
847 1475
Scatter diagram of weight of fish and number of
boats
Weight of fish (in kg)
250
200
150
Weight of fish (in kg)
100
50
0
0 20 40 60 80 100 120
Weight of fish (in kg)
250
200
150
Weight of fish (in kg)
Linear (Weight of fish (in kg) )
100
50
0
0 20 40 60 80 100 120
No relation
Uncorrelated or Zero Correlation:
Y axis
Y axis
Fig 3: Zero correlation (r = 0)
No or Zero Correlation
If there is no relationship between the two
variables such that the value of one variable
change and the other variable remain constant
is called no or zero correlation.
Example
There is no correlation between a fish height
and the amount they earn.
Height and pulse rate of fish;
Assumptions
The concerned variables are linearly related.
i.e., by plotting them on a graph paper, a
straight line would be obtained.
There exists cause and effect relationship
between the (concerned) related variables
A large number of independent causes are
operating both the correlated variables so as
produce a normal distribution.
Both the variables are random
Since the variables are independent, there
exists regression of one variable on the other.
Prosperities of Correlation Coefficient
1. Correlation coefficient is independent of
change of origin and scale.
2. The value of correlation coefficient lies
between -1 and +1 i.e., -1 ≤ r ≤ +1.
3. Correlation coefficient is the geometric
mean of two regression coefficients.
4. Correlation coefficient is symmetric with
respect to the dependence of the variables.
5. The value of correlation coefficient is very
much influenced by large items, if they are
present in data.
Necessity of Studying Correlation
1. The Pearson correlation coefficient is used
for assessing the linear (straight line)
association between an 𝑋 and a 𝑌 variable,
and requires interval or ratio measurement.
2. Symbol for the sample correlation coefficient
is 𝑟, which is the sample estimate of that can
be obtained from a sample of pairs (𝑋, 𝑌) of
values for 𝑋 and 𝑌.
3. The correlation varies from negative one to
positive one (– 1 ≤ 𝑟 ≤ +1).
4. Correlation of +1 or –1 refers to a perfect
positive or negative 𝑋 , 𝑌 relationship,
respectively. Data falling exactly on a straight
line indicates that |𝑟| = 1.
Interpret r
The value of 𝑟 ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated by the following
diagram.
strong intermediate weak weak intermediate strong
-1 -0.75 -0.25 0 0.25 0.75 1
indirect Direct
perfect
correlation
no relation
Interpret r
If 𝑟 is very close to +1, we say there is a strong
positive correlation
𝑦 increases as 𝑥 increases, and the
relationship is good.
If 𝑟 is close to -1, there is a strong negative
correlation: 𝑦 decreases as 𝑥 increases.
When 𝑟 is close to zero (either positive or
negative) there is very little relationship
between the two variables.
Spearman’s rank correlation
Sometimes we come across statistical series in
which the variables under consideration are
not capable of quantitative measurement but
can be arranged in serial order.
This happens when we are dealing with
qualitative characteristics (attributes) such as
honesty, beauty, character, morality, etc.,
Let the random variables X and Y denote the
ranks of the individuals in the characteristics A
and B respectively.
If we assume that there is no tie, i.e., if no two
individuals get the same rank in a
characteristic then, obviously, X and Y assume
numerical values ranging from 1 to N.
Spearman Rank Correlation Coefficient (rs)
1. It is a non-parametric measure of correlation.
2. This procedure makes use of the two sets of
ranks that may be assigned to the sample values
of x and Y.
3. Spearman Rank correlation coefficient could be
computed in the following cases:
4. Both variables are quantitative.
5. Both variables are qualitative ordinal.
6. One variable is quantitative and the other is
qualitative ordinal.
Example
Calculate Spearman’s rank correlation
coefficient between advertisement cost and
sales of fish from the following data:
Advertiseme
nt cost 39 65 62 90 82 75 25 98 36 78
(‘000Tk.):
Sales (lakhs
47 53 58 86 62 68 60 91 51 84
Tk.):
Solution:
Let denote the advertisement cost (‘000 Tk.)
and denote the sales (lakhs Tk.).
𝑿𝒊 𝒀𝒊 Rank of Rank of 𝒅𝒊 = 𝒙𝒊 - 𝑑 2
𝑿𝒊 (𝒙𝒊 ) 𝒀𝒊 (𝒚𝒊 ) 𝒚𝒊 𝑖
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 84 4 3 1 1
10 10 - -
Here =10
6 𝑛𝑖=1 𝑑𝑖2
𝑟𝑠 = 1 −
𝑛 𝑛2 − 1
6 × 30
=1− = 0.82
10 × 99
• The value of 𝑟𝑠 denotes the magnitude and
nature of association giving the same
interpretation as simple 𝑟.
Comment:
There is an indirect weak correlation between
level of education and income.
Problem
A psychologist wanted to compare two methods
𝐴 & 𝐵 of teaching. He selected a random sample of
22 students.
He grouped them into 11 pairs so that the students
in a pair have approximately equal scores in an
intelligence test.
In each pair one student was taught by method A
and the other by method B and examined after the
course.
The marks obtained by them as follows.
Pair: 1 2 3 4 5 6 7 8 9 10 11
A: 24 29 19 14 30 19 27 30 20 28 11
B: 37 35 16 26 23 27 19 20 16 11 21
Solutions
A B RA RB D D2
24 37 6 1 5 25
29 35 3 2 1 1
19 16 8.5 9.5 -1 1
14 26 10 4 6 36
30 23 1.5 5 -3.5 12.25
19 27 8.5 3 5.5 30.25
27 19 5 8 -3 9
30 20 1.5 7 -5.5 30.25
20 16 7 9.5 -2.5 6.25
28 11 4 11 -7 49
11 21 11 6 5 25
In A series the items 19 &30 are repeated twice
and in B series16 is repeated twice
Apply the following formula
6 (di) 2
rs 1
n(n 1)
2
The value of 𝑟𝑠 denotes the magnitude and
nature of association giving the same
interpretation as simple 𝑟.
Comment:
There is an indirect weak correlation between
level of education and income.
Uses of correlation
1. It is used in physical and social sciences.
2. It is useful for economists to study the
relationship between variables like price, quantity
etc. Businessmen estimates costs, sales, price etc.
using correlation.
3. It is helpful in measuring the degree of
relationship between the variables like income
and expenditure, price and supply, supply and
demand etc.
4. Sampling error can be calculated.
5. It is the basis for the concept of regression.