Module-I Regression (3)
Module-I Regression (3)
Correlation :- So far we have considered only univariate distribution . We know how to find
averages and dispersion of a distribution. These measures give a complete idea about the
structure of a distribution..
Sometimes it is necessary to know the relationship between two variables . For example, family
income and expenditure, price of a product and its demand, advertisement expenditure and sales
volume etc. If two quantities vary in such a way that movements in one are accompanied by
movements in the other, then these quantities are said to be correlated.
Correlation is a statistical technique to ascertain the association or relationship between two or
more variables. Correlation analysis is a statistical technique to study the degree and direction of
relationship between two or more variables. A correlation coefficient is a statistical measure of
the degree to which changes to the value of one variable predict change to the value of another.
When the fluctuation of one variable reliably predicts a similar fluctuation in another variable,
there’s often a tendency to think that means that the change in one causes the change in the other.
Types of Correlation:
Correlation is described or classified in several different ways. Three of the most important are:
I. Positive and Negative
II. Simple, Partial and Multiple
III. Linear and nonlinear
Positive Correlation: If both the variables vary in the same direction, correlation is said to be
positive. It means if one variable is increasing, the other on an average is also increasing or if
one variable is decreasing, the other on an average is also decreasing, then the correlation is
said to be positive correlation. For example, the correlation between heights and weights of a
group of persons is a positive correlation.
Height (cm): X 158 160 163 166 168 171 174 176
Weight (kg) : Y 60 62 64 65 67 69 71 72
Negative Correlation: If both the variables vary in opposite direction, the correlation is said
to be negative. If it means if one variable increases, but the other variable decreases or if one
variable decreases, but the other variable increases, then the correlation is said to be negative
correlation. For example, the correlation between the price of a product and its demand is a
negative correlation.
Zero Correlation: Actually it is not a type of correlation but still it is called zero or no
correlation. When we don’t find any relationship between the variables then, it is said to be zero
correlation. It means a change in value of one variable doesn’t influence or change the value of
another variable. For example, the correlation between weight of a person and intelligence is a
zero or no correlation.
Sl No Particulars Solution
Perfect Positive Correlation: In this case, the points will form on a straight line falling
from the lower left hand corner to the upper right hand corner.
Perfect Negative Correlation: In this case, the points will form on a straight line rising
from the upper left hand corner to the lower right hand corner.
High Degree of Positive Correlation: In this case, the plotted points fall in a narrow
band, wherein points show a rising tendency from the lower left hand corner to the upper
right hand corner.
High Degree of Negative Correlation: In this case, the plotted points fall in a narrow
band, wherein points show a declining tendency from upper left hand corner to the lower
right hand corner.
Low Degree of Positive Correlation: If the points are widely scattered over the
diagrams, wherein points are rising from the left hand corner to the upper right hand
corner.
Low Degree of Negative Correlation: If the points are widely scattered over the
diagrams, wherein points are declining from the upper left hand corner to the lower right
hand corner.
Zero (No) Correlation: When plotted points are scattered over the graph haphazardly,
then it indicates that there is no correlation or zero correlation between two variables.
Coefficient of Correlation: Karl Pearson’s method of calculating coefficient of correlation is
based on the covariance of the two variables in a series. This method is widely used in practice
and the coefficient of correlation is denoted by the symbol “r”. If the two variables under study
are X and Y, the following formula suggested by Karl Pearson can be used for measuring the
degree of relationship of correlation we have
Σ(𝑥𝑖−𝑥)Σ(𝑦𝑖−𝑦)
Cov (x,y)= 𝑛
𝑐𝑜𝑣(𝑥,𝑦)
r= σ𝑥σ𝑦
(i)
By substituting the values of σ𝑥 𝑎𝑛𝑑 σ𝑦 equation(i) becomes
Σ𝑥Σ𝑦
Σ𝑥𝑦 −
r= 2
𝑛
2
(ii)
2 (Σ𝑥) 2 (Σ𝑦)
Σ𝑥 − 𝑛
Σ𝑦 − 𝑛
Σ(𝑥𝑖−𝑥)Σ(𝑦𝑖−𝑦)
r=
2 2
(iii)
Σ(𝑥 − 𝑥) Σ(𝑦 − 𝑦)
𝑛 𝑛
Above different formulas can be used in different situations depending upon the information
given in the problem.
2 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
The coefficient of determination = 𝑟 = 𝑇𝑜𝑡𝑎𝑙 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
r 𝑟
2 comment
1 1 Variation in dependent
variable Y can be completely
explained by the independent
variable X.
0.75≤r<0.9 2
0.56≤𝑟 <0.81 56% of variation in Y can be
explained by the presence of
X. Hence, we say there exists
definite positive correlation
0.5≤r<0.75 2
0.25≤𝑟 <0.56 Unreliable positive
correlation
0<r<0.5 2
0≤𝑟 <0.25 Poor positive correlation
r=0 2
𝑟 =0 No linear correlation
Example 1: a set of data giving the number of police traffic patrols on duty and the number of
fatalities for the region was recorded and a correlation of r = -0.81was found. Interpret the value
of r.
Solution:
since the value of r is between -0.75 & -0.9. Hence there exists Definite Negative Correlation.
Example 2: Compute Karl Pearson’s correlation coefficient and comment on your result.
x 7 5 4 11 10 12 14 9
y 14 8 8 19 16 19 20 16
Solution:
Here n = 8, Σ𝑥 = 72, Σ𝑦 = 120
So, 𝑥 = 9, 𝑦 = 15
Σ (𝑥 − 𝑥) × (𝑦 − 𝑦)
We use the formula, r = 𝑛 σ𝑥 σ𝑦
To calculate the required summations, we prepare the following table.
x y x-𝑥 2
y-𝑦 2
(𝑥 − 𝑥) (𝑦 − 𝑦) (𝑥 − 𝑥) (𝑦 − 𝑦)
7 14 -2 4 -1 1 2
5 8 -4 16 -7 49 28
4 8 -5 25 -7 49 35
11 19 2 4 4 16 8
10 16 1 1 1 1 1
12 19 3 9 4 16 12
14 20 5 25 5 25 25
9 16 0 0 1 1 0
2 2
Now, Σ(𝑥 − 𝑥) = 84, Σ(𝑦 − 𝑦) = 158, Σ(𝑥 − 𝑥) (𝑦 − 𝑦) = 111
2
Σ(𝑥 − 𝑥) 84
Standard Deviation of X = σ𝑥 = 𝑛
= 8
= 10. 5 = 3.2403
2
Σ(𝑦 − 𝑦) 158
Standard Deviation of y = σ𝑦 = 𝑛
= 8
= 19. 75 = 4.4441
Σ (𝑥 − 𝑥) × (𝑦 − 𝑦)
Now r = 𝑛 σ𝑥 σ𝑦
111
= 8 × 3.2403 × 4.4441
= 0.9635
Hence there exists high positive Correlation.
x 14 8 10 11 9 13 5
y 14 9 11 13 11 12 4
Solution:
We observe that n = 7, Σ𝑥 = 70, Σ𝑦 = 74
∑𝑥∑𝑦
∑ 𝑥𝑦 − 𝑛
We use the formula, r = 2 2
2 (Σ𝑥) 2 (Σ𝑦)
Σ𝑥 − 𝑛
Σ𝑦 − 𝑛
To calculate the required summations, we prepare the following table.
x y 𝑥
2
𝑦
2 xy
8 9 64 81 72
9 11 81 121 99
2 2
Here Σ𝑥 = 756, Σ𝑦 = 848, Σ𝑥𝑦 = 796,
Substituting theses values in the formula
∑𝑥∑𝑦
∑ 𝑥𝑦 − 𝑛
r= 2 2
2 (Σ𝑥) 2 (Σ𝑦)
Σ𝑥 − 𝑛
Σ𝑦 − 𝑛
70 × 74
796 −
= 2
7
2
(70) (74)
756 − 7
848 − 7
r = 0.9231.
Example 4: A computer while calculating the correlation coefficient between the variable X
and Y obtained the following results: N = 30; ∑X = 120 ∑X2 = 600 ∑Y = 90 ∑Y2 = 250
∑XY = 335 It was, however, later discovered at the time of checking that it had copied
down two pairs of observations as: (X, Y) : (8, 10) (12, 7) While the correct values were: (X,
Y) : (8, 12) (10, 8) Obtain the correct value of the correlation coefficient between X and Y.
Solution:
Correct ∑X = 120 – 8 – 12 + 8 + 10 = 118
2 2 2 2 2
Correct ∑𝑥 = 600 – 8 – 12 + 8 + 10
= 600 – 64 – 144 + 64 + 100 = 556
Correct ∑Y = 90 – 10 – 7 + 12 + 8 = 93
2 2 2 2 2
Correct ∑𝑦 = 250 – 10 – 7 + 12 + 8
= 250 – 100 – 49 + 144 + 64 = 309
Correct ∑XY = 335 – (8×10) – (12×7) + (8×12) + (10×8)
= 335 – 80 – 84 + 96 + 80 = 347
r = -0.4030
Hence there exists poor negative correlation.
Regression analysis
Meaning: A study of measuring the relationship between associated variables, wherein one
variable is dependent on another independent variable, called Regression. It was developed by
Sir Francis Galton in 1877 to measure the relationship of height between parents and their
children.
Regression analysis is a statistical tool to study the nature and extent of functional
relationship between two or more variables and to estimate (or predict) the unknown values of
dependent variables from the known values of independent variables.
The variable that forms the basis for predicting another variable is known as the Independent
Variable and the variable that is predicted is known as the dependent variable. For example, if we
know that two variables price (X) and demand (Y) are closely related we can find out the most
probable value of X for a given value of Y or the most probable value of Y for a given value of
X. Similarly, if we know that the amount of tax and the rise in the price of a commodity are
closely related, we can find out the expected price for a certain amount of tax levy.
Uses of Regression Analysis: 1. It provides estimates of values of the dependent variables from
values of independent variables. 2. It is used to obtain a measure of the error involved in using
the regression line as a basis for estimation. 3. With the help of regression analysis, we can
obtain a measure of degree of association or correlation that exists between the two variables. 4.
It is a highly valuable tool in economies and business research, since most of the problems of
economic analysis are based on cause and effect relationship.
Sl No Correlation Regression
1 It measures the degree and direction of It measures the nature and extent of average
relationship between the variables. relationship between two or more variables
in terms of the original units of the data
7 There may be zero correlation such as There is nothing like zero regression.
the weight of the wife and income of
the husband.
Regression Lines and Regression Equation: Regression lines and regression equations are
used synonymously. Regression equations are algebraic expressions of the regression lines. Let
us consider two variables: X & Y. If y depends on x, then the result comes in the form of simple
regression. If we take the case of two variables X and Y, we shall have two regression lines as the
regression line of X on Y and the regression line of Y on X. The regression line of Y on X gives
the most probable value of Y for given value of X and the regression line of X on Y given the
most probable value of X for given value of Y. Thus, we have two regression lines. However,
when there is either perfect positive or perfect negative correlation between the two variables,
the two regression lines will coincide, i.e. we will have one line. If the variables are independent,
r is zero and the lines of regression are at right angles i.e. parallel to X axis and Y axis.
Therefore, with the help of simple linear regression model we have the following two regression
lines
1. Regression line of Y on X: This line gives the probable value of Y (Dependent variable) for
any given value of X (Independent variable).
Regression line of Y on X : Y – 𝑌 = 𝑏𝑦𝑥 (X – 𝑋 )
OR : Y = a + bX
2. Regression line of X on Y: This line gives the probable value of X (Dependent variable) for
any given value of Y (Independent variable).
Regression line of X on Y : X – 𝑋= 𝑏𝑥𝑦 (Y –𝑌 )
OR : X = a + bY
In the above two regression lines or regression equations, there are two regression parameters,
which are “a” and “b”. Here “a” is an unknown constant and “b” which is also denoted as “byx”
or “bxy”, is also another unknown constant popularly called a regression coefficient. Hence,
these “a” and “b” are two unknown constants (fixed numerical values) which determine the
position of the line completely. If the value of either or both of them is changed, another line is
determined. The parameter “a” determines the level of the fitted line (i.e. the distance of the line
directly above or below the origin). The parameter “b” determines the slope of the line (i.e. the
change in Y for unit change in X).
If the values of constants “a” and “b” are obtained, the line is completely determined. But
the question is how to obtain these values. The answer is provided by the method of least
squares. With a little algebra and differential calculus, it can be shown that the following two
normal equations, if solved simultaneously, will yield the values of the parameters “a” and “b”.
𝑌 = a + bX
Σ𝑋𝑌 − 𝑛𝑋 𝑌
b= 2 2
Σ𝑋 −𝑛𝑋
Where
● b = slope of the best fitting estimating line
● X = Values of independent variables
● Y= Values of dependent variables
● 𝑋 = Mean of the values of independent variables
● 𝑌 = Mean of the values of dependent variables
● n= Number of data points
a = 𝑌 - b𝑋
Where
● a = Y-intercept
● b = slope of the best fitting estimating line
● 𝑋 = Mean of the values of independent variables
● 𝑌 = Mean of the values of dependent variables
With these two equations, we can find the best fitting regression line for any two variable sets of
data points.
Ex- Suppose the Director of Chapel Hill Sanitation Department is interested in the relationship
between the age of a garbage truck and the annual repair expense. Determine this relationship. If
the city has a truck that is 4 years old, predict the annual repair expense for the same. Also
calculate Standard Error of Estimate.
Using this estimating equation we can determine if the city has a truck that is 4 years old, predict
the annual repair expense for the same
Solution:
To calculate annual profit if the firm spends 8$ Million for R & D in 1996 substitute X as 8.
2nd Method
Instead of solving the normal equations simultaneously, we can obtain the values of a, b, 𝑎1,
𝑏1 as follows:
Σ𝑥Σ𝑦 Σ𝑥Σ𝑦
Σ𝑥𝑦 − 𝑛
Σ𝑥𝑦 − 𝑛
b= 2 𝑏1= 2
2 (Σ𝑥) 2 (Σ𝑦)
Σ𝑥 − 𝑛
Σ𝑦 − 𝑛
𝑎 = 𝑦 - b𝑥 𝑎1 = 𝑥1 - 𝑏1𝑦
2 2
Thus, after calculating the summations Σx, Σy, Σ𝑥 , Σ𝑦 , Σxy values of the constants b, 𝑏1,
a, 𝑎1 can be obtained with the help of above formulae and then the regression equations can
be formed.
Example 2: Find the two regression equations and also estimate y when x = 13 and x when y =
10.
x 11 7 9 5 8 6 10
y 16 14 12 11 15 14 17
Solution:
x y 𝑥
2
𝑦
2 xy
7 14 49 196 98
9 12 81 144 108
5 11 25 121 55
8 15 64 225 120
6 14 36 196 84
y = 8.7141 + 0.6786x
y = 8.7141 + (0.6786*13)
y = 17.5359 is the estimated value of y when x = 13
x = -2.0047 +0.7074y
y = -2.0047 (0.7074*10)
y = 5.0693 is the estimated value of x when y = 10
Sometimes the means and standard deviations of the two variables, also the value of the
coefficient of correlation are known. Then we need not study the entire set of values again.
Here, we can calculate b and 𝑏1 as follows.
x y
Mean 43 37
Solution:
(i) For regression equation of y on x,
𝑟σ𝑦 0.65 × 2.8
b=
σ𝑥
= 3.1
= 0.5871
The regression equation is
∴y = 𝑦 + b(x - 𝑥)
∴y = 37 + 0.5871 (x - 43)
∴y = 37 + 0.5871x - 25.2453
∴y = 0.5871x + 11.7547
y = 0.5871x + 11.7547
To estimate y when x = 40,
∴y = 0.5871 × 40 + 11.7547
y = 35.2387
x = 41.5608
We have seen two regression equations and correspondingly two regression lines can be drawn,
one where X is independent and Y is dependent (Y on X) and the other, where Y is independent
and X is dependent (X on Y). It is obvious that the regression coefficients b, 𝑏1 represent slopes
𝑟σ𝑦 𝑟σ𝑥
b= and 𝑏1 =
σ𝑥 σ𝑦
𝑟σ𝑦 𝑟σ𝑥 2
∴ b × 𝑏1 =
σ𝑥
×
σ𝑦
=𝑟
∴ r = ± 𝑏 × 𝑏1
Note that r is positive if b, b, are positive and r is negative if b,𝑏1 are negative.
Thus, r, the correlation coefficient, is the geometric mean of the regression coefficients b and 𝑏1
This property can be used to obtain r, the correlation coefficient from the regression equations.
As an illustration, consider the following example.
Example 4: Consider the relationship between consumption expenditure (C) and income
(Y) in an economy as modeled by two different researchers:
Researcher 1: 2C-Y = 15
Researcher 2: 3C-4Y = -25
Calculate the arithmetic means of income(𝑌) and consumption(𝐶).
Compute the correlation coefficient between income and consumption based on these two
regression equations.
Solution:
(i) To find values of C and Y.
It is given the required lines are
2C - Y - 15 = 0 …(i)
3C - 4Y + 25 = 0 …(ii)
As per properties, (𝑥, 𝑦) will lie on both the regression, Thus,
2𝐶 - 𝑌 - 15 = 0 …(iii)
3𝐶 - 4𝑌 + 25 = 0 …(iv)
To solve the equation,
Multiplying (iii) by 4 and subtracting from (iv)
We get the value 𝐶 = 17
Substituting 𝐶 = 17 in equation (i)
We get the value 𝑌 = 19
𝐶 = 17 , 𝑌 = 19
(ii) To find r, the coefficient of correlation. First we have to find the regression coefficient b and
𝑏1 from the equations. Let equation (i) be the regression equation of C on Y. Write it down in the
standard form. C on Y.
C = 𝑎1+ 𝑏1y
Equation (i) is 2C - Y - 15 = 0
∴2C = Y + 15
𝑌 15
∴C = 2 + 2
Now, by comparing it with the standard form of reg line C on Y.
1
𝑏1 = coefficient of Y in equation = 2
1
𝑏1 = 2
Let equation (ii) be the regression equation of Y on C. Write it down in the standard form.
y = a + bx, we have
3C - 4Y + 25 = 0
∴ 4Y = 3C + 25
3𝐶 25
∴Y= 4
+ 4
Now, by comparing it with the standard form of reg line Y on C.
3
𝑏 = coefficient of C in equation = 4
3
𝑏 = 4
Now, r = ± 𝑏 × 𝑏1
3 1
=± 4
× 2
3
=± 18
= ±0.6123
As b, 𝑏1 are positive, r is also positive.
So, r = 0.6123.
Solution:
(i) To find mean values of Q and P for the regression equation
5x - 6y + 90 = 0 …(i)
15x - 8y - 180 = 0 …(ii)
As per properties, (𝑄, 𝑃) will lie on both the regression, Thus,
5𝑄 - 𝑃 + 90 = 0 …(iii)
15𝑄 - 8𝑃 - 180 = 0 …(iv)
By solving them simultaneously we get
𝑄 = 36, 𝑃 = 45
5
∴b= 6
15
∴b= 8
5 15
2. As 6
< 8
, choose the lesser one to be b and the corresponding equation to be that
5
of P on Q. Hence equation (i) is of Q on P and 𝑏 = 6
.
3. So equation (ii) is that of Q on P and 𝑏1 is the inverse of previous b, obtained from
equation (ii)
1 8
∴ 𝑏1 = 15/8
= 15
8
∴ 𝑏1 = 15
Now, r = 𝑏 × 𝑏1
5 8
= 6
× 15
40
= 90
= 0.6667
(iii) To find Standard Deviation of Q,
5 2
We know that, b = 6
, r = 0.6667 = 3
, σ𝑦 = 1.
𝑟σ𝑃
Consider b =
σ𝑄
Substituting the above values
5 2 1
6
= 3
× σ𝑥
∴σ
𝑄
= 2
3
×
6
5
=
4
5
= 0.8
So the standard deviation of Q is 0.8.
Multiple Regression
As we mentioned above,, we can use more than one independent variable to estimate the
dependent variable and, in this way, attempt to increase the accuracy of the estimate. This
process is called multiple regression and correlation analysis. It is based on the same
assumptions and procedures we have encountered using simple regression. General
two-variable regression equation is
𝑌 = a+𝑏1𝑋1+𝑏2𝑋2
Example: Given the following set of data calculate (i) Multiple Regression plane (ii) Predict
Y when 𝑋1= 3 and 𝑋2= 2.7
Y 25 30 11 22 27 19
Solution:
Consider the normal equations for the equation of Y on 𝑋1 and 𝑋2
𝑌 = 23.82449