Unit 2 - (A) Correlation & Regression
Unit 2 - (A) Correlation & Regression
A
Semester –I
Unit – II
Correlation and Regression
➢ Introduction:
We have studied the different series where various items assumed different value of one variable.
We have discussed up till now, measures of central tendency and measures of dispersion are
calculated in such cases for purpose of comparison and analysis. With the help of these measures
data can be easily understood. There can, however, be such series also where, each item assumes the
values of two or more variables. For examples, if the heights and weights of a group of persons are
measured, we shall get such series where each member of the group would assume two values, one
relating to height and other relating to weight. Such a distribution is known as bivariate distribution.
But someti mes it appears that the values of the various variables, so obtained are interrelated. It is
likely that such relationship may be obtained in two series relating to the heights and weights of a
group of persons. It may be observed that weight increases with increase in height. So that tall people
are heavier than short sized people. Similarly, if the data are collected about the prices of a
commodity and quantities sold at different prices, two series would be obtained. In two such series
we are again likely to find some relationship. With increases in the price of the commodity the
quantity sold is bound to decrease. We can thus conclude that there is some relationship between
price and demand. Such relationship can be found in many types of series, for example, price and
supply, heights and weights of persons, price of sugar and sugarcane, age of husbands and wives, ec.
So, we can say that “The term correlation (or co-variation) indicates the relationship between
two such variables in which with changes in the values of one variable, the values of the other
variable also changes.” Thus correlation is statistical tool of studying the relationship between two
variables. For correlation it is essential that the two phenomena should have cause-effect
relationship. If such relationship does not exist then one should not talk of correlation.
➢ Types of correlation:
1) By direction of change (Positive and Negative)
Positive Correlation: While studying the relationships of any two related variables, if we find the
deviation of the value of variables are in the same direction i.e. if one variable increases (or
decreases), the corresponding value of the second variable also increases (or decreases), then it is
called a Positive Correlation. For e.g. Height and weight of human beings, demand and supply,
amount of rain fall and yield of crop have positive correlation.
Negative Correlation: While studying the relationships of any two related variables, if we find the
deviation of the value of variables in the opposite direction i.e. if one variable increases (or
decreases), the corresponding value of the second variable decreases (or increases), then it is called
a Negative Correlation. For e.g. price and demand of commodity, temperatures and sales of woolen
clothes have negative correlation.
• If plotted dots lie on the straight line rising from the lower left-hand corner to the upper right
hand corner then the correlation is said to be perfect positive correlation.
• If plotted dots lies on the straight line from the upper left hand corner to the lower right hand
corner then correlation is said to be perfect negative correlation.
• If plotted dots fall in a narrow band showing a rising tendency from the lower left hand corner
to the upper right hand corner, then correlation is high degree positive correlation. As the band
becomes wider the degree of correlation becomes low and we called low degree positive
correlation.
• If plotted dots fall in a narrow band showing a decreasing tendency from the upper left hand
corner to the lower right hand corner, then correlation is high degree negative correlation. As
the band becomes wider the degree of correlation becomes low and we called low degree
negative correlation.
• If the dots are widely scattered in haphazard manner, it indicates no correlation between two
study variables.
.
. fxy
.
yn
Total of
Frequencies of X fx
N
Karl Pearson’s Correlation coefficient: It measures the degree of correlation between two
variables. It is denoted by r xy or r denoting the measure of correlation between two variables x
and y. It can be written as
cov( x, y )
r =
x y
Where,
_
X=
x for without frequency data
n
=
f x x
for with frequency data
N
_
Y=
y for without frequency data
n
=
f y y
for with frequency data
N
Cov( x, y ) =
xy − x y for without frequency data
n
=
f xy xy
− x y for with frequency data
N
x − (x )
2
2
x
= for without frequency data
n
( f x ) − (x )
2
x 2
= for with frequency data
N
y − (y )
2
2
y
= for without frequency data
n
( fy y2)
=
N
()
− x
for with frequency data
2
If the correlation coefficient is close to -1 that means you have a strong negative relationship
Formulas:
(a)For ungrouped bivariate data (without frequency)
x−x y− y ( )( )
r xy =
x− x ( ) (y − y )
2 2
=
xy − n x y
r xy
x − nx y
2 2 2
− ny
2
n xy − x y
r =
n x − ( x ) n y − ( y )
xy 2
2 2 2
− y
2 2 2
n f x x − f x n f y y f
x y
➢ Properties of Correlation Coefficient:
(i) Karl Pearson’s Correlation coefficient lies between -1 and +1, i.e. -1 ≤ r ≤ +1
(ii) Correlation coefficient is independent of the change of origin and scale.
(iii) Two independent variables are uncorrelated but converse is not true.
Hence r xy = 0 for independent variables.
Rank Correlation:
In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship
between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is
the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular
variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used
to assess the significance of the relation between them.
if, for example, one variable is the identity of a college basketball program and another variable is the identity of
a college football program, one could test for a relationship between the poll rankings of the two types of program:
do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank
correlation coefficient can measure that relationship, and the measure of significance of the rank correlation
coefficient can show whether the measured relationship is small enough to likely be a coincidence.
If there is only one variable, the identity of a college football program, but it is subject to two different poll
rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings
can be measured with a rank correlation coefficient.
The Spearman correlation coefficient, rs, can take values from +1 to -1.
A rs of +1 indicates a perfect association of ranks, a rs of zero indicates no association between ranks and
a rs of -1 indicates a perfect negative association of ranks.
The closer rs is to zero, the weaker the association between the ranks.
Exam Marks
English 56 75 45 71 62 64 58 80 76 61
Maths 66 70 40 60 65 56 59 77 67 63
56 66 9 4 5 25
75 70 3 2 1 1
45 40 10 10 0 0
71 60 4 7 3 9
62 65 6 5 1 1
64 56 5 9 4 16
58 59 8 8 0 0
80 77 1 1 0 0
76 67 2 3 1 1
61 63 7 6 1 1
as n = 10. Hence, we have a ρ (or rs) of 0.67. This indicates a strong positive relationship between the ranks
individuals obtained in the maths and English exam. That is, the higher you ranked in maths, the higher you
ranked in English also, and vice versa.
Regression Analysis:
It may be noted that in this equation ‘y’ is a dependent variable and ‘x’ is independent variable.
‘a’ is Y-intercept and
‘b’ is the slope of the line and it represents the change in Y variable for a unit change in X variable.
The value of numerical constants ‘a’ and ‘b’ are obtained with the help of the best fit curve and this
based on the principal of least square. The principle of least square is that we minimize the sum of
squares of the deviations or the errors of estimates. Thus the deviations between the given observed
values of the variable and their corresponding estimated values are given by the line of best fit.
Thus Line of Regression of Y on X written as
(y − y ) = b (x − x )yx
= r xy
y
Where b is called regression coefficient of y on x.
yx
x
• Line of Regression of X on Y:
• X = c+ dY
• ( )
x − x = b xy y − y ( )
Where d= b = r xy x is called regression coefficient of x on y.
xy
y
➢ Regression coefficient: It gives the rate of change of the dependent variable when
independent variable changes by one unit. It is also called the slope of the line.
i.e. b yx measures the how much unit change in variable y when x change by one unit.
and b xy
measures the how much unit change in variable x when y change by one unit.
➢ Formulas:
(a) For ungrouped bivariate data(without frequency)
(x − x ) (y − y ) (x − x ) (y − y )
( )
b =
( )
and b =
y− y
xy 2
x− x
yx 2
=
xy − n x y =
xy − n x y
b and b
y − ny x − nx
xy 2 2 yx 2 2
n xy − xy n xy − x
y
b = and b =
n y − ( y ) n x − ( x )
xy 2 2 yx 2 2
2. If one regression coefficient is greater than one than other regression coefficient must be
less than one. i.e., bxy b yx 1
3. Sign of both regression coefficients and correlation coefficients are ALWAYS same.
4. Arithmetic mean of the regression coefficients is greater than the correlation coefficient. i.e.
1
( +
2 bxy b yx
r)
5. Regression coefficients are independent of change of origin but not of scale.
➢ Remarks:
1. Two lines of regression intersect at point of mean values of variable X and Y i.e (X, Y).
X: 4,5,6,7,8,9,10 7
Y:10,20,30,40,50,60,70 40
2. When two regression lines are perpendicular to each other than there is no correlation between
two study variables. i.e. rxy = 0
3. When two regression lines are coincides to each other then there is perfect correlation between
two study variables. i.e. rxy = 1
Y= a+ bX+ € e= Y-Y^
➢ Coefficient of Determination
It is useful to measure the strength of the relationship. This is done by calculating the coefficient
of determination R2. In other words, the coefficient of determination gives the ratio of the explain
variance to the total variance. The coefficient of determination is the square of the coefficient of
correlation i.e r2. Thus.
Explained Variance
Coefficient of determination = r 2 =
Total Variance
Remark :This is true for models with only one independent variable.
R2 has a value of 0.6483. This means 64.83% of the variation in the y is explained by your regression
model. The remaining 35.17% is unexplained, i.e. due to error.
In general the higher the value of R2, the better the model fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
➢ Correlation Analysis Vs. Regression Analysis
Exercise
Correlation
1. The following data refers to advertisement expense and no. of units sold in last six months.
Ad. Expense (in ‘000 Rs.) 14 21 26 22 15 19
3. From the following data, find out the correlation coefficient between heights of fathers and sons.
Heights of fathers(inches) 65 66 67 67 68 69 70 72
Heights of sons(inches) 67 68 65 68 72 72 69 71
4. Compute Karl Pearson’s coefficient of correlation in the following series relating to cost of living
and wages.
Wages (Rs.) 100 101 102 100 99 98 97 98 96 95
Cost of living 98 99 99 97 95 92 95 94 90 91
5. A prognostic test in Mathematics was given to 10 students who were about to bring a course in
statistics. The scores (X) in their test were examined in relations to score (Y) in the final examination
in Statistics. The following result were obtained:
∑x = 71, ∑y = 70, ∑x2 = 555, ∑y2 = 526, and ∑xy =527.
Find the coefficient of correlation between x and y.
6. Calculate correlation coefficient from the following results:
N=10, ∑ (x- 14)2 =180, ∑ (y – 15)2 = 215, and ∑(x – 14 )(y – 15) = 60.
r = 0.32
cov(x, y) = 7.86 v(x) 10 sd(x) = sq root 10= 3.162
cov( x, y )
r =
x y
9. From the following data, compute the compute the coefficient of correlation and interpret it.
x y
No. of pairs of observations 15 15
Arithmetic mean 25 18
Standard deviation 3.01 3.03
Sum of squares of deviations from mean 136 138
Sum of product of deviations of x and y from 122
their respective means
=
(x − x ) (y − y )
( ) ( )
r xy
x− x y − y
2 2
= 122/{ sqrt(136)* sqrt(138)}= 0.89
10. The following table gives bivariate frequency distribution of age and marks of 100 students in a test.
12. Following is the distribution of students according to their heights and weights:
Height (in Weight x (in lbs.)
inches) 90-100 100-110 110-120 120-130
50-55 4 7 5 2
55-60 6 10 7 4
60-65 6 12 10 7
65-70 3 8 6 3
Find out the correlation coefficient between height and weight.
Regression
13. Given the following information:
Year 1999 2000 2001 2002 2003 2004
Research expense (in ‘000 Rs.) 5 11 4 5 3 2
(X)
Annual Profit ( in ‘000 Rs.) (Y) 31 40 30 34 25 20
(i) Develop the estimating equation that best describes the given data. Y on X -regression eq.
(ii) Estimate the annual profit when research expense made will 7000.
(iii) How much variation in the annual profits (Y) is explained by the variation in the research
expenditure(X)? –coeff. of determination – r2
14. From the following data of the age of husband and the age of wife, form two regression lines.
Calculate the husband’s age when wife’s age is 16. Calculate wife’s age when husband’s age is 25.
Husband’s 36 23 27 28 28 29 30 31 33 35
age
Wife’s 29 18 20 22 27 21 29 27 29 28
age
15. Given the following results for the height (x) and weight (y) in appropriate units of 1000 students.
Mean of X = 68, mean of y = 150, σx =2.5, σy =20, and r=0.6.
Obtain the equations of two regression lines. Estimate height of a student whose weight 200 units
and also estimate weight of a student whose height is 60 units.
16. Find out the regression equation showing the regression of capacity utilization on product from the
following data.
Average Standard deviation
Production (in lack units ) 35.6 10.5
Capacity utilization (in %) 84.8 8.5
r = 0.62
Estimate the production, when capacity utilization is 70%.
17. To know what relationship exist between unemployment and suicide attempts, a sociologist surveyed
twelve citied and obtained the following data.
city 1 2 3 4 5 6 7 8 9 10 11 12
Unemployment rate percent 7.3 6.4 6.2 5.5 6.4 4.7 5.8 7.9 6.7 9.6 10.3 7.2
No. of suicide attempts per 22 17 9 8 12 5 7 19 13 29 33 18
1000 residents
(i) Develop the estimating equation that best describes the given data.
(ii) Estimate attempted suicide rate when unemployment rate happens to be 6%.
(iii) Calculate coefficient of determination and interpret it.
18. The equations of two regression lines between two variables are expressed as 2x – 3y = 0 and 4y -
5x -8 = 0.
(i) Identify which of the two can be called regression of y on x and of x on y.
(ii) Find mean of x and mean of y.
(iii)Find coefficient of correlation between x and y.
19. Find the regression equation of x on y and the coefficient of correlation from the following data.
∑x = 60, ∑y = 40, ∑x2 = 4160, ∑y2 = 1720, and ∑xy = 1150 and N = 10.
20. From the following data, find out the probable yield when the rainfall is 29”.
Rainfall Yield
Mean 25” 40 units per hectare
Standard deviation 3” 6 units per hectare
Correlation coefficient between rainfall and production = 0.8
21. The following are the two regression equations. Find the correlation coefficient and mean of the
variables. If s.d. of x is 1.2 then find variance of y.
8x - 10y + 61 = 0 and 40x -18 y – 2/4.
22. A student obtained the following two regression equations. Do yo agree with him?
6x = 15Y + 21 and 21X + 14 Y=56
23. Calculate lines of regressions from the following data.
Sales Advertising Expenditure
revenue 5-15 15-25 25-35 35-45
75-125 3 4 4 8
125-175 8 6 5 7
175-225 2 2 3 4
225-275 2 3 2 2
24. A business Statistics student has taken a random sample of starting salaries and college grade-point
averages for some recently graduated friends of his, to check are good grades in college important
for earning a good salary? The data are as follow:
Starting salary 36 30 30 24 27 33 21 27
($ thousand)
Grade-point 4.0 3.0 3.5 2.0 3.0 3.5 2.5 2.5
average
(i) Plot the scatter diagram and interpret it.
(ii) Develop the estimating equation that best describes these data.
(iii) Predict the starting salary for a student having grade point average 3.5.
(iv) r xy
= b xb
yx xy
is known as the ________ property.
(v) If r =1, the relation between bxy and byx is ________.
(vi) If the regression coefficient bxy 1 then byx is ________.
(vii) The paired values plotted on a graph marked by points leads to a ________
diagram.
(viii) The independent variables in regression equation are often called ________
variables.
(ix) The measure of change in independent variable corresponding to an unit change
in independent variable is called ________.
'
(x) If each value of both the variables X and Y is divided by 5, then b yx from coded
values will be ________as byx.
(xi) The range of Pearson’s coefficient of correlation is ________.
(xii) Product moment correlation is called ________.
(xiii) If simple correlation coefficient is zero then regression coefficient is equal to
________.
(xiv) If the regression line of Y on X is 2Y = 3X-6, the estimated value of Y for given
value of X=10 is ________.
(xv) If the lines of regression of Y on X is 4X-5Y +33 =0 and of X on Y is 20X-9Y-
107=0, the mean value x and y are _______.