Unit-1 Correlation and Regression
Unit-1 Correlation and Regression
CORRELATION
4 ANALYSIS
“When the relationship is of a quantitative nature, the appropriate statistical tool for discovering
the existence of relation and measuring the intensity of relationship is known as correlation”
—CROXTON AND COWDEN
LEARNING OBJECTIVES
The statistical techniques discussed so far are for only one variable. In many research
situations one has to consider two variables simultaneously to know whether these two variables
are related linearly. If so, what type of relationship that exists between them. This leads to
bivariate (two variables) data analysis namely correlation analysis. If two quantities vary in such a
way that movements ( upward or downward) in one are accompanied by the movements( upward
or downward) in the other, these quantities are said to be co-related or correlated.
The correlation concept will help to answer the following types of questions.
• Whether study time in hours is related with marks scored in the examination?
• Is it worth spending on advertisement for the promotion of sales?
• Whether a woman’s age and her systolic blood pressure are related?
• Is age of husband and age of wife related?
• Whether price of a commodity and demand related?
• Is there any relationship between rainfall and production of rice?
• Investigates the type and strength of the relationship that exists between the two variables.
• Progressive development in the methods of science and philosophy has been characterized by
the rich knowledge of relationship.
In this chapter, we study simple correlation only, multiple correlation and partial correlation
involving three or more variables will be studied in higher classes .
3. Uncorrelated
In other words, if one variable increases, the other variable (on an average) also increases or if one
variable decreases, the other (on an average)variable also decreases.
For example,
i) Income and savings
ii) Marks in Mathematics and Marks in Statistics. (i.e.,Direct relationship pattern exists).
X -Height of goods
Height of the Li increases / decreases according The starng posion of wring depends on the height of
to the Height of goods increases / decreases. the writer.
For example,
i) Price and demand
ii) Unemployment and purchasing power
3) Uncorrelated:
The variables are said to be uncorrelated if smaller values of x are associated with smaller
or larger values of y and larger values of x are associated with larger or smaller values of y. If the
two variables do not associate linearly, they are said to be uncorrelated. Here r = 0.
Important note: Uncorrelated does not
imply independence. This means “do not interpret
as the two variables are independent instead
interpret as there is no specific linear pattern exists
but there may be non linear relationship”.
X Y X Y
4) Perfect Positive Correlation
If the values of x and y increase or decrease proportionately then they are said to have
perfect positive correlation.
5) Perfect Negative Correlation
The purpose of correlation analysis is to find the existence of linear relationship between
the variables. However, the method of calculating correlation coefficient depends on the types of
measurement scale, namely, ratio scale or ordinal scale or nominal scale.
If the plotted points in the plane form a band and they show the
rising trend from the lower left hand corner to the upper right hand corner,
X
the two variables are positively correlated. In this case 0 < r < 1
2) Negative correlation
Y
If the plotted points in the plane form a band and they show the falling
trend from the upper left hand corner to the lower right hand corner, the two
X
variables are negatively correlated. In this case -1 < r < 0
3) Uncorrelated Y
If the plotted points spread over in the plane then the two variables
are uncorrelated.
X
In this case r = 0
4) Perfect positive correlation
Y
If all the plotted points lie on a straight line from lower left hand
corner to the upper right hand corner then the two variables have perfect
positive correlation. X
In this case r = +1
If all the plotted points lie on a straight line falling from upper left
hand corner to lower right hand corner, the two variables have perfect
negative correlation. In this case r = -1
X
4.4.2 Properties
1. The correlation coefficient between X and Y is same as the correlation coefficient between Y and
X (i.e, rxy = ryx ).
2. The correlation coefficient is free from the units of measurements of X and Y
3. The correlation coefficient is unaffected by change of scale and origin.
x A y B
Thus, if ui i and vi i with c ≠ 0 and d ≠ 0 i=1,2, ..., n
c d
n n n
n ui vi ui vi
r i 1 i 1 i 1
2 2
n
n n
n
n ui ui
2
n vi vi 2
i 1 i 1 i 1 i 1
Example 4.1
The following data gives the heights(in inches) of father and his eldest son. Compute the
correlation coefficient between the heights of fathers and sons using Karl Pearson’s method.
Height of father 65 66 67 67 68 69 70 72
Height of son 67 68 65 68 72 72 69 71
i 1 i 1 i 1 i 1
Calculation
xi yi x i2 y i2 x iy i
65 67 4225 4489 4355
66 68 4356 4624 4488
67 65 4489 4225 4355
67 68 4489 4624 4556
68 72 4624 5184 4896
69 72 4761 5184 4968
70 69 4900 4761 4830
72 71 5184 5041 5112
544 552 37028 38132 37560
Heights of father and son are positively correlated. It means that on the average , if fathers are
tall then sons will probably tall and if fathers are short, probably sons may be short.
Short-cut method
Let A = 68 , B = 69, c = 1 and d = 1
xi yi ui = (xi – A)/c v i = (y i – B)/d ui 2 v i2 u iv i
= xi – 68 = y i – 69
65 67 -3 -2 9 4 6
66 68 -2 -1 4 1 2
67 65 -1 -4 1 16 4
67 68 -1 -1 1 1 1
68 72 0 3 0 9 0
69 72 1 3 1 9 3
70 69 2 0 4 0 0
72 71 4 2 16 4 8
Total 0 0 36 44 24
n n n
n ui vi ui vi
r i 1 i 1 i 1
2 2
n
n n
n
n ui ui
2
n vi vi 2
i 1 i 1 i 1 i 1
8 × 24
r=
8 × 36 8 × 44
= 0.603
Note: The correlation coefficient computed by using direct method and short-cut method is the same.
Example 4.2
The following are the marks scored by 7 students in two tests in a subject. Calculate
coefficient of correlation from the following data and interpret.
Marks in test-1 12 9 8 10 11 13 7
Marks in test-2 14 8 6 9 11 12 3
Solution:
Let x denote marks in test-1 and y denote marks in test-2.
xi yi xi2 yi2 xiyi
12 14 144 196 168
9 8 81 64 72
8 6 64 36 48
10 9 100 81 90
11 11 121 121 121
1 12 169 144 156
7 3 49 9 21
Total 70 63 728 651 676
n n n
n xi yi xi yi
r i 1 i 1 i 1
2 2
n
n n
n
n xi xi
2
n yi yi 2
i 1 i 1 i 1 i 1
n n n
xi 70
i 1
xi 2 728
i 1
x y
i 1
i i 676
n n
yi 63
i 1
y
i 1
i
2
651 n 7
7 676 70 63
r
7 728 702 7 651 632
4732 4410
5096 4900 7 651 3969
322 322 322
0.95
196 588 14 24.25 339.5
2. Correlation does not imply causal relationship. That a change in one variable causes a
change in another.
NOTE
1. Uncorrelated : Uncorrelated (r = 0) implies no ‘linear relationship’. But there may exist non-
linear relationship (curvilinear relationship).
Example: Age and health care are related. Children and elderly people need much more health
care than middle aged persons as seen from the following graph.
Health care
Child Old
0 Age
Adult
However, if we compute the linear correlation r for such data, it may be zero implying
age and health care are uncorrelated, but non-linear correlation is present.
2. Spurious Correlation : The word ‘spurious’ from Latin means ‘false’ or ‘illegitimate’. Spurious
correlation means an association extracted from correlation coefficient that may not exist in reality.
n n2 1
where Di = R1i – R2i
Interpretation
Spearman’s rank correlation coefficient is a statistical measure of the strength of a
monotonic (increasing/decreasing) relationship between paired data. Its interpretation is similar
to that of Pearson’s. That is, the closer to the ±1 means the stronger the monotonic relationship.
0.01 to 0.19: “Very Weak Agreement” (-0.01) to (-0.19): “Very Weak Disagreement”
0.80 to 1.0: “Very Strong Agreement” (-0.80) to (-1.0): “Very Strong Disagreement”
Example 4.3
Two referees in a flower beauty competition rank the 10 types of flowers as follows:
Referee A 1 6 5 10 3 2 4 9 7 8
Referee B 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient and find out what degree of agreement is between the
referees.
D 2
i 60
i 1
n
Here n = 10 and D 2
i 60
i 1
n
6 Di2
i 1
1
n n 1 2
6 60 360 360
1 1 1 0.636
2
10 10 1
10 99 990
Interpretation: Degree of agreement between the referees ‘A’ and ‘B’ is 0.636 and they have “strong
agreement” in evaluating the competitors.
Example 4.4
Calculate the Spearman’s rank correlation coefficient for the following data.
Candidates 1 2 3 4 5
Marks in Tamil 75 40 52 65 60
Marks in English 25 42 35 29 33
D i
2
40 and n = 5
i 1
n
6 Di2
i 1
1
n n 1 2
6 40 240
1 1 1
2
5 5 1
5 24
Interpretation: This perfect negative rank correlation (-1) indicates that scorings in the subjects,
totally disagree. Student who is best in Tamil is weakest in English subject and vice-versa.
Example 4.5
Quotations of index numbers of equity share prices of a certain joint stock company and
the prices of preference shares are given below.
Years 2013 2014 2015 2016 2017 2008 2009
Equity shares 97.5 99.4 98.6 96.2 95.1 98.4 97.1
Reference shares 75.1 75.9 77.1 78.2 79 74.6 76.2
Using the method of rank correlation determine the relationship between equity shares
and preference shares prices.
Solution:
D
i 1
2
i 90
D 2
i 90 and n = 7.
i 1
n
6 Di2
i 1
1
n n 1 2
6 90 540 540
1 1 1 1 1.66071 0.6071
7 72 1 7 48 336
Interpretation: There is a negative correlation between equity shares and preference share prices.
There is a strong disagreement between equity shares and preference share prices.
1 6
n n2 1
where mi is the number of repetitions of ith rank
Example 4.6
Compute the rank correlation coefficient for the following data of the marks obtained by
8 students in the Commerce and Mathematics.
Marks in Commerce 15 20 28 12 40 60 20 80
Marks in Mathematics 40 30 50 30 20 10 30 60
1 1
Di 12 m1 m1 12 m2 m2 ...
2 3 3
1 6
n n2 1
Repetitions of ranks
In Commerce (X), 20 is repeated two times corresponding to ranks 3 and 4. Therefore, 3.5
is assigned for rank 2 and 3 with m1=2.
In Mathematics (Y), 30 is repeated three times corresponding to ranks 3, 4 and 5. Therefore,
4 is assigned for ranks 3,4 and 5 with m2=3.
Therefore,
1 3
( 1 3
81.5 + 12 2 − 2 + 12 3 − 3) ( )
ρ = 1− 6
8 82 − 1 ( )
= 1− 6
[81.5 + 0.5 + 2] = 1−
504
=0
504 504
=
Interpretation: Marks in Commerce and Mathematics are uncorrelated
Yule’s coefficient: Q
AB A B
AB A B
Note 1: The usage of the symbol α is not to be confused with level of significance.
Note 2: (AB): Number with attributes AB etc.
This coefficient ranges from –1 to +1. The values between –1 and 0 indicate inverse
relationship (association) between the attributes. The values between 0 and +1 indicate direct
relationship (association) between the attributes.
Example 4.7
Out of 1800 candidates appeared for a competitive examination 625 were successful; 300 had
attended a coaching class and of these 180 came out successful. Test for the association of attributes
attending the coaching class and success in the examination.
Solution:
N = 1800
A: Success in examination α: No success in examination
B: Attended the coaching class β: Not attended the coaching class
(A) = 625, (B) = 300, (AB) = 180
B β Total
A 180 445 625
α 120 1055 1175
Total 300 1500 N = 1800
Remark: Consistency in the data using contingency table may be found as under.
Construct a 2 × 2 contingency table for the given information. If at least one of the cell
frequencies is negative then there is inconsistency in the given data.
Example 4.8
Verify whether the given data: N = 100, (A) = 75, (B) = 60 and (AB) = 15 is consistent.
Solution:
The given information is presented in the following contingency table.
B β Total
A 15 60 75
α 45 -20 25
Total 60 40 N = 100
POINTS TO REMEMBER
Correlation study is about finding the linear relationship between two variables.
Correlation is not causation. Sometimes the correlation may be spurious.
Correlation coefficient lies between –1 and +1.
Pearson’s correlation coefficient provides the type of relationship and intensity of
relationship, for the data in ratio scale measure.
Spearman’s correlation measures the relationship between the two ordinal variables.
Yule’s coefficient of Association measures the association between two dichotomous
attributes.
n n
3
n n
3
n n
3
n n 1 2
8. If ∑D 2
= 0, rank correlation is
45. Find the Karl Pearson’s coefficient of correlation for the following data.
Wages 100 101 102 102 100 99 97 98 96 95
Cost of living 98 99 99 97 95 92 95 94 90 91
How are the wages and cost of living correlated?
46. Calculate the Karl Pearson’s correlation coefficient between the marks (out of 10) in statistics
and mathematics of 6 students.
Student 1 2 3 4 5 6
Statistics 7 4 6 9 3 8
Mathematics 8 5 4 8 3 6
48. Calculate the Spearman’s rank correlation coefficient between price and supply from the
following data.
Price 4 6 8 10 12 14 16 18
Supply 10 15 20 25 30 35 40 45
49. A random sample of 5 college students is selected and their marks in Tamil and English are
found to be:
Tamil 85 60 73 40 90
English 93 75 65 50 80
Calculate Spearman’s rank correlation coefficient.
50. Calculate Spearman’s coefficient of rank correlation for the following data.
x 53 98 95 81 75 71 59 55
y 47 25 32 37 30 40 39 45
51. Calculate the coefficient of correlation for the following data using ranks.
Mark in Tamil 29 24 25 27 30 31
Mark in English 29 19 30 33 37 36
52. From the following data calculate the rank correlation coefficient.
x 49 34 41 10 17 17 66 25 17 58
y 14 14 25 7 16 5 21 10 7 20
Yule’s coefficient
53. Can vaccination be regarded as a preventive measure of Hepatitis B from the data given below.
Of 1500 person in a locality, 400 were attacked by Hepatitis B. 750 has been vaccinated. Among
them only 75 were attacked.
III 38. n = 10
40. r = 0.85
41. (αβ ) = −50 , The given data is inconsistent
45. r = 0.847 wages and cost of living are highly positively correlated.
46. r = 0.8081. Statistics and mathematics marks are highly positively correlated.
47. ρ = 0.8929 price of tea and coffee are highly positively correlated.
49. ρ = 0.8
52. ρ = +0.733
54. There is a positive association between not attacked and not vaccinated.
REGRESSION
5 ANALYSIS
Francis Galton (1822-1911) was born in a wealthy family. The youngest of nine
children, he appeared as an intelligent child. Galton’s progress in education was
not smooth. He dabbled in medicine and then studied Mathematics at Cambridge.
In fact he subsequently freely acknowledged his weakness in formal Mathematics,
but this weakness was compensated by an exceptional ability to understand the
meaning of data. Many statistical terms, which are in current usage were coined
by Galton. For example, correlation is due to him, as is regression, and he was the Francis Galton
originator of terms and concepts such as quartile, decile and percentile, and of the use of median as
the midpoint of a distribution.
The concept of regression comes from genetics and was popularized by Sir Francis Galton
during the late 19th century with the publication of regression towards mediocrity in hereditary stature.
Galton observed that extreme characteristics (e.g., height) in parents are not passed on completely to
their offspring. An examination of publications of Sir Francis Galton and Karl Pearson revealed that
Galton's work on inherited characteristics of sweet peas led to the initial conceptualization of linear
regression. Subsequent efforts by Galton and Pearson brought many techniques of multiple regression
and the product-moment correlation coefficient.
LEARNING OBJECTIVES
Introduction
The correlation coefficient is an useful statistical tool for describing the type ( positive or
negative or uncorrelated ) and intensity of linear relationship (such as moderately or highly) between
two variables. But it fails to give a mathematical functional relationship for prediction purposes.
Regression analysis is a vital statistical method for obtaining functional relationship between a
dependent variable and one or more independent variables. More specifically, regression analysis
helps one to understand how the typical value of the dependent variable (or ‘response variable’)
changes when any one of the independent variables (regressor(s) or predictor(s)) is varied, while
the other independent variables are held fixed. It helps to determine the impact of changes in
the value(s) of the the independent variable(s) upon changes in the value of the dependent variable.
Regression analysis is widely used for prediction.
Types of ‘Regression’
Based on the kind of relationship between the dependent variable and the set of independent
variable(s), there arises two broad categories of regression viz., linear regression and non-linear regression.
If the relationship is linear and there is only one independent variable, then the regression
is called as simple linear regression. On the other hand, if the relationship is linear and the
number of independent variables is two or more, then the regression is called as multiple linear
regression. If the relationship between the dependent variable and the independent variable(s) is
not linear, then the regression is called as non-linear regression.
NOTE
Error There are many reasons for the
presence of the error term in the
linear regression model. It is also
Inependent Dependent
known as measurement error. In
variable X variable Y
some situations, it indicates the
Regression line,Y=a+bX+e presence of several variables other
than the present set of regressors.
The general form of the simple linear regression equation is Y = a + bX + e, where ‘X’ is
independent variable, ‘Y’ is dependent variable, a’ is intercept, ‘b’ is slope of the line and ‘ e’ is
error term. This equation can be used to estimate the value of response variable (Y) based on the
given values of the predictor variable (X) within its domain.
Before going for further study, the following points are to be kept in mind.
• Both the independent and dependent variables must be measured at the interval scale.
• There must be linear relationship between independent and dependent variables.
• Linear Regression is very sensitive to Outliers (extreme observations). It can affect the
regression line extremely and eventually the estimated values of Y too.
The method of least squares helps us to find the values of unknowns ‘a’ and ‘b’ in such a
way that the following two conditions are satisfied:
n
• Sum of the residuals is zero. That is ∑ ( y − yˆ ) =
i =1
i i 0.
n
Sum of the squares of the residuals E (= ∑ ( y − yˆ )
2
• a, b) i i is the least.
i =1
n
i.e., E (a,= ∑ ( y − a − bx )
2 Y
b) i i . yi = a+b xi+Error
i =1
∂E (a, b) n
=−2∑ ( yi − a − bxi ) = 0
∂a i =1
∂E (a, b) n
=−2∑ xi ( yi − a − bxi ) = 0
∂b i =1
These give
n n
na + b∑ xi =
∑ yi
=i 1 =i 1
n n n
a ∑ xi + b∑ xi2 =
∑ xi yi
=i 1 =i 1 =i 1
These equations are popularly known as normal equations. Solving these equations for ‘a’
and ‘b’ yield the estimates â and b̂ .
â= y − bx
ˆ
and
1 n
∑ xi yi − x y
n i =1
b=
ˆ
1 n 2
∑
n i =1
xi − x 2
It may be seen that in the estimate of ‘b’, the numerator and denominator are respectively
the sample covariance between X and Y, and the sample variance of X. Hence, the estimate of ‘b’
may be expressed as
Cov( X , Y )
bˆ =
V (X )
Further, it may be noted that for notational convenience the denominator of b̂ above is
mentioned as variance of X. But, the definition of sample variance remains valid as defined in
1 n
Chapter I, that is,
n 1 i 1
xi x 2 .
From Chapter 4, the above estimate can be expressed using, rXY , Pearson’s coefficient of the
simple correlation between X and Y, as
SD(Y )
bˆ = rXY .
SD( X )
Example 5.1
n n
Construct the simple linear regression equation of Y on X if n = 7, xi 113 , x 2
i 1983 ,
n n i 1 i 1
y 182
i 1
i and x y 3186 .
i i
i 1
Solution:
The simple linear regression equation of Y on X to be fitted for given data is of the form
^
Y = a + bx (1)
The values of ‘a’ and ‘b’ have to be estimated from the sample data solving the following
normal equations.
n n
na b xi yi (2)
i 1 i 1
n n n
a xi b xi2 xi yi
i 1 i 1 i 1
(3)
Substituting the given sample information in (2) and (3), the above equations can be
expressed as
7 a + 113 b = 182 (4)
113 a + 1983 b = 3186 (5)
(4) ×113 ⇒ 791 a + 12769 b = 20566
(5) ×7 ⇒ 791 a + 13881 b = 22302
(−) (−) (−)
−1112 b = −1736
1736
⇒b = = 1.56
1112
b = 1.56
Substituting this in (4) it follows that,
7 a + 113 × 1.56 = 182
7 a + 176.28 = 182
7 a = 182 – 176.28
= 5.72
Hence, a = 0.82
Man-hours 3.6 4.8 7.2 6.9 10.7 6.1 7.9 9.5 5.4
Productivity (in units) 9.3 10.2 11.5 12 18.6 13.2 10.8 22.7 12.7
Solution:
The simple linear regression equation to be fitted for the given data is
Yˆ= a + bx
Here, the estimates of a and b can be calculated using their least squares estimates
â= y − bx
ˆ
1 n n
ˆ1 x
=aˆ
=
∑ i n∑
y
n i 1=
− b i
i.e., i 1
1 n
∑ xi yi − ( x × y )
n i =1
b=
ˆ
1 n 2
∑ xi − x 2
n i =1
n
n n
n∑ xi yi − ∑ xi × ∑ yi
or equivalently bˆ=
= i1 = i 1 =i 1
2
n
n
n∑ xi − ∑ xi
2
=i 1 = i1
From the given data, the following calculations are made with n=9
Man-hours xi Productivity y i x i2 x iy i
3.6 9.3 12.96 33.48
4.8 10.2 23.04 48.96
7.2 11.5 51.84 82.8
6.9 12 47.61 82.8
10.7 18.6 114.49 199.02
6.1 13.2 37.21 80.52
7.9 10.8 62.41 85.32
9.5 22.7 90.25 215.65
5.4 12.7 29.16 66.42
9 9 9 9
∑ xi = 62.1
i =1
∑ yi = 121
i =1
∑ xi2 = 468.97
i =1
∑ x y = 894.97
i =1
i i
8054.73 − 7514
=
4220.73 − 3856.41
540.73
=
364.32
Thus, bˆ = 1.48 .
Now â can be calculated using b̂ as
121 62.1
aˆ = − 1.48 ×
9 9
= 13.40 – 10.21
Hence, â = 3.19
Therefore, the required simple linear regression equation fitted to the given data is
=
Yˆ 3.19 + 1.48 x
It should be noted that the value of Y can be estimated using the above fitted equation for
the values of x in its range i.e., 3.6 to 10.7.
In the estimated simple linear regression equation of Y on X
Yˆ= aˆ + bx
ˆ
Yˆ =y − bx
ˆ + bx
ˆ
Yˆ − y = bˆ( x − x )
It shows that the simple linear regression equation of Y on X has the slope b̂ and the
corresponding straight line passes through the point of averages ( x , y ) . The above representation
of straight line is popularly known in the field of Coordinate Geometry as ‘Slope-Point form’. The
above form can be applied in fitting the regression equation for given regression coefficient b̂
and the averages x and y .
As mentioned in Section 5.3, there may be two simple linear regression equations for each
X and Y. Since the regression coefficients of these regression equations are different, it is essential
to distinguish the coefficients with different symbols. The regression coefficient of the simple
linear regression equation of Y on X may be denoted as bYX and the regression coefficient of the
simple linear regression equation of X on Y may be denoted as bXY.
1 n
∑ xi yi − xxyy
n i =1
bXY =
1 n 2
∑
n i =1
yi − y 2
Xˆ −=
x bXY ( y − y ).
Also, the relationship between the Karl Pearson’s coefficient of correlation and the
regression coefficient are
SD(Y )
bXX = r SD( X ) and bYXbˆ == rXY .
XY
SD(Y ) SD( X )
2. It is clear from the property 1, both regression coefficients must have the same sign.
i.e., either they will positive or negative.
3. If one of the regression coefficients is greater than unity, the other must be less than unity.
4. The correlation coefficient will have the same sign as that of the regression coefficients.
5. Arithmetic mean of the regression coefficients is greater than the correlation coefficient.
bXY bYX
rXY
2
6. Regression coefficients are independent of the change of origin but not of scale.
m m2
3. Angle between the two regression lines is tan 1 1 where m1 and m2 are the
1 m1m2
slopes of regression lines X on Y and Y on X respectively.
4. The angle between the regression lines indicates the degree of dependence between the variable.
5. Regression equations intersect at (X, Y)
x 12 14 15 14 18 17
y 42 40 45 47 39 45
Estimate the likely demand when the X = 25.
Solution:
xi ui = xi – 15 ui 2 yi v i = yi – 43 v i2 u iv i
12 -3 9 42 -1 1 3
14 -1 1 40 -3 9 3
15 -0 0 45 2 4 0
14 -1 1 47 4 16 -4
18 3 9 39 -4 16 -12
17 2 4 45 2 4 4
Total 90 0 24 258 0 50 -6
6
90
=
=
xx ∑xx=
∑ /=
i ==
1
66 ii == 15
6
6
258
=y ∑ y=
/5
i =1
i = 43
6
=i 1 = i1
∧ ∧ ∧^ ∧
a=
=
u − −b UV
buvv v=
=0
^∧ ∧
Hence, the regression line of U on V is U = =b vv + a =
bUV −0.12v
y = 40.5
Example 5.4
The following data gives the experience of machine operators and their performance
ratings as given by the number of good parts turned out per 50 pieces.
Operators 1 2 3 4 5 6 7 8
Experience (X) 8 11 7 10 12 5 4 6
Ratings (Y) 11 30 25 44 38 25 20 27
Obtain the regression equations and estimate the ratings corresponding to the experience
x=15.
Solution:
xi yi x iy i x i2 y i2
8 11 88 64 121
11 30 330 121 900
7 25 175 49 625
10 44 440 100 1936
12 38 456 144 1444
5 25 125 25 625
4 20 80 16 400
6 27 162 36 729
Total 63 220 1856 555 6780
Regression equation of Y on X,
Y y bYX x x
^
x i
63
x i 1
7.875
n 8
n
y i
220
y i 1
27.5
n 8
Y y bYX x x
^
^
Y – 27.5 = 2.098 (x – 7.875)
^
Y – 27.5 = 2.098 x – 16.52
^
Y = 2.098x + 10.98
When x = 15,
^
Y = 2.098 × 15 +10.98
^
Y = 31.47 + 10.98
= 42.45
Regression equation of X on Y,
X x bXY y y
^
n n n
n x i y i x i y i
bXY i 1 i 1 i 1
2
n
n
n y y i
2
i
i 1 i 1
8 1856 63 220
8 6780 220 220
14848 13860
54240 48400
988
=
5840
bXY = 0.169
Example 5.5
The random sample of 5 school students is selected and their marks in statistics and
accountancy are found to be
Statistics 85 60 73 40 90
Accountancy 93 75 65 50 80
Solution:
The two regression lines are:
Regression equation of Y on X,
^
Y y bYX x x
Regression equation of X on Y,
X x bXY y y
^
ui = x i – A v i = xi – B
xi yi u iv i ui 2 y i2
= xi – 60 = xi – 75
x i
348
x i 1
69.6
n 5
n
y i
363
y i 1
72.6
n 5
Since the mean values are in decimals format not as integers and numbers are big, we take
origins for x and y and then solve the problem.
Calculation of bYX
n n n
n ui vi ui vi
bYX bVU i 1 i 1 i 1
2
n
n
n u ui
2
i
i 1 i 1
5 970 48 9(12)
5 2094 (48) 2
4850 + 576
=
10470 – 2304
5426
= = 0.664
8126
b=
YX b=
VU 0.664
^
Y – 72.6 = 0.664 (x – 69.6)
^
Y – 72.6 = 0.64x – 46.214
^
Y = 0.664x + 26.386
Regression equation of X on Y,
X x bXY y y
^
Calculation of bXY
n n n
n ui vi ui vi
bXY bUV i 1 i 1 i 1
2
n
n
n v vi
2
i
i 1 i 1
5 970 48 (12)
5 1074 (12) 2
4850 576 5426
5370 144 5226
bUV = 1.038
^
X – 69.6 = 1.038 (y – 72.6)
^
X – 69.6 = 1.038y – 75.359
^
X = 1.038y – 5.759
Solution:
The regression coefficient of Y on X is bYX = –1.5
The regression coefficient of X on Y is bXY = 0.6
Both the regression coefficients are of different sign, which is a contrary. So the given
equations cannot be regression lines.
Example: 5.7
mean S.D
Yield of wheat (kg. unit area) 10 8
Annual Rainfall (inches) 8 2
Solution:
Let us denote the dependent variable yield by Y and the independent variable rainfall by X.
Regression equation of Y on X is given by
SD(Y )
Y – ybˆ == rXY (x – x)
SD( X )
Example 5.8
For 50 students of a class the regression equation of marks in Statistics (X) on marks in
Accountancy (Y) is 3Y – 5X + 180 = 0. The mean marks in of Accountancy is 50 and variance of
marks in statistics is 16
25
of the variance of marks in Accountancy.
Solution:
We are given that:
n = 50, Regression equation of X on Y as 3Y – 5X + 180 = 0
16
y = 50 , V ( X ) = V (Y ) , and V(Y) = 25.
25
We have to find (i) x and (ii) rXY
(i) Calculation for x
Since (x, y) is the point of intersection of the two regression lines, they lie on the regression
line 3Y – 5X + 180 = 0
Hence, 3 y 5x 180 0
3(50) 5x 180 0
5 x 180 150
330
330
x 66
5
x 66
(ii) Calculation for coefficient of correlation.
3Y 5 X 180 0
5 X 180 3Y
X 36 0.6 Y
bXY 0.6
0.6 = r SD( X )
XY
SD(Y )
0.6 × SD(Y )
rXY =
SD( X )
V (Y )
2
rXY = 0.36 × (1)
V (X)
Given that:
V(Y) = 25
16
V ( X ) = V (Y )
25
= 16 × 25
25
V(X) = 16
0.36 25
rXY = 0.75
16
Example 5.9
5 9
If two regression coefficients are bYX = and bXY = , what would be the value of rXY?
6 20
Solution:
The correlation coefficient rXY bYX bXY
5 9 = 0.375
6 20
Since both the signs in bYX and bXY are positive, correlation coefficient between X and Y is
positive.
Correlation Regression
1. It indicates only the nature and extent of It is the study about the impact of the
linear relationship independent variable on the dependent
variable. It is used for predictions.
2. If the linear correlation is coefficient is The regression coefficient is positive, then for
positive / negative , then the two variables every unit increase in x, the corresponding
are positively / or negatively correlated average increase in y is bYX. Similarly, if the
regression coefficient is negative , then for
every unit increase in x, the corresponding
average decrease in y is bYX.
3. One of the variables can be taken as x and Care must be taken for the choice of independent
the other one can be taken as the variable y. variable and dependent variable. We can not
assign arbitrarily x as independent variable and
y as dependent variable.
4. It is symmetric in x and y, It is not symmetric in x and y, that is, bXY and bYX
ie., rXY=rYX have different meaning and interpretations.
POINTS TO REMEMBER
There are several types of regression - Simple linear correlation , multiple linear
correlation and non-linear correlation.
In simple linear regression there are two linear regression lines Y on X and X on Y.
In the linear regression line Y = a + bX + e , where ‘X’ is independent variable, ‘Y’ is
dependent variable, a’ is intercept, ‘b’ is slope of the line and ‘ e’ is error term.
The point ( X , Y ) passes through the regression lines.
The “ Method of least squares” gives the line of best fit.
Both the regression lines have the same sign either positive of negative.
The sign of the regression coefficient and the sign of the correlation coefficient is
same.
4. In regression equation X = a + by + e is
a) correlation coefficient of Y on X b) correlation coefficient of X on Y
c) regression coefficient of Y on X d) regression coefficient of X on Y
5. bYX =
SD( X ) SD(Y ) SD( X ) SD(Y )
a) rXY b) rXY c) d)
SD(Y ) SD( X ) SD(Y ) SD( X )
a) cov(X, Y) b) SD(X)
c) correlation coefficient d) coefficient of variance
10. Regression analysis helps in establishing a functional relationship between ______ variables.
a) 2 or more variables b) 2 variables
c) 3 variables d) none of these
13. If the two lines of regression are perpendicular to each other then rXY =
a) 0 b) 1 c) –1 d) 0.5
m m2
c) tan 1 1 d) none of the above
1 m1 m2
16. bXY =
SD(Y ) SD( X )
a) rXY b) rXY
SD( X ) SD(Y )
1
c) rXY SD(X) SD(Y) d)
bYX
17. Regression equation of X on Y is
a) Y = a + bYX x + e b) Y = bXY x + a + e
c) X = a + bXY y + e d) X = bYX y + a + e
^
18. For the regression equation 2Y = 0.605x + 351.58. The regression coefficient of Y on X is
a) Y = 8 + 0.7 X b) X = 8 + 0.7 Y
c) Y = 0.7 + 8 X d) X = 0.7 + 8 Y
36. Given x = 90, y = 70, bXY = 1.36, bYX = 0.61 when y = 50, Find the most probable value of X.
37. Compute the two regression equations from the following data.
x 1 2 3 4 5
y 3 4 5 6 7
^
If x = 3.5 what will be the value of Y ?
43. The following table shows the age (X) and systolic blood pressure (Y) of 8 persons.
Age (X) 56 42 60 50 54 49 39 45
Blood pressure (Y) 160 130 125 135 145 115 140 120
Fit a simple linear regression model, Y on X and estimate the blood pressure of a person of
60 years.
44. Find the regression equation of X on Y given that n = 5, ∑x = 30, ∑y = 40, ∑xy = 214, ∑x2 = 220,
∑y2 = 340.
45. Given the following data, estimate the marks in statistics obtained by a student who has
scored 60 marks in English.
Mean of marks in Statistics = 80, Mean of marks in English = 50, S.D of marks in Statistics =
15, S.D of marks in English = 10 and Coefficient of correlation = 0.4.
46. Find the linear regression equation of percentage worms (Y) on size of the crop (X) based on
the following seven observations.
47. In a correlation analysis, between production (X) and price of a commodity (Y) we get the
following details.
Variance of X = 36.
The regression equations are:
12X – 15Y + 99 = 0 and 60 X – 27 Y =321
Calculate (a) The average value of X and Y.
(b) Coefficient of correlation between X and Y.