0% found this document useful (0 votes)
7 views20 pages

Correlation Regression

This document provides lecture notes on correlation and regression within the context of probability and statistics, specifically for a Bachelor of Technology program in Computer Engineering/Information Technology. It explains the concept of correlation, its types (positive, negative, simple, multiple, partial, total, linear, and nonlinear), and introduces Karl Pearson's coefficient of correlation as a measure of the relationship between two variables. Additionally, it includes examples and exercises to illustrate the application of these concepts.

Uploaded by

zaqcdebgtmju04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

Correlation Regression

This document provides lecture notes on correlation and regression within the context of probability and statistics, specifically for a Bachelor of Technology program in Computer Engineering/Information Technology. It explains the concept of correlation, its types (positive, negative, simple, multiple, partial, total, linear, and nonlinear), and introduces Karl Pearson's coefficient of correlation as a measure of the relationship between two variables. Additionally, it includes examples and exercises to illustrate the application of these concepts.

Uploaded by

zaqcdebgtmju04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

A Lecture Notes on

Probability & Statistics:


Correlation & Regression

BACHELOR OF TECHNOLOGY

in

Computer Engineering/Information Technology

Prepared By

Dr Mihir Suthar

Mathematics & Humanities Department

Gandhinagar Institute of Technology


10000301 Probability & Statistics
Correlation and Regression
• Correlation and regression are the most commonly used techniques for investigating the relationship
between two quantitative variables.

1. Correlation
• Correlation is the relationship that exists between two or more variables. Two variables are said to be
correlated if a change in one variable affects a change in the other variable. Such a data connecting two
variables is called bivariate data.
• Correlation measures the closeness of the relationship between the variables.
• Some examples of a relationship are as follows:
▪ Relationship between heights and weights
▪ Relationship between price and demand of commodity
▪ Relationship between age of husband and age of wife

1.1 Types of Correlation


Correlation is classified into four types:
(i) Positive correlation and negative correlation
(ii) Simple correlation and multiple correlation
(iii) Partial correlation and total correlation
(iv) Linear correlation and nonlinear correlation

i) Positive and negative correlations

➢ Positive correlation
If the value of one variable increases, the value of the other variable also increases, or, if value of one
variable decreases, the value of the other variable also decreases. This type of correlation is said to be
positive correlation.
e.g. The correlation between heights and weights of group of persons

Height (cm) 145 150 160 162 165 175


Weight (kg) 55 60 62 65 67 68

➢ Negative correlation
If the value of one variable increases, the value of the other variable decreases, or, if value of one variable
decreases, the value of the other variable increases. This type of correlation is said to be negative
correlation.
e.g. The correlation between the price and demand of a commodity

Price (Rs 15 10 8 7 6 3
per unit)
Demand 150 200 220 260 300 320
(units)

©Gandhinagar Institute of Technology Page | 1


10000301 Probability & Statistics
ii) Simple and multiple correlations

➢ Simple correlation
The relationship between only two variables is described as simple correlation.
e.g. The quantity of money and price level, demand and price

➢ Multiple correlation
The relationship between more than two variables is described as multiple correlation.
e.g. Relationship between price, demand and supply of a commodity

iii) Partial and total correlations

➢ Partial correlation
When more than two variables are studied excluding some other variables, the relationship is termed as
partial correlation.
➢ Total correlation
When more than two variables are studied without excluding any variables, the relationship is termed as
total correlation.

iv) Linear and nonlinear correlations

➢ Linear correlation
If the ratio of change between two variables is constant, the correlation is said to be linear.
The graph of a linear relationship will be a straight line.
e.g.
Milk (l) 5 10 15 20 25 30
Curg (kg) 2 4 6 8 10 12

➢ Nonlinear correlation
If the ratio of change between two variables is not constant, the correlation is said to be nonlinear.
The graph of a nonlinear relationship will be a curve.
e.g.
Price (Rs 15 10 8 7 6 3
per unit)
Demand 150 200 220 260 300 320
(units)

©Gandhinagar Institute of Technology Page | 2


10000301 Probability & Statistics

❖ Important
• There are various relationship between two variables represented by the following scatter diagrams.

Perfect positive correlation: If all the plotted points lie on a straight line rising from the lower hand corner to
the upper righthand corner, the correlation is said to be perfect positive correlation.

Perfect negative correlation: If all the plotted points lie on a straight line from the upper left-hand corner to the
lower right-hand corner, the correlation is said to be perfect negative correlation.

©Gandhinagar Institute of Technology Page | 3


10000301 Probability & Statistics
High degree of positive correlation: If all the plotted points lie in the narrow strip, rising from the lower left-
hand corner to the upper right -hand corner, it indicates a high degree of positive correlation.

High degree of negative correlation: If all the plotted points lie in a narrow strip, falling from the upper left-
hand corner to the lower right-hand corner, it indicates the existence of a high degree of negative correlation.
degree of negative correlation.

No correlation: If all the plotted points lie on a straight line parallel to the x- axis or y- axis, it indicates the
absence of any relationship between the variables.

1.2 Karl Pearson’s Coefficient of Correlation

The coefficient of correlation is the measure of correlation between two random variables 𝑋 and 𝑌 , is denoted
by 𝑟.
𝑐𝑜𝑣(𝑋,𝑌)
𝑟= -------------------(1)
𝜎𝑋 𝜎𝑌

Where,
1
𝑐𝑜𝑣(𝑋, 𝑌) = the covariance of variables 𝑋 and 𝑌 = 𝑛 ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)

©Gandhinagar Institute of Technology Page | 4


10000301 Probability & Statistics
∑(𝑥−𝑥̅ )2
𝜎𝑋 = the standard deviation of variable 𝑋 = √
𝑛

∑(𝑦−𝑦̅)2
𝜎𝑌 = the standard deviation of variable 𝑌 = √ 𝑛

So,
1
)
𝑟= 𝑛 ∑(𝑥 − 𝑥̅ (𝑦 − 𝑦̅)
2 2
√∑(𝑥 − 𝑥̅ ) √∑(𝑦 − 𝑦̅)
𝑛 𝑛
By simplifying,
𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
𝑟=
√𝑛∑𝑥 2 − (∑𝑥)2 √𝑛∑𝑦 2 − (∑𝑦)2

The above expression 𝑟 is called Karl Pearson’s coefficient of correlation.

1.2.1 Properties of Coefficient of Correlation


• The coefficient of correlation lies between −1 and 1. i.e. −1 ≤ 𝑟 ≤ 1
• Correlation coefficient is independent of change of origin and change of scale. i.e. 𝑟𝑥𝑦 = 𝑟𝑑𝑥 𝑑𝑦
𝑥−𝑎 𝑦−𝑏
Here, 𝑑𝑥 = and 𝑑𝑦 =
ℎ 𝑘
i.e
𝑛 ∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦
𝑟=
2
√𝑛 ∑𝑑𝑥2 − (∑𝑑𝑥 )2 √𝑛 ∑𝑑𝑦2 − (∑𝑑𝑦 )

• Two independent variables are uncorrelated. i.e. 𝑟 = 0.

Example 1: Calculate the coefficient of correlation for the following data.


𝑥 9 8 7 6 5 4 3 2 1
𝑦 15 16 14 13 11 12 10 8 9

Solution: Here, 𝑛 = 9
𝒙 𝒚 𝒙𝟐 𝒚𝟐 𝒙𝒚
9 15 81 225 135
8 16 64 256 128
7 14 49 196 98
6 13 36 169 78
5 11 25 121 55
4 12 16 144 48
3 10 9 100 30
2 8 4 64 16
1 9 1 81 9
∑𝒙 =45 ∑𝒚 =108 ∑𝒙𝟐 =285 ∑𝒚𝟐 =1356 ∑𝒙𝒚 =597

©Gandhinagar Institute of Technology Page | 5


10000301 Probability & Statistics

The coefficient of correlation is


𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
𝑟=
√𝑛∑𝑥 2 − (∑𝑥)2 √𝑛∑𝑦 2 − (∑𝑦)2
9(597)−(45)(108)
=
√9(285)−452 √9(1356)−1082

= 0.95

Example 2: Calculate the correlation of coefficient between the following data.


𝑥 5 9 13 17 21
𝑦 12 20 25 33 35
Solution: Here, 𝑛 = 5
∑𝑥 65
𝑥̅ = = = 13
𝑛 5
∑𝑦 125
𝑦̅ = = = 25
𝑛 5

𝒙 𝒚 𝒙−𝒙 ̅ 𝒚−𝒚 ̅ (𝒙 − 𝒙̅) 𝟐 (𝒚 − 𝒚̅ )𝟐 ̅)(𝒚 − 𝒚


(𝒙 − 𝒙 ̅)
5 12 -8 -13 64 169 104
9 -20 -4 -5 16 25 20
13 25 0 0 0 0 0
17 33 4 8 16 64 32
21 35 8 10 64 100 80
∑𝒙 ∑𝒚 ∑(𝒙 − 𝒙̅) ∑(𝒚 − 𝒚̅) ∑(𝒙 − 𝒙 ̅) 𝟐 ∑(𝒚 − 𝒚 ̅ )𝟐 ∑(𝒙 − 𝒙̅)(𝒚 − 𝒚
̅)
=65 =125 =0 =0 =160 =358 = 236

The coefficient of correlation is


1
∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)
𝑟= 𝑛
2 2
√∑(𝑥 − 𝑥̅ ) √∑(𝑦 − 𝑦̅)
𝑛 𝑛
1
(236)
= 5
√160 √358
5 5
= 0.986
Example 3: Calculate the coefficient of correlation for the following pairs of 𝑥 and 𝑦:
𝑥 17 19 21 26 20 28 26 27
𝑦 23 27 25 26 27 25 30 33

Solution: Here, 𝑛 = 8. Let 𝑎 = 23 and 𝑏 = 27 be the assumed mean of 𝑥 and 𝑦 respectively.


©Gandhinagar Institute of Technology Page | 6
10000301 Probability & Statistics
So, 𝑑𝑥 = 𝑥 − 𝑎 = 𝑥 − 23 and 𝑑𝑦 = 𝑦 − 𝑏 = 𝑦 − 27

The coefficient of correlation is

𝒙 𝒚 𝒅𝒙 𝒅𝒚 𝒅𝟐𝒙 𝒅𝟐𝒚 𝒅𝒙 𝒅𝒚
17 23 -6 -4 36 16 24
19 27 -4 0 16 0 0
21 25 -2 -2 4 4 4
26 26 3 -1 9 1 -3
20 27 -3 0 9 0 0
28 25 5 -2 25 4 -10
26 30 3 3 9 9 9
27 33 4 6 16 36 24
∑𝒙 =184 ∑𝒚 =216 ∑𝒅𝒙 =0 ∑𝒅𝒚 =0 ∑𝒅𝟐𝒙 =124 ∑𝒅𝟐𝒚 =70 ∑𝒅𝒙 𝒅𝒚 =48
𝑛 ∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦
𝑟=
2
√𝑛 ∑𝑑𝑥2 − (∑𝑑𝑥 )2 √𝑛 ∑𝑑𝑦2 − (∑𝑑𝑦 )

8(48)
= = 0.515
√8(124)√8(70)

Exercise:
1. From the following information relating to the stock exchange quotations for two shares 𝐴 and 𝐵,
ascertain by using Pearson’s coefficient of correlation how shares 𝐴 and 𝐵 are correlated in their prices?
Price share (A) Rs. 160 164 172 182 166 170 178
Price share (B) Rs. 292 280 260 234 266 254 230
Ans. – 0.96
2)
2. For the following data, show that 𝑐𝑜𝑣(𝑥, 𝑥 = 0.
𝑥 -3 -2 -1 0 1 2 3
2
𝑥 9 4 1 0 1 4 9
3. The following data gave the growth of employment in lacs in the organized sector in India between 1988
and 1995.
Year 1988 1989 1990 1991 1992 1993 1994 1995
Public sector 98 101 104 107 113 120 125 128
Private sector 65 65 67 68 68 69 68 68
Find the coefficient of correlation between the employment in public and private sectors.
Ans. 0.77
4. The coefficient of correlation between two variables 𝑋 and 𝑌is 0.48. The covariance is 36. The variance
of 𝑋 is 16. Find the standard deviation of 𝑌. Ans. 18.75
5. Calculate the coefficient of correlation between 𝑥 and 𝑦 from the following data.
𝑛 = 10, ∑𝑥 = 140 , ∑𝑦 = 150, ∑(𝑥 − 10)2 = 180
∑(𝑦 − 15)2 = 215, ∑(𝑥 − 10)(𝑦 − 15) = 60 Ans. 0.915

©Gandhinagar Institute of Technology Page | 7


10000301 Probability & Statistics

2. Rank Correlation
• Let a group of 𝑛 individuals be arranged in order of merit with respect to some characteristics. The same
group would give a different rank for different characteristics.
• Considering the orders corresponding to two characteristics 𝐴 and 𝐵, the correlation between these 𝑛 pairs
of ranks is called the rank correlation in the characteristics 𝐴 and 𝐵 for that group of individuals.

2.1 Spearman’s Rank Correlation Coefficient


• Let 𝑥, 𝑦 be the ranks of the 𝑖 𝑡ℎ individuals in two characteristics 𝐴 and 𝐵 respectively, where 𝑖 = 1,2. . 𝑛.
• Assuming that no two individuals have the same rank either for 𝑥 or 𝑦 , each of the variables 𝑥 and 𝑦 take
the values 1,2, … . , 𝑛.
• Spearman’s rank correlation coefficient is defined by
6 ∑𝑑 2
𝑟=1−
𝑛(𝑛2 − 1)
Here, 𝑑 denotes the difference between the ranks of the 𝑖 𝑡ℎ individuals in the two variables.

Example 1: Ten students got the following percentage of marks in Mathematics and English:
Mathematics(x) 8 36 98 25 75 82 92 62 65 35
English (y) 84 51 91 60 68 62 86 58 35 49
Find the rank correlation coefficient.
Solution: Here, 𝑛 = 10
𝒙 𝒚 Rank in Rank in 𝒅=𝒙−𝒚 𝒅𝟐
Mathematics 𝒙 English 𝒚
8 84 10 3 7 49
36 51 7 8 -1 1
98 91 1 1 0 0
25 60 9 6 3 9
75 68 4 4 0 0
82 62 3 5 -2 4
92 86 2 2 0 0
62 58 6 7 -1 1
65 35 5 10 -5 25
35 49 8 9 -1 1
∑𝒅 = 𝟎 ∑𝒅𝟐 = 𝟗𝟎

The rank correlation coefficient is


6∑𝑑 2 6(90)
𝑟 =1− 2
=1− = 0.455
𝑛(𝑛 − 1) 10(102 − 1)

©Gandhinagar Institute of Technology Page | 8


10000301 Probability & Statistics
Example 2: Ten competitors in a musical test were ranked by three judges 𝐴, 𝐵 and 𝐶 in the following order:
Rank by 𝐴 1 6 5 10 3 2 4 9 7 8
Rank by 𝐵 3 5 8 4 7 10 2 1 6 9
Rank by 𝐶 6 4 9 8 1 2 3 10 5 7
Using the rank correlation method, find which pair of judges has the nearest approach to common
liking in music.
Solution: Here, 𝑛 = 10
Rank Rank Rank 𝒅𝟏 𝒅𝟐 𝒅𝟑 𝒅𝟐𝟏 𝒅𝟐𝟐 𝒅𝟐𝟑
by A by B by C = 𝒙−𝒚 =𝒚−𝒛 =𝒛−𝒙
𝒙 𝒚 𝒛
1 3 6 -2 -3 5 4 9 25
6 5 4 1 1 -2 1 1 4
5 8 9 -3 -1 4 9 1 16
10 4 8 6 -4 -2 36 16 4
3 7 1 -4 6 -2 16 36 4
2 10 2 -8 8 0 64 64 0
4 2 3 2 -1 -1 4 1 1
9 1 10 8 -9 1 64 81 1
7 6 5 1 1 -2 1 1 4
8 9 7 -1 2 -1 1 4 1
𝟐 𝟐
∑𝒅𝟏 = 𝟎 ∑𝒅𝟐 ∑𝒅𝟑 = 𝟎 ∑𝒅𝟏 ∑𝒅𝟐 ∑𝒅𝟐𝟑
=𝟎 = 𝟐𝟎𝟎 = 𝟐𝟏𝟒 = 𝟔𝟎

The rank correlation coefficient is


6 ∑𝑑12 6(200)
𝑟𝑥,𝑦 = 1 − 2
=1− = − 0.21
𝑛(𝑛 − 1) 10(102 − 1)
6 ∑𝑑22 6(214)
𝑟𝑦,𝑧 = 1 − 2
=1− = − 0.296
𝑛(𝑛 − 1) 10(102 − 1)
6 ∑𝑑32 6(60)
𝑟𝑧,𝑥 = 1 − 2
=1− = 0.64
𝑛(𝑛 − 1) 10(102 − 1)
Here, 𝑟𝑧,𝑥 is maximum, the pair of judges 𝐴 and 𝐶 has the nearest common approach.

2.2 Tied Ranks

• If there is a tie between two or more individuals ranks, the rank is divided among equal individuals. e.g. if
4+5
two items have fourth rank, the 4th and 5th rank is divided between them equally and is given as = 4.5𝑡ℎ
2
rank to each of them.

©Gandhinagar Institute of Technology Page | 9


10000301 Probability & Statistics
4+5+6
• th
If three items have the same 4 rank, each of them is given th
= 5 rank.
3
1
• If 𝑚 is the number of items having equal ranks then the factor 12 (𝑚3 − 𝑚) is added to ∑𝑑 2 .
• If there are more than one cases of this type, this factor is added corresponding to each case. i.e.
1 1
6 [ ∑𝑑 2 + 12 (𝑚13 − 𝑚1 ) + 12 (𝑚23 − 𝑚2 ) + ⋯ . ]
𝑟 = 1−
𝑛(𝑛2 − 1)
Example 1: Obtain the rank correlation coefficient from the following data:
𝑥 10 12 18 18 15 40
𝑦 12 18 25 25 50 25
Solution: Here, 𝑛 = 6
𝒙 𝒚 Rank 𝒙 Rank 𝒚 𝒅𝟐
𝒅=𝒙−𝒚
10 12 1 1 0 0
12 18 2 2 0 0
18 25 4.5 4 0.5
0.25
18 25 4.5 4 0.5
0.25
15 50 3 6 -3 9
40 25 6 4 2 4
∑𝒅𝟐
= 𝟏𝟑. 𝟓
• Here, there are two items in the 𝑥 series having equal values at the rank 4. Each is given the rank 4.5
4+5
(i. e. = 4.5 rank)
2
• Similarly, there are three items in the 𝑦 series having equal values at the rank 3. Each of them is given the
3+4+5
rank 4. (i. e. = 4 rank)
3
So, 𝑚1 = 2, 𝑚2 = 3
The rank correlation coefficient is
1 1
6 [ ∑𝑑 2 + 12 (𝑚13 − 𝑚1 ) + 12 (𝑚23 − 𝑚2 )]
𝑟 =1−
𝑛(𝑛2 − 1)
1 1
6 [ 13.5 + 12 (8 − 2) + 12 (27 − 3) + ⋯ . ]
= 1− = 0.5429
6(62 − 1)

Exercise:
1. Two judges gave the following ranks to a series of eight one -act plays in a drama competition. Examine
the relationship between their judgments.
Judge A 8 7 6 3 2 1 5 4
Judge B 7 5 4 1 3 2 6 8
Ans. 0.62
2. Compute Spearman’s rank correlation coefficient from the following data:
𝑥 18 20 34 52 12
𝑦 39 23 35 18 46
Ans. – 0.9

©Gandhinagar Institute of Technology Page | 10


10000301 Probability & Statistics
3. Ten competitors in a voice test are ranked by three judges 𝐴, 𝐵 and 𝐶 in the following order:
Rank by 𝐴 6 10 2 9 8 1 5 3 4 7
Rank by 𝐵 5 4 10 1 9 3 8 7 2 6
Rank by 𝐶 4 8 2 10 7 6 9 1 3 6
Use the method of rank correlation to gauge which pairs of judges has the nearest approach to common
liking in voice. Ans. The first and the third judge
4. The following table gives the scores obtained by 11 students in English and Hindi translation. Find the
rank correlation coefficient.
Scores 40 46 54 60 70 80 82 85 85 90 95
in
English
Scores 45 45 50 43 40 75 55 72 65 42 70
in
Hindi
Ans. 0.36
5. Following are the scores of ten students in a class and their IQ.
Score 35 40 25 55 85 90 65 55 45 50
IQ 100 100 110 140 150 130 100 120 140 110
Calculate the rank correlation coefficient between the score and IQ.
Ans. 0.47

3. Regression
• Regression is defined as a method of estimating the value of one variable when that of the other is known
and the variables are correlated.
• Regression analysis is used to predict or estimate one variable in terms of the other variable.
• It is useful in statistical estimation of demand curves, supply curves, production function, cost function etc.

3.1 Type of Regression


Regression is classified into two types:
i) Simple regression and multiple regression
ii) Linear regression and nonlinear regression

i) Simple and multiple regression

➢ Simple regression: The regression analysis for studying only two variables at a time is known as simple
regression.
➢ Multiple regression: The regression analysis for studying more than two variables at a time is known as
multiple regression.

ii) Linear and nonlinear regression

➢ Linear regression: If the regression curve is straight line, then the regression is said to be linear.

©Gandhinagar Institute of Technology Page | 11


10000301 Probability & Statistics

➢ Nonlinear regression: If the regression curve is not a straight line i.e. not a first-degree equation in the
variables 𝑥 and 𝑦, the regression is said to be nonlinear regression.

3.2 Lines of Regression

• If all the points in the scatter diagram cluster around a straight line, the line is called the line of regression.
• The line of regression is the line of best fit and is obtained by the principle of least squares.

3.2.1 Line of regression of 𝒚 on 𝒙


• It is the line which gives the best estimate for the values of 𝑦 for any given values of 𝑥.
• The regression equation of 𝑦 on 𝑥 is given by
𝜎𝑦
𝑦 − 𝑦̅ = 𝑟 (𝑥 − 𝑥̅ )
𝜎𝑥
It is also written as 𝑦 = 𝑎 + 𝑏𝑥

3.2.2 Line of regression of 𝒙 on 𝒚


• It is the line which gives the best estimate for the values of 𝑥 for any given values of 𝑦.
• The regression equation for 𝑥 on 𝑦 is given by
𝜎𝑥
𝑥 − 𝑥̅ = 𝑟 (𝑦 − 𝑦̅)
𝜎𝑦
It is also written as 𝑥 = 𝑎 + 𝑏𝑦
Here, 𝑥̅ and 𝑦̅ are means of 𝑥 series and 𝑦 series respectively, 𝜎𝑥 and 𝜎𝑦 are standard deviations of 𝑥 series
and 𝑦 series respectively and 𝑟 is the correlation coefficient between 𝑥 and 𝑦.

3.3 Regression Coefficients


• The slope 𝑏 of the line of regression of 𝑦 on 𝑥 is called the coefficient of regression of 𝑦 on 𝑥.
• Regression coefficient of 𝑦 on 𝑥 is denoted by 𝑏𝑦𝑥 .
𝜎𝑦
Regression coefficient of 𝑦 on 𝑥 = 𝑏𝑦𝑥 = 𝑟
𝜎𝑥
• The slope 𝑏 of the line of regression of 𝑥 on 𝑦 is called the coefficient of regression of 𝑥 on 𝑦.
• Regression coefficient of 𝑥 on 𝑦 is denoted by 𝑏𝑥𝑦 .
𝜎𝑥
Regression coefficient of 𝑥 on 𝑦 = 𝑏𝑥𝑦 = 𝑟
𝜎𝑦

3.3.1 Expressions for Regression of Coefficients

𝜎𝑦
i) Regression coefficient of 𝑦 on 𝑥 = 𝑏𝑦𝑥 = 𝑟 𝜎
𝑥

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


=
∑(𝑥 − 𝑥̅ )2
©Gandhinagar Institute of Technology Page | 12
10000301 Probability & Statistics
𝜎𝑥
Regression coefficient of 𝑥 on 𝑦 = 𝑏𝑥𝑦 = 𝑟 𝜎
𝑦

∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)


=
∑(𝑦 − 𝑦̅)2
𝜎𝑦
ii) Regression coefficient of 𝑦 on 𝑥 = 𝑏𝑦𝑥 = 𝑟 𝜎
𝑥

𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
=
𝑛∑𝑥 2 − (∑𝑥)2

𝜎
Regression coefficient of 𝑥 on 𝑦 = 𝑏𝑥𝑦 = 𝑟 𝜎𝑥
𝑦

𝑛∑𝑥𝑦 − ∑𝑥∑𝑦
=
𝑛∑𝑦 2 − (∑𝑦)2
𝜎𝑦
iii) Regression coefficient of 𝑦 on 𝑥 = 𝑏𝑦𝑥 = 𝑟 𝜎
𝑥

𝑛∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦


=
𝑛∑𝑑𝑥2 − (∑𝑑𝑥 )2

𝜎
Regression coefficient of 𝑥 on 𝑦 = 𝑏𝑥𝑦 = 𝑟 𝜎𝑥
𝑦

𝑛∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦


= 2
𝑛∑𝑑𝑦2 − (∑𝑑𝑦 )

3.3.2 Properties of Regression Coefficients

➢ The coefficient of correlation is the geometric mean of the coefficients of regression.


i.e. 𝑟 = √𝑏𝑦𝑥 𝑏𝑥𝑦
➢ If one of the regression coefficients is greater than one, the other must be less than one.
i.e. if 𝑏𝑦𝑥 < 1 then 𝑏𝑥𝑦 > 1
➢ The arithmetic mean of regression coefficients is greater than or equal to the coefficient of correlation.
1
i.e. 2 (𝑏𝑦𝑥 + 𝑏𝑥𝑦 ) ≥ 𝑟
➢ Regression coefficients are independent of the change of origin but not of scale.
𝑘 ℎ 𝑥−𝑎 𝑦−𝑏
i.e.𝑏𝑑𝑥 𝑑𝑦 = ℎ 𝑏𝑥𝑦 and 𝑏𝑑𝑦 𝑑𝑥 = 𝑘 𝑏𝑦𝑥 . Here, 𝑑𝑥 = and 𝑑𝑦 =
ℎ 𝑘
➢ Both regression coefficients will have the same sign. i.e. either both are positive, or both are negative.
➢ The sign of correlation is same as that of the regression coefficients. i.e 𝑟 > 0 if 𝑏𝑥𝑦 > 0 and 𝑏𝑦𝑥 > 0 and
𝑟 < 0 if 𝑏𝑥𝑦 < 0 and 𝑏𝑦𝑥 < 0.

Properties of Lines of Regression


➢ The two regression lines 𝑥 on 𝑦 and 𝑦 on 𝑥 always intersect at their means (𝑥̅ , 𝑦̅).
➢ As 𝑟 = √𝑏𝑦𝑥 𝑏𝑥𝑦 , the coefficients 𝑟, 𝑏𝑦𝑥 , 𝑏𝑥𝑦 all have the same sign.
➢ If 𝑟 = 0, the regression coefficients are zero.
©Gandhinagar Institute of Technology Page | 13
10000301 Probability & Statistics
➢ If 𝑟 = 0, the regression lines are perpendicular to each other and if 𝑟 = ±1, these lines are identical.
Example 1: The regression lines of a sample are 𝑥 + 6𝑦 = 6 and 3𝑥 + 2𝑦 = 10. Find (i) sample means 𝑥̅ and 𝑦̅
(ii) the coefficient of correlation between 𝑥 and 𝑦 (iii) Also, find the value of 𝑦 at 𝑥 = 12.
Solution:
(i) The regression lines pass through the point (𝑥̅ , 𝑦̅).
So, the regression lines of as sample are
𝑥̅ + 6 𝑦̅ = 6
3𝑥̅ + 2𝑦̅ = 10
1
To solve the above equations, we get 𝑥̅ = 3, 𝑦̅ = 2 .

(ii) Consider the line 𝑥 + 6𝑦 = 6 be the regression line of 𝑦 on 𝑥. So,


1
𝑦 =− 𝑥+1
6
Compare with general form of regression line of 𝑦 on 𝑥,
1
𝑏𝑦𝑥 = −
6
Again, consider the line 3𝑥 + 2𝑦 = 10 be the regression line of 𝑥 on 𝑦. So,
2 10
𝑥=− 𝑦+
3 3
Compare with general form of regression line of 𝑥 on 𝑦,
2
𝑏𝑥𝑦 = −
3
Thus,

1 2 1
𝑟 = √𝑏𝑦𝑥 𝑏𝑥𝑦 = √(− ) (− ) =
6 3 3

Here, both 𝑏𝑦𝑥 and 𝑏𝑥𝑦 are negative. So, 𝑟 is also negative.
1
Therefore, the coefficient of correlation is 𝑟 = − 3 .

(iii) From (ii), At 𝑥 = 12,


1
𝑦 =− 𝑥+1
6
1
∴ 𝑦 = − (12) + 1
6
∴ 𝑦 = −1

©Gandhinagar Institute of Technology Page | 14


10000301 Probability & Statistics
Example 2: From the following results, obtain the two regression equations and estimate the yield when the
rainfall is 29 cm and the rainfall, when the yield is 600 kg:
Yield Rainfall
in kg in cm

Mean 508.4 26.7


SD 36.8 4.6

The coefficient of correlation between yield and rainfall is 0.52.


Solution:
Let 𝑥 be the rainfall in cm and 𝑦 be the yield in kg. Here,
𝑥̅ = 26.7, 𝜎𝑥 = 4.6, 𝑦̅ = 508.4, 𝜎𝑦 = 36.8 and 𝑟 = 0.52

The regression coefficients are


𝜎𝑦 36.8
𝑏𝑦𝑥 = 𝑟 = 0.52 = 4.16
𝜎𝑥 4.6
𝜎𝑥 4.6
𝑏𝑥𝑦 = 𝑟 = 0.52 = 0.065
𝜎𝑦 36.8

Now, the regression line of 𝑦 on 𝑥 is


𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )

𝑦 − 508.4 = 4.16(𝑥 − 26.7)


∴ 𝑦 = 4.16𝑥 + 397.328
And the regression line of 𝑥 on 𝑦 is
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅)

𝑥 − 26.7 = 0.065(𝑦 − 508.4)


∴ 𝑥 = 0.065𝑦 − 6.346
When the rainfall 𝑥 is 29 cm, estimated yield 𝑦 is
𝑦 = 4.16(29) + 397.328 = 517.968 kg
When the yield 𝑦 is 600 kg, estimated rainfall 𝑥 is
𝑥 = 0.065(600) − 6.346 = 32.654 cm
Example 3: The following data give the experience of machine operators and their performance rating as
given by the number of good parts turned out per 100 pieces.
Operators 1 2 3 4 5 6
Performance rating (𝑥) 23 43 53 63 73 83
Experience (𝑦) 5 6 7 8 9 10
Calculate the regression line of performance rating on experience and also estimate the probable
performance if an operator has 11 years of experience.

©Gandhinagar Institute of Technology Page | 15


10000301 Probability & Statistics
Solution: Here, 𝑛 = 6
𝒙 𝒚 𝒚𝟐 𝒙𝒚
23 5 25 115
43 6 36 258
53 7 49 371
63 8 64 504
73 9 81 657
83 10 100 830.
∑𝒙 = 𝟑𝟑𝟖 ∑𝒚 = 𝟒𝟓 ∑𝒚𝟐 = 𝟑𝟓𝟓 ∑𝒙𝒚 = 𝟐𝟕𝟑𝟓

The regression coefficient of 𝑥 on 𝑦 is


𝑛∑𝑥𝑦 − ∑𝑥 ∑𝑦
𝑏𝑥𝑦 = = 11.429
𝑛∑𝑦 2 − (∑𝑦)2
∑𝑥 ∑𝑦
Here, 𝑥̅ = = 56.33 and 𝑦̅ = = 7.5
𝑛 𝑛

So, the equation of regression line of 𝑥 on 𝑦 is


𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅)

𝑥 − 56.33 = 11.429(𝑦 − 7.5)


∴ 𝑥 = 11.429 𝑦 − 29.3875
When the experience is 11 years of an operator, estimated performance is 𝑥 = 96.33

Example 4: The number of bacterial cells (y) per unit volume in a culture at different hours (x) is given
below:
𝑥 0 1 2 3 4 5 6 7 8 9
𝑦 43 46 82 98 123 167 199 213 245 272
Fit lines of regression of 𝑦 on 𝑥 and 𝑥 on 𝑦. Also, estimate the number of bacterial cells after
15 hours.
Solution: Here, 𝑛 = 10
𝒙 𝒚 𝒙𝟐 𝒙𝒚 𝒚𝟐
0 43 0 0 1849
1 46 1 46 2116
2 82 4 164 6724
3 98 9 294 9604
4 123 16 492 15129
5 167 25 835 27889
6 199 36 1194 39601
7 213 49 1491 45369
8 245 64 1960 60025
9 272 81 2448 73984
∑𝒙 = 𝟒𝟓 ∑𝒚 = 𝟏𝟒𝟖𝟖 𝟐 ∑𝒙𝒚 = 𝟖𝟗𝟐𝟒 𝟐
∑𝒙 = 𝟐𝟖𝟓 ∑𝒚 = 𝟐𝟖𝟐𝟐𝟗𝟎

©Gandhinagar Institute of Technology Page | 16


10000301 Probability & Statistics
∑𝑥 ∑𝑦
Here, 𝑥̅ = = 4.5 and 𝑦̅ = = 148.8
𝑛 𝑛

The regression coefficients are


𝑛∑𝑥𝑦 − ∑𝑥 ∑𝑦
𝑏𝑥𝑦 = = 0.0366
𝑛∑𝑦 2 − (∑𝑦)2
and
𝑛∑𝑥𝑦 − ∑𝑥 ∑𝑦
𝑏𝑦𝑥 = = 27.0061
𝑛∑𝑥 2 − (∑𝑥)2
The regression line of 𝑦 on 𝑥 is
𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )

𝑦 − 148.8 = 27.0061(𝑥 − 4.5)


∴ 𝑦 = 27.0061𝑥 + 27.2726
The regression line of 𝑥 on 𝑦 is
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅)

𝑥 − 4.5 = 0.0366(𝑦 − 148.8)


∴ 𝑥 = 0.366𝑦 − 0.9461
Thus, at 𝑥 = 15 hours, 𝑦 = 432.3641
Example 5: Find the two lines of regression from the following data:
Age of Husband (x) 25 22 28 26 35 20 22 40 20 18
Age of wife (y) 18 15 20 17 22 14 16 21 15 14
Hence, estimate (i) the age of the husband when the age of the wife is 19, and (ii) the age of
the wife when the age of the husband is 30.
Solution: Let 𝑎 = 26 and 𝑏 = 17 be the assumed means of 𝑥 and 𝑦 series respectively.
𝑑𝑥 = 𝑥 − 𝑎 = 𝑥 − 26 and 𝑑𝑦 = 𝑦 − 𝑏 = 𝑦 − 17

Here, 𝑛 = 10
𝒙 𝒚 𝒅𝒙 𝒅𝒚 𝒅𝟐𝒙 𝒅𝟐𝒚 𝒅𝒙 𝒅𝒚
25 18 -1 1 1 1 -1
22 15 -4 -2 16 4 8
28 0 2 3 4 9 6
26 17 0 0 0 0 0
35 22 9 5 81 25 45
20 14 -6 -3 36 9 18
22 16 -4 -1 16 1 4
40 21 14 4 196 16 56
20 15 -6 -2 36 4 12
18 14 -8 -3 64 9 24
∑𝒙 = 𝟐𝟓𝟔 ∑𝒚 = 𝟏𝟕𝟐 ∑𝒅𝒙 = −𝟒 ∑𝒅𝒚 = 𝟐 𝟐 𝟐 ∑𝒅𝒙 𝒅𝒚 = 𝟏𝟕𝟐
∑𝒅𝒙 = 𝟒𝟓𝟎 ∑𝒅𝒚 = 𝟕𝟖

©Gandhinagar Institute of Technology Page | 17


10000301 Probability & Statistics
Means of 𝑥 and 𝑦 are
∑𝑥 ∑𝑦
𝑥̅ = = 25.6 and 𝑦̅ = = 17.2
𝑛 𝑛

The regression coefficients are


𝑛∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦
𝑏𝑥𝑦 = 2 = 2.227
𝑛∑𝑑𝑦2 − (∑𝑑𝑦 )

and
𝑛∑𝑑𝑥 𝑑𝑦 − ∑𝑑𝑥 ∑𝑑𝑦
𝑏𝑦𝑥 = = 0.385
𝑛∑𝑑𝑥2 − (∑𝑑𝑥 )2
The regression line of 𝑦 on 𝑥 is
𝑦 − 𝑦̅ = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )

𝑦 − 17.2 = 0.385(𝑥 − 25.6)


∴ 𝑦 = 0.385𝑥 + 7.344
The regression line of 𝑥 on 𝑦 is
𝑥 − 𝑥̅ = 𝑏𝑥𝑦 (𝑦 − 𝑦̅)

𝑥 − 25.6 = 2.227(𝑦 − 17.2)


∴ 𝑥 = 2.227𝑦 − 12.704
When the age of the wife is 19, estimated age of the husband is
𝑥 = 2.227(19) − 12.704 = 29.601 ≈ 30
So, Age of the husband is 30 years
When the age of husband is 30, estimated age of the wife is
𝑦 = 0.385(30) + 7.344 = 18.894 ≈ 19
So, Age of the wife is 19 years

Exercise:
1. The following are the lines of regression 4𝑦 = 𝑥 + 38 and 9𝑦 = 𝑥 + 288. Estimate 𝑦 when 𝑥 = 99 and 𝑥
when 𝑦 = 30. Also, find the means of 𝑥 and 𝑦. Ans. 𝒚 = 𝟒𝟑, 𝒙 = 𝟖𝟐, 𝒙 ̅ = 𝟏𝟔𝟐, 𝒚 ̅ = 𝟓𝟎
2. In partially destroyed laboratory record of analysis of correlation data the following results are legible.
Variance= 9, the equations of the lines of regression 4𝑥 − 5𝑦 + 33 = 0, 20𝑥 − 9𝑦 − 107 = 0. Find (i) the
mean values of 𝑥 and 𝑦 (ii) the standard deviation of 𝑦 and (iii) the coefficient of correlation between 𝑥 and
̅ = 𝟏𝟑, 𝒚
𝑦. Ans. (i) 𝒙 ̅ = 𝟏𝟕 (ii) 𝝈𝒚 = 𝟒 (iii) 𝒓 = 𝟎. 𝟔
3. Find the likely production corresponding to a rainfall of 40 cm from the following data:
Rainfall (in cm) Output (in quintals)
Mean 30 50
SD 5 10
𝑟 = 0.8
Ans. 66 quintals

©Gandhinagar Institute of Technology Page | 18


10000301 Probability & Statistics
4. The following table gives the age of a car of a certain make and annual maintenance cost. Obtain the
equation of the line of regression of cost on age.
Age of a car 2 4 6 8
Maintenance 1 2 2.5 3
Ans. 𝒙 = 𝟎. 𝟑𝟐𝟓 𝒚 + 𝟎. 𝟓
5. The following data give the heights in inches(x) and weights in lbs (y) of a random sample of 10 students:
𝑥 61 68 68 64 65 70 63 62 64 67
𝑦 112 123 130 115 110 125 100 113 116 126
Estimate the weight of a student of height 59 inches. Ans. 126.4 lbs
6. Find the regression equations of 𝑦 on 𝑥 from the data given below taking deviations from actual mean of 𝑥
and 𝑦.
Price in rupees (x) 10 12 13 12 16 15
Demand (y) 40 38 43 45 37 43
Estimate the demand when the price is Rs.20. Ans. 𝒚 = −𝟎. 𝟐𝟓 𝒙 + 𝟒𝟒. 𝟐𝟓, 𝒚 = 𝟑𝟗. 𝟐𝟓

References:
[1] P.G. Hoel, S.C. Port and C. J. Stone, Introduction to Probability Theory, Universal Book Stall
[2] S. Ross. A First Course in Probability, 6th Ed., Pearson Education India.
[3] W. Feller, An Introduction to Probability Theory and its Applications, Vol.1, Wiley.
[4] D.C. Montogomery and G.C. Runger, Applied Statistics and Probability for Engineers, Wiley
[5] J.L.Devore, Probability and Statistics for Engineering and Sciences, Cengage Learning.
[6] R. R. Singh and M. Bhatt, Probability and Statistics, Mc Graw Hill

©Gandhinagar Institute of Technology Page | 19

You might also like