31 Mathematics Correlation Regression
31 Mathematics Correlation Regression
1. Introduction.
2. Covariance.
3. Correlation.
4. Rank Correlation.
5. Linear regression.
6. Equations of lines of regression.
7. Angle between two lines of regression.
8. Important points about regression coefficients bxy and
byx.
1
1. Introduction.
“If it is proved true that in a large number of instances two variables tend always to fluctuate in the
same or in opposite directions, we consider that the fact is established and that a relationship exists. This
relationship is called correlation.”
(1) Univariate distribution: These are the distributions in which there is only one variable such as the
heights of the students of a class.
(2) Bivariate distribution: Distribution involving two discrete variable is called a bivariate distribution.
For example, the heights and the weights of the students of a class in a school.
(3) Bivariate frequency distribution: Let x and y be two variables. Suppose x takes the values
x 1 , x 2 ,....., x n and y takes the values y1 , y 2 ,....., yn , then we record our observations in the form of
ordered pairs (x 1 , y 1 ) , where 1 i n,1 j n . If a certain pair occurs fij times, we say that its frequency is
fij .
The function which assigns the frequencies fij ’s to the pairs (x i , y j ) is known as a bivariate frequency
distribution.
2. Covariance.
Let (x 1 , x i );i 1,2,....., n be a bivariate distribution, where x 1 , x 2 ,....., x n are the values of variable x and
y 1 , y 2 ,....., y n those of y. Then the covariance Cov (x, y) between x and y is given by
n n n n
y
1 1 1 1
Cov (x , y) (x i x )(y i y) or Cov (x , y ) (x i y i x y ) where, x x i and y i are
n i1 n i1 n i1 n i1
Covariance is not affected by the change of origin, but it is affected by the change of scale.
2
3. Correlation.
The relationship between two variables such that a change in one variable results in a positive or
negative change in the other variable is known as correlation.
(i) Perfect correlation: If the two variables vary in such a manner that their ratio is always constant, then
the correlation is said to be perfect.
(ii) Positive or direct correlation: If an increase or decrease in one variable corresponds to an increase
or decrease in the other, the correlation is said to be positive.
(2) Karl Pearson's coefficient of correlation: The correlation coefficient r(x , y) , between two variable x
n n n
Cov (x , y ) Cov (x , y )
i1
n x i y i x i y i
i1 i1
and y is given by, r(x , y ) or , r(x , y )
Var(x ) Var(y) x y n
n
2 n
n
2
n x
i1
i
2
x i
i1
n
i1
2
y i y i
i1
(x x )(y y ) dxdy
r .
(x x )2 (y y )2 dx 2 dy 2
dx . dy
dxdy n
(3) Modified formula: r , where dx x x ; dy y y
dx dy dy
2 2
dx 2
n
n
2
Cov (x , y ) Cov (x , y )
Also rxy .
x y var(x ). var(y )
3
4. Rank Correlation.
Let us suppose that a group of n individuals is arranged in order of merit or proficiency in possession of
two characteristics A and B.
These rank in two characteristics will, in general, be different.
For example, if we consider the relation between intelligence and beauty, it is not necessary that a
beautiful individual is intelligent also.
Rank Correlation: 1
6 d 2
r 1 , it means that there is perfect correlation in the two characteristics i.e., every individual is getting the
same ranks in the two characteristics. Here the ranks are of the type (1, 1), (2, 2),....., (n, n).
r 1 , it means that if the rank of one characteristics is high, then that of the other is low or if the rank
of one characteristics is low, then that of the other is high. e.g., if the two characteristics be richness
and slimness in person, then r 0 means that the rich persons are not slim.
4
r 1 , it means that there is perfect negative correlation in the two characteristics i.e, an individual
getting highest rank in one characteristic is getting the lowest rank in the second characteristic. Here
the rank, in the two characteristics in a group of n individuals are of the type (1, n), (2, n 1),....., (n, 1) .
Important Tips
u v n u . v
1
i i i i
r(x , y ) , where ui x i A, vi yi B .
u n u v n v
2 1 2 1 2
2
i i i i
5
Regression
5. Linear Regression.
If a relation between two variates x and y exists, then the dots of the scatter diagram will more or less be
concentrated around a curve which is called the curve of regression. If this curve be a straight line, then
it is known as line of regression and the regression is called linear regression.
Line of regression: The line of regression is the straight line which in the least square sense gives the
best fit to the given frequency.
y
We have, m 1 slope of the line of regression of y on x = b yx r.
x
1
m 2 Slope of line of regression of x on y = y
b xy r. x
6
y r y
m 2 m1 r x x ( r 2 y ) x (1 r 2 ) x y
tan = y 2 .
1 m 1m 2 r y y r x r y2 r( x2 y2 )
1 .
x r x
Here the positive sign gives the acute angle , because r 2 1 and x , y are positive.
1 r 2 x y
tan . 2 .....(i)
r x y2
Note: If r 0 , from (i) we conclude tan or / 2 i.e., two regression lines are at right angels.
If r 1 , tan 0 i.e., 0 , since is acute i.e., two regression lines coincide.
(1) r byx .b xy i.e. the coefficient of correlation is the geometric mean of the coefficient of regression.
(2) If, then b xy 1 i.e. if one of the regression coefficient is greater than unity, the other will be less than unity.
(3) If the correlation between the variable is not perfect, then the regression lines intersect at (x , y ) .
1
(4) b yx is called the slope of regression line y on x and is called the slope of regression line x on y.
b xy
(5) byx b xy 2 byx b xy or byx b xy 2r , i.e. the arithmetic mean of the regression coefficient is greater
than the correlation coefficient.
(6) Regression coefficients are independent of change of origin but not of scale.
y2
(7) The product of lines of regression’s gradients is given by .
x2
(8) If both the lines of regression coincide, then correlation will be perfect linear.
(9) If both b yx and b xy are positive, the r will be positive and if both b yx and b xy are negative, the r will be
negative.
7
Important Tips
If r 0 , then tan is not defined i.e. . Thus the regression lines are perpendicular.
2
If r 1 or 1 , then tan = 0 i.e. = 0. Thus the regression lines are coincident.
bc d ad b
If regression lines are y ax b and x cy d , then x and y .
1 ac 1 ac
(1) Standard error of prediction: The deviation of the predicted value from the observed value is
(2) Relation between probable error and standard error: If r is the correlation coefficient in a sample
1 r2
of n pairs of observations, then its standard error S.E. (r) and probable error P.E. (r) = 0.6745
n
1r 2
(S.E.)= 0.6745 . The probable error or the standard error are used for interpreting the coefficient
n
of correlation.
(i) If r P. E.(r) , there is no evidence of correlation.
(ii) If r 6 P. E.(r) , the existence of correlation is certain.
The square of the coefficient of correlation for a bivariate distribution is known as the “Coefficient of
determination”.