0% found this document useful (0 votes)
4 views23 pages

Statistics Books

Uploaded by

rahul psit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Statistics Books

Uploaded by

rahul psit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

7.11.

Measure of Skewness
Shape of data is measured by Skewness and Kurtosis. Skewness measures lack of symmetry of data.
Positive or right skewed means longer right tail and Negative or left skewed means longer left tail.

7.11.1. Karl Pearson's skewness coefficient

𝑚𝑒𝑎𝑛−𝑚𝑜𝑑𝑒 3(𝑚𝑒𝑎𝑛−𝑚𝑒𝑑𝑖𝑎𝑛)
Karl Pearson’s skewness is defined as Sk = or Sk = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

Theoretically, the values of Sk vary between ±3. However, normally these values lie between ±1. Any
threshold or rule of thumb is arbitrary, but if the skewness is greater than 1.0 (or less than -1.0), the
skewness is substantial and the distribution is said to be far from symmetrical. In this case instead of Mean
and Standard Deviation we should preferably use median and Quartile Deviation as measure of Central
Tendency and Measure of Dispersion respectively. Alternately we may treat the outliers separately.
Examples of Skewness: Salary data is often positively skewed as many employees in a company make
relatively little, while increasingly few people make very high salaries. Failure rate data is often left
skewed. Consider light bulbs - very few will burn out right away, the vast majority lasting for quite a long
time. Age of pensioners is positively skewed because there may be very less people with age 90-100 or
above and a very large number of pensioners with age 60-70 years.

7.12. Kurtosis – measure of relative flatness or peakedness of data set


If we know the measures of central tendency, dispersion and skewness, we still cannot form a complete
idea about the distribution as it is clear from the following figure in which all the three curves are
symmetrical about the mean and have the same range. In addition to these measures, we should know one
more measure which Prof. Karl Pearson calls as the ‘Convexity of the Frequency Curve’ or Kurtosis.
Kurtosis enables us to have an idea about the ‘flatness or peakedness’ of the frequency curve. It is
measured by the coefficient β2 (estimate b2) or its derivation 𝛾 2.

A standard normal distribution has mean µ = 0; Std. Deviation  = 1, skewness = 0, and kurtosis = 0.

Page | 94
Karl Pearson’s coefficient of Kurtosis is given by the formula

And 𝛾 2 = b2 – 3.
In probability theory and statistics, kurtosis is a measure of the "tailedness" of the probability
distribution. Curve which is neither flat nor peaked is called the normal or mesokurtic curve; for such a
curve β2 = 3, and 𝛾 2 = 0. Curve which is flatter than the normal curve is known as platykurtic; for such
a curve β2 < 3 and 𝛾 2 < 0; they are said to have negative kurtosis. Curve which is more peaked than the
normal curve is called leplokurtic; for such a curve β2 > 3 and 𝛾 2 > 0; they are said to have positive
kurtosis.
Any threshold or rule of thumb is arbitrary, but if the kurtosis is greater than 2.0 (or less than -2.0), the
kurtosis is substantial.

7.13. Comparison among dispersion, skewness and kurtosis: Dispersion, Skewness and
Kurtosis are different characteristics of frequency distribution. Dispersion studies the scatter of the items
round a central value. It does not show the extent to which deviations cluster below an average or above
it. Skewness tells us about the clustering of the deviations above and below a measure of central tendency.
Kurtosis studies the concentration of the items at the central part of a series. If items concentrate too much
at the centre, the curve becomes ‘LEPTOKURTIC – Positive Kurtosis’ and if the concentration at the
centre is comparatively less, the curve becomes ‘PLATYKURTIC – Negative Kurtosis’.

Page | 95
Multiple choice questions: choose the correct answer
1. The measures of dispersion based on every item of the series is:
(a) range
(b) standard deviation
(c) quartile deviation
(d) None of these [Ans. (b)]

2. One of the measures of dispersion which is more useful in case of open-end distributions:
(a) range
(b) mean deviation
(c) standard deviation
(d) quartile deviation [Ans. (d)]

3. Standard deviation is always computed from:


(a) mean
(b) mode
(c) median
(d) Geometric Mean [Ans. (a)]
4. Which of the following measures is least affected by extreme items:
(a) quartile deviation
(b) range
(c) standard deviation
(d) mean deviation . [Ans. (a)]

5. Mean deviation is:


(a) less than S.D.
(b) more than S.D.
(c) not related to S.D.
(d) equal to Standard Deviation [Ans. (a)]

Page | 96
6. Coefficient of variation is given by:
SD
(a) x 100
𝑀𝑒𝑎𝑛
𝑀𝑒𝑎𝑛
(b) x 100
SD
QD
(c) x 100
𝑀𝑒𝑎𝑛
SD
(d) x 100 [Ans. (a)]
𝑅𝑎𝑛𝑔𝑒

7. Which one of the following measurement does not divide a set of observations into equal parts?
(a) Quartiles
(b) Standard Deviations
(c) Percentiles
(d) Deciles [Ans. (b)]

8. Which one is the not measure of dispersion.


(a) The Range
(b) 50th Percentile
(c) Inter-Quartile Range
(d) Variance [Ans. (b)]

9. In a statistical analysis, dispersion of data is measured by:


(a) Geometric Mean
(b) Arithmetic Mean
(c) Mode
(d) Range [Ans. (d)]

10. Which of the following is the most commonly used as measure of dispersion?
(a) Mean deviation
(b) Standard deviation
(c) Range
(d) Quartile deviation [Ans. (b)]

11. Which measure is not dependent on the value of each score?


(a) Mean
(b) Variance
(c) Standard Deviation
(d) Range [Ans. (d)]

Page | 97
12. All are measures of dispersion except:
(a) Range
(b) Mean Deviation
(c) Standard Deviation
(d) Median [Ans. (d)]

13. Which of the following is called semi-interquartile range?


(a) Q1 - Q3
(b) Q3 - Q1
(c) ½(Q3+Q1)
(d) ½(Q3- Q1) [Ans. (d)]

14. Which of the following is called Inter-quartile range?


(a) ½ (Q3- Q1)
(b) Q1 - Q2
(c) Q3 - Q1
(d) Q2 - Q3 [Ans. (c)]

15. The coefficient of Quartile Deviation is :


Q3 − Q1
(a)
2
Q3+ Q1
(b) Q3− Q1
Q3 − Q1
(c) Q3+ Q1
Q3+ Q1
(d) [Ans. (c)]
2

16. Which of the following is an example of a relative measure of dispersion?


(a) Variance
(b) Mean Deviation
(c) Standard deviation
(d) Coefficient of Variation [Ans. (d)]

17. The square of the variance of a distribution is the:


(a) Standard deviation
(b) Mean Deviation
(c) Absolute dispersion
(d) None of the above [Ans. (d)]

Page | 98
18. Standard deviation (σ) is calculated by one of the following formulae:

𝑓𝑖𝑥𝑖 2 𝑓𝑖𝑥𝑖 2
(a) √ − ( )
𝑓𝑖 𝑓𝑖

2
√𝑓𝑖𝑥𝑖 − (𝑓𝑖𝑑𝑖 )
2
(b) 𝑓𝑖 𝑓𝑖

2
√𝑓𝑖𝑥𝑖 + (𝑓𝑖𝑑𝑖 )
2
(c) 𝑓𝑖 𝑓𝑖

𝑓𝑖𝑥𝑖 2 𝑓𝑖𝑥𝑖 2
(d) √ − ( ) [Ans. (a)]
𝑝𝑖 𝑓𝑖

19. 20 babies are born in a hospital on the same day. Each weighs 2.5 kg, the standard deviation is:
(a) 1
(b) 0
(c) 2.5
(d) 5 [Ans. (b)]
20. The general formula for an estimate of a variance is sum of squared deviations divided by:
(a) Population size
(b) Sample size
(c) Degrees of freedom
(d) Level of significance [Ans. (c)]

21. If a sample consists of three observations, 13, 14, 15 an estimate of the population standard
deviation equals as:
2
(a) √3

(b) Zero
1
(c) √3

(d) 1 [Ans. (a)]


22. In a series of 300 boys, the mean systolic blood pressure was 120 mm of Hg and the standard
deviation was found to be 20. The coefficient of variation is :
(a) 16.7%
(b) 8.3%
(c) 40%
(d) 30% [Ans. (a)]

Page | 99
23. Dispersion of a group of data can be graphically represented by:
(a) Normal curve
(b) Lorenz curve
(c) Curvilinear curve
(d) Cumulative frequency curve [Ans. (b)]
24. The semi- interquartile range is most closely related to the:
(a) Mean
(b) Median
(c) Mode
(d) Geometric Mean [Ans. (b)]
25. The properties of the standard deviation are most closely related to those of the :
(a) Mean
(b) Median
(c) Mode
(d) Range [Ans. (a)]
26. In one score in distribution is changed to another value, it is certain that:
(a) The range has changed
(b) The standard deviation has changed
(c) The semi-interquartile range has changed
(d) The Question does not make sense [Ans. (b)]

TRY
Q1. What are the requisites of a good measure of Dispersion?

Q2. What for a Measure of variation used? (b) What are various measures of dispersion? Name them.

Q3. Compare Mean Deviation and Std. Deviation.

Q4. Give the merits and demerits of Standard Deviation (SD)

Q5. Define coefficient of variation

Q6. What do you mean by range? Give its merits and demerits.

Q7. Calculate coefficient of variation from the following data:


Life in hrs. 0-50 50-100 100-150 150-200 200-250
No. of bulbs 2 8 60 25 5

Page | 100
Q8. Calculate standard deviation:
Variable 20-30 30-40 40-50 50-60 60-70 70-80 80-90
frequency 3 61 132 153 140 51 2

Q9. Calculate Range and Coefficient of range from the following data:
Marks 25 30 35 60 75
No. of Students 4 8 10 6 3

Q10. Calculate quartile deviation and its coefficient from the data given below:
Variable 20-30 30-40 40-50 50-60 60-70
frequency 4 6 8 6 4

Q11. Calculate the mean deviation about mean for the following distribution:
Classes 20-40 40-80 80-100 100-120 120-140
frequency 3 6 20 12 9

Q12. Find the standard deviation from the following data:


Marks 0-10 10-20 20-30 30-40 40-50
No. of Students 10 15 10 10 5

Q13. Compute third quartile from the given data: 11, 12, 14, 18, 22, 26, 30 (1)

Page | 101
Chapter 8
Correlation and Regression
The word correlation is used in everyday life to denote some form of association. We might say that there
is a correlation between foggy days and attacks of breathlessness. However, in statistical terms correlation
denotes association between two quantitative variables. It measures the degree/extent of the
relationship between variables.
8.1. Types of Correlation
According to the direction of change in variables there are two types of correlation:
(a) Positive Correlation: Correlation between two variables is said to be positive if the values of one
variable increase (or decrease) as the values of other variable also increase (or decrease). Some examples
of positive correlation are correlation between: (i) Heights and weights of group of persons; (ii) House
hold income and expenditure; (iii) Expenditure on advertising and sales revenue; it is because as the
expenditure on advertising increases, sales revenue also increases. Thus, the change is in the same
direction. Hence the correlation is positive.
(b) Negative Correlation: Correlation between two variables is said to be negative if the values of one
variable increase (or decrease) as the values of other variable decrease (or increase). Some examples of
negative correlations are correlation between (i) Volume and pressure of gas (ii) Price and demand of
goods; (iii) Literacy and poverty in a country; it is because as the literacy level goes up in a country, the
poverty in the country decreases. Thus the change in the values of two variables is in opposite direction.
Therefore, the correlation between Literacy and poverty in a country is negative.
8.1.1 Karl Pearson’s Coefficient of Correlation: It measures the degree/extent of linear relationship*/
association between two variables e.g. height and weight; income and income tax, etc.
Cov(X,Y)
The Karl Pearson’s coefficient of correlation is denoted by ‘r’ and is described as r =
x ∗ y
 (X –`X) ∗ (Y –`Y)
Where: Cov(x,y) = , called covariance between X & Y
N

√ (X –`X)
2
x = Standard deviation of series X =
𝑁

 (Y –`Y) 2
y = Standard deviation of series Y = √
𝑁
N = Number of pairs of observations
N∗XY − X ∗ Y
A simple formula for calculation: r =
√[N∗ X2 – (X)2 ] ∗ √[N∗ Y2 – (Y)2 ]

Correlation lies between - 1 and + 1. When one variable increases as the other increases the correlation is
positive and when one variable decreases as the other increases it is negative. Complete absence of
correlation is represented by 0. Figure below gives graphical representations of some correlations.
*A linear relationship means a relationship that can be represented by a line on a graph paper (the word “linear”
means “a line”). In case of linear relationship, the highest power of the variables x and y is 1; e.g. y = 3x + 7

Page | 102
(a) Perfect Positive Correlation
(b) Perfect Negative Correlation
(c) Absence of correlation
(d) Non-Linear (Curvilinear)
Correlation

8.1.2 Calculation of the correlation coefficient: Table below indicates a set of 15 values [N = 15] of
variables X and Y; calculate the correlation coefficient between the two variables and interpret the result.

X Y X2 Y2 XY
0.5 41 0.25 1681 20.50 N∗XY − X ∗ Y
r=
√[N∗ X2 – (X)2] ∗ √[N∗ Y2 – (Y)2 ]
0.7 55 0.49 3025 38.50
2.5 41 6.25 1681 102.50 15∗3814.5−118.7∗541
= Ö[15∗ 1235.35 – (118.7)2 ] ∗ Ö[15∗ 20695 – (541)2 ]
4.1 39 16.81 1521 159.90
57217.5−64216.7
5.9 50 34.81 2500 295.00 =
√18530.25−14089.69 ∗ √310425−292681
6.1 32 37.21 1024 195.20
−6999.2
7.0 41 49.00 1681 287.00 =
√4440.56 ∗ √17744
−6999.2
8.2 42 67.24 1764 344.40 =
66.637 ∗ 133.207
10.0 26 100.00 676 260.00 −6999.2
=
8876.489
10.1 35 102.01 1225 353.50 = - 0.7885
10.9 25 118.81 625 272.50
Interpretation: r = (-) 0.79; means a high
11.5 31 132.25 961 356.50 negative correlation indicating that if the
12.1 31 146.41 961 375.10 value of variable X increases, the value of
variable Y will decrease and vice versa.
14.1 29 198.81 841 408.90
15.0 23 225.00 529 345.00
118.70 541.00 1235.35 20695.00 3814.5

Page | 103
8.1.3. Scatter or dot diagram – Rough Measure of Correlation
Scatter or dot diagram is a graphic presentation to show the degree/extent of relationship (correlation)
between two variables; it is also called Correlation Diagram. To plot a scatter diagram the given values of
variables X and Y are marked in the XY plane in the form of dots. Normally independent variable is
plotted along the X - axis and the dependent variable along the Y-axis.

Relationship between X and Y - scatter diagram

60
Y

50
40
30
20
y = -1.5762x + 48.54
10
0
0.0 5.0 10.0 15.0 20.0
X

Interpretation of the Scatter diagram: All the points lie around a down-sloping line so the correlation
between the variables X and Y is high and negative.
8.1.4. Short-Cut Method for Calculation of Correlation Coefficient
When values of variables are big and actual means of variables X and Y are not whole number, calculation
of correlation coefficient is somewhat cumbersome and we can use shortcut method in which deviations
are taken from assumed mean for the variables X and Y. Formula for correlation coefficient by shortcut
method:
n∗dxdy − dx ∗ dy
r= ;
√[n∗ dx2 – (dx)2] ∗ √[n∗ dy2 – (dy)2]
where n = No. of pairs of observations,
a = Assumed mean of X, b = Assumed mean of Y,
dx = x – a : Sum of deviation from assumed mean a in x-series,
dy = y – b : Sum of deviation from assumed mean b in y-series.
Note: a and b are so chosen that they approximately lie in the middle of x and y series respectively.

Ex. 2: Calculate correlation coefficient from the following data by shortcut method:
x 10 12 14 18 20
y 5 6 7 10 12

Page | 104
Solution
x ydx= x - 14 dy = y - 8 dx*dy dx2 dy2
10 5 -4 -3 12 16 9
12 6 -2 -2 4 4 4
14 7 0 -1 0 0 1
18 10 4 2 8 16 4
20 12 6 4 24 36 16
74 40 4 0 48 72 34
n∗dxdy − dx ∗ dy
Using r = ;n=5
√[n∗ dx – (dx)2 ] ∗ √[n∗ dy2 – (dy)2]
2

5∗48−4∗0
=
Ö[5∗ 72 – (4)2] ∗ Ö[5∗ 34 – (0)2]
240 − 0
=
√360 −16 ∗ √170−0
240 240
= =
√344 ∗ √170 18.55 ∗ 13.04
240
= = 0.992
241.89

8.1.5. Interpretation of Coefficient of Correlation ‘r’


The coefficient of correlation describes not only the magnitude of correlation but also its direction. Thus
r = + 1 means Perfect Positive Correlation or complete agreement in the same direction; r = - 1 means
Perfect Negative Correlation i.e. complete agreement in the opposite direction and r = 0 means No Linear
relation.
As a rule of thumb, correlation coefficients between 0.00 and 0.25 are considered weak, between 0.25 and
0.75 moderate and between 0.75 and 1.00 high. Value of r = + 0.80 means correlation is strong and positive
i.e. variables X and Y have strong direct relationship. Value - 0.26 means correlation is weak and
negative i.e. variables X and Y have weak inverse relationship.

Value of correlation
coefficient Correlation is
+1 Perfect Positive Correlation
-1 Perfect Negative Correlation
0 No Correlation
0 to 0.25 Weak Positive Correlation
0.75 to (+1) Strong Positive Correlation
(-) 0.25 to 0 Weak Negative Correlation
(-) 0.75 to (−1) Strong Negative Correlation

Page | 105
8.1.6. Properties of Correlation Coefficient
(a) Correlation coefficient lies between -1 and +1.
(b) Correlation coefficient is independent of change of origin and scale; it means if a number is added to or
subtracted from all the given values of the data, the correlation is not affected [Change of Origin]. Similarly if
all the items of the data set are multiplied or divided by the same number, the correlation is not affected.
[Change of Scale]
(c) If X and Y are two independent variables then correlation coefficient between X and Y is zero.

8.2. Spearman’s Rank Correlation Coefficient(𝝆)


Rank correlation is used when variables under consideration are not capable of quantitative measurement
but can be arranged in serial order i.e. when we are dealing with qualitative characteristics (attributes) like
honesty, beauty, morality, etc. It is based on the ranks of the items rather than actual values. However,
it can be used even with the actual values after ordering/ranking them. Its examples are (i) correlation
between honesty & wisdom (ii) finding out the degree of agreement between scores (ranks) given to the
different Departments of an Organization based on their audit risks by two auditors. (iii) Correlation
between Ranking of trainees at the beginning (x) and at the end (y) of a certain course.

6D2
Formula for Spearman Rank correlation coefficient is (𝝆) = 1 -
N(N2 −1)
D = Difference between the ranks of two items
N = The number of observations.

8.2.1. Computation of Spearman’s Rank Correlation Coefficient


i) Give ranks to the values of items of both the series. Generally the item with the highest value is
ranked 1 and then the others are given ranks 2, 3, 4, . . . etc. according to their values in the
decreasing order.
ii) Find out the difference D = R1 - R2 where R1 = Rank of x and R2 = Rank of y [Note that ΣD = 0
(always)] for each pair of ranks.
iii) Calculate D2 for each rank and then find ΣD2
iv) Apply the formula.

8.2.2. Spearman’s rank correlation when there is a tie between two or more items: In case there is a tie
i.e. same values are repeated for a variable; give the average rank to items getting the same rank. If m be
the number of items of equal ranks, the factor (m3-m)/12 is added to ΣD2. If there are more than one such
cases/items then this factor is added as many times as the number of such equal rank cases, so we have:

1 1 1
6{D2 + (m1 3 − m1 )+ (m2 3 − m2 )+ (m3 3 − m3 )+ −−−−− }
12 12 12
𝝆 =1 - 2
N(N −1)
Where N = total number of pairs of items and m1, m2, m3 etc. indicate the number of times ranks are
repeated.

Page | 106
Calculation of Rank Correlation [No repetition of Ranks]
2
D=R -R
Sl. No. Rank X = R1 Rank Y = R2 1 2 D

1 1 3 -2 4

2 3 1 2 4

3 7 4 3 9

4 5 5 0 0

5 4 6 -2 4

6 6 9 -3 9

7 2 7 -5 25

8 10 8 2 4

9 9 10 -1 1

10 8 2 6 36

Sum  0 96
6D2 6∗96
𝝆 =1- =1-
N(N2 −1) 10(102 −1)
6∗96
=1- = 0.4181
10∗99

8.2.3. Merits and Demerits of Rank Correlation Coefficient


Merits
(i) Spearman’s rank correlation coefficient can be interpreted in the same way as the Karl
Pearson’s correlation coefficient;
(ii) It is easy to understand and easy to calculate;
(iii) For finding out the association between qualitative characteristics, rank correlation coefficient
is the only method;
(iv) Rank correlation does not require the assumption of the normality of the population from
which the sample observations are taken.
Demerits
1. Correlation coefficient can be calculated for bivariate frequency distribution but rank
correlation coefficient cannot be calculated for bivariate frequency distribution; and
2. If the number of pairs of items n > 30, this formula is time consuming.

Page | 107
8.3. The Regression Analysis
Correlation Analysis is concerned with measuring the strength of the relationship between variables and
measures the degree/extent of the relationship between variables. While Regression Analysis is used to
ascertain the probable form of the relationship between variables with the ultimate objective to predict or
estimate the value of one variable corresponding to a given value of other variable(s). The regression
Analysis is often more useful than the correlation coefficient as it enables us to predict value of variable
y for a given value of x and vice versa. For e.g. if we have regression equation between tax and income;
for a given income we can find out the tax amount. Similarly if we have regression equation between no.
of teachers and their salary, we can find out the salary bill of a school/Districts if we know the number of
teachers.
There are two types of variables in regression analysis. The variable which is used for prediction is called
independent variable. It is also known as regressor or predictor or explanatory variable. The variable whose
value is predicted by the independent variable is called dependent variable. It is also known as regressed or
explained variable.
8.3.1. Types of Regression - If scatter diagram shows some relationship between independent variable X and
dependent variable Y, then the scatter diagram will be more or less concentrated round a curve, which may be
called the curve of regression. When the curve is a straight line, it is known as line of regression and the
regression is said to be linear regression. If the relationship between dependent and independent variables is
not a straight line but curve of any other type then regression is known as nonlinear or curvilinear.
Regression can also be classified according to number of variables being used. If only two variables are used
this is considered as simple regression whereas the involvement of more than two variables in regression is
categorized as multiple regression.
8.3.2 Regression Lines: The relationship between two variables can be represented by a simple equation
called the regression equation; in the simplest case this can be a straight line. For two variables X and Y,
we have two regression lines i.e. regression line of X on Y and regression line of Y on X. The regression
line of X on Y gives the most probable values/estimate of X for given values of Y whereas the regression
line of Y on X gives the most probable values of Y for given values of X.
When the two sets of observations increase or decrease together (positive correlation) the line slopes
upwards from left to right; when one set decreases as the other increases (negative correlation) the line
slopes downwards from left to right.
(a) Regression Equation of Y on X: Regression Equation of Y on X is expressed as: Y = a + byx*X
Where byx is regression coefficient of y on x; Y is a dependent variable; X independent variable
and ‘a’ is Y intercept, the value of Y when X is zero.

Using least square criteria* the Regression Equation of Y on X becomes: (Y –`Y) = r *𝒚 (X –`X)
𝒙

Where r is the correlation coefficient between variables X and Y, x is the standard deviation (SD) of
variable X; y is the SD of variable Y while x̅ and y̅ are the means of variables X and Y respectively.
*The Least Square Criterion (the line of best fit): “The sum of the squared deviations (Vertical for regression line of y on x / Horizontal
for regression line of x on y) of the observed data points from the line of best fit is smaller than the sum of the squared deviations of the
data points from any other line.
i.e.  (X – Xe)2 or  (Y – Ye)2 is minimum where Xe & Ye are the estimated values of variables X & Y based on regression lines.

Page | 108
(b) Regression Equation of X on Y: Regression Equation of X on Y is expressed as: X = a + bxy*Y;
Where bxy is regression coefficient of x on y; X is a dependent variable, Y independent variable and
‘a’ is X intercept, the value of X when Y is zero.

Using least square criteria the Regression Equation of X on Y becomes: (X –`X) = r *  𝒙 (Y –`Y)
𝒚

Where r is the correlation coefficient between the variables X and Y, x is the standard deviation of
variable X, y is the Standard Deviation of variable Y while 𝑥̅ 𝑎𝑛𝑑 𝑦̅ are the means of variables X and Y
respectively.

(c) Properties of Regression Lines:


(i) When there is perfect correlation i.e. when r = ± 1, the two regression lines coincide (become
one line).
(ii) The farther the two regression lines from each other, the lesser is the degree of correlation.
(iii) The two regression lines always meet at point (x̅, y̅)
 
(d) Regression Coefficients: byx = r 𝒚 and bxy = r 𝒙 are called the regression coefficients of Y on
𝒙 𝒚
X and X on Y respectively. Properties of regression coefficients are:
(i) byx * bxy = r2 (Coefficient of determination) i.e. Geometric mean of the regression coefficients
is equal to the correlation between the variables.
(ii) If one of the regression coefficients is greater than one, then other must be less than one.
(ii) bxy, byx and r must have the same sign.
(e) Normal Equation Method for Regression Lines
(i) Regression Equation of Y on X: Let the Regression Equation of Y on X is: y = a + b x; normal
equations to obtain the values of a and b are
y = n a + b  x ------------(1)
(x*y) = a  x + b  x2 ---------(2)
By solving equations (1) & (2) simultaneously, values of a & b can be determined to get regression line.
(ii) Regression Equation of X on Y: Let the Regression Equation of X on Y be: x = a + b y
Normal equations to get the values of a and b are
x = n a + b  y ------------(1)
(x*y) = a  y + b  y2 ---------(2)
By solving equations (1) & (2) simultaneously, values of a & b can be determined to get regression line.

8.4. Distinction between Correlation and Regression


Both correlation and regression have important role in the study of relationship between the variables but there
are some distinctions between them as explained under:
(i) Correlation analysis studies the extent or degree of relationship between two or more variables
while regression analysis tries to find out the type of relationship between two or more variables.
(ii) Correlation has limited application because it gives the strength of relationship while the purpose
of regression is to "predict" the value of the dependent variable for the given values of one or more
independent variables.

Page | 109
(iii) Correlation does not consider the concept of dependent and independent variables while in
regression analysis one variable is considered as dependent variable and other(s) is/are as
independent variable(s).

8.5. Coefficient of Determination


Ratio of explained variance to total variance gives the Coefficient of Determination, mathematically:
Coefficient of Determination (r2) = Explained Variance / Total variance
8.5.1. Properties of Coefficient of Determination:
(i) Gives the percentage variation in the dependent variable accounted for by the independent
variable.
(ii) Given by the square of the correlation coefficient, i.e., r2. For e.g. if r = 0.8, the coefficient of
determination (r2) = 0.64 so only 64% of the variation in the dependent variable has been
explained by the independent variable and remaining 36% variation in the dependent variable is
due to others factors.
(iii) Always non–negative and doesn't tell about direction of relationship.
(iv) It is more useful than Coefficient of Correlation.

8.6. Multiple and Partial Correlation and Regression


Regression analysis is a statistical technique which allows us to assess the relationship between one
dependent variable (DV) and one or several independent variables (IVs). Multiple Regression is an
extension of bi-variate regression in which several independent variables (IVs) are combined to predict
the value of the dependent variable (DV). Regression may be assessed in a variety of manners, such as:
8.6.1. Partial regression and correlation - The relationship between two variables may be unclear because
of the confounding (confusing) influence of another variable; for example, if we calculate the correlation
between mental age and height in children 1 to 10 years of age; we may find a high correlation. Does that
mean that height causes intelligence? The key factor is age, not height. Once we control for age, the
relationship between height and mental age becomes trivial/insignificant.
One study was conducted to determine whether the number of hours spent studying was related to grades,
the researchers found a negative correlation. This does not mean that studying fewer results in higher
grades. Once they controlled for intelligence, the researchers found a significant positive relation between
grades and hours of study.
Partial correlation may be written as r12.3. This indicates that we are measuring the correlation between
variables 1 and 2 with the effect of variable 3 removed from both the variables 1 and 2. Consider the
example college grades (Variable-1), hours of study (Variable-2) and intelligence (variable 3). If we use
partial correlation to measure correlation between hours of study and grades; the correlation between
intelligence and grades (r13) and the correlation between intelligence and hours of study (r 23) is removed.
The confounding influence of intelligence is thus removed statistically, and the relationship between
grades and hours of study can be measured accurately. This relationship would give the partial correlation.

Page | 110
8.6.2. Multiple regression and correlation: It studies the combined effect of all the variables acting on
the dependent variable. The multivariate regression equation is of the form:
Y = A + B1X1 + B2X2 + ... + BnXn + E; where:
Y = The predicted value or the Dependent Variable (DV),
A = The Y intercept, the value of Y when all X’s are zero,
X’s = The values of the Independent Variables (IVs),
B = The coefficients of regression and E = An error term.
The goal of the regression is to derive the regression coefficients, or beta coefficients (B values). The beta
coefficients allow the computation of reasonable Y values with the regression equation. When reporting
multiple correlations, R2 rather than R is often presented.
Although regression analysis reveals relationships between variables this does not imply that the
relationships are causal. Demonstration of causality is not a statistical problem, but an experimental and
logical problem.

8.7. Correlation and Causation


It is a common error to confuse correlation and causation. All that correlation shows is that the two
variables are associated. There may be a third variable, a confounding variable that is related to both of
them and indicates a correlation between the two variables. The relationship between two variables may
be unclear because of the confounding influence of another variable; for example, if we calculate the
correlation between mental age and height in children 1 to 10 years of age; we may find a high correlation.
The key factor is age, not height. Once we control for age, the relationship between height and mental age
becomes trivial/insignificant.
Another example is; as ice cream sales increase, the rate of drowning deaths increases sharply. Therefore,
ice cream consumption causes drowning. This example fails to recognize the importance of time and
temperature in ice cream sales and swimming. Ice cream is sold during the hot summer months at a much
greater rate than during colder times and it is during these hot summer months that people are more likely
to engage in activities involving water, such as swimming. The increased drowning deaths are simply
caused by more exposure to water-based activities, not ice cream sales.
Ex.1: The following data pertain to the chlorine residual in a swimming pool at various times after it has
been treated with chemicals:

No. of hours 2 4 6 8 10 12
Chlorine residual 1.8 1.5 1.4 1.1 1.1 0.9
(parts per million)

(a) Calculate the Karl Pearson’s coefficient of correlation between No. of hours and Chlorine residual.
(b) Fit a least squares line which will enable us to predict the chlorine residual content in terms of the
numbers of hours since the pool has been treated with chemicals.
(c) Use the equation of the least squares line to estimate the chlorine residual in the pool 5 hours after it
has been treated with chemicals.

Page | 111
Solution: Let No. of hours = x and Chlorine residual (parts per million) = y;

Sl. No. x Y xi 42.0


X2 Y2 XY (x) = = = 7.0;
1 n 6
2 1.8 4 3.24 3.6 yi 7.8
2 4 1.5 (y) = = = 1.3
16 2.25 6.0 n 6
3 6 1.4 36 1.96 8.4 𝑥𝑖 2 𝑥𝑖 2
4 σx =√ −( ) =
8 1.1 64 1.21 8.8 n n
5 10 1.1 364 42
100 1.21 11.0 √ − ( )2
6 6 6
12 0.9 144 0.81 10.8
Total 42 7.8 364 10.68 48.6 = √60.67 − 49 = 3.42
10.68 7.8
σy = √ − ( )2
6 6

= √ 1.78 − 1.69 = 0.3


N∗XY − X ∗ Y
r=
√[N∗ X2 – (X)2 ] ∗ √[N∗ Y2 – (Y)2 ]

6∗48.6−42∗7.8 291.6−327.6
= =
Ö[6∗ 364 – (42)2 ] ∗ Ö[6∗ 10.68 – (7.8)2 ] √2184−1764 ∗ √64.08−60.84

−36
=
√420 ∗ √3.24
−36
=
20.49 ∗ 1.8
−36
=
36.88
= -0.976
To predict the chlorine residual content we take y = Chlorine residual as dependent variable so the line
of regression of y on X would be used:
y
(Y –`Y) = r * (X –`X)
x
Putting the values calculated above we have:
0.3
Y – 1.3 = -0.976 * (X - 7)
3.42
Y = 1.3 – 0.0856 (X - 7)
Y = 1.3 – 0.0856*X + 0.599
Y = 1.899 – 0.0856x is the required line of regression or line of best fit.
To estimate the chlorine residual in the pool 5 hours after it has been treated with chemicals, put x = 5
So we have Y = 1.899 – 0.0856*5
= 1.899 – 0.428
= 1.471 parts per million
Page | 112
Multiple choice questions: choose the correct answer
1. An association between two variables can be known by calculating:
(a) Coefficient of correlation
(b) Coefficient of regression
(c) Standard error of mean
(d) Standard deviation [Ans. (a)]
2. Coefficient of correlation lies from:
(a) 0 to +1
(b) -1 to 0
(c) -1 to +1
(d) -1.1 to +1.1 [Ans. (c)]
3. If r = -1 what does it mean?
(a) Weak correlation
(b) No correlation
(c) Perfect negative correlation
(d) Wrong calculation [Ans. (c)]
4. The correlation between Price and supply level of a commodity is 0.2. What does it indicate?
(a) Strong correlation
(b) Weak correlation
(c) Moderate correlation
(d) None of the above [Ans. (b)]
5. If r = 1.02 what does it indicate?
(a) Very strong +ve correlation
(b) Moderately +ve correlation
(c) Weak correlation
(d) Calculation of ‘r’ is wrong [Ans. (d)]
6. Not true about correlation coefficient ‘r’ is:
(a) Tells about Audit Risk of the Department
(b) -1 correlation shows perfect linear relationship
(c) Tells association between two variables
(d) Does not tell about causation [Ans. (a)]

Page | 113
7. If the correlation between height and weight is very strong. What will be the possible value of
coefficient of correlation?
(a) 0.2
(b) 0.1
(c) 0.9
(d) -0.9 [Ans. (c)]
8. A correlation between two variables measures the degree to which they are:
(a) Mutually exclusive
(b) Casually related
(c) Positively skewed
(d) Associated [Ans. (d)]
9. Formula for Rank Correlation Coefficient ‘𝜌’ is:
6D2
(a) 1 - N(N2 −1)

6D2
(b) 1 - N(N2 +1)

6D2
(c) 1 + N(N2 −1)

6D2
(d) 1 + N(N2 +1) [Ans. (a)]

10. If D2 = 0, the value of rank correlation ‘𝜌’ is


(a) 0
(b) +1
(c) 2
(d) -1 [Ans. (b)]
11. The unit of correlation coefficient between height in feet and weight in kg is:
(a) Kg/feet
(b) Percentage
(c) Non-existent
(d) Feet/kg [Ans. (c)]
12. If rxy is positive the relation between X and Y is of the type:
(a) When Y increases X increases
(b) When Y decreases X increases
(c) When Y increases X does not change
(d) Anything is possible [Ans. (a)]

Page | 114
13. Of the following three measures which can measure any type (linear and non-linear) of
relationship?
(a) Karl Pearson’s coefficient of correlation
(b) Spearman’s rank correlation
(c) Scatter diagram
(d) Both (i) and (ii) are correct [Ans. (c)]
14. If precisely measured data are available the simple correlation coefficient is:
(a) More accurate than rank correlation coefficient
(b) Less accurate than rank correlation coefficient
(c) As accurate as the rank correlation coefficient
(d) Can be more or less accurate [Ans. (a)]

TRY
Q1. Raw material used in the production of a synthetic fiber is stored in a place which has no humidity
control. Measurements of the relative humidity and the moisture content of samples of the raw material
(both in percentages) on 12 days yielded the following results:
Humidity 46 53 37 42 34 29 60 44 41 48 33 40
Moisture 12 14 11 13 10 8 17 12 10 15 9 13
Content
(a) Calculate the Karl Pearson’s coefficient of correlation between Humidity and Moisture Content.
(b) Fit a least squares line which will enable us to predict the moisture content in terms of the relative
humidity.
(c) Use the result to estimate (predict) the moisture content when the relative humidity is 38 percent.

Q2. The following data pertain to X, the amount of fertilizer (in kgs.) which a farmer applies to his soil,
and Y, his yield of wheat (in quintals per acre):
X 112 92 72 66 112 88 42 126 72 52 28
Y 33 28 38 17 35 31 8 37 32 20 17
Assuming that the data can be looked upon as a random sample from a bivariate normal population,
calculate r. Also, draw a scatter diagram of these paired data and judge whether the assumption seems
reasonable.

Q3. Ranking of 10 trainees at the beginning (x) and at the end (y) of a certain course are given below:
Trainees A B C D E F G H I J
x 1 6 3 9 5 2 7 10 8 4
y 6 8 3 7 2 1 5 9 4 10
Calculate spearman’s rank correlation coefficient. [ρ=0.394]

Page | 115
Q4. Find the coefficient of rank correlation between the marks obtained in Mathematics (x) and those in
Statistics (y) by 10 students of certain class out of a total of 50 marks in each subject. [ρ=0.95]
Student No. 1 2 3 4 5 6 7 8 9 10
X 12 18 32 18 25 24 25 40 38 22
Y 16 15 28 16 24 22 28 36 34 19

Q5. Comment: The correlation coefficient between the accidents in a particular year and the babies born
in that year was found to be 0.8, which seems to be quite high.

Q6. The production manager of a company maintains that the flow time in days (y), depends on the number
of operations (x) to be performed. The following data given the necessary information:
x 2 2 3 4 4 5 6 6 7 7
y 8 13 14 11 20 10 22 26 22 25
Plot a scatter diagram. Calculate the value the Karl Pearson’s Correlation Coefficient. [r (x, y) =0.78]

Q7. From the following data of the age of husband and the age of wife, form two regression lines and
calculate the husband’s age when the wife’s age is 16.

Husband’s age 36 23 27 28 28 29 30 31 33 35
Wife’s age 29 18 20 22 27 21 29 27 29 28
[Husband’s age: x; Wife’s age: y; y = 0.95x - 3.5; x = 0.8y + 10; 22.8yrs.]

Q8. The following table gives the ages and blood pressure of 10 women.
Age(x) 56 42 36 47 49 42 60 72 63 55
Blood Pressure (y) 147 125 118 128 145 140 155 160 149 150
(i) Find the correlation coefficient between X and Y.
(ii) Determine the least square regression equation of Y on X.
(iii) Estimate the blood pressure of a woman whose age is 45 years.
[(i) r = 0.89, (ii) y = 83.758 + 1.11x (iii) When x=45, y=134]

Q9. The following table gives the normal weight of a baby during the first six months of life:
Age in months 0 2 3 5 6
Weight in lbs. 5 7 8 10 12
Estimate the weight of a baby at the age of 4 months. [9.2982 lbs]

Q10. Does correlation imply causation? Explain with the help of an example.
[Hint: No. Correlation only implies co-variation.]
Q11. When rank correlation is more precise than simple correlation coefficient?
[Hint: When precise measurements of the variables are not possible.]
Page | 116

You might also like