02 Basic Ideas of Regression Analysis
Historical origin of the term regression:
The term regression literally means stepping back towards the average. It was first introduced by
a British biometrician Sir Francis Galton (1822-1911), in connection with the inheritance of
stature. He found that although there was a tendency for tall parents to have tall children and for
short parents to have short children, the average height of children born of parents of a given
height tended to regress or step back toward the average height in the population as a whole. In
other words, the offspring of abnormally tall or short parents tend to regress or step back toward
the average height of the population.
Galton’s law of universal regression was confirmed by his friend Karl Pearson, who collected
more than thousand records of heights of members of family groups. He found that the average
height of sons of a group of tall fathers was less than their fathers’ height and the average height
of sons of a group of short fathers was greater than their fathers’ height, thus regressing tall and
short sons alike toward the average height of all men. In the words of Galton, thus was
regression to mediocrity.
A hypothetical example:
To understand the concept of regression analysis, consider the data given in table 2.1. The data in
the table refer to a total population of 60 families in a hypothetical community and their weekly
income X and weekly consumption expenditure Y , both in Tk.
The 60 families are divided into 10 income groups (from Tk. 800 to Tk. 2600) and the weekly
expenditures of each family in the various groups are shown in the table. Therefore, we have 10
fixed values of X and the corresponding Y values against each of the X values.
There is considerable variation in weekly consumption expenditure in each income group, which
can be seen clearly from figure 2.1. But the general picture that one gets is that despite the
variability of weekly consumption expenditure within each income bracket, on the average,
weekly consumption expenditure increases as income increases.
Table 2.1: Weekly income X and weekly consumption expenditure Y , both in Tk.
X 800 1000 1200 1400 1600 1800 2000 2200 2400 2600
(weekly family
income)
550 650 790 800 1020 1100 1200 1350 1370 1500
600 700 840 930 1070 1150 1360 1370 1450 1520
Y 650 740 900 950 1100 1200 1400 1400 1550 1750
(weekly family 700 800 940 1030 1160 1300 1440 1520 1650 1780
expenditure) 750 850 980 1080 1180 1350 1450 1570 1750 1800
880 1130 1250 1400 1600 1890 1850
1150 1620 1910
Conditional means
Of Y , E Y X 650 770 890 1010 1130 1250 1370 1490 1610 1730
1
02 Basic Ideas of Regression Analysis
To see this clearly, in table 2.1, we have given the mean or average expenditure corresponding to
each of the 10 levels of income. Thus, corresponding to the weekly income level Tk. 800, the
mean consumption expenditure is Tk. 650 and so on.
In all we have 10 mean values for the 10 subpopulations of Y . We call these mean values
conditional expected values, as they depend on the given values of the (conditioning) variable
X . Symbolically, we denote them as E Y X , which are read as the expected values of Y given
the values of X .
Figure 2.1: Conditional distribution of expenditure for various levels of income.
It is important to distinguish these conditional expected values from the unconditional expected
value of weekly consumption expenditure, E Y . If we add the weekly consumption expenditures
for all the 60 families in the population and divide this number by 60, we get the number Tk.
1212, which is the unconditional mean or expected value of weekly consumption expenditure,
E Y , it is unconditional in the sense that in arriving at this number we have disregarded the
income levels of various families.
Obviously, the various conditional expected values of Y given in table 2.1 are different from the
unconditional expected value of Y of Tk. 1212. When we ask the question, “What is the expected
value of weekly consumption expenditure of a family?” we get the answer Tk. 1212 (the
unconditional mean). But if we ask the question, “What is the expected value of weekly
consumption expenditure of a family whose weekly income is Tk. 1400?” we get the answer Tk.
1010 (the conditional mean).
2
02 Basic Ideas of Regression Analysis
To put it differently, if we ask the question, “What is the best (mean) prediction of weekly
consumption expenditure of a family whose weekly income is Tk. 1400?” the answer would be
Tk. 1010. Thus, the knowledge of the income level may enable us to better predict the mean
value of consumption expenditure than if we do not have that knowledge. This probably is the
essence of regression analysis.
The dark circled points in figure 2.1 show the conditional mean values of Y against the various
X values. If we join these conditional mean values, we obtain what is known as the population
regression line (PRL) or more generally, the population regression curve. More simply, it is the
regression of Y on X .
The adjective population comes from the fact that we are dealing in this example with the entire
population of 60 families. Of course, in reality a population may have many families.
Geometrically, then, a population regression curve is simply the locus of the conditional means
of the dependent variable for the fixed values of the explanatory variable(s). More simply, it is
the curve connecting the means of the subpopulations of Y corresponding to the given values of
the regressor X .
The population regression curve shows that for each X (income level), there is a population of
Y values (weekly consumption expenditures) that are spread around the (conditional) mean of
those Y values. For simplicity, we are assuming that these Y values are distributed symmetrically
around their respective (conditional) mean values. And the regression line (or curve) passes
through these (conditional) mean values.
The concept of population regression function (PRF):
From the preceding discussion, it is clear that each conditional mean E Y X i is a function of Xi ,
where X i is a given value of X . Symbolically,
E Y X i f X i 2.2.1
Here, f X i denotes some function of the explanatory variable X . In our example, E Y X i is a
linear function of X i . Equation (2.2.1) is known as the conditional expectation function (CEF) or
population regression function (PRF) or population regression (PR). It states merely that the
expected value of the distribution of Y given X i is functionally related to X i . In simple terms, it
tells how the mean or average response of Y varies with X .
What form does the function f X i assume? This is an important question because in real
situations, we do not have the entire population available for examination. The functional form
of the population regression function (PRF) is therefore an empirical question, although in
specific cases theory may have something to say.
For example, an economist might posit that consumption expenditure is linearly related to
income. Therefore, as a first approximation or a working hypothesis, we may assume that the
population regression function (PRF), E Y X i , is a linear function of X i . That is,
3
02 Basic Ideas of Regression Analysis
E Y X i 1 2 X i 2.2.2
Here, 1 and 2 are unknown but fixed parameters known as the regression coefficients. They
are also known as intercept and slope coefficients, respectively. Equation (2.2.2) is known as the
linear population regression function (LPRF).
Some alternative expressions used in the literature are linear population regression model or
simply linear population regression. In the sequel, the terms regression, regression equation and
regression model will be used synonymously.
In regression analysis, our interest is in estimating the population regression functions (PRFs)
like equation (2.2.2), that is, estimating the values of the unknown parameters 1 and 2 on the
basis of observations on Y and X .
Stochastic specification of population regression function (PRF):
It is clear from figure 2.1 that, as family income increases, family consumption expenditure on
the average increases, too. What about the consumption expenditure of an individual family in
relation to its (fixed) level of income? It is obvious from table 2.1 and figure 2.1 that an
individual family’s consumption expenditure does not necessarily increase as the income level
increases.
For example, from table 2.1, we observe that corresponding to the income level of Tk. 1000 there
is one family whose consumption expenditure of Tk. 650 is less than the consumption
expenditures of two families whose weekly income is only Tk. 800. But notice that the average
consumption expenditure of families with a weekly income of Tk. 1000 is greater than the
average consumption expenditure of families with a weekly income of Tk. 800.
What, then, can we say about the relationship between an individual family’s consumption
expenditure and a given level of income? We see from figure 2.1 that, given the income level of
X i , an individual family’s consumption expenditure is clustered around the average consumption
of all families at that X i , that is, around its conditional expectation. Therefore, we can express
the deviation of an individual Yi around its expected value as follows:
u i Yi E Y X i Yi E Y X i u i 2.4.1
Here, the deviation u i is an unobservable random variable taking positive or negative values.
Technically, u i is known as the stochastic disturbance or stochastic error term. We can say that
the expenditure of an individual family, given its income level, can be expressed as the sum of
two components.
E Y X i
is simply the mean consumption expenditure of all the families with the same level of
income. This component is known as the systematic or deterministic component.
4
02 Basic Ideas of Regression Analysis
ui is the random or nonsystematic component. We shall examine shortly the nature of the
stochastic disturbance term, but for the moment assume that it is surrogate or proxy for all the
omitted or neglected variables that may affect Y but are not (or cannot be) included in the
regression model. If E Y X i is assumed to be linear in X i , then equation (2.4.1) may be written
as:
Yi E Y X i u i Yi 1 2 X i u i 2.4.2
Equation (2.4.2) posits that the consumption expenditure of a family is linearly related to its
income plus the disturbance term. Thus, the individual consumption expenditures, given X Tk.
800, can be expressed as:
Y1 550 1 2 800 u 1 Y2 600 1 2 800 u 2
Y3 650 1 2 800 u 3 Y4 700 1 2 800 u 4
Y5 750 1 2 800 u 5
Now, if we take the expectation on both the sides of equation (2.4.1), then we have that
E Yi X i E E Y X i E u i X i
E Yi X i E Y X i E u i X i 2.4.4
Notice carefully that in equation (2.4.4), we have taken the conditional expectation, conditional
upon the given X ’s. Since, E Yi X i is the same thing as E Y X i , then equation (2.4.4) implies
that E u i X i 0 . Thus, the assumption that the regression line passes through the conditional
means if Y implies that the conditional mean values of ui (conditional upon the given X ’s) are
zero.
From the previous discussion, it is clear that equation (2.2.2) and equation (2.4.2) are equivalent
forms if E u i X i 0 . But the stochastic specification in equation (2.4.2) has the advantage that it
clearly shows that there are other variables besides income that affect consumption expenditure
and that an individual family’s consumption expenditure cannot be fully explained only by the
variable(s) included in the regression model.
The concept of sample regression function (SRF):
By confining our discussion so far to the population of Y values corresponding to the fixed X ’s,
we have deliberately avoided sampling considerations (note that the data of table 2.1 represent
the population, not a sample).
But it is about time to face up to the sampling problems, for in most practical situations what we
have is but a sample of Y values corresponding to some fixed X ’s. Therefore, our task now is to
estimate the population regression function (PRF) on the basis of the sample information.
As an illustration, pretend that the population of table 2.1 was not known to us and the only
information we had was a randomly selected sample of Y values for the fixed X ’s as given in
table 2.4.
5
02 Basic Ideas of Regression Analysis
Table 2.4: A random sample from the population of table 2.1
Y 700 650 900 950 1100 1150 1200 1400 1550 1500
X 800 1000 1200 1400 1600 1800 2000 2200 2400 2600
Unlike table 2.1, we now have only one Y value corresponding to the given X ’s, each Y (given
X i ) in table 2.4 is chosen randomly from similar Y ’s corresponding to the same X i from the
population of table 2.1.
From the sample of table 2.4, can we predict the average weekly consumption expenditure Y in
the population as a whole corresponding to the chosen X ’s? In other words, can we estimate the
population regression function (PRF) from the sample data? We may not be able to estimate the
population regression function (PRF) accurately because of sampling fluctuations. To see this,
suppose we draw another random sample from the population of table 2.1, as presented in table
2.5.
Table 2.5: A random sample from the population of table 2.1
Y 550 880 900 800 1180 1200 1450 1350 1450 1750
X 800 1000 1200 1400 1600 1800 2000 2200 2400 2600
Plotting the data of tables 2.4 and 2.5, we obtain the scatter gram given in figure 2.4. In the
scatter gram, two sample regression lines are drawn so as to fit the scatters reasonably well. SRF1
is based on the first sample and SRF2 is based on the second sample.
Which of the two regression lines represents the true population regression line? There is no way
we can be absolutely sure that either of the regression lines shown in figure 2.4 represents the
true population regression line (curve). The regression lines in figure 2.4 are known as the
sample regression lines (SRL).
Supposedly they represent the population regression line, but because of sampling fluctuations
they are at best the approximations of the true population regression line (PRL). In general, we
would get N different sample regression functions (SRFs) for N different samples and these
SFRs are not likely to be the same.
6
02 Basic Ideas of Regression Analysis
Now, analogously to the population regression function (PRF) that underlies the population
regression line, we can develop the concept of the sample regression function (SRF) to represent
the sample regression line. The sample regression function (SRF) corresponding to the
population regression function (PRF): E Y X i 1 2 X i is given by:
Yˆi ˆ1 ˆ2 X i 2.6.1
Now, just as we expressed the population regression function (PRF) in two equivalent forms,
equation (2.2.2) and equation (2.4.2), we can express the sample regression function (SRF) in
equation (2.6.1) in its stochastic form as follows:
Y i ˆ1 ˆ2 X i uˆ i 2.6.2
Here, uˆ i denotes the (sample) residual term. Conceptually, uˆ i is analogous to u i and be regarded
as an estimate of u i . It is introduced in the sample regression function (SRF) for the same
reasons as u i was introduced in the population regression function (PRF).
To sum up, then, we find our primary objective in regression analysis is to estimate the
population regression function (PRF): Yi 1 2 X i u i on the basis of the sample regression
function: Y i ˆ1 ˆ2 X i uˆ i . In terms of the sample regression function (SRF), the observed Y i can
be expressed as:
Y i ˆ1 ˆ2 X i uˆ i Y i Yˆi uˆ i 2.6.3
In terms of the population regression function (PRF), the observed Yi can be expressed as:
Y i 1 2 X i u i Y i E Y X i u i 2.6.4
The critical question now is: Granted that the sample regression function (SRF) is but an
approximation of the population regression function (PRF), can we devise a rule or a method that
will make this approximation as close as possible? In other words, how should the sample
regression function (SRF) be constructed so that ˆ1 is as close as possible to the true 1 and ˆ2
is as close as possible to the true 2 even though we will never know the true 1 and 2 ?
We can develop procedures that tell us how to construct the sample regression function (SRF) to
mirror the population regression function (PRF) as faithfully as possible. It is fascinating to
consider that this can be done even though we never actually determine the population regression
function (PRF) itself.
Deterministic or non-stochastic or mathematical model:
A model is said to be deterministic or non-stochastic or mathematical model, if for a set of values
of independent variables, there is one and only one corresponding value of the dependent
variable. Thus, the model is:
7
02 Basic Ideas of Regression Analysis
Y i 1 2 X 2i ... k X k i ; i 1 , 2 , ... , n
Now if the values of 1 , 2 , ... , k , are known, then for a set of values of independent variables,
there is one and only one corresponding value of the dependent variable. The above model is
therefore a deterministic or non-stochastic or mathematical model.
Stochastic or statistical model:
A model is said to be stochastic or statistical model, if for a set of values of independent
variables, there is a whole probability distribution of the values of dependent variable. Thus the
model is:
Y i 1 2 X 2i ... k X k i u i ; i 1 , 2 , ... , n
Here, i is the random error component or disturbance term. Now if the values of 1 , 2 , ... , k
are known, then for a set of values of independent variables, there are different values of the
dependent variable because of the presence of the random error term or disturbance term. The
above model is therefore a stochastic or statistical model.
Regression versus causation:
Although regression analysis deals with the dependence of one variable on other variables, it
does not necessarily imply causation. A statistical relationship, however strong and however
suggestive, can never establish causal connection. Our ideas of causation must come from
outside statistics, ultimately from some theory or other. To ascribe causality, one must appeal to
a priori or theoretical considerations. For example, common sense suggests that crop yield
depends on rainfall, but the reverse is not true. This is causation. But there is no statistical reason
to assume that rainfall does not depend on crop yield.