Linear Regression Analysis. Statistics 2 Notes
Linear Regression Analysis. Statistics 2 Notes
Definition: Linear Regression Analysis is a forecasting procedure that uses the least
squares approach on one or more Independent Variables to develop a forecasting model for
a capacity requirements for future business.
Synopsis
Let us begin to look at this chapter with an objective to determine the normal capacity
required by a company for the medium to long term. The decision will be based on demand
forecast. When forecasting future demand for capacity planning purposes, fluctuations in
demand will be expected –but to some extent ignored. The aim will be to get a general
picture of future demand. Take as an example the forecast demand pattern shown in figure
1.1 below
1|Page
be used for forecasting. The relationship is expressed mathematically by means of a
Regression Equation
1.0 Introduction
1.1.0 Regression Analysis
Linear Regression analysis is a very useful tool for today’s manager. Two purposes of
regression analysis are to understand the relationship between variables and to predict
the value of one based on another (such as cost of advertising and sales). Regression
has been used to model such things as relationship between level of education and
income, the price of a house and its square footage (Centimetres).
In any regression model, the variable to be predicted is called the dependent variable
or response variable. The value of this is said to be dependent upon the value of an
independent variable, which is sometimes called an explanatory variable or a
predictor variable.
Table 1.1:b Tororo Construction Company Sales and Local Incomes (Payroll)
TCC Sales Local Payroll
($100,000s) ($100,000,000s)
6 3
8 4
9 6
5 4
4.5 2
9.5 5
Table 1.1 data for TCC used to draw Scatter Diagram in Fig.1.1 below.
2|Page
This graph indicates that higher values for the local income (payroll) seem to result in
higher sales for the company. This is not a perfect relationship because not all the points
lie on a straight line, but there is a relationship. A line has been drawn through the data
to help show the relationship that exists between the income (payroll) and sales. The
points do not all lie on the line, so there would be some error involved if we tried to
predict sales based on payroll using this or any other line. Regression analysis provides
the answer to this question.
3|Page
1.1.4 Estimation of the Slope and Intercept
Estimate of the slope and intercept are found from sample data. The true values for
the intercept and slope are not known, and therefore; they are estimated using sample
data. The regression equation based on sample data is given by equation
Ŷ = 𝑏! + 𝑏" 𝑋
Where:
Ŷ = predicted value of Y
b0 = estimate of β0, based on sample results
b1 = estimate of β1, based on sample results.
Therefore, in the case of Tororo Construction Company, we are trying to predict the
sales, so the dependent variable (Y) would be sales. The variable we use to help
predict sales is the Podut area Incomes (Payroll), so this is the independent variable
(X). Although any number of lines can be drawn through these points to show a
relationship between X and Y in figure 1.1 above, the line that will be chosen is the one
that in some way minimises the errors. Error is defined as
∈ = 𝒀 − Ŷ
2.0 Minimisation of Square Errors
Since errors may be positive or negative, the average error could be zero even though
there are extremely large errors –both positive and negative. To eliminate the difficulty
of negative errors cancelling positive errors, the errors can be squared.
Thus the best regression line is defined as the one with the minimum sum of the squared
errors. For this reason, regression analysis is sometimes called Least-squares
regression.
2.1.0 Statisticians
Statisticians have developed formulas that we can use to find the equation of a straight
line that would minimise the sum of the squared errors. The simple linear regression
equation is
Ŷ = 𝑏! + 𝑏" 𝑋
The following formulas can be used to compute the intercept and the slope.
= ∑𝑋
𝑛
= average (mean) of X values
= ∑𝑌
𝑛
= average (mean) of Y values
4|Page
Σ(𝑋 − )(𝑌 − )
𝑏1 =
∑(𝑋 − )#
𝑏% = 𝑌 − 𝑏&
Table 1.2: Regressions Calculations for Tororo Construction Company
Y X (X – )2 (X – )(Y – )
2
6 3 (3 – 4) = 1 (3 – 4)(6 – 7) = 1
2
8 4 (4 – 4) = 0 (4 – 4)(8 – 7) = 0
9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4
5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0
4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5
9.5 5 (5 – 4)2 = 1 (5 – 4)(9.5 – 7) = 2.5
∑Y = 42 ∑X = 24 => ∑(X - )2 = 10 ∑(X - )(Y – ) =
= 42/6 =7 24/6 = 4 12.5
∑'(
= ∑𝑋
𝑛 = = 4
)
∑* ('
= = = 7
+ )
𝑏! = − 𝑏"
Substituting
𝑏! = 7 − (1.25)(4) = 2
Ŷ = 2 + 1.25𝑋
Or
Sales = 2 + 1.25(Payroll)
If the payroll next year is $600 million (X = 6), then the predicted value would be
5|Page
2.1.1 Deduction
One of the purposes of regression is to understand the relationship among variables.
This model tells us that for each $100 million (represented by X) increase in the payroll,
we would expect the sales to increase by $125,000 since b1 = 1.25(100,000s). this model
helps Tororo Construction Company see how the local economy and company sales are
related.
3.1.1 The SST Measures the Total Variability in Y about the Mean
Logically summing-up these values would be misleading because the negatives would
cancel out the positives, making it appear that the numbers are closer to the mean than
they actually are. To prevent this problem, we will use the Sum of the Squares Total
(SST) to measure the total variability in Y.
𝑆𝑆𝑇 = ∑(𝑌 − )!
6|Page
3.1.2 The SSE measures the variability in Y about the regression line
If we did not use X to predict Y, we would simply use the mean of Y as the prediction,
and the SST would measure the accuracy of our predictions. However, a regression
line may be used to predict the value of Y, and while there are still errors involved, the
sum of these squared errors will be less than the total sum, of squares just compared.
The Sum of the Squares Error (SSE) is
𝑆𝑆𝐸 = ∑𝑒 ! = ∑(𝑌 − Ŷ)!
Y X (Y - )𝟐 Ŷ (Y - Ŷ)𝟐 (Ŷ − )𝟐
Table 1.2 provides the calculations for the Tororo Construction example. The mean (
= 7) is compared to each value and we get
SST = 22.5
The prediction (Ŷ) for each observation is computed and compared to the actual value.
These results in
SSE = 6.875
The SSE is much lower than the SST. Using the regression line has reduced the
variability in the sum of squares by 22.5 – 6.875 = 15.625. This is called the Sum of
Squares due to Regression (SSR) and indicates how much of the total variability in Y
is explained by the regression model. Mathematically, this can be calculated as
𝑺𝑺𝑹 = (𝑆𝑆𝑇 − SSE)𝟐
Table 1.3 indicates that SSR = 15.625
There is a very important relationship between the sums of squares that we have
computed:-
7|Page
(Sum of squares total) = (Sum of squares due to regression) + (Sum of squares Error)
Figure 1.2: Deviations from the Regression Line and from the Mean
The figure above 1.2 displays the data for Tororo Construction Company. The
regression line is shown, as a line representing the mean of the Y values. The errors
used in computing the sums of squares are shown on this graph. Notice how the sample
points are closer to the regression line than they are to the mean.
8|Page
15.625
𝒓𝟐 = = 0.6944
22.5
This means that about 69% of the variability in sales (Y) is explained by the regression
equation based on payroll (X)
If every point in the sample were on the regression line (meaning all errors are 0), then
100% of the variability in Y could be explained by the regression equation, so r2 = 1
and SSE = 0. The lowest possible value of r2 is 0. Indicating that X explains 0% of
the variability in Y. Thus, r2 can range from a low of 0 to a high of 1. In developing
regression equations, a good model will have an r2 value close to 1.
𝒓 = ±√0.6944 = 0.8333
9|Page
Figure 1.3 (a, b, c, d): Illustrate possible scatter diagrams for different values of r
10 | P a g e
Using Computer Software (Excel) for Regression Analysis
Software such as QM for windows and Excel QM is often used for regression
calculations. We will rely on Excel for most of the calculations in the rest of
this chapter. When using Excel to develop a regression model, the input and
output for Excel 2007 and 2010 are the same.
The Tororo Construction Company example will be used to illustrate how to develop a
Regression model in Excel 2010. Go to the Data Analysis in the Excel Windows Menu.
When the Data Analysis window opens, scroll down to and highlight Regression and
click OK. (table 1.4)
Table 1.4: How to Access the regression Option in Excel 2007 or 2010
The Regression window will open, and you can input the X and Y ranges (values).
Check the Labels box because the cells with the variable name were included in the first
row of the X and Y ranges/ values. (table 1.5)
11 | P a g e
Table 1.5: Showing Data Input for Regression in Excel
To have the output presented on this page rather than on a new worksheet, select Output
Range and give a cell address for the start of the output. Click the OK button, and the
output appears in the output range specified.
Table 1.6: Excel Output for the Tororo Construction Company example
12 | P a g e
Assumption of the Regression Model
If we can make certain assumptions about the errors in a regression model, we can
perform statistical tests to determine if the model is useful. The following assumptions
are made about the errors:
1. The errors are independent
2. The errors are normally distributed
3. The errors have a mean of zero
4. The errors have a constant variance (regardless of the value of X)
Plotting Errors
A Plot of the errors may highlight problems with the model. It is possible to check the
data to see if these assumptions are met. Often a plot of the residuals will highlight any
glaring violations of the assumptions. When the errors (residuals) are plotted against
the independent variable, the pattern should appear random.
Figure 1.4 series, present some typical error patterns, with figure 1.4A displaying a
pattern that is expected when the assumptions are met, thus; the model is appropriate.
The errors are random and no discernible pattern is present
13 | P a g e
Figure 1.4A: Pattern of Errors Indicating Randomness
Figure 1.4B demonstrates an error pattern in which the errors increase as X increases.
Violating the constant variance assumption.
Figure 1.4C: Pattern of Errors Indicating that the Errors Relationship is Not Linear
14 | P a g e
Estimating the Variance
While the errors are assumed to have constant variance (𝜎 # ), this is usually not known. It can
be estimated from the sample results. The estimate of 𝜎 # is the mean squared error (MSE)
and is estimated by S2. The MSE is the sum of squares due to error divided by the degrees
of freedom.
SSE
S # = MSE =
n−k−1
Where
n = number of observations in the sample
k = number of independent variables
in this example, n = 6 and k = 1, so
𝑆𝑆𝐸 6.8750 6.8750
𝑆 # = 𝑀𝑆𝐸 = = = = 1.7188
𝑛−𝑘−1 6−1−1 4
from this we can estimate the standard deviation as
𝑆 = √𝑀𝑆𝐸
this is called the standard error of the estimate or the standard deviation of the regression.
In the example shown
𝑆 = √𝑀𝑆𝐸 = √1.7188 = 1.31
This is used in many of the statistical tests about the model. It is also used to find interval
estimates for both Y and regression coefficients. (The MSE is a common measure of
accuracy in forecasting. When used with techniques besides regression, it is common to
divide the SSE by n rather than n - k - 1)
Testing the Model for Significance (An F test is used to determine if there is a relationship
between X and Y)
Both the MSE and r2 (r2 = Correlation Coefficient) provide a measure of accuracy in a regression
model. However, when the sample size is too small, it is possible to get good values for both
of these even if there is no relationship between the variables in the regression model. To
determine whether these values are meaningful, it is necessary to test the model for
significance.
To see if there is a linear relationship between X and Y, a statistical hypothesis test is
performed. The underlying linear model was given earlier as
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿+ ∈
15 | P a g e
If β1 = 0, then Y does not depend on X in any way. The null hypothesis says there is no linear
relationship between the two variables (i.e. β1 = 0). The alternate hypothesis is that there is a
linear relationship (i.e. β1 ≠ 0). If the null hypothesis can be rejected, then we have proven
that a linear relationship does exist, so X is helpful in predicting Y. The F distribution is used
to test this hypothesis.
The F statistic used in the hypothesis test is based on the MSE and the mean squared regression
(MSR). The MSR is calculated as
𝑆𝑆𝑅
𝑀𝑆𝑅 =
𝑘
Where
K = number of independent variables in the model
The F statistic is
𝑀𝑆𝑅
𝐹 =
𝑀𝑆𝐸
Based on the assumptions regarding the errors in a regression model, this calculated F
statistic is described by the F distribution with
• degrees of freedom for the numerator = df1 = k
• degrees of freedom for the denominator = df2 = n – k – 1
Where
k = the number of independent (X) variables.
If there is very little error, the denominator (MSE) of the F statistics is very small relative to
the numerator (MSR), and the resulting F statistic would be large. This would be an indication
that the model is useful. A significance level related to the value of the F statistic is then found.
Whenever the F value is large, the significances level (p-value) will be low, indicating that it
is extremely unlikely that this could have occurred by chance. When the F value is large (with
a resulting small significance level), we can reject the null hypothesis that there is no linear
relationship. This means that there is a linear relationship and the values of MSE and r2 are
meaningful. The hypothesis test just described above can be summarised as follows, step by
step.
16 | P a g e
Steps in Hypothesis Test for a Significant Regression Model
17 | P a g e
Figure 1.5: F Distribution for KYU Construction Company Test for Significance
18 | P a g e
The Analysis of Variance (ANOVA) Table
When software such as Excel or QM for windows is used to develop regression models, the
output provides the observed significance level, or p-value, for the calculated F value. This is
then compared to the level of significance (α) to make the decision.
Table 1.7: Analysis of variance (ANOVA) table for regression
DF SS MS F Significance F
k SSR MSR = SSR/k 𝑀𝑆𝑅 P(F > MSR/ MSE)
Regression 𝑀𝑆𝐸
n – k- 1 SSE MSE = SSE/(n-k-1)
Residual
n-1 SST
Total
Table 1.7 provides summary information about the ANOVA table. This shows how the
numbers in the last three columns of the table are computed. The last column of this table,
labelled Significance F, is the p-value, or observed significance level, which can be used in
the hypothesis test about the regression model.
19 | P a g e
20 | P a g e