0% found this document useful (0 votes)
254 views

Linear Regression Analysis. Statistics 2 Notes

1. The document discusses using linear regression analysis to forecast future capacity needs based on demand forecasts. It provides an example of using a company's past sales data and local payroll amounts to develop a regression model to predict future sales based on predicted payroll. 2. The example shows data for a construction company's sales and local payroll over 6 years in a table. A scatter plot of this data indicates a relationship between higher payroll and higher sales. 3. The document then explains how to calculate the slope and intercept of the regression line that best fits the data by minimizing the sum of squared errors between the actual and predicted values. It works through the calculations for the example company's data to develop a regression equation to predict
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
254 views

Linear Regression Analysis. Statistics 2 Notes

1. The document discusses using linear regression analysis to forecast future capacity needs based on demand forecasts. It provides an example of using a company's past sales data and local payroll amounts to develop a regression model to predict future sales based on predicted payroll. 2. The example shows data for a construction company's sales and local payroll over 6 years in a table. A scatter plot of this data indicates a relationship between higher payroll and higher sales. 3. The document then explains how to calculate the slope and intercept of the regression line that best fits the data by minimizing the sum of squared errors between the actual and predicted values. It works through the calculations for the example company's data to develop a regression equation to predict
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

BUSITEMA UNIVERSITY

Faculty of Engineering & Technology

Department of Agricultural Mechanization & Irrigation Engineering


Course: Engineering Statistics
Programme: AMI Year IV

Definition: Linear Regression Analysis is a forecasting procedure that uses the least
squares approach on one or more Independent Variables to develop a forecasting model for
a capacity requirements for future business.

Synopsis
Let us begin to look at this chapter with an objective to determine the normal capacity
required by a company for the medium to long term. The decision will be based on demand
forecast. When forecasting future demand for capacity planning purposes, fluctuations in
demand will be expected –but to some extent ignored. The aim will be to get a general
picture of future demand. Take as an example the forecast demand pattern shown in figure
1.1 below

Figure 1.1: Demand Forecast


The average demand is stable at X, but some variations about this level are expected. In this
case, when planning future capacity, a capacity level of X could be established, with the
expectation that some means would be available to deal with the short-term variations.
However, if it was known in advance that this was not possible, then a higher level of capacity,
e.g. Y might be fixed. So knowing whether and how demand fluctuations might be dealt with
will influence the normal capacity, which is fixed given the forecast of future demand.

To develop forecasting techniques based on associative predictions or economic indicators


can be a lengthy business which usually involves an extensive examination of the statistical
relationship of past sales data and various likely indices. The closeness of the statistical
association of variables can be measured by calculation of a correlation coefficient. Once
indices bearing a close correlation with historical data have been found, the relationship can

1|Page
be used for forecasting. The relationship is expressed mathematically by means of a
Regression Equation

Let’s now consider Regression Analysis as a Chapter

1.0 Introduction
1.1.0 Regression Analysis
Linear Regression analysis is a very useful tool for today’s manager. Two purposes of
regression analysis are to understand the relationship between variables and to predict
the value of one based on another (such as cost of advertising and sales). Regression
has been used to model such things as relationship between level of education and
income, the price of a house and its square footage (Centimetres).

In any regression model, the variable to be predicted is called the dependent variable
or response variable. The value of this is said to be dependent upon the value of an
independent variable, which is sometimes called an explanatory variable or a
predictor variable.

1.1.2 Investigation of relationship between variables.


To investigate the relationship between variables, it is helpful to look at the data
available to you. The available data can be plotted on a Chart/ Graph for interpretation.
Such a graph is often called a scatter diagram or a scatter plot. Normally the
independent variable is plotted on the horizontal (X-axis) and the dependent variable
is plotted on the vertical (Y-axis).

Example 1: Tororo Construction Company (TCC) renovates old homes in Podut


village on Iganga-Malaba highway. Over time, the company has found that its year
Income volume of renovation work is dependent on the Podut area payroll of the
residents. The figures for Tororo Construction Company’s revenue and the amount of
money earned by wage earners who live in Podut for the past six (6) years are presented
in table 1.1. Economists have predicted the local area income (payroll) to be $600
million in the coming year, and Tororo Construction Company (TCC) wants to plan
accordingly.

Table 1.1:b Tororo Construction Company Sales and Local Incomes (Payroll)
TCC Sales Local Payroll
($100,000s) ($100,000,000s)
6 3
8 4
9 6
5 4
4.5 2
9.5 5

Table 1.1 data for TCC used to draw Scatter Diagram in Fig.1.1 below.

2|Page
This graph indicates that higher values for the local income (payroll) seem to result in
higher sales for the company. This is not a perfect relationship because not all the points
lie on a straight line, but there is a relationship. A line has been drawn through the data
to help show the relationship that exists between the income (payroll) and sales. The
points do not all lie on the line, so there would be some error involved if we tried to
predict sales based on payroll using this or any other line. Regression analysis provides
the answer to this question.

Figure 1.1: Scatter Diagram of Tororo Construction Company Data

1.1.3 Simple Linear Regression


In any regression model there is an implicit assumption (which can be tested) that a
relationship exists between the variables. There is also some random error that cannot
be predicted. The underlying simple linear regression model is given by this formula:
𝒀 = 𝛽! + 𝛽" 𝑋 + 𝝐
Where:
Y = dependent variable (or Response Variable)
X = independent variable (or Predictor variable or explanatory variable)
β0 = intercept (value of Y when X = 0)
β1 = slope of regression line
є = random error

3|Page
1.1.4 Estimation of the Slope and Intercept
Estimate of the slope and intercept are found from sample data. The true values for
the intercept and slope are not known, and therefore; they are estimated using sample
data. The regression equation based on sample data is given by equation
Ŷ = 𝑏! + 𝑏" 𝑋
Where:
Ŷ = predicted value of Y
b0 = estimate of β0, based on sample results
b1 = estimate of β1, based on sample results.

Therefore, in the case of Tororo Construction Company, we are trying to predict the
sales, so the dependent variable (Y) would be sales. The variable we use to help
predict sales is the Podut area Incomes (Payroll), so this is the independent variable
(X). Although any number of lines can be drawn through these points to show a
relationship between X and Y in figure 1.1 above, the line that will be chosen is the one
that in some way minimises the errors. Error is defined as

Error = (Actual value) – (Predicted Value)

∈ = 𝒀 − Ŷ
2.0 Minimisation of Square Errors
Since errors may be positive or negative, the average error could be zero even though
there are extremely large errors –both positive and negative. To eliminate the difficulty
of negative errors cancelling positive errors, the errors can be squared.

Thus the best regression line is defined as the one with the minimum sum of the squared
errors. For this reason, regression analysis is sometimes called Least-squares
regression.

2.1.0 Statisticians
Statisticians have developed formulas that we can use to find the equation of a straight
line that would minimise the sum of the squared errors. The simple linear regression
equation is
Ŷ = 𝑏! + 𝑏" 𝑋

The following formulas can be used to compute the intercept and the slope.
= ∑𝑋
𝑛
= average (mean) of X values

= ∑𝑌
𝑛
= average (mean) of Y values

4|Page
Σ(𝑋 − )(𝑌 − )
𝑏1 =
∑(𝑋 − )#

𝑏% = 𝑌 − 𝑏&
Table 1.2: Regressions Calculations for Tororo Construction Company
Y X (X – )2 (X – )(Y – )
2
6 3 (3 – 4) = 1 (3 – 4)(6 – 7) = 1
2
8 4 (4 – 4) = 0 (4 – 4)(8 – 7) = 0
9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4
5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0
4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5
9.5 5 (5 – 4)2 = 1 (5 – 4)(9.5 – 7) = 2.5
∑Y = 42 ∑X = 24 => ∑(X - )2 = 10 ∑(X - )(Y – ) =
= 42/6 =7 24/6 = 4 12.5

∑'(
= ∑𝑋
𝑛 = = 4
)

∑* ('
= = = 7
+ )

Σ(𝑋 − )(𝑌 − ) 12.5


𝑏" = = = 1.25
∑(𝑋 − )# 10

𝑏! = − 𝑏"

Substituting

𝑏! = 7 − (1.25)(4) = 2

The estimated regression equation therefore is

Ŷ = 2 + 1.25𝑋
Or
Sales = 2 + 1.25(Payroll)

If the payroll next year is $600 million (X = 6), then the predicted value would be

Ŷ = 2 + 1.25(6) = 9.5 Or $950,000.00

5|Page
2.1.1 Deduction
One of the purposes of regression is to understand the relationship among variables.
This model tells us that for each $100 million (represented by X) increase in the payroll,
we would expect the sales to increase by $125,000 since b1 = 1.25(100,000s). this model
helps Tororo Construction Company see how the local economy and company sales are
related.

Possible Questions involved in measuring the fit of the Regression Model


A regression equation can be developed for any variables X and Y, even random
numbers. We certainly would not have any confidence in the ability of one random
number to predict the value of another random number. This may lead to argument
that:
How do we know that the model is actually helpful in predicting Y based on X?
Should we have confidence in this model?
Does the model provide better predictions (smaller errors) than simply using the
average of the Y values?

3.0 Deviations May be Positive or Negative.


In the Tororo Construction Company case, sales figures (Y) varied from a low 4.5 to a
high of 9.5, and the mean was 7. If each sales value is compared with the mean, we see
how far they deviate from the mean and we could compute a measure of the total
variability in sales. Because Y is sometimes higher and sometimes lower than the
mean, there may be both Positive and Negative Deviations.

3.1.1 The SST Measures the Total Variability in Y about the Mean
Logically summing-up these values would be misleading because the negatives would
cancel out the positives, making it appear that the numbers are closer to the mean than
they actually are. To prevent this problem, we will use the Sum of the Squares Total
(SST) to measure the total variability in Y.
𝑆𝑆𝑇 = ∑(𝑌 − )!

6|Page
3.1.2 The SSE measures the variability in Y about the regression line
If we did not use X to predict Y, we would simply use the mean of Y as the prediction,
and the SST would measure the accuracy of our predictions. However, a regression
line may be used to predict the value of Y, and while there are still errors involved, the
sum of these squared errors will be less than the total sum, of squares just compared.
The Sum of the Squares Error (SSE) is
𝑆𝑆𝐸 = ∑𝑒 ! = ∑(𝑌 − Ŷ)!

Table 1.3: Sum of Squares for Tororo Construction Company

Y X (Y - )𝟐 Ŷ (Y - Ŷ)𝟐 (Ŷ − )𝟐

6 3 (6 – 7)! = 1 2 + 1.25(3) = 5.75 0.0625 1.563

8 4 (8 – 7)! = 1 2 + 1.25(4) = 7.00 1 0

9 6 (9 - 7)! = 4 2 + 1.25(6) = 9.50 0.25 6.25

5 4 (5 - 7)! = 4 2 + 1.25(4) = 7.00 4 0

4.5 2 (4.5 - 7)! = 6.25 2 + 1.25(2) = 4.50 0 6.25

9.5 5 (9.5 - 7)! = 6.25 2 + 1.25(5) = 8.25 1.5625 1.563


∑(Y- )𝟐 = 𝟐𝟐. 𝟓 ∑(Y – Ŷ)𝟐 = 𝟔. 𝟖𝟕𝟓 ∑( Ŷ − )𝟐 = 15.625
=7 SST = 22.5 SSE = 6.875 SSR = 15.625

Table 1.2 provides the calculations for the Tororo Construction example. The mean (
= 7) is compared to each value and we get
SST = 22.5
The prediction (Ŷ) for each observation is computed and compared to the actual value.
These results in
SSE = 6.875
The SSE is much lower than the SST. Using the regression line has reduced the
variability in the sum of squares by 22.5 – 6.875 = 15.625. This is called the Sum of
Squares due to Regression (SSR) and indicates how much of the total variability in Y
is explained by the regression model. Mathematically, this can be calculated as
𝑺𝑺𝑹 = (𝑆𝑆𝑇 − SSE)𝟐
Table 1.3 indicates that SSR = 15.625

There is a very important relationship between the sums of squares that we have
computed:-

7|Page
(Sum of squares total) = (Sum of squares due to regression) + (Sum of squares Error)

SST = SSR + SSE

Figure 1.2: Deviations from the Regression Line and from the Mean

The figure above 1.2 displays the data for Tororo Construction Company. The
regression line is shown, as a line representing the mean of the Y values. The errors
used in computing the sums of squares are shown on this graph. Notice how the sample
points are closer to the regression line than they are to the mean.

3.1.3 Coefficient of Determination (r2)


The SSR is sometimes called the explained variability in Y while the SSE is the
unexplained variability is Y. The proportion of the variability in Y that is explained
by the regression equation is called the coefficient of determination and is denoted by
r2. Thus
𝑆𝑆𝑅 𝑆𝑆𝐸
𝒓𝟐 =
= 1 −
𝑆𝑆𝑇 𝑆𝑆𝑇
2
Thus, r can be found using either the SSR or the SSE. For Tororo Construction
Company, we have

8|Page
15.625
𝒓𝟐 = = 0.6944
22.5
This means that about 69% of the variability in sales (Y) is explained by the regression
equation based on payroll (X)

If every point in the sample were on the regression line (meaning all errors are 0), then
100% of the variability in Y could be explained by the regression equation, so r2 = 1
and SSE = 0. The lowest possible value of r2 is 0. Indicating that X explains 0% of
the variability in Y. Thus, r2 can range from a low of 0 to a high of 1. In developing
regression equations, a good model will have an r2 value close to 1.

4.0 Correlation Coefficient


This measure expresses the degree or strength of the linear relationship. It is usually
expressed as r and can be any number between and including ±1 . Figure 1.3(a, b, c,
d) illustrates possible scatter diagrams for different values of r. The value of r is the
square root of r2. It is negative if the slope is negative, and is positive is the slope is
positive. Thus,
𝒓 = ±√𝑟 !
For the Tororo Construction Company example with r2 = 0.6944,

𝒓 = ±√0.6944 = 0.8333

We know it is positive because the slope is +1.25.

9|Page
Figure 1.3 (a, b, c, d): Illustrate possible scatter diagrams for different values of r

10 | P a g e
Using Computer Software (Excel) for Regression Analysis
Software such as QM for windows and Excel QM is often used for regression
calculations. We will rely on Excel for most of the calculations in the rest of
this chapter. When using Excel to develop a regression model, the input and
output for Excel 2007 and 2010 are the same.

The Tororo Construction Company example will be used to illustrate how to develop a
Regression model in Excel 2010. Go to the Data Analysis in the Excel Windows Menu.
When the Data Analysis window opens, scroll down to and highlight Regression and
click OK. (table 1.4)

Table 1.4: How to Access the regression Option in Excel 2007 or 2010

The Regression window will open, and you can input the X and Y ranges (values).
Check the Labels box because the cells with the variable name were included in the first
row of the X and Y ranges/ values. (table 1.5)

11 | P a g e
Table 1.5: Showing Data Input for Regression in Excel

To have the output presented on this page rather than on a new worksheet, select Output
Range and give a cell address for the start of the output. Click the OK button, and the
output appears in the output range specified.

Errors are also called Residuals


The sums of squares are shown in the column headed by SS. Another name for Error is
Residual. In Excel, the sum of squares error is shown as the sum of squares residual. The
values in this output are the same values shown in table 1.3 above (page 6)
Sum of squares regression = SSR = 15.625
Sum of squares error (Residual) = SSE = 6.8750
Sum of squares total = SST = 22.5
The coefficient of determination (r2) is shown to be 0.6944. The coefficient of correlation is
called Multiple R in the Excel output, and this is 0.8333

Table 1.6: Excel Output for the Tororo Construction Company example
12 | P a g e
Assumption of the Regression Model
If we can make certain assumptions about the errors in a regression model, we can
perform statistical tests to determine if the model is useful. The following assumptions
are made about the errors:
1. The errors are independent
2. The errors are normally distributed
3. The errors have a mean of zero
4. The errors have a constant variance (regardless of the value of X)

Plotting Errors
A Plot of the errors may highlight problems with the model. It is possible to check the
data to see if these assumptions are met. Often a plot of the residuals will highlight any
glaring violations of the assumptions. When the errors (residuals) are plotted against
the independent variable, the pattern should appear random.

Figure 1.4 series, present some typical error patterns, with figure 1.4A displaying a
pattern that is expected when the assumptions are met, thus; the model is appropriate.
The errors are random and no discernible pattern is present

13 | P a g e
Figure 1.4A: Pattern of Errors Indicating Randomness

Figure 1.4B demonstrates an error pattern in which the errors increase as X increases.
Violating the constant variance assumption.

Figure 1.4B: Pattern of Errors Indicating Non-constant Error Variance


Figure 1.4C shows errors consistently increasing at first, and then consistently
decreasing. A pattern such as this would indicate that the model is not linear and some
other form (perhaps quadratic) should be used. In general, patterns in the plot of the
errors indicate problems with the assumptions or the model specification.

Figure 1.4C: Pattern of Errors Indicating that the Errors Relationship is Not Linear

14 | P a g e
Estimating the Variance
While the errors are assumed to have constant variance (𝜎 # ), this is usually not known. It can
be estimated from the sample results. The estimate of 𝜎 # is the mean squared error (MSE)
and is estimated by S2. The MSE is the sum of squares due to error divided by the degrees
of freedom.
SSE
S # = MSE =
n−k−1
Where
n = number of observations in the sample
k = number of independent variables
in this example, n = 6 and k = 1, so
𝑆𝑆𝐸 6.8750 6.8750
𝑆 # = 𝑀𝑆𝐸 = = = = 1.7188
𝑛−𝑘−1 6−1−1 4
from this we can estimate the standard deviation as

𝑆 = √𝑀𝑆𝐸

this is called the standard error of the estimate or the standard deviation of the regression.
In the example shown
𝑆 = √𝑀𝑆𝐸 = √1.7188 = 1.31

This is used in many of the statistical tests about the model. It is also used to find interval
estimates for both Y and regression coefficients. (The MSE is a common measure of
accuracy in forecasting. When used with techniques besides regression, it is common to
divide the SSE by n rather than n - k - 1)

Testing the Model for Significance (An F test is used to determine if there is a relationship
between X and Y)

Both the MSE and r2 (r2 = Correlation Coefficient) provide a measure of accuracy in a regression
model. However, when the sample size is too small, it is possible to get good values for both
of these even if there is no relationship between the variables in the regression model. To
determine whether these values are meaningful, it is necessary to test the model for
significance.
To see if there is a linear relationship between X and Y, a statistical hypothesis test is
performed. The underlying linear model was given earlier as

𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿+ ∈

15 | P a g e
If β1 = 0, then Y does not depend on X in any way. The null hypothesis says there is no linear
relationship between the two variables (i.e. β1 = 0). The alternate hypothesis is that there is a
linear relationship (i.e. β1 ≠ 0). If the null hypothesis can be rejected, then we have proven
that a linear relationship does exist, so X is helpful in predicting Y. The F distribution is used
to test this hypothesis.

The F statistic used in the hypothesis test is based on the MSE and the mean squared regression
(MSR). The MSR is calculated as
𝑆𝑆𝑅
𝑀𝑆𝑅 =
𝑘
Where
K = number of independent variables in the model
The F statistic is
𝑀𝑆𝑅
𝐹 =
𝑀𝑆𝐸
Based on the assumptions regarding the errors in a regression model, this calculated F
statistic is described by the F distribution with
• degrees of freedom for the numerator = df1 = k
• degrees of freedom for the denominator = df2 = n – k – 1
Where
k = the number of independent (X) variables.

If there is very little error, the denominator (MSE) of the F statistics is very small relative to
the numerator (MSR), and the resulting F statistic would be large. This would be an indication
that the model is useful. A significance level related to the value of the F statistic is then found.
Whenever the F value is large, the significances level (p-value) will be low, indicating that it
is extremely unlikely that this could have occurred by chance. When the F value is large (with
a resulting small significance level), we can reject the null hypothesis that there is no linear
relationship. This means that there is a linear relationship and the values of MSE and r2 are
meaningful. The hypothesis test just described above can be summarised as follows, step by
step.

16 | P a g e
Steps in Hypothesis Test for a Significant Regression Model

1) Specify null and alternative hypotheses:


H0:β1 = 0
H1:β1 ≠ 0
2) Select the level of significance (α). Common values are 0.01 and 0.05
3) Calculate the value of the test statistic using the formula
𝑀𝑆𝑅
𝐹 =
𝑀𝑆𝐸
4) Make a decision using one of the following methods:
a) Reject the null hypothesis if the test statistic is greater than the F value from the table
in Appendix D. Otherwise, do not reject the null hypothesis:
Reject if F calculated > Fα,df1,df2
df1 = k
df2 = n – k - 1
b) Reject the null hypothesis if the observed significance level, or p-value, is less than
the level of significance (α). Otherwise, do not reject the null hypothesis
p-value = P(F > calculated test statistic)
Reject if p-value < α

17 | P a g e
Figure 1.5: F Distribution for KYU Construction Company Test for Significance

18 | P a g e
The Analysis of Variance (ANOVA) Table
When software such as Excel or QM for windows is used to develop regression models, the
output provides the observed significance level, or p-value, for the calculated F value. This is
then compared to the level of significance (α) to make the decision.
Table 1.7: Analysis of variance (ANOVA) table for regression

DF SS MS F Significance F
k SSR MSR = SSR/k 𝑀𝑆𝑅 P(F > MSR/ MSE)
Regression 𝑀𝑆𝐸
n – k- 1 SSE MSE = SSE/(n-k-1)
Residual
n-1 SST
Total

Table 1.7 provides summary information about the ANOVA table. This shows how the
numbers in the last three columns of the table are computed. The last column of this table,
labelled Significance F, is the p-value, or observed significance level, which can be used in
the hypothesis test about the regression model.

Tororo Construction Company Example


The Excel output that includes the ANOVA table for the Tororo construction company is
shown in table 1.6 . The observed significance level for F = 9.0909 is given to be 0.0394.
This means
𝑃(𝐹 > 9.0909) = 0.0394
Because this probability is less than 0.05 (α), we would reject the hypothesis of no linear
relationship and conclude that there is a linear relationship between X and Y. Note in figure
1.5 above that the area under the curve to the right of 9.09 is clearly less than 0.05, which is
the area to the right of the F value associated with a 0.05, level of significance.

19 | P a g e
20 | P a g e

You might also like