ECO 391-007 Lecture Handout For Chapter 15 SPRING 2003 Regression Analysis Sections 15.1, 15.2
ECO 391-007 Lecture Handout For Chapter 15 SPRING 2003 Regression Analysis Sections 15.1, 15.2
REGRESSION ANALYSIS
Regression analysis is a statistical tool that allows us to look at the impact of one variable on
another while controlling for potential confounding effects. (It holds other things constant, or, in
Latin, "Ceteris Paribus")
Examples:
Examples: Which one of the following is a dependent variable and which are independent ones?
Rain , Agricultural Output
Dependent Variables: (also called endogenous variables) are the variables whose values are
influenced by the value of the independent variable.
Examples:
(1) n is the sample size and i represents the observation number.
(2)
Observation Dependent Variable Independent Variable Independent Variable
Number Yearly Income ($'s) Education (years) Years of Work
(i), n = 3 Experience
1 12,000 8 0
2 20,000 10 5
3 30,000 12 10
(4) Dependent Variable - The grade you will receive in this class:
List potential independent variables:
C. MBA Admissions Example
The Dean of B&E college needs help to determine which applicants to accept to our MBA program. He
hires you to predict how each applicant would do academically in our MBA program.
1) What factors (variables) on the applicants would you want data on?
2) How can we measure the impact of each of these variables on MBA academic performance?
Regression Analysis: A statistical technique that attempts to explain changes in the dependent
variable as a function of changes in independent (explanatory) variables, through the
quantification of an equation. (holding all else constant)
1) To quantify theories.
(Describe economic reality.)
Y=
The points are (Xi, Yi) where i denotes the observation number.
Y
Points to plot:
(X1, Y1) = (5, 1.2) 10
(X2, Y2) = (70, 17.3)
(X3, Y3) = (15, 2.6) 8
(X4, Y4) = (30, 4.7)
(X5, Y5) = (50, 7.5) 6
0
0 10 20 30 40 50 60 70 X
0: constant term (or Y-intercept term).
0 tells us the value of Y when X is zero.
(Graphically, value of Y where the line hits the Y axis.)
Yi =0 + 1Xi
1
0
Looking ahead:
Regression analysis allows us to estimate the values of 0 and 1 that characterize the relationship between
X and Y.
Example: Expenditure on food (at constant prices) as a function of the quantities of goods.
1
2
3
III. Deterministic and Stochastic Relationships
A deterministic relationship is one in which each value of X is paired with only one Y value. It’s
an
exact relationship, of the same nature as discussed in the previous section.
A stochastic relationship is one in which one value of X may be associated with several different
values of Y for different data points. In short, there is an underlying linear relation between X and
Y, but Y is subject to some external “noise”.
Example:
Y = yearly family expenditures on recreational activity.
X = yearly family income.
The stochastic error term accounts for all of the other variables besides X that determine the value of
Y.
εi accounts for:
4)Randomness-unpredictable occurrences
Note: Some dependent variables will have more inherent error than others.
Regression analysis: A method of estimating stochastic relationships and analyzing the estimates.
One-variable Stochastic Relationships are best illustrated by a scatter plot diagram:
Example: height-weight stochastic relationship.
Sample A Sample B
Y Y
X X
Sample A Sample B
X=income, Y=consumption X=price of cars,Y=# of cars sold
Y Y
X X
3) If the relationship between two variables is linear or nonlinear.
Linear Nonlinear
Y Y
X X
4) Something about the strength of a relationship between two variables.
Sample A Sample B
Y Y
X X
IV. The Simple Linear Regression Model.
Recall that a stochastic relationship between two variables is one in which the explanatory, independent
variable explains some of the value of the dependent variable, but it is not the sole determinant of Y.
Since other variables and error in data collection might also be affecting the value of Y, we include a
random error term, , that accounts for everything that X does not.
Yi = o + 1Xi + εi
where: o and 1 are coefficients
εi is the random or stochastic error term
and i denotes the observation number.
This equation shows the behavioral relationship between X and Y and if we estimate the specific values
of o and 1 then we have statistically quantified the relationship.
The goal of linear regression analysis is to estimate the values of o and 1 using sample
data.
For example,
Let Xi be a family’s income and let Yi be the family’s spending on recreational activities.
Two families who both have an income of $60,000 per year, (X1 = $60,000 and X2 = $60,000), may have
different levels of recreational spending. (Y1 = $5,000 and Y2 = $10,000)
For any given value of X, Y is said to be a random variable meaning that Y can take on any one in a
distribution of possible values. We expect this distribution to have a mean or expected value. For
instance, ten different families who all earn $60,000 dollars may all spend different amounts on recreation,
but we may say that on average, families who earn $60,000 per year spend $7,000 on recreation.
E(YiX = Xi) or E(YiXi) is called the conditional expected value of the random variable Yi when X
takes on a specific value. Below is a distribution showing the different values the random variable Yi can
take on given that Xi takes a specific value. (here: Xi = $60,000)
E(YiXi) Yi
given Xi = $60,000
For a linear regression model
E(YiXi) = o + 1Xi
The mean of the Y distribution at each value of X falls on the population regression line.
f(Y)
Y
X2
X1 X2 X3 X
The actual (observed) data points and the population regression line:
E(YiXi) = o + 1Xi
True Population Regression Line
Note that the actual data points from a sample do not all actually fall directly on the true population
regression line.
The difference between the data points and the line is represented by the random error term.
The random error term, εi = Yi - E(YiXi)
εi = Yi - o + 1Xi or
Yi = o + 1Xi + εi (The Stochastic Equation)
Thus,
1) The (o + 1Xi) portion of the above equation is the systematic or deterministic component of the
stochastic equation. If Y depended solely upon this part of the equation, then each value of X would
only be associated with one value of Y.
2) εi is the random error term. This accounts for any part of the Y value that is explained by factors
other than X. This is the part of the equation that allows one X value to be associated with more than
one Y value. (i.e. “the garbage collector”)
Again, we do not observe the entire population to get the values of β1 and β2. We need to estimate these
values using samples.
Sample Information:
1) Ŷi = bo + b1Xi is called the sample regression equation (estimated regression equation) that
shows the behavioral relationship between X and Y for the sample data. This equation serves as an
estimate of the true population regression line that we cannot actually measure.
The actual (observed) data points and the sample regression line:
Yi = βo + β1Xi
Population Regression Line
Yi = bo + b1Xi
Sample Regression Line
eI (the residual) is an estimate of εi and it represents the difference between the actual observed Yi value
and the Ŷi value that is predicted by plugging Xi into the estimated regression line formula.
There will be n residual values, one for each data point pair.
ei = Yi - Ŷi or ei = Yi - bo - b1Xi
Example:
2 15 8
3 8 5
4 12 8
5 14 10
Suppose that we take these data points and estimate the sample regression equation.
(We would be using formulas and techniques that you will learn in 15.3.) We would estimate:
using
After using the method of least squares that we will learn, we find the bo = 2 and b1 = .5 or
Ŷi = 2 + .5Xi
IN CLASS EXERCISE:
1) Graph the sample regression line. Return to the previous table and for each value of X, calculate the predicted
value of Yi, or Ŷi. Plot each of these five predicted values on the graph below. Connect these points and you
have the sample regression line. You will be graphing the points (X i, Ŷi). As we plug each of the values of X into
the sample regression equation, we will calculate the predicted value Ŷ i. This is the value of Y if we fit it perfectly
into the behavioral relationship defined by the sample regression line. Complete the fourth column of the table.
2) Plot the five original, observed data points. Label the actual, observed data points 1, 2, 3, 4, and 5.
3) On the graph, mark the distance between the sample regression line and the
actual observed data points. These distances represent the residuals. In the table above, calculate the value of the
residuals to complete the last column.
Recall that the residual is calculated as ei = Yi - Ŷi.
Y
14
12
10
2 4 6 8 10 12 14 16 18 20 X
Next time we will study how to estimate ’s using the sample data above (actually, we will look for such bo, b1 that
minimize the sum of squared residuals. For now, let’s take for granted that the best estimates are bo = 2 and b1=0.5
bo = 2: means that….
1) To test your understanding of linear relationships, try graphing the following linear equations:
a) Y = 4 + 2X
b) Y = 4 - 2X
c) Y = 2 + 2X
d) Y = 2 + 3X
Note that larger values of the slope make the graph of the line appear steeper.
e) Try to verbally interpret the coefficients.
2) Suppose that a company installs and repairs copying machines. The company studied the relationship
between repair costs for a sample of six machines and the number of pages copied by each machine. The
goal is to identify machines whose costs are too high relative to their copying volumes. The repair costs
in dollars and the pages copied in thousands for the six machines are as follows:
Machine 1 2 3 4 5 6
Repair Cost 85 120 70 165 125 90
Pages 900 1350 550 850 1500 800
Copied
a) Which variable is the dependent variable and which is the independent variable? Why?
d) Does there appear to be any relationship between repair costs and the number of pages copied? (i.e.
direct or inverse, linear or nonlinear, weak or strong.)
e) Can you think of any other independent variables that might be influencing this dependent variable?
3) a) Based on lecture to this point, write your own definition of regression analysis that makes sense to
you and memorize it.
b) What are the three primary uses for regression analysis? Give one specific example of each that we
did not discuss in class.
4) If the points (3,18) and (6,9) are two points on a straight line,
a) What is the slope of that line?
d) Based on the information you have been given, can you find the value of the
Y-intercept term? If so, find it.
5) Consider the following related variable pairs. Which pairs show deterministic relationships and which
show stochastic relationships? Explain.
X Y
Number of hamgurgers Person i’s weight
consumed per week by
person i
7) Explain the four factors that contribute to the random error term.
8) Which dependent variable, people’s annual income or attendance at UK basketball games, would you
expect to exhibit more random (unexplainable) inherent variation and why?
9) Along with this practice sheet you were given a copy of UK’s MBA program admission application.
In an earlier class, we considered the variables that might determine an applicant’s academic success in
the program. Looking at the application, you will see that the class came up with most of the same
variables that the admissions office actually considers. If we wanted to estimate a student’s MBA GPA
as a function of these potential determinants, list the variables from the application form that we can
actually quantify (measure and use numerically) in our estimation. What unit of measure would we use
for each of these dependent variables? For each variable discuss how reliable you think the data are. (An
important bit of info for this class - the word data is plural.)
10) Each year top American cities are ranked according to their ability to provide high-quality and low-
cost labor to companies that are relocating. One important measure used to form the rankings is the labor
stress index, which indicates the availability of workers in the city. (The higher the index, the tighter the
job market - i.e. the more difficult for employers to find employees.) Note that one of the determinants
of this measure is the unemployment rate. The values of these two variables for each of the top 10 cities
are listed below in the table.
Obs. # 1 2 3 4 5 6 7 8 9 10
Labor Market 107 107 100 100 80 100 100 93 87 80
Stress Index(Y)
Unemployment 4.5% 3.8% 5.1% 4.9% 5.4% 4.8% 5.5% 4.3% 5.7% 4.6%
Rate(X)
(When calculating your statistics, treat the percentages as whole numbers, i.e. enter 4.5% as the number
4.5 rather than .045. The results should be comparable, but your calculations by hand will be less
tedious.)
e) Based on your scatter plot diagram, what is your initial conclusion about the relationship between the
labor market stress index and the unemployment rate? (Relationship positive or negative, linear or
nonlinear, strong or weak?)
11) a) Graph the true regression line and the estimated regression line assuming that o > o and
1 < 1, with each being positive. Clearly denote each line.
b) In the graph, plot one observation (data point) that is below both lines. Show for that observation the
residual, e, and the stochastic error term, . (2 points)
a) One Drawback of conducting controlled experiments is the potential for confounding effects.
b) Regression analysis is used to test theories, quantify theories, and make forecasts.
Lecture I: An Overview of Regression Analysis KEY KEY KEY
Questions for Practice
1) When graphing a linear equation there are a few things to keep in mind. The most obvious place to
start is with the intercept term. The Y=intercept, or o, tells the value of Y when X is zero. This is the
number that appears as the constant term in the equation. So for a) we know that one point on the line is
the point (0,4). Find another point that satisfies the equation. For instance if X = 2, Y = 4 + 2(2) = 4 + 4
= 8. So another point on the line is the point (2,8). All you need to graph a linear function are two
points. Graph these two points a draw a line that runs through both.
d a
Y
c
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10 X
b
Note that larger values of the slope make the graph of the line appear steeper.
e) A verbal interpretion of the coefficient would say as X increases by one unit Y increases by the value
of the slope. For instance, in part a), as X increases by one unit, Y increases by 2.
2) a) The repair cost is dependent because it depends upon the level of use. Pages copied would then be
the independent variable.
b) Scatter diagram:
Repair Costs
170
160
150
140
130
120
110
100
90
80
70
500 600 700 800 900 1,000 1,100 1,200 1,300 1,400 1,500 #of copies
c) the maintenance cost of machine 4 seems to be out of line. It stands out from the other points in the
diagram.
d) Given the appearance of the scatter diagram it would seem that the variables are positively and linearly
related. The relationship appears to be very strong.
e) Other independent variables that might be influencing the repair cost could be i) how often the user
cleans the machine and how well they maintain service for the machine; ii) Do they give the machine a
rest in between running big jobs; iii) do they use the appropriate type of paper; iv) do they use the
machine as the backboard in the office’s big Nerf basketball championship; etc.
b) Regression analysis may be used to test theories, quantify relationships, and make predictions or
forecasts. I will let you work on the examples.
4) If the points (3,18) and (6,9) are two points on a straight line,
b) Since the slope is negative, we can assume that the variables are negatively related.
d) Based on the information you have been given, you can find the value of the
Y-intercept term. Try a little simple algebra. We know that a linear equation can be written as Y = o +
1X. We know that 1 = -3. Plug in the X and Y values from one of the points. We know these points
“satisfy” the equation.
18 = o - 3(3) or 18 = o - 9 or 27 = o
Also, if you draw the graph, plot the two points, and draw the line going through them. You can usually
see where it hits the Y-axis. (although this is not always the most accurate approach.)
5) i) The relationship between hamburger consumption and human weight is stochastic. While
hamburger consumption certainly might have an impact on weight, other factors besides hamburger
consumption are also important in determining weight.
ii) The relationship between ticket sales and ticket revenues is deterministic because the number of
tickets sold (as long as we know the price) completely determines the revenue from selling the tickets.
iii) The relationship between study time and GPA is stochastic because other factors in addition to study
time are essential in determining the value of GPA.
20
consider. Is age a factor? Did each girl sell in their home neighborhood? How many doors did each girl
knock upon? Did they use the phone to try to make sales? Is Ingrid more pleasant looking or more
outgoing? Is Constance less motivated? Does Ingrid come from a very big family with LOTS of
relatives?
10) a) The independent variable is the unemployment rate. This variable is one of the determinants of the
stress index that tells us how tight the job market is in an area.
b) The dependent variable is the stress index. Its value is determined or a function of the unemployment
rate.
c) This is a stochastic relationship. The value of the stress index varies for other reasons besides just the
level of unemployment. (i.e. unemployment is not the sole determinant of the stress index.)
21
d) See Below: 110
Stress
Index
105
100
95
90
85
80
Unemployment Rate
3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8
e) Although it is somewhat difficult to see given this scatter plot, it would appear that there is some sort
of linear, negative relationship although it does not look very strong.
11) Estimated Regression Line
a) and b) Y
1 True Population
Regression Line
1
ei i
o
o
12) a) False: Controlled experiments allow you to avoid the problems related to
confounding affects by controlling for potential confounding factors.
b) True: These are the reasons we discussed for using regression analysis.
22