0% found this document useful (0 votes)
122 views73 pages

Statistics Micro Mini Multiple Regression: January 5-9, 2008 Beth Ayers

The document summarizes a statistics course on multiple regression. It discusses introducing multiple regression, reviewing simple linear regression including assumptions, scatter plots, correlation, R-squared, and hypothesis testing. It provides examples of regression output, interpretation, and checking assumptions. Key topics are modeling relationships between multiple variables and ensuring assumptions of linear regression are met.

Uploaded by

n9d110139546
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views73 pages

Statistics Micro Mini Multiple Regression: January 5-9, 2008 Beth Ayers

The document summarizes a statistics course on multiple regression. It discusses introducing multiple regression, reviewing simple linear regression including assumptions, scatter plots, correlation, R-squared, and hypothesis testing. It provides examples of regression output, interpretation, and checking assumptions. Key topics are modeling relationships between multiple variables and ensuring assumptions of linear regression are met.

Uploaded by

n9d110139546
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 73

Statistics Micro Mini

Multiple Regression

January 5-9, 2008

Beth Ayers

January 6, 2009 - morni 1


ng session
Tuesday 9am-12pm Session
• Critique of An Experiment in Grading
Papers

• Review of simple linear regression

• Introduction to Multiple regression


‒ Assumptions
‒ Model checking
‒ R2
‒ Multicollinearity

January 6, 2009 - morni 2


ng session
Simple Linear Regression
• Both the response and explanatory variable are
quantitative

• Graphical Summary
‒ Scatter plot

• Numerical Summary
‒ Correlation
‒ R2
‒ Regression equation
‒ Response = ¯0 + ¯1 ¢ explanatory

• Test of significance
‒ Test significance of regression equation coefficients
January 6, 2009 - morni 3
ng session
Scatter plot
• Shows relationship between two
quantitative variables
‒ y-axis = response variable
‒ x-axis = explanatory variable

January 6, 2009 - morni 4


ng session
Correlation and R2
• Correlation indicates the strength and
direction of the linear relationship between
two quantitative variables
‒ Values between -1 and +1

• R2 is the fraction of the variability in the


response that can be explained by the linear
relationship with the explanatory variable
‒ Values between 0 and +1

• Correlation2 = R2

• Large values of each depend on the field


January 6, 2009 - morni 5
ng session
Linear Regression Equation
• Linear Regression Equation
‒ Response = ¯0 + ¯1 * explanatory

‒ ¯0 is the intercept
‒ the value of the response variable when the
explanatory variable is 0

‒ ¯1 is the slope
‒ For each 1 unit increase in the explanatory
variable, the response variable increases by ¯1

• ¯0 and ¯1 are most often found using


least squares estimation
January 6, 2009 - morni 6
ng session
Assumptions of linear regression
• Linearity
‒ Check my looking at either observed vs. predicted or
residual vs. predicted plot
‒ If non-linear, predictions will be wrong

• Independence of errors
‒ Can often be checked by knowing how data was
collected. If not sure can use autocorrelation plots.

• Homoscedasticity (constant variance)


‒ Look at residuals versus predicted plot
‒ If non-constant variance predictions will have wrong
confidence intervals and estimated coefficients may be
wrong

• Normality of errors
‒ Look at normal probability plot
‒ If non-normal confidence intervals and estimated
coefficients will be wrong

January 6, 2009 - morni 7


ng session
Assumptions of linear regression

• If the assumptions are not met, the


estimates of ¯0, ¯1, their standard
deviations, and estimates of R2 will be
incorrect

• Maybe possible to do transformations to


either the explanatory or response
variable to make the relationship linear

January 6, 2009 - morni 8


ng session
Hypothesis testing
• Want to test if there is a significant
linear relationship between the variables
‒ H0 = there is no linear relationship between
the variables (¯1 = 0)
‒ H1 = there is a linear relationship between
the variables (¯1 ≠ 0)

• Testing ¯0 = 0 may or may not be


interesting and/or valid

January 6, 2009 - morni 9


ng session
Monday’s Example
• Curious if typing speed (words per
minute) affects efficiency (as measured
by number of minutes required to finish
a paper)
• Graphical display

January 6, 2009 - morni 10


ng session
Sample Output
• Below is sample output for this
regression

January 6, 2009 - morni 11


ng session
Numerical Summary
• Numerical summary
‒ Correlation = -0.946
‒ R2 = 0.8944
‒ Efficiency = 85.99 – 0.52*speed

• For each additional word per minute


typed, the number of minutes needed to
complete an assignment decreases by
0.52 minutes

• The intercept does not make sense since


it corresponds to a speed of zero words
per minute
January 6, 2009 - morni 12
ng session
Interpretation of r and R2
• r = -0.946
‒ This indicates a strong negative linear
relationship

• R2 = 89.44
‒ 89.44% of the variability in efficiency can be
explained by words per minute typed

January 6, 2009 - morni 13


ng session
Hypothesis test
• To test the significance of ¯1
‒ H0 = there is no linear relationship between
the speed and efficiency (¯1 = 0)
‒ H1 = there is a linear relationship between
the speed and efficiency (¯1 ≠ 0)

• Test statistic: t = -20.16


• P-value = 0.000

• In this case, testing ¯0 = 0 is not


interesting; however it may be in some
experiments
January 6, 2009 - morni 14
ng session
Checking Assumptions
• Checking assumptions
‒ Plot on left: residual vs. predicted
‒ Want to see no pattern
‒ Plot on right: normal probability plot
‒ Want to see points fall on line

January 6, 2009 - morni 15


ng session
Another Example
• Suppose we have an explanatory and
response variable and would like to
know if there is a significant linear
relationship
• Graphical display

January 6, 2009 - morni 16


ng session
Numerical Summary
• Numerical summary
‒ Correlation = 0.971
‒ R2 = 0.942
‒ Response = -21.19 + 19.63*explanatory

• For each additional unit of the


explanatory variable, the response
variable increases by 19.63 minutes

• When the explanatory variable has a


value of 0, the response variable has a
value of -21.19
January 6, 2009 - morni 17
ng session
Hypothesis testing
• To test the significance of ¯1
‒ H0 = there is no linear relationship between
the explanatory and response (¯1 = 0)
‒ H1 = there is a linear relationship between
the explanatory and response (¯1 ≠ 0)

• Test statistic: t = 49.145


• P-value = 0.000

• It appears as though there is a


significant linear relationship between
the variables
January 6, 2009 - morni 18
ng session
Sample Output
• Sample output for this example, we can
see both coefficients are highly
significant

January 6, 2009 - morni 19


ng session
Checking Assumptions
• Checking assumptions
‒ Plot on left: residual vs. predicted
‒ Want to see no pattern
‒ Plot on right: normal probability plot
‒ Want to see points fall on line

January 6, 2009 - morni 20


ng session
Example 6 (cont)
• Checking assumptions
‒ In the residual vs. predicted plot we see that the
residual values are higher for lower and higher
predicted values and lower for values in the middle
‒ In the normal probability plot we see that the
points are falling off the lines at the two ends

• This indicates that one of the assumptions was


not met!

• In this case the is a quadratic relationship


between the variables
• With experience you’ll be able to determine what
relationships are present given the residual versus
predicted plot
January 6, 2009 - morni 21
ng session
Data with Linear Prediction Line

• When we add the predicted linear


relationship, we can clearly see misfit

January 6, 2009 - morni 22


ng session
Multiple Linear Regression
• Use more than one explanatory variable
to explain the variability in the response
variable

• Regression Equation
‒ Y = ¯0 + ¯1¢X1 + ¯2¢X2 + . . . + ¯N¢XN

• ¯j is the change in the response variable


(Y) when Xj increases by 1 unit and all
the other explanatory variables remain
fixed

January 6, 2009 - morni 23


ng session
Exploratory Analysis
• Graphical Display
‒ Look at the scatter plot of the response
versus each of the explanatory variables

• Numerical Summary
‒ Look at the correlation matrix of the response
and all of the explanatory variables

January 6, 2009 - morni 24


ng session
Assumptions of Multiple Linear
Regression
• Same as simple linear regression!
‒ Linearity
‒ Independence of errors
‒ Homoscedasticity (constant variance)
‒ Normality of errors

• Methods of checking assumptions are


also the same

January 6, 2009 - morni 25


ng session
R2adj
• R2 is the fraction of the variation in the
response variable that can be explained by
the model

• When variables are added to the model, R2


will increase or stay the same (it will not
decrease!)
‒ Use R2adj which adjusts for the number of
variables
‒ Check to see if there is a significant increase

• R2adj is a measure of the predictive power of


our model, how well do the explanatory
variables collectively predict the response
January 6, 2009 - morni 26
ng session
Inference in Multiple Regression

• Step 1
‒ Does the data provide evidence that any of
the explanatory variables are important in
predicting Y?
‒ No – none of the variables are important, the
model is useless
‒ Yes – at least one variable is important, move
to step 2

• Step 2
‒ For each explanatory variable Xj: does the
data provide evidence that Xj has a significant
linear effect with Y, controlling for all the
other variables
January 6, 2009 - morni 27
ng session
Step 1
• Test the overall hypothesis that at least
one of the variables is needed
‒ H0: none of the explanatory variables are
important in predicting the response variable
‒ H1: at least one of the explanatory variables
is important in predicting the response
variable

• Formally done with an F-test


‒ We will skip the calculation of the F-statistic
and p-value as they are given in output

January 6, 2009 - morni 28


ng session
Step 2
• If H0 is rejected, test the significance of
each of the explanatory variables in the
presence of all of the other explanatory
variables

• Perform a T-test for the individual


effects
‒ H0: Xj is not significant to the model
‒ H1: Xj is significant to the model

January 6, 2009 - morni 29


ng session
Example
• Earlier we looked at how typing speed
and efficiency are linearly related

• Now we want to see if adding GPA (on a


0-5 point scale) as an explanatory
variable will make the model more
predictive of efficiency

January 6, 2009 - morni 30


ng session
Graphical displays

January 6, 2009 - morni 31


ng session
Numerical Summary

Efficienc Words per GPA


y minute

Efficiency 1.00 -0.95 -0.92


Words per 1.00 0.96
minute
GPA 1.00

January 6, 2009 - morni 32


ng session
Sample Output

January 6, 2009 - morni 33


ng session
Step 1 – Overall Model Check
• For our example with words per minute
and GPA, the F-test yields
‒ F-statistic: 207.4
‒ P-value = 0.0000

• Interpretation, at least one of the


variables (words per minute and GPA)
are important in predicting efficiency

January 6, 2009 - morni 34


ng session
Step 2
• Test significance of words per minute
‒ T-statistic: -4.67
‒ P-value = 0.0000

• Test significance of GPA


‒ T-statistic: -1.33
‒ P-value = 0.1900

• Conclusions
‒ Words per minute is significant but GPA is not
‒ In this case we ended up with a simple linear
regression with words per minute as the only
explanatory variable
January 6, 2009 - morni 35
ng session
Looking at R2adj
• R2adj (wpm and GPA) = 89.39

• R2adj (wpm) = 89.22

• Adding GPA to the model only


raised the R2adj by 0.17%, not
nearly enough to justify adding
GPA to the model
‒ This agrees with the hypothesis
testing on the previous page
January 6, 2009 - morni 36
ng session
Automatic methods
• Model Selection – compare models to
determine which best fits the data

• Uses one of several criteria (R2adj, AIC


score, BIC score) to compare models

• Often use stepwise regression


‒ Start with no variables, add variables one at
a time until there is no significant change in
the selection criteria
‒ Start with all variables, remove variables one
at a time until there is no significant change
in the selection criteria

• Packages have built in methods for this


January 6, 2009 - morni 37
ng session
Multicollinearity
• Collinearity refers to the linear
relationship between two explanatory
variables

• Multicollinearity is more general and


refers to the linear relationship between
two or more explanatory variables

January 6, 2009 - morni 38


ng session
Multicollinearity
• Perfect multicollinearity – one of the
variables is a perfect linear function of
other explanatory variables, one of the
variables must be dropped
‒ Example: using both inches and feet

• Near-perfect multicollinearity – occurs


when there are strong, but not perfect
linear relationships among the
explanatory variable
‒ Example: Height and arm spread

January 6, 2009 - morni 39


ng session
Collinearity Example
• An instructor wants to predict final exam
grade and has the following explanatory
variables
‒ Midterm 1
‒ Midterm 2
‒ Diff = Midterm 2 – Midterm 1

• Diff is a perfect linear function of


Midterm 1 and Midterm 2
‒ Drop diff from the model
‒ Use Diff but neither Midterm 1 or Midterm 2

January 6, 2009 - morni 40


ng session
Indicators of Multicollinearity
• Moderate to high correlations among the
explanatory variables in the correlation
matrix

• The estimates of the regression


coefficients have surprising and/or
counterintuitive values

• Highly inflated standard errors

January 6, 2009 - morni 41


ng session
Indicators of Multicollinearity
• The correlation matrix alone isn’t always
enough

• Can calculate the tolerance, a more


reliable measure of multicollinearity
‒ Run the regression with Xj as the response
versus the rest of the explanatory variables
‒ Let R2j be the be the R2 value from this
regression
‒ Tolerance (Xj) = 1 – R2j
‒ Variance Inflation Factor (VIF)= 1/Tolerance

• Do more checking if the tolerance is less


than 0.20 or VIF is greater than 5
January 6, 2009 - morni 42
ng session
Back to Example
• Use GPA as the response and words per
minute as the explanatory
‒ R2 = 0.91
‒ Tolerance (GPA) = 0.09
‒ Well below 0.30!

• Adding GPA to the regression equation


does not add to the predictive power of
the model

January 6, 2009 - morni 43


ng session
What can be done?
• Drop the correlated variables!

• Interpretations of coefficients will be


incorrect if you leave all variables in the
regression.

• Do model selection (same as that on


slide 37)

January 6, 2009 - morni 44


ng session
Example
• Suppose we have an online math tutor and
classroom performance variables and we’d like
to predict final exam scores.

• Math tutor variables


‒ Time spent on the tutor (minutes)
‒ Number of problems solved correctly

• Classroom variable
‒ Pre-test score

• Response variable
‒ Final exam score

January 6, 2009 - morni 45


ng session
Example
•Exploratory analysis – correlation matrix
‒ The correlation between pretest and number
correct seems high

Final Pretest Number Time


Score Correct
Final Score 1.00 0.85 0.82 0.37

Pretest 1.00 0.90 0.01

Number 1.00 0.03


Correct
Time 1.00

January 6, 2009 - morni 46


ng session
Example
•Exploratory analysis
‒ linear relationship between time and final
is not strong

January 6, 2009 - morni 47


ng session
Example
• Run the linear regression using pretest,
number correct, and time as linear
predictors of final score

January 6, 2009 - morni 48


ng session
Step 1
• Test the overall hypothesis that at least one of
the variables is needed
‒ H0: none of the explanatory variables are
important in predicting the response variable
‒ H1: at least one of the explanatory variables is
important in predicting the response variable

• F-statistic = 95.56
• P-value = 0.0000

• At least one of the three explanatory variables is


important in predicting final exam score

January 6, 2009 - morni 49


ng session
Step 2
• Test significance of pretest score
‒ T-statistic: 4.88
‒ P-value = 0.0000

• Test significance of number correct


‒ T-statistic: 1.99
‒ P-value = 0.0524

• Test significance of time


‒ T-statistic: 6.45
‒ P-value = 0.0000

• Conclusions
‒ Pretest score and time are significant but number
correct is not
January 6, 2009 - morni 50
ng session
Example
• This is not surprising given the high
correlation (0.90) between pretest score
and number correct

• Formally show
‒ Number Correct ~ Pretest + Time
‒ R2 = 0.8044
‒ Tolerance = 1 – 0.8044 = 0.1956
‒ Lower than 0.20
‒ VIF = 1/0.1956 = 5.11
‒ VIF is greater than 5

January 6, 2009 - morni 51


ng session
Model Selection
• Why was test number correct and not pretest
chosen as insignificant? Depends on which
variable adds more to the predictive power of
the regression equation

• Doing stepwise regression will yield more


information

• Depending on the criteria used, some model


selection procedures dropped number correct
and others kept all three variables
‒ If we decide to drop number correct we will have
to rerun the regression

January 6, 2009 - morni 52


ng session
Rerunning the regression
• New output

January 6, 2009 - morni 53


ng session
Steps 1 and 2
• Step 1
‒ F-statistic = 133
‒ P-value = 0.0000

• Step 2
‒ Test significance of pretest score
‒ T-statistic: 14.93
‒ P-value = 0.0000

‒ Test significance of time


‒ T-statistic: 6.34
‒ P-value = 0.0000

January 6, 2009 - morni 54


ng session
Example
• Conclusion – both pretest score and time
are important predictors of final exam
score

• R2adj = 84.34
‒ 84% of the variability in final exam score is
explained by pretest score and time

January 6, 2009 - morni 55


ng session
Check Assumptions
• There may be a slight pattern to the
residual vs. fitted plot, but overall the
plots look good

January 6, 2009 - morni 56


ng session
Interpretation
• The final regression equation is:

Final  - 8.16  0.59  pretest  0.29 time

• For each additional point on the pretest, a


student’s predicted final exam score increases by
0.59 points, holding time on the tutor constant

• For each additional minute on the tutor, a


student’s predicted final exam score increases by
0.29 points, holding pretest score constant

January 6, 2009 - morni 57


ng session
Notes on Example
• If either pretest or time was found to be
non-significant, we would have rerun the
regression again

• Multiple regression often takes several


regressions before we are done

• The built in automatic model selection in


statistical packages will do these in one
step!

January 6, 2009 - morni 58


ng session
Alternate Ending
• What if we had dropped pretest instead of
number correct?

• The regression equation would be:


Final  12.58  0.43  number correct  0.29 time

January 6, 2009 - morni 59


ng session
Steps 1 and 2
• Step 1
‒ F-statistic = 88.52
‒ P-value = 0.0000

• Step 2
‒ Test significance of number correct score
‒ T-statistic: 12.09
‒ P-value = 0.0000

‒ Test significance of time


‒ T-statistic: 5.19
‒ P-value = 0.0000

January 6, 2009 - morni 60


ng session
Check the Assumptions
• On the residual vs. predicted there is a
slight pattern. I’d recommend dropping
the outlier and rerunning the regression.

January 6, 2009 - morni 61


ng session
Notes
• We can see that both pretest and time
are significant but that the assumptions
might be questionable

• However, when the R2adj of this model


with the previous model we see the
different
‒ R2adj (pretest, time) = 84.34
‒ R2adj (Number correct, time) = 78.13

• This model with pretest describes more of


the variability in final exam scores
January 6, 2009 - morni 62
ng session
Another Example
•Suppose we have 4 explanatory variables (X1, X2,
X3, X4) and we have our response variable Y

•X1 and X3 appear to be highly correlated


Y X1 X2 X3 X4

Y 1.00 -0.36 0.76 -0.38 0.54

X1 1.00 -0.33 0.98 0.09

X2 1.00 -0.34 -0.12

X3 1.00 0.08

X4 1.00
January 6, 2009 - morni 63
ng session
Exploratory Analysis
• Appears reasonable that each of the 4
explanatory variables may have a linear
relationship with the response variable

January 6, 2009 - morni 64


ng session
Example
• Start by running the regression with all
four explanatory variables

January 6, 2009 - morni 65


ng session
Steps 1 and 2
• Step 1
‒ F-statistic = 1900
‒ P-value = 0.0000

• Step 2
‒ Test significance of X1
‒ T-statistic: -9.04
‒ P-value = 0.0000

‒ Test significance of X2
‒ T-statistic: 207.21
‒ P-value = 0.0000

‒ Test significance of X3
‒ T-statistic: 0.88
‒ P-value = 0.3817

‒ Test significance of X4
‒ T-statistic: 181.57
‒ P-value = 0.0000
January 6, 2009 - morni 66
ng session
Conclusions
• Variable X3 is not significant in predicting
Y

• Calculate the tolerance for X3


‒ X3 ~ X1 + X2 + X4
‒ R2 = 0.96
‒ Tolerance = 0.04
‒ VIF = 25

• Remove X3 from the regression and


rerun!

January 6, 2009 - morni 67


ng session
Updated Regression
• R2adj= 99.94
‒ Note that the R2adj is the same as the
regression with all four variables

January 6, 2009 - morni 68


ng session
Steps 1 and 2
• Step 1
‒ F-statistic = 2675
‒ P-value = 0.0000

• Step 2
‒ Test significance of X1
‒ T-statistic: -42.62
‒ P-value = 0.0000

‒ Test significance of X2
‒ T-statistic: 208.82
‒ P-value = 0.0000

‒ Test significance of X4
‒ T-statistic: 181.46
‒ P-value = 0.0000
January 6, 2009 - morni 69
ng session
Things to Note
• When we reran the regression without X3,
the changes in the regression equation
and step 2 of the analysis were mostly to
X1

• This is not surprising since it was X1 and


X3 which were highly correlated

January 6, 2009 - morni 70


ng session
Check Assumptions
• I would probably delete the low two
observations in the residual vs. fitted plot
and rerun

January 6, 2009 - morni 71


ng session
After removing observations
•Step 1 significant
•All three variables significant in Step 2

Y  16.51 4.98  X1 9.96  X 2 15.02  X 4

January 6, 2009 - morni 72


ng session
Outliers
• Removing observations in a linear
regression is often subjective

• Many packages will indicate observations


which are possible outliers

• Running a regression with and without


the observations and comparing them is
best

January 6, 2009 - morni 73


ng session

You might also like