0% found this document useful (0 votes)
34 views18 pages

06 Regression

Uploaded by

edgar leiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views18 pages

06 Regression

Uploaded by

edgar leiva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Prediction is a key task of statistics

Predict the height of a son who is chosen at


random from 928 sons. The average height
of sons, 68.1 in, is the ‘best’ predictor.

62 64 66 68 70 72 74

Heights of 928 Sons

Predict the height of a son whose father is


74
72

72 in tall. This additional information


70
Son's height

about the father should allow us to make a


68
66

better prediction. Regression does just that.


64
62

64 66 68 70 72

Father's height
The correlation coefficient
25000

74
20000

72
direction: sloping up.

70
direction: sloping up. 15000

Son's height
form: exponential form: linear.
Income

68
strength: weak 10000 strength: weak

66
64
5000

62
0
6 8 10 12 14 16
direction form 64 66 68 70 72

Education strenght Father's height

The scatterplot visualizes the relationship between two quantitative variables. It may
have a direction (sloping up or down), form (a scatter that clusters around a line is
called linear) and strength (how closely do the points follow the form?).
If the form is linear, then a good measure of strength is the correlation coefficient r:
Our data are (xi , yi ), i = 1, . . . , n.
n
1 X xi − x̄ yi − ȳ z value for 'Y'
r = ×
n sx sy
i=1
z-value for 'x'
(divide by n − 1 instead of n if this is also done for the standard deviations sx , sy ).
Correlation measures linear association
A numerical summary of these pairs of data is given by: x̄, sx , ȳ, sy , r.
As a convention the variable on the horizontal axis is called explanatory variable or
predictor, the one on the vertical axis is called response variable. also called independent variable, is
the variable that is manipulated by
the researcher.
r is always between −1 and 1. The sign of r gives the direction of the association and
its absolute value gives the strength: also called the dependent variable
because it depends on the changes
caused by the manipulated variable
(explanatory).
r = -0.9 r = -0.6 r=0 r = 0.2 r=1

Since both x and y were standardized when computing r, r has no units and is not
affected by changing the center or the scale of either variable.
Correlation measures linear association
Keep in mind that r is only useful for measuring linear association:

r=0

Also remember that correlation does not mean causation:

Among school children there is a high


450

correlation between shoe size and reading


400
Reading score

ability. Both are driven by the lurking


350

variable ‘age’.
300

6 7 8 9 10 11 12 the lurking variable is the one that truly makes the correlation between 2 variables which were
Shoe size
taught to have a causative relation.
The regression line
If the scatterplot shows a linear association, then this relationship can be summarized
by a line.
(19+21+26+31+32)/5 =

this is approximately
40

40
the predicted value for y ( y )
35

35
when x is 40
30

30
Percent body fat

Percent body fat


25

25
B=(x , y )
20

20
15

15
y-y
10

10
30 40 50 60 A=(x , y ) 30 40 50 60

Age Age

To find this line for n pairs of data (x1 , y1 ), . . . , (xn , yn ), recall that the equation of a
line produces the y-value ŷi = a + bxi . The idea is to choose the line that minimizes
the sum of the squared distances between the observed yi and the ŷi . In other words,
find a and b that minimize
X n X n
2
(yi − ŷi ) = (yi − (a + bxi ))2
i=1 i=1
The method of least squares

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize


n
X n
X
(yi − ŷi )2 = (yi − (a + bxi ))2 ratio: 'cociente'

i=1 i=1

s
This is the method of least squares. It turns out that b = r sxy and a = ȳ − bx̄. This
line ŷ = a + bx is called the regression line. done with calculus

There is another interpretation of the regression line:


look at the first picture in the
it computes the average value of y when the first coordinate is near x. last slide.

Remember that often times an average is the ‘best’ predictor. This shows how the
regression line incorporates the information given by x to produce a good predictor of y.
Regression to the mean
See this video:
The main use of regression is to predict y from x: https://www.youtube.com/watch?v=1tSqSMOyNFE

Given x, predict y to be ŷ = a + bx. look in the last page

The prediction for y at x = x̄ is simply ŷ = ȳ.


s
But b = r sxy means that if x is one standard deviation sx above x̄, then the predicted ŷ
is only r sy above ȳ.
Since r is between −1 and 1, the prediction is ‘towards the mean’: ŷ is fewer standard
deviations away from ȳ than x is from x̄.

Regression to the mean is a statistical

100
principle that reflects how extreme
values are often followed by less look at this for example, these points
90
extreme ones over repeated 80
have a y not so far away from y,
measurements, and this can be however, their x values are
Final score

influenced by various factors such as considerably far away from x. If this


70

luck, random variation, and inherent happens, then we talk about a


60

variability in data. regression 'to the mean'.


the same happens here.
50

20 30 40 50 60 70

Midterm score
Regression to the mean football-shaped: that has the form of the ball
from U.S football.

This is called regression to the mean (or: the regression effect). In can be observed
in data whose scatter is football-shaped such as the exam scores: In such a test-retest
situation, the top group on the test will drop down somewhat on the retest, while the
bottom group moves up.
A heuristic explanation is this: To score among the very top on the midterm requires
excellent preparation as well as some luck. This luck may not be there any more on the
final exam, and so we expect this group to fall back a bit.
This effect is simply a consequence of there being a scatter around the line.
Erroneously assuming that this occurs due to some action (e.g. ‘the top scorers on the
midterm slackened off’) is the regression fallacy.
Predicting y from x and x from y
If we are given x, then we use the regression line ŷ = a + bx to predict y.
To find this regression line we need only x̄, ȳ, sx , sy and r.
We can use software to compute this line, e.g. ‘lm’ in R, but it can also be done
quickly by hand:
midterm = 49.5, final = 69.1, smid = 10.2, sf inal = 11.8, r = 0.67.
Predict the final exam score of a student who scored 41 on the midterm.
Predict x from y
Predict the midterm score of a student who scored 89 on the final.
When predicting x from y it is a mistake to use the regression line ŷ = a + bx, derived
for regressing y on x, and solve for x. This is because regressing x on y will result in a
different regression line.
To avoid confusing these, always put the predictor on the x-axis and proceed as on the
previous slide.
then: x - 49.5 = 0.67*10.2 (89 - 69.1)
11.8
x = 61.02

predict Y from X

predict X from Y
Normal approximation in regression
Regression requires that the scatter is football-shaped. Then one may use normal
approximation for the y-values conditional on x. That is, the observations whose first
coordinate is near that x have y-values that approximately follow the normal curve.
Remember that the predicted value y given a value
x is the average of y values when we approximate
to that x value (go and see slide 5).
To standardize, subtract√off the predicted
value ŷ, then divide by 1 − r2 × sy .
this is the 'standard deviation'
of that normal curve

Among the students who scored around 41 on the midterm, what percentage scored
above 60 on the final?
Residuals
The differences between observed and predicted y-values are called residuals:
ei = yi − ŷi , i = 1, . . . , n
Residuals are used to check whether the use of regression is appropriate. The residual
plot is a scatterplot of the residuals against the x-values. It should show an
unstructured horizontal band.
100

20
90

10
80
Final score

Residuals
70

0
60

-10
50

-20
20 30 40 50 60 70 20 30 40 50 60 70

Midterm score Midterm score


Residual plots
A curved pattern suggests that the scatter is not linear:
a curve pattern

6000
25000

4000
20000

2000
Residuals
Income

15000

0
-2000
10000

-4000
-6000
5000

6 8 10 12 14 16 6 8 10 12 14 16

Education Education

But it may still be possible to analyze these data with√regression! Regression may
applicable after transforming the data, e.g. regress income or log(income) on
Education.
Transformations of the variables
Another violation of the football-shaped assumption about the scatter arises if the
scatter is heteroscedastic:

Heteroscedastic:
It happens when the standard
deviations of a predicted
variable, monitored over
different values of an

Residuals
independent variable or as
related to prior time periods,
are non-constant.

opposite of heteroscedastic

A transformation of the y-variables may produce a homoscedastic scatter, i.e. result in


equal spread of the residuals across x. (However, it may also result in a non-linear
scatter, which may require a second transformation of the x-values to fix!)
Transformation of the variables
2000 Pres. Election in Florida
by county (w/o Palm Beach)

1000

200
800

0
600
Buchanan

Residuals
400

-200
200

-400
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush

The residual plot looks heteroscedastic. Taking log of both variables produces a residual
plot that is very satisfactory:
7

1.0
6

0.5
log(Buchanan)

Residuals

0.0
4

-0.5
3

-1.0
7 8 9 10 11 12 7 8 9 10 11 12

log(Bush) log(Bush)
the opposite of an influential point
Outliers

Points with very large residuals (outliers) should be examined: they may represent
typos or interesting phenomena.

2000 Pres. Election in Florida


by county
3500

2500
3000

2000
2500

1500
2000

1000
Buchanan

Residuals
1500

500
1000

0
500

-500
-1000
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush
Leverage and influential points
A point whose x-value is far from the mean of the x-values has high leverage: it has
the potential to cause a big change the regression line.

4.0
3.5
3.0
2.5
point with high leverage, if we drop it, the red line is formed.

2.0
1.5
1.0
1 2 3 4 5 6 7

Whether it does change the line a lot (→ influential point) or not can only be
determined by refitting the regression without the point. An influential point may have
a small residual (because it is influential!), so a residual plot is not helpful for this
analysis.
Some other issues
in other words, regression line only works for the dots used to create it.

I Avoid predicting y by extrapolation, i.e. at x-values that are outside the range of
the x-values that were used for the regression: The linear relationship often breaks
down outside a certain range.
I Beware of data that are summaries (e.g. averages of some data). Those are less
variable than individual observations and correlations between averages tend to
overstate the strength of the relationship.
I Regression analyses often report ‘R-squared’: R2 = r2 . It gives the fraction of the
variation in the y-values that is explained by the regression line. (So 1 − r2 is the
fraction of the variation in the y-values that is left in the residuals.)

R-squared explains what proportion of the variation of y can be explained by the regression line.

You might also like