0% found this document useful (0 votes)

34 views18 pages

06 Regression

Uploaded by

edgar leiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views18 pages

06 Regression

Uploaded by

edgar leiva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Prediction is a key task of statistics

Predict the height of a son who is chosen at

random from 928 sons. The average height
of sons, 68.1 in, is the ‘best’ predictor.

62 64 66 68 70 72 74

Heights of 928 Sons

Predict the height of a son whose father is

74
72

72 in tall. This additional information

70
Son's height

about the father should allow us to make a

68
66

better prediction. Regression does just that.

64
62

64 66 68 70 72

Father's height
The correlation coefficient
25000

74
20000

72
direction: sloping up.

70
direction: sloping up. 15000

Son's height
form: exponential form: linear.
Income

68
strength: weak 10000 strength: weak

66
64
5000

62
0
6 8 10 12 14 16
direction form 64 66 68 70 72

Education strenght Father's height

The scatterplot visualizes the relationship between two quantitative variables. It may
have a direction (sloping up or down), form (a scatter that clusters around a line is
called linear) and strength (how closely do the points follow the form?).
If the form is linear, then a good measure of strength is the correlation coefficient r:
Our data are (xi , yi ), i = 1, . . . , n.
n
1 X xi − x̄ yi − ȳ z value for 'Y'
r = ×
n sx sy
i=1
z-value for 'x'
(divide by n − 1 instead of n if this is also done for the standard deviations sx , sy ).
Correlation measures linear association
A numerical summary of these pairs of data is given by: x̄, sx , ȳ, sy , r.
As a convention the variable on the horizontal axis is called explanatory variable or
predictor, the one on the vertical axis is called response variable. also called independent variable, is
the variable that is manipulated by
the researcher.
r is always between −1 and 1. The sign of r gives the direction of the association and
its absolute value gives the strength: also called the dependent variable
because it depends on the changes
caused by the manipulated variable
(explanatory).
r = -0.9 r = -0.6 r=0 r = 0.2 r=1

Since both x and y were standardized when computing r, r has no units and is not
affected by changing the center or the scale of either variable.
Correlation measures linear association
Keep in mind that r is only useful for measuring linear association:

r=0

Also remember that correlation does not mean causation:

Among school children there is a high

450

correlation between shoe size and reading

400
Reading score

ability. Both are driven by the lurking

350

variable ‘age’.
300

6 7 8 9 10 11 12 the lurking variable is the one that truly makes the correlation between 2 variables which were
Shoe size
taught to have a causative relation.
The regression line
If the scatterplot shows a linear association, then this relationship can be summarized
by a line.
(19+21+26+31+32)/5 =

this is approximately
40

40
the predicted value for y ( y )
35

35
when x is 40
30

30
Percent body fat

Percent body fat

25
B=(x , y )
20

20
15

15
y-y
10

10
30 40 50 60 A=(x , y ) 30 40 50 60

Age Age

To find this line for n pairs of data (x1 , y1 ), . . . , (xn , yn ), recall that the equation of a
line produces the y-value ŷi = a + bxi . The idea is to choose the line that minimizes
the sum of the squared distances between the observed yi and the ŷi . In other words,
find a and b that minimize
X n X n
2
(yi − ŷi ) = (yi − (a + bxi ))2
i=1 i=1
The method of least squares

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize

n
X n
X
(yi − ŷi )2 = (yi − (a + bxi ))2 ratio: 'cociente'

i=1 i=1

s
This is the method of least squares. It turns out that b = r sxy and a = ȳ − bx̄. This
line ŷ = a + bx is called the regression line. done with calculus

There is another interpretation of the regression line:

look at the first picture in the
it computes the average value of y when the first coordinate is near x. last slide.

Remember that often times an average is the ‘best’ predictor. This shows how the
regression line incorporates the information given by x to produce a good predictor of y.
Regression to the mean
See this video:
The main use of regression is to predict y from x: https://www.youtube.com/watch?v=1tSqSMOyNFE

Given x, predict y to be ŷ = a + bx. look in the last page

The prediction for y at x = x̄ is simply ŷ = ȳ.

s
But b = r sxy means that if x is one standard deviation sx above x̄, then the predicted ŷ
is only r sy above ȳ.
Since r is between −1 and 1, the prediction is ‘towards the mean’: ŷ is fewer standard
deviations away from ȳ than x is from x̄.

Regression to the mean is a statistical

100
principle that reflects how extreme
values are often followed by less look at this for example, these points
90
extreme ones over repeated 80
have a y not so far away from y,
measurements, and this can be however, their x values are
Final score

influenced by various factors such as considerably far away from x. If this

luck, random variation, and inherent happens, then we talk about a

variability in data. regression 'to the mean'.

the same happens here.
50

20 30 40 50 60 70

Midterm score
Regression to the mean football-shaped: that has the form of the ball
from U.S football.

This is called regression to the mean (or: the regression effect). In can be observed
in data whose scatter is football-shaped such as the exam scores: In such a test-retest
situation, the top group on the test will drop down somewhat on the retest, while the
bottom group moves up.
A heuristic explanation is this: To score among the very top on the midterm requires
excellent preparation as well as some luck. This luck may not be there any more on the
final exam, and so we expect this group to fall back a bit.
This effect is simply a consequence of there being a scatter around the line.
Erroneously assuming that this occurs due to some action (e.g. ‘the top scorers on the
midterm slackened off’) is the regression fallacy.
Predicting y from x and x from y
If we are given x, then we use the regression line ŷ = a + bx to predict y.
To find this regression line we need only x̄, ȳ, sx , sy and r.
We can use software to compute this line, e.g. ‘lm’ in R, but it can also be done
quickly by hand:
midterm = 49.5, final = 69.1, smid = 10.2, sf inal = 11.8, r = 0.67.
Predict the final exam score of a student who scored 41 on the midterm.
Predict x from y
Predict the midterm score of a student who scored 89 on the final.
When predicting x from y it is a mistake to use the regression line ŷ = a + bx, derived
for regressing y on x, and solve for x. This is because regressing x on y will result in a
different regression line.
To avoid confusing these, always put the predictor on the x-axis and proceed as on the
previous slide.
then: x - 49.5 = 0.67*10.2 (89 - 69.1)
11.8
x = 61.02

predict Y from X

predict X from Y
Normal approximation in regression
Regression requires that the scatter is football-shaped. Then one may use normal
approximation for the y-values conditional on x. That is, the observations whose first
coordinate is near that x have y-values that approximately follow the normal curve.
Remember that the predicted value y given a value
x is the average of y values when we approximate
to that x value (go and see slide 5).
To standardize, subtract√off the predicted
value ŷ, then divide by 1 − r2 × sy .
this is the 'standard deviation'
of that normal curve

Among the students who scored around 41 on the midterm, what percentage scored
above 60 on the final?
Residuals
The differences between observed and predicted y-values are called residuals:
ei = yi − ŷi , i = 1, . . . , n
Residuals are used to check whether the use of regression is appropriate. The residual
plot is a scatterplot of the residuals against the x-values. It should show an
unstructured horizontal band.
100

20
90

10
80
Final score

Residuals
70

0
60

-10
50

-20
20 30 40 50 60 70 20 30 40 50 60 70

Midterm score Midterm score

Residual plots
A curved pattern suggests that the scatter is not linear:
a curve pattern

6000
25000

4000
20000

2000
Residuals
Income

15000

0
-2000
10000

-4000
-6000
5000

6 8 10 12 14 16 6 8 10 12 14 16

Education Education

But it may still be possible to analyze these data with√regression! Regression may
applicable after transforming the data, e.g. regress income or log(income) on
Education.
Transformations of the variables
Another violation of the football-shaped assumption about the scatter arises if the
scatter is heteroscedastic:

Heteroscedastic:
It happens when the standard
deviations of a predicted
variable, monitored over
different values of an

Residuals
independent variable or as
related to prior time periods,
are non-constant.

opposite of heteroscedastic

A transformation of the y-variables may produce a homoscedastic scatter, i.e. result in

equal spread of the residuals across x. (However, it may also result in a non-linear
scatter, which may require a second transformation of the x-values to fix!)
Transformation of the variables
2000 Pres. Election in Florida
by county (w/o Palm Beach)

1000

200
800

0
600
Buchanan

Residuals
400

-200
200

-400
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush

The residual plot looks heteroscedastic. Taking log of both variables produces a residual
plot that is very satisfactory:
7

1.0
6

0.5
log(Buchanan)

Residuals

0.0
4

-0.5
3

-1.0
7 8 9 10 11 12 7 8 9 10 11 12

log(Bush) log(Bush)
the opposite of an influential point
Outliers

Points with very large residuals (outliers) should be examined: they may represent
typos or interesting phenomena.

2000 Pres. Election in Florida

by county
3500

2500
3000

2000
2500

1500
2000

1000
Buchanan

Residuals
1500

500
1000

0
500

-500
-1000
0

0 50000 100000 150000 200000 250000 300000 0 50000 100000 150000 200000 250000 300000

Bush Bush
Leverage and influential points
A point whose x-value is far from the mean of the x-values has high leverage: it has
the potential to cause a big change the regression line.

4.0
3.5
3.0
2.5
point with high leverage, if we drop it, the red line is formed.

2.0
1.5
1.0
1 2 3 4 5 6 7

Whether it does change the line a lot (→ influential point) or not can only be
determined by refitting the regression without the point. An influential point may have
a small residual (because it is influential!), so a residual plot is not helpful for this
analysis.
Some other issues
in other words, regression line only works for the dots used to create it.

I Avoid predicting y by extrapolation, i.e. at x-values that are outside the range of
the x-values that were used for the regression: The linear relationship often breaks
down outside a certain range.
I Beware of data that are summaries (e.g. averages of some data). Those are less
variable than individual observations and correlations between averages tend to
overstate the strength of the relationship.
I Regression analyses often report ‘R-squared’: R2 = r2 . It gives the fraction of the
variation in the y-values that is explained by the regression line. (So 1 − r2 is the
fraction of the variation in the y-values that is left in the residuals.)

R-squared explains what proportion of the variation of y can be explained by the regression line.

Statistics for Data Analysts
No ratings yet
Statistics for Data Analysts
18 pages
Correlation and Regression 2
No ratings yet
Correlation and Regression 2
24 pages
Understanding the Correlation Coefficient
No ratings yet
Understanding the Correlation Coefficient
54 pages
1 - Simple Linear Regression
No ratings yet
1 - Simple Linear Regression
43 pages
Linear Regression II
No ratings yet
Linear Regression II
54 pages
Simple Linear Regression Part 1
No ratings yet
Simple Linear Regression Part 1
63 pages
(Mathe) Simple Linear Regression and Correlation
No ratings yet
(Mathe) Simple Linear Regression and Correlation
61 pages
@regression
No ratings yet
@regression
33 pages
Regression and Correlation Analysis Guide
No ratings yet
Regression and Correlation Analysis Guide
10 pages
Lecture 07 Regression
No ratings yet
Lecture 07 Regression
22 pages
5 - Chapter9-Linear Regression
No ratings yet
5 - Chapter9-Linear Regression
15 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Lecture8 4
No ratings yet
Lecture8 4
29 pages
Parametric Test
No ratings yet
Parametric Test
49 pages
Linear Regression and Correlation Analysis
No ratings yet
Linear Regression and Correlation Analysis
9 pages
Statistical Analysis: Linear Regression
No ratings yet
Statistical Analysis: Linear Regression
36 pages
Linear Regression Basics Guide
No ratings yet
Linear Regression Basics Guide
6 pages
Linear Regression Zamin
No ratings yet
Linear Regression Zamin
29 pages
Correlation
100% (1)
Correlation
29 pages
Regression and Correlation Analysis
No ratings yet
Regression and Correlation Analysis
16 pages
BCSE352E EDA CAT 2 Mod 1,2,5
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5
146 pages
Correlation and Regression 2020
No ratings yet
Correlation and Regression 2020
63 pages
Stats101A - Chapter 1
No ratings yet
Stats101A - Chapter 1
25 pages
CH 4 - Correlation and Regression YARA&LAMA
No ratings yet
CH 4 - Correlation and Regression YARA&LAMA
27 pages
Correlation and Regression
No ratings yet
Correlation and Regression
31 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Relationship - Correlation and Regression
No ratings yet
Relationship - Correlation and Regression
42 pages
DAM Class 21-24 Regression Analysis
No ratings yet
DAM Class 21-24 Regression Analysis
93 pages
Regression and Correlation
No ratings yet
Regression and Correlation
14 pages
Handout 5 Correlation and Regression (Recovered)
No ratings yet
Handout 5 Correlation and Regression (Recovered)
6 pages
Regression: Leech N L, Barret K C & Morgan G A (2011)
No ratings yet
Regression: Leech N L, Barret K C & Morgan G A (2011)
35 pages
12.1correlation and Simple Linear
No ratings yet
12.1correlation and Simple Linear
45 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Linear Regression and Correlation
No ratings yet
Linear Regression and Correlation
41 pages
Linearregression-Rupak
No ratings yet
Linearregression-Rupak
32 pages
Sec2 Regression PDF
No ratings yet
Sec2 Regression PDF
183 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Regression Analysis
No ratings yet
Regression Analysis
6 pages
8-Simple Regression Analysis
No ratings yet
8-Simple Regression Analysis
9 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
4 Regression
No ratings yet
4 Regression
24 pages
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
No ratings yet
BCSE352E EDA CAT 2 Mod 1,2,5 PDF
146 pages
STAB27
No ratings yet
STAB27
51 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
8 pages
Understanding Simple Regression Analysis
100% (1)
Understanding Simple Regression Analysis
8 pages
STATG5 - Simple Linear Regression Using SPSS Module
No ratings yet
STATG5 - Simple Linear Regression Using SPSS Module
16 pages
Chapter7
No ratings yet
Chapter7
52 pages
F Regression
No ratings yet
F Regression
65 pages
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
No ratings yet
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
43 pages
Regression Lecture Summary
No ratings yet
Regression Lecture Summary
31 pages
Chapter 3.3. Correlation and Linear Regression
No ratings yet
Chapter 3.3. Correlation and Linear Regression
20 pages
6 Correlation and Linear Regression
No ratings yet
6 Correlation and Linear Regression
32 pages
Looking at Data: Relationships: Least-Squares Regression
No ratings yet
Looking at Data: Relationships: Least-Squares Regression
23 pages
Regression vs. Correlation Explained
No ratings yet
Regression vs. Correlation Explained
32 pages
IV Ai & Ds Al3451 ML Unit2
No ratings yet
IV Ai & Ds Al3451 ML Unit2
50 pages
Yangon Students' Budgeting Habits
No ratings yet
Yangon Students' Budgeting Habits
12 pages
Full-Factorial Experiment Design Data
No ratings yet
Full-Factorial Experiment Design Data
9 pages
Stress Coping and Satisfaction in Nursing Students
No ratings yet
Stress Coping and Satisfaction in Nursing Students
12 pages
Job Stress & Workload Impact on Hotel Staff
No ratings yet
Job Stress & Workload Impact on Hotel Staff
5 pages
Statistics Project
No ratings yet
Statistics Project
13 pages
HUL Beta Calculation
No ratings yet
HUL Beta Calculation
14 pages
Impact of Fuel Subsidy Removal On Public Projects Delivery in Nigeria
No ratings yet
Impact of Fuel Subsidy Removal On Public Projects Delivery in Nigeria
14 pages
A Study On Brand Preference of Mobile Phones Among
No ratings yet
A Study On Brand Preference of Mobile Phones Among
6 pages
Assessing The Restorative Components
No ratings yet
Assessing The Restorative Components
12 pages
FINAL Lika Trisela
No ratings yet
FINAL Lika Trisela
21 pages
Green Heterogeneous Base Catalyst From Ripe-Unripe Plantain Peels For The Transesteri Cation of Waste Cooking Oil
No ratings yet
Green Heterogeneous Base Catalyst From Ripe-Unripe Plantain Peels For The Transesteri Cation of Waste Cooking Oil
21 pages
Olabode 2022. ERGONOMICS OF AUTOMATIC TRANSMISSION VEHICLE ON WORK-RELATED MDSs AMONG ACADEMIC & NON-ACADEMIC STAFF
No ratings yet
Olabode 2022. ERGONOMICS OF AUTOMATIC TRANSMISSION VEHICLE ON WORK-RELATED MDSs AMONG ACADEMIC & NON-ACADEMIC STAFF
14 pages
Financial Literacy and Saving Behavior
No ratings yet
Financial Literacy and Saving Behavior
11 pages
(Ebook PDF) Business Analytics Data Analysis Decision Making 6th PDF Download
75% (4)
(Ebook PDF) Business Analytics Data Analysis Decision Making 6th PDF Download
54 pages
QBM101
No ratings yet
QBM101
37 pages
Data Analysis Report Team 5
No ratings yet
Data Analysis Report Team 5
15 pages
Packaging's Impact on Impulsive Buying
No ratings yet
Packaging's Impact on Impulsive Buying
12 pages
Final SPSS Record
No ratings yet
Final SPSS Record
44 pages
Factors Affecting Consumer Buying Decision Towards Choosing A Smartphone Among Young Adults
No ratings yet
Factors Affecting Consumer Buying Decision Towards Choosing A Smartphone Among Young Adults
13 pages
Agricultural Extension Services and Maize Yield in Malawi
100% (1)
Agricultural Extension Services and Maize Yield in Malawi
27 pages
CFA Level 2 Formula Sheet
No ratings yet
CFA Level 2 Formula Sheet
44 pages
Analysis of Macroeconomic
No ratings yet
Analysis of Macroeconomic
68 pages
JIKMA Vol 1 No 5 Oktober 2023 Hal 01-09
No ratings yet
JIKMA Vol 1 No 5 Oktober 2023 Hal 01-09
9 pages
STAT 301 Cheatsheet PDF
No ratings yet
STAT 301 Cheatsheet PDF
2 pages
EViews Time Series Regression Tutorial
No ratings yet
EViews Time Series Regression Tutorial
80 pages
Conflicts Among Farmers and Pastoralists in Northern Nigeria Induced by Freshwater Scarcity
No ratings yet
Conflicts Among Farmers and Pastoralists in Northern Nigeria Induced by Freshwater Scarcity
9 pages
Industrial Crops & Products: Sciencedirect
No ratings yet
Industrial Crops & Products: Sciencedirect
12 pages
Machine Learning for Renewable Energy Forecasting
No ratings yet
Machine Learning for Renewable Energy Forecasting
9 pages
Audit Effectiveness in Fraud Detection
No ratings yet
Audit Effectiveness in Fraud Detection
18 pages

06 Regression

Uploaded by

06 Regression

Uploaded by

Prediction is a key task of statistics

Predict the height of a son who is chosen at

Heights of 928 Sons

Predict the height of a son whose father is

72 in tall. This additional information

about the father should allow us to make a

better prediction. Regression does just that.

Education strenght Father's height

Also remember that correlation does not mean causation:

Among school children there is a high

correlation between shoe size and reading

ability. Both are driven by the lurking

Percent body fat

For n pairs of data (x1 , y1 ), . . . , (xn , yn ), find a and b that minimize

There is another interpretation of the regression line:

Given x, predict y to be ŷ = a + bx. look in the last page

The prediction for y at x = x̄ is simply ŷ = ȳ.

Regression to the mean is a statistical

influenced by various factors such as considerably far away from x. If this

luck, random variation, and inherent happens, then we talk about a

variability in data. regression 'to the mean'.

Midterm score Midterm score

A transformation of the y-variables may produce a homoscedastic scatter, i.e. result in

2000 Pres. Election in Florida

You might also like