StatLearning2r PDF
StatLearning2r PDF
Ingrassia
7 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Preliminary note:
Random variable and realization 1/2
11 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Preliminary note:
Random variable and realization 2/2
14 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x) = β0 + β1 x
so that
Y = β0 + β1 x + ε
where
! β0 is the intercept
! β1 is the slope
! ε is the random error component.
β0 , β1 are the unknown constants, they are also known as the model
parameters; ε is assumed to have mean zero and unknown variance σ 2 .
15 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The equation
Y = β0 + β1 x + ε
maybe viewed as a population regression model.
Assume we have a sample of n pairs of data, say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Each yi (i = 1, . . . , n) is assumed to be a realization of the random variable
Yi = β0 + β1 xi + εi
that represents the sample regression model. Thus we assume that the
errors are uncorrelated, i.e. the value of one error does not depend on the
value of any other error.
16 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The equation
Y = β0 + β1 x + ε
maybe viewed as a population regression model.
Assume we have a sample of n pairs of data, say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Each yi (i = 1, . . . , n) is assumed to be a realization of the random variable
Yi = β0 + β1 xi + εi
that represents the sample regression model. Thus we assume that the
errors are uncorrelated, i.e. the value of one error does not depend on the
value of any other error.
Parameter estimates
We denote by β!0 , β!1 the estimates of β0 , β1 based on the training data.
17 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × newspaper
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
18 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × newspaper
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
19 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × newspaper
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
GIven the training data, we can produce estimates β!0 and β!1 for β0 , β1 .
20 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × TV
25
20
15
sales
10
5
TV
21 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × TV
25
20
15
sales
10
5
TV
22 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
sales ≈ β0 + β1 × TV
25
20
15
sales
10
5
TV
GIven the training data, we can produce estimates β!0 and β!1 for β0 , β1 .
23 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
i.e.
# n
"
∂S ##
= −2 (yi − β!0 − β!1 xi ) = 0
∂β0 #β0 = !0
β i= 1
# n
"
∂S ##
= −2 (yi − β!0 − β!1 xi )xi = 0
∂β1 #β1 = !1
β i= 1
24 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
σxy Sxy
β!1 = 2 = and β!0 = ȳ − β!1 x̄ .
σx SSx
where
n n
1" 1"
x̄ = xi ȳ = yi
n i= 1 n i= 1
n n
1" SSx 1" Sxy
σx2 = (xi − x̄)2 = σxy = (xi − x̄)(yi − ȳ) =
n i= 1 n n i= 1 n
n
" n
"
SSx = (xi − x̄)2 Sxy = (xi − x̄)(yi − ȳ).
i= 1 i= 1
26 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Residuals
Define the residual as the difference between the observed values yi and the
corresponding fitted values f̂ (xi ) = β!0 + β!1 xi :
27 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consider
$ % $ %
f̂ (x + 1) − f̂ (x) = β!0 + β!1 x + β!1 − β!0 + β!1 x = β!1
thus the slope β!1 measures the average variation in Y associated with a
one-unit increase in X.
! β0 is the intercept, i.e. the expected value of Y for x = 0 provided that
x = 0 is in the range of tX in the training data.
28 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
29 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
25
20
15
sales
10
5
0 20 40 60 80 100
TV
31 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
32 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
25
20
15
sales
10
5
TV
34 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
35 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
36 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
37 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
11
11
10
10
B
Y
9
Y
8
8
A
7
7
6
6
5 7 9 11 13 15
5 7 9 11 13 15
X
X
The slope in the least-squares fit depends heavily on either the points A and
B. The points A and B are influential observations.
Situations such as this often require corrective action, such as further
analysis and possible deletion of the unusual points, or estimation of the
model parameter with some robust technique.
38 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In this case one of the 18 observations is very remote in x space. The slope
is largely determined by the extreme point.
10 10
C
9 9
Y
Y
8 8
C
7 7
9 11 13 15 9 11 13 15
X X
39 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
12 12
D
11 11
10 10
Y
Y
9 9
8 8
7 7
9.0 9.5 10.0 10.5 11.0 11.5 9.0 9.5 10.0 10.5 11.0 11.5
X X
If this point is really an outlier, then the estimate of the intercept may be
incorrect.
40 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
41 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
44 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
45 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
or, equivalently
TSS = SSf + RSS
where
n
" n
" n
"
TSS = (yi − ȳ)2 SSf = (ŷi − ȳ)2 RSS = (yi − ŷi )2 .
i= 1 i= 1 i= 1
Meaning of R2
Thus R2 gives the proportion of Y explained by the regressor X.
Remark
The statistics R2 should be used with caution, since it is always possible to
make R2 large by adding enough terms to the model.
48 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Degrees of freedom
Consider the identity
TSS = SSf + RSS.
&
! The total sum of squares TSS = ni= 1 (yi − ȳ)2 has νT = n − 1 degrees
of freedom& because one degree of freedom is lost as a result of the
constraint ni= 1 (yi − ȳ) = 0;
&
! The model sum of squares SSf = ni= 1 (ŷi − ȳ)2 has νf = 1 degree of
freedom because SSf is completely determined by the regression
parameter β!1 ;
&
! the residual sum of squares RSS = ni= 1 (yi − ŷi )2 has νR = n − 2
degrees of freedom because two constraints are imposed as a result of
estimating β!0 and β!1 .
Thus:
νT = νf + νR
n − 1 = 1 + (n − 2)
49 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
50 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ei = yi − ŷi i = 1, . . . , n
51 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
52 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.5
8
1.0
24
0.5
22
Residuals
V2
0.0
20
-0.5
18
-1.0
16
-1.5
16 18 20 22 24 26 16 18 20 22 24 26
Fitted : X Fitted : X
53 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
50
43
5
40
Residuals
30
0
Y2
20
-5
10
-10
33
0
0 2 4 6 8 10 12 14 16 10 20 30 40
X Fitted : X
54 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.5
41
50
1.0
40
0.5
Residuals
0.0
30
Y
-0.5
20
-1.0
10
-1.5
18
45
0
0 2 4 6 8 10 12 14 16 10 20 30 40
X Fitted : X
Data with regression model and corresponding residual plot (non linearity).
55 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Small departure from the normality assumption do not affect the model
greatly, but gross non-normality is potential more serious as inference
on the parameters depend on the normality assumption.
! A very simple method of checking the normality assumption is to
construct a Q-Q plot of the residuals, which is a graph designed so that
the cumulative normal distribution will plot as a straight line. In other
words, the Q-Q plot, or quantile-quantile plot, is a graphical tool to help
us assess if a set of data plausibly came from some theoretical
distribution such as a Normal or exponential.
! We remind that by a quantile, we mean the fraction (or percent) of points
below the given value. That is, the 0.3 (or 30%) quantile is the point at
which 30% percent of the data fall below and 70% fall above that value.
! Q-Q plots take your sample data, sort it in ascending order, and then
plot them versus quantiles calculated from a theoretical distribution. The
number of quantiles is selected to match the size of your sample data.
! R → qqnorm(), qqline()
60 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
0.4
1
0.3
probability
Density
0
0.2
−1
0.1
−2
0.0
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
w e(i)
61 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
6
0.4
4
2
0.3
probability
Density
0
0.2
−2
0.1
−4
0.0
−6
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
w e(i)
62 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.7
2.0
0.6
0.5
1.5
probability
0.4
Density
1.0
0.3
0.2
0.5
0.1
0.0
0.0
w e(i)
63 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0
0.9
2.0
0.8
1.5
0.7
probability
Density
0.6
1.0
0.5
0.5
0.4
0.3
0.0
w e(i)
64 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
65 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
5
quantili empirici
residuals
0
−5
−5
8 10 12 14 16 18 20 −3 −2 −1 0 1 2 3
66 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
5
5
quantili empirici
residuals
0
−5
−5
−10
−10
13 14 15 16 17 18 −3 −2 −1 0 1 2 3
67 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y = f (x) + ε
Y = β0 + β1 x + ε
Note
The true relationship is generally not known for real data, but the least squares
line can always be computed using the training data.
68 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
10
5
5
y
y
0
0
−5
−5
−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
10
10
5
5
y
y
0
0
−5
−5
−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
69 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
10
5
5
y
y
0
0
−5
−5
−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
10
10
5
5
y
y
0
0
−5
−5
−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
x x x
70 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
74 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
79 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
σ2 σ2
Var(B1 ) = &n =
1 (xi − x̄) SSx
2
i=
' ( ' (
1 x̄2 1 x̄2
Var(B0 ) = σ 2 + &n = σ2 + .
n i= 1 (xi − x̄)
2 n SSx
82 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In summary
Roughly speaking s2e is the average amount that the response will deviate from
the true regression line.
85 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
!1 ) = &n s2e s2
Var(B = e
1 (xi − x̄) SSx
2
i=
' ( ' (
!0 ) = s2e 1 x̄2 1 x̄2
Var(B + &n = s2e + .
n i= 1 (xi − x̄)
2 n SSx
87 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
88 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
89 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
X Y1 Y2 Y3 Z W
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.1 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.1 5.39 8 5.56
12 10.84 9.13 8.15 8 7.91
7 4.82 7.26 6.42 8 6.89
5 5.68 4.74 5.73 19 12.5
91 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
8
8
Y1
Y2
6
2
2 4 6 8 10 12 14 4 6 8 10 12 14
X X
13 13
11 11
Y3
9 9
7 7
5 5
4 6 8 10 12 14 7 9 11 13 15 17 19
X Z
92 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
10
8
Y1
Y2
6
2
2 4 6 8 10 12 14 4 6 8 10 12 14
X X
13 13
11 11
Y3
9 9
7 7
5 5
4 6 8 10 12 14 7 9 11 13 15 17 19
X Z
93 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Conclusions:
! The first model seems to be perfectly appropriate;
! Figure 2 suggests that Y has a smoothed curved relation with X,
possibly quadratic;
! In Figure 3, all but one of the observations lie close to a straight line;
! Figure 4 shows that one observation has played a critical role.
94 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Conclusions:
! The first model seems to be perfectly appropriate;
! Figure 2 suggests that Y has a smoothed curved relation with X,
possibly quadratic;
! In Figure 3, all but one of the observations lie close to a straight line;
! Figure 4 shows that one observation has played a critical role.
Conclusion
A good statistical analysis begins always through a good graphical analysis.
95 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
96 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Prediction 1/3
ŷ = β!0 + β!1 x
in order to predict the response Y on the basis of a set of values for the
predictor X.
However, there are three sorts of uncertainty associated with this prediction:
1. reducible errors,
2. model bias,
3. irreducible errors.
98 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Prediction - 2/3
Reducible errors
The coefficient estimates β!0 , β!1 are estimates for β0 , β1 . That is, the least
squares line
ŷ = β!0 + β!1 x
is only an estimate for the true population regression line
f (x) = β0 + β1 x.
100 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Prediction - 3/3
Irreducible errors
Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + ε.
101 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.95
−1.96 0 1.96
102 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.9545
−2 0 2
β!1 ± 2 SE(β!1 )
2
The Gamma function Γ is a positive function defined for x > 0 and has the two
main properties: Γ(x) = (x − 1)Γ(x − 1); if x = n is a positive integer, then
Γ(n) = (n − 1)!.
104 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
t(3) t(1)
0.4
N(0,1) t(2)
t(3)
t(10)
t(20)
N(0,1)
0.3
t(1)
0.2
0.1
0.0
-4 -2 0 2 4
ν = 10
Let Q ∼ t(10) then P(− 2.23 ≤ Q ≤ 2.23) = 0.95:
probability = 0.95
−2.23 0 2.23
106 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.9266
−2 0 2
107 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.95
−2.04 0 2.04
108 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.9454
−2 0 2
109 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.95
−2 0 2
110 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.95
−1.98 0 1.98
111 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
probability = 0.9518
−2 0 2
112 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Given training data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of size n, get the estimate
β!0 , β!1 of β0 , β1 ;
! compute the residual standard error se and afterwards SE(β!0 ), SE(β!1 ).
! Compute the intervals
, -
β!0 − 2 SE(β!0 ), β!0 + 2 SE(β!0 ) = β!0 ± 2 SE(β!0 )
, -
β!1 − 2 SE(β!1 ), β!1 + 2 SE(β!1 ) = β!1 ± 2 SE(β!1 )
The interval β!0 ± 2 SE(β!0 ) may or may not contain the true value β0 ;
analogously the interval β!1 ± 2 SE(β!1 ) may or may not contain the true
value β1 .
115 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
B0 − β0 B1 − β1
∼ t(n − 2) and ∼ t(n − 2)
SE(B0 ) SE(B1 )
where t(n − 2) denotes a t distribution with n − 2 degrees of freedom and n is
the sample size.
! Then:
! approximately the 95% of the intervals β!0 ± 2 SE(β!0 ) will contain β0 ,
! approximately the 95% of the intervals β!1 ± 2 SE(β!1 ) will contain β1 .
120 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
124 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(sales∼ TV)->lmTV.fit
confint(lmTV.fit)
2.5 % 97.5 %
(Intercept) 6.12971927 7.93546783
x 0.04223072 0.05284256
lm(sales∼ newspaper)->lmNEWS.fit
confint(lmNEWS.fit)
2.5 % 97.5 %
(Intercept) 11.12595560 13.57685854
x 0.02200549 0.08738071
125 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
126 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
127 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
128 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
129 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + ε.
This error is referred to as irreducible error.
How much will Y vary from Ŷ?
We use prediction intervals on a future observation ad x to answer this
question.
Prediction intervals are always wider than confidence intervals, because they
incorporate both the error in the estimate for f (x) (the reducible error) and the
uncertainty as to how much an individual point will differ from the population
regression plane (the irreducible error).
131 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
or, equivalently,
ŷ ± 2 × sŶ|x ,
where ) ' (
1 (x − x̄)2
sŶ|x = s2e + ,
n SSx
and &n 2
i= 1 ei
s2e = .
n− 2
132 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
133 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
TV
134 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
135 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25
20
15
sales
10
5
0 20 40 60 80 100
newspaper
136 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
H0 : β1 = 0
H1 : β1 ̸= 0.
137 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
142 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
β!1 − 0
t∗ =
SE(β!1 )
which measures the number of standard deviations that β!1 is away from 0.
143 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Thus the p-value, under the null hypothesis β1 = 0, measures the probability
to observe a test statistic larger than the observed value t∗ (in absolute
value).
145 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
148 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In summary, the smaller p-value, the smaller evidence in favour of the null
hypothesis (or the larger evidence against the null hypothesis).
154 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
−t’ 0 t’
155 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.025 0.025
−t’ 0 t’
−t’ 0 t’
157 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.025 0.025
−t’ 0 t’
0.005 0.005
−t’ 0 t’
−t’ 0 t’
160 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.025 0.025
−t’ 0 t’
0.005 0.005
−t’ 0 t’
60
6
40
5
y
20
4
0
3
−10 −5 0 5 10 15 −10 −5 0 5 10
x x
60
6
40
5
y
20
4
0
3
−10 −5 0 5 10 15 −10 −5 0 5 10
x x
lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
165 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)
166 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Conclusion
We can conclude that β0 and β1 are statistically different from 0.
167 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
168 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
169 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction
Y = β0 + β1 x1 + · · · + βp xp + ε ,
where
! Y is the response,
! X1 , X2 , . . . , Xp are the predictors,
! β1 , . . . , βp are the regression coefficient, where βj quantifies the
association between the variable Xj (j = 1, . . . , p)and the response Y.
! ε is the v.a. error term, where we assume again ε ∼ N(0, σ 2 ).
170 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x1 , x2 ) = β0 + β1 x1 + β2 x2 ,
f (x1 , x2 ) = β0 + β1 x1 + β2 x2
f (x1 + 1, x2 ) = β0 + β1 (x1 + 1) + β2 x2 .
Thus:
171 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The parameters are estimated using the same least squares approach that
we saw in the context of simple linear regression and let β!0 , β!1 , . . . , β!p the
estimate of β0 , β1 , . . . , βp and set
174 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Thus, it can be proved that the estimated parameters β!0 , β!1 , . . . , β!p are given
by
! = (X′ X)− 1 X′ y
β
where X′ denotes the transpose of X.
Thus the fitted values are given by
! = X(X′ X)− 1 X′ y.
ŷ = Xβ
176 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
X2
X1
Call:
lm(formula = sales ∼ TV + radio + newspaper)
178 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Call:
lm(formula = sales ∼ TV + radio + newspaper)
179 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
182 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! This illustrates that the simple and multiple regression coefficients can
be quite different.
! This difference stems from the fact that in the simple regression case,
the slope term represents the average effect of a $1,000 increase in
newspaper advertising, ignoring other predictors such as TV and radio.
! In contrast, in the multiple regression setting, the coefficient for
newspaper represents the average effect of increasing newspaper
spending by $1,000 while holding TV and radio fixed.
185 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Notice that the correlation between radio and newspaper is 0.35. This
reveals a tendency to spend more on newspaper advertising in markets
where more is spent on radio advertising.
187 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
188 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
where
n
" n
" n
"
TSS = (yi − ȳ)2 SSf = (ŷi − ȳ)2 RSS = (yi − ŷi )2 .
i= 1 i= 1 i= 1
189 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
νT = νf + νR
n − 1 = p + (n − p − 1)
190 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
191 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
H0 : β1 = 0
H1 : β1 ̸= 0.
In the multiple regression setting with p predictors, we need to ask whether all
of the regression coefficients are zero. Thus, we test the hypothesis system
H0 : β1 = β2 = · · · = βp = 0
H1 : at least one βj is non-zero.
192 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
193 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
195 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
196 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0 F(2,5)
F(2,5)
0.8 F(5,10)
F(10,5)
F(10,20)
F(10,5)
F(5,10)
0.6
F(10,20)
0.4
0.2
0.0
1 3 5 7 9
x 197 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Call:
lm(formula = sales ∼ TV + radio + newspaper)
198 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
It is possible that all of the predictors are associated with the response, but it
is more often the case that the response is only related to a subset of the
predictors.
The task of determining which predictors are associated with the response, in
order to fit a single model involving only those predictors, is referred to as
variable selection.
Ideally, we would like to perform variable selection by trying out a lot of
different models, each containing a different subset of predictors.
Unfortunately, there are a total of 2p models that contains subsets of p
variables.
199 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
200 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2. Backward selection.
! We start with all variables in the model, and backward remove the
variable with the largest p-value–that is, the variable selection that
is the least statistically significant.
! The new (p − 1)-variable model is fit, and the variable with the
largest p-value is removed.
! This procedure continues until a stopping rule is reached. For
instance, we may stop when all remaining variables have a
p-value below some threshold.
201 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
202 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
step(lm(sales∼ TV+radio+newspaper),direction="backward")
step(lm(sales∼ TV+radio+newspaper),direction="forward")
step(lm(sales∼ TV+radio+newspaper),direction="both")
204 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The most common numerical measures of model fit is the R2 , the fraction of
variance explained. This quantity is computed and interpreted in the same
fashion as for simple linear regression:
&n 2
&n 2
2 TSS − RSS RSS i= 1 (yi − ŷi ) i= 1 ei
R = =1− =1− & n =1− & n
RSS TSS i= 1 (yi − ȳ) 2
i= 1 i − ȳ)
(y 2
where:
! TSS measures the Total Sum of Squares of Y
! RSS measures the Sum of Squares of the Residuals, i.e. amount of
variability that is left unexplained.
An R2 value close to 1 indicates that the model explains a large proportion of
the variance in the response variable.
205 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
predictor(s) R2
newspaper 0.0512
radio 0.3320
TV 0.6119
TV+radio 0.8971943
TV+newspaper 0.6458
radio+newspaper 0.3327
TV+radio+newspaper 0.8972106
! The model that uses all three advertising media to predict sales has
an R2 of 0.8972106;
! the model that using only TV and radio has an R2 value of 0.8971943
(approximately the same);
! in other words, there is a small increase in R2 if we include newspaper
advertising in the model that already contains TV and radio advertising;
! it turns out that R2 does not decrease when a variable is added to the
model, even if this variable is only weakly associated with the response.
206 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
predictor(s) R2
newspaper 0.0512
radio 0.3320
TV 0.6119
TV+radio 0.8971943
TV+newspaper 0.6458
radio+newspaper 0.3327
TV+radio+newspaper 0.8972106
We remind that:
! RSS is an unbiased estimate of the error variance ε,
! TSS is an unbiased estimate of the total variance of Y.
209 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
predictor(s) R2 R2Adj
newspaper 0.0512 0.0473
radio 0.3320 0.3287
TV 0.6119 0.6099
TV+radio 0.8971943 0.8962
TV+newspaper 0.6458 0.6422
radio+newspaper 0.3327 0.3259
TV+radio+newspaper 0.8972106 0.8956
210 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
211 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
predictor(s) R2 R2Adj
newspaper 0.0512 0.0473
radio 0.3320 0.3287
TV 0.6119 0.6099
TV+radio 0.8971943 0.8962
TV+newspaper 0.6458 0.6422
radio+newspaper 0.3327 0.3259
TV+radio+newspaper 0.8972106 0.8956
Note that the model TV+radio has an R2Adj larger than the model
TV+radio+newspaper.
212 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Call:
lm(formula = sales ∼ TV + radio + newspaper)
213 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y = β0 + β1 x1 + · · · + βp xp + ε ,
where ε ∼ N(0, σ 2 ).
In general, if the model has p predictors the estimate s2e of σ 2 is given by
&n
i= 1 ei RSS
s2e = =
n− p− 1 n− p− 1
where ei are the residuals, i.e. ei = yi − ŷi .
*
RSS 6
RSE = s2e =
n− 2
is know as the residual standard error, where n − p − 1 is the number of
degrees of freedom.
214 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
215 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sales
TV
Radio
We see that some observations lie above and some observations lie below
the least squares regression plane.
Notice that there is a clear pattern of negative residuals, followed by positive
residuals, followed by negative residuals.
216 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sales
TV
Radio
Sales
TV
Radio
218 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sales
TV
Radio
The positive residuals (those visible above the surface), tend to lie along the
45-degree line, where TV and Radio budgets are split evenly. The negative
residuals (most not visible), tend to lie away from this line, where budgets are
more lopsided.
219 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sales
TV
Radio
220 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
in order to predict the response Y on the basis of a set of values for the
predictors X1 , X2 , . . . , Xp . However, like in the simple regression model, there
are three sorts of uncertainty associated with this prediction:
1. reducible errors,
2. model bias,
3. irreducible errors.
221 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x) = β0 + β1 x + · · · + βp xp .
Model bias
Of course, in practice assuming a linear model for f (x) is almost always an
approximation of reality, so there is an additional source of potentially
reducible error which we call model bias. So when we use a linear model,
we are in fact estimating the best linear approximation to the true line.
222 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Irreducible errors
Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + · · · + βp xp ε.
This error is referred to as irreducible error.
How much will Y vary from Ŷ?
We use prediction intervals to answer this question, like in the case of simple
regression models.
Prediction intervals are always wider than confidence intervals, because they
incorporate both the error in the estimate for f(X) (the reducible error) and the
uncertainty as to how much an individual point will differ from the population
regression plane (the irreducible error).
223 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
224 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
225 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
226 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1500
Balance
500
0
100
80
Age
60
40
20
8
6
Cards
4
2
20
15
Education
10
5
150
100
Income
50
14000
8000
Limit
2000
1000
600
Rating
200
0 500 1500 2 4 6 8 50 100 150 200 600 1000
227 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
and use this variable as a predictor in the regression equation. This results in
the model
⎧
⎨ β0 + β1 + εi if the ith person is a female
Yi = β0 + β1 xi + εi =
⎩ β0 + εi if the ith person is a male.
228 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
attach(Credit)
levels(factor(Gender))
[1] " Male" "Female"
contrasts(Gender)
Female
Male 0
Female 1
229 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Call:
lm(formula = Balance∼ Gender)
230 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
231 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Now:
! β0 can be interpreted as the average credit card balance among males,
! β0 + β1 as the average credit card balance among females, and β1 as the
average difference in credit card balance between females and males.
232 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Now:
! β0 can be interpreted as the average credit card balance among males,
! β0 + β1 as the average credit card balance among females, and β1 as the
average difference in credit card balance between females and males.
233 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
However, we notice that the p-value for the dummy variable is very high. This
indicates that there is no statistical evidence of a difference in average credit card
balance between females and males.
234 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
235 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Now:
! β0 can be interpreted as the average credit card balance for African
Americans,
! β1 can be interpreted as the difference in the average balance between
the Asian and African American categories, and
! β2 can be interpreted as the difference in the average balance between
the Caucasian and African American categories.
236 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Then both of these variables can be used in the regression equation, in order
to obtain the model
⎧
⎪
⎪ β + εi if the ith person is African American.
⎨ 0
⎪
Yi = β0 +β1 xi1 +β2 xi2 εi = β0 + β1 + εi if the ith person is Asian
⎪
⎪
⎪
⎩ β + β + ε if the ith person is Caucasian
0 2 i
There will always be one fewer dummy variable than the number of levels.
The level with no dummy variable – African American in this example – is
known as the baseline.
237 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
levels(factor(Ethnicity))
[1] "African American" "Asian" "Caucasian"
contrasts(Ethnicity)
Asian Caucasian
African American 0 0
Asian 1 0
Caucasian 0 1
238 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Call:
lm(formula = Balance∼ Ethnicity)
239 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
240 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
241 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
242 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
243 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
244 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Non-linear relationships
Y = β0 + β1 x1 + β2 x2 + · · · + βp xp + ε.
Y = β0 + β1 x + β2 x2 + β3 x3 + ε.
If we let
x1 = x x2 = x2 x3 = x3
then we can rewrite
Y = β0 + β1 x1 + β2 x2 + β3 x3 + ε
245 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
246 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
40
30
mpg
20
10
horsepower
247 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(mpg∼ horsepower)->lm.fit
summary(lm.fit)
Call: lm(formula = mpg∼ horsepower )
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1
248 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
40
30
mpg
20
10
horsepower
Clearly, there is a large bias because the straight line does not fit the data
pattern.
249 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(mpg∼ horsepower+I(horsepower∧2))->lm.fit
summary(lm.fit)
40
30
mpg
20
10
horsepower
251 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
50
Linear
Degree 2
40 Degree 5
Miles per gallon
30
20
10
Horsepower
The linear regression fit is shown in orange. The linear regression fit for a
model that includes horsepower2 is shown as a blue curve. The linear
regression fit for a model that includes all polynomials of horsepower up to
fifth-degree is shown in green.
252 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ε.
If we let
x3 = x1 x2 and β3 = β12
then we can write
Y = β0 + β1 x1 + β2 x2 + β3 x3 + ε
which is a linear regression model.
Note that we can also write
where β˜1 = β1 + β3 x2 .
Since β˜1 changes with X2 , the effect of X1 on Y is no longer constant:
adjusting X2 will change the impact of X1 on Y.
253 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consider again the Advertising data. A linear model that uses radio, TV
and an interaction between the two to predict sales takes the form
254 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
255 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
256 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
These results strongly suggest that the model that includes the interaction term is superior to
the model that contains only main effects.
257 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
These results strongly suggest that the model that includes the interaction term is superior to
the model that contains only main effects.
The p-value for the interaction term, radio × TV, is extremely low, indicating that there is
strong evidence for H1 :β3 ̸ = 0.
In other words, it is clear that the true relationship is not additive.
258 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that
of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.
259 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that
of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.
(β!1 + β3
! × radio) × 1, 000 = 19 + 1.1 × radio units.
260 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that
(96.8 − 89.7)/(100 − 89.7) = 69%
of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.
And an increase in radio advertising of $1,000 will be associated with an increase in sales of
(β!2 + β!3 × TV) × 1, 000 = 29 + 1.1 × TV units.
261 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Y = β0 + β1 x + β2 x2 + β3 x3 + ε
Y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ε
262 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
263 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consider the Credit data, and suppose that we wish to predict balance
using the income (quantitative) and student (qualitative) variables. In the
absence of an interaction term, the model takes the form
;
β2 if ith person is a student
balancei ≈ β0 + β1 × incomei +
0 if ith person is not a student
264 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Consider the Credit data, and suppose that we wish to predict balance
using the income (quantitative) and student (qualitative) variables. In the
absence of an interaction term, the model takes the form
;
β2 if ith person is a student
balancei ≈ β0 + β1 × incomei +
0 if ith person is not a student
Notice that this amounts to fitting two parallel lines to the data, one for
students and one for non-students. The lines for students and non-students
have different intercepts, β0 + β2 versus β0 , but the same slope, β1 .
265 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1400
1000
Balance
600
200
0 50 100 150
Income
The fact that the lines are parallel means that the average effect on balance of a
one-unit increase in income does not depend on whether or not the individual is a
student.
This represents a potentially serious limitation of the model, since in fact a change in
income may have a very different effect on the credit card balance of a student versus
a non-student.
266 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1400
1000
Balance
600
200
0 50 100 150
Income
The fact that the lines are parallel means that the average effect on balance of a
one-unit increase in income does not depend on whether or not the individual is a
student.
This represents a potentially serious limitation of the model, since in fact a change in
income may have a very different effect on the credit card balance of a student versus
a non-student.
This limitation can be addressed by adding an interaction variable, created by
multiplying income with the dummy variable for student.
267 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
;
β2 + β3 × incomei if student
balancei ≈ β0 + β1 × incomei +
0 if not a student
;
(β0 + β2 ) + (β1 + β3 ) × incomei if student
≈
β0 + β1 × incomei if not a student
Once again, we have two different regression lines for the students and the
non-students.
Those regression lines have different intercepts, β0 + β2 versus β0 , as well as
different slopes, β1 + β3 versus β1 .
This allows for the possibility that changes in income may affect the credit
card balances of students and non-students differently.
268 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1400
student
non−student
1000
Balance
600
200
0 50 100 150
Income
The figure shows the estimated relationships between income and balance for students
and non-students in this model.
269 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1400
student
non−student
1000
Balance
600
200
0 50 100 150
Income
The figure shows the estimated relationships between income and balance for students
and non-students in this model.
We note that the slope for students is lower than the slope for non-students.
270 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1400
student
non−student
1000
Balance
600
200
0 50 100 150
Income
The figure shows the estimated relationships between income and balance for students
and non-students in this model.
We note that the slope for students is lower than the slope for non-students.
This suggests that increases in income are associated with smaller increases in credit
card balance among students as compared to non-students.
271 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
272 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
273 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction
When we fit a linear regression model to a particular data set, many problems
may occur. Most common among these are the following:
1. Non-linearity of the response-predictor relationships.
2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers.
5. High-leverage points.
6. Collinearity.
Some issues have been already anlysed for simple regression models.
Other are typical of multiple linear regression models.
274 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
278 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
280 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
mpg = β0 + β1 × horsepower + ε
Residual Plot for Linear Fit
20
323
15
330
334
40
10
Residuals
5
30
mpg
0
−5
20
−15 −10
10
The red line is a smooth fit to the residuals, which is displayed in order to
make it easier to identify any trends.
The residuals exhibit a clear U-shape, which provides a strong indication of
non-linearity in the data.
281 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
334
15
323
40
10
5
Residuals
30
mpg
0
−5
20
−15 −10
155
10
282 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
286 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! In addition, p-values associated with the model will be lower than they
should be; this could cause us to erroneously conclude that a
parameter is statistically significant.
! In short, if the error terms are correlated, we may have an unwarranted
sense of confidence in our model.
288 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
291 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
292 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ρ=0.0
3
2
1
Residual
−1 0
−3
0 20 40 60 80 100
The residuals from a linear regression fit to data generated with uncorrelated
errors. There is no evidence of a time-related trend in the residuals.
293 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ρ=0.5
2
1
Residual
0
−2
−4
0 20 40 60 80 100
The residuals illustrate a more moderate case in which the residuals had a
correlation of 0.5.
There is still evidence of tracking, but the pattern is less clear
294 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
ρ=0.9
1.5
0.5
Residual
−0.5
−1.5
0 20 40 60 80 100
The residuals are from a data set in which adjacent errors had a correlation
of 0.9.
There is a clear pattern in the residuals (note that adjacent residuals tend to
take on similar values).
295 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Another important assumption of the linear regression model is that the error
terms have a constant variance, Var(εi ) = σ 2 . The standard errors,
confidence intervals, and hypothesis tests associated with the linear model
rely upon this assumption.
Unfortunately, it is often the case that the variances of the error terms are
non-constant. For instance, the variances of the error terms may increase
with the value of the response.
One can identify non-constant variances in the errors, or heteroscedasticity,
from the presence of a funnel shape in the residual plot
296 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
15
998
975
845
10
5
Residuals
0
−5
−10
10 15 20 25 30
Fitted values
In this example the magnitude of the residuals tends to increase with the fitted values.
When faced with this problem, one possible
√solution is to transform the response Y
using a concave function such as log Y or Y.
Such a transformation results in a greater amount of shrinkage of the larger responses,
leading to a reduction in heteroscedasticity.
297 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.4
0.2
0.0
Residuals
671
437
Fitted values
This figure displays the residual plot after transforming the response using
log Y.
The residuals now appear to have constant variance, though there is some
evidence of a slight non-linear relationship in the data. using log Y . The
residuals now appear to have constant variance, though there is some
evidence of a slight non-linear relationship in the data.
298 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The matrix
H = X(X′ X)− 1 X′
is called the hat matrix, and thus
ŷ = Hy.
299 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outliers 1/4
An outlier is a point for which yi is far from the value predicted by the outlier
model. Outliers can arise for a variety of reasons, such as incorrect recording
of an observation during data collection.
20
6
4
2
Y
0
−2
−4
−2 −1 0 1 2
300 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outliers 1/4
20
6
4
2
Y
0
−2
−4
−2 −1 0 1 2
Outliers 2/4
20
4
3
Residuals
2
1
0
−1
−2 0 2 4 6
Fitted Values
302 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outliers 3/4
To address this problem, instead of plotting the residuals, we can plot the
studentized residuals, computed by dividing each residual ei by its estimated
standard error
ei
ri = 6
RSE(1 − hii )
6
where hii is the ith diagonal matrix of the hat matrix H and RSE(1 − hii ) is
the estimated standard error of ei .
20
6
Studentized Residuals
4
2
0
−2 0 2 4 6
Fitted Values
303 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outliers 3/4
20
6
Studentized Residuals
4
2
0
−2 0 2 4 6
Fitted Values
304 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outliers 4/4
305 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
0
−1
−2
−3
5 10 15 20 25 30
fitted values
306 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10 20
Y
5
0
−2 −1 0 1 2 3 4
Observation 41 in has high leverage, in that the predictor value for this
observation is large relative to the other observations.
307 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
20
Y
5
0
−2 −1 0 1 2 3 4
Observation 41 in has high leverage, in that the predictor value for this
observation is large relative to the other observations.
The red solid line is the least squares fit to the data, while the blue dashed
line is the fit produced when observation 41 is removed.
308 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10
4
20
2
Y
5
0
0
−2
−4
−2 −1 0 1 2 −2 −1 0 1 2 3 4
X X
! Comparing the effect of an outlier and an high leverage point, we observe that
removing the high leverage observation has a much more substantial impact on
the least squares line than removing the outlier.
! In fact, high leverage observations tend to have a sizable impact on the
estimated regression line.
! It is cause for concern if the least squares line is heavily affected by just a couple
of observations, because any problems with these points may invalidate the
entire fit.
310 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
X2
0
−1
−2
−2 −1 0 1 2
X1
311 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
X2
0
−1
−2
−2 −1 0 1 2
X1
The example shows a data set with two predictors, X1 and X2 . Most of the
observations’ predictor values fall within the blue dashed ellipse, but the red
observation is well outside of this range.
But neither its value for X1 nor its value for X2 is unusual. So if we examine
just X1 or just X2 , we will fail to notice this high leverage point. simultaneously.
312 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
1
X2
0
−1
−2
−2 −1 0 1 2
X1
313 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1 (xi − x̄)2
hi = + &n
n j= 1 (xj − x̄)
2
It is clear from this equation that hi increases with the distance of xi from x̄.
314 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1 (xi − x̄)2
hi = + &n
n j= 1 (xj − x̄)
2
It is clear from this equation that hi increases with the distance of xi from x̄.
315 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
20
5
41
4
Studentized Residuals
10
41
3
20
2
Y
1
0
0
−1
−2 −1 0 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25
X Leverage
316 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
321
328
20
10
mpg | others
116
−10
horsepower | others
317 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 1/7
Collinearity refers to the situation in which two or more predictor variables are
closely related to one another.
Consider the Credit data and look at the correlation matrix of quantitative variables:
str(Credit)
Credit num<-subset(Credit, select=Income:Education)
cor(Credit num)
318 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 2/7
! The two predictors limit and age appear to have no obvious relationship.
Income Limit Rating Cards Age Education
Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277
Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235
Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301
Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511
Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000
319 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 2/7
! The two predictors limit and age appear to have no obvious relationship.
80
70
60
Age
50
40
30
Limit
320 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 3/7
In contrast, there is a strong correlation between the variables Rating and limit.
Income Limit Rating Cards Age Education
Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277
Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235
Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301
Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511
Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000
321 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 3/7
800
600
Rating
400
200
Limit
! We say that the predictors limit and rating are very highly correlated with
each other, and we say that they are collinear.
322 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 3/7
800
600
Rating
400
200
Limit
! We say that the predictors limit and rating are very highly correlated with
each other, and we say that they are collinear.
The presence of collinearity can pose problems in the regression context, since it
can be difficult to separate out the individual effects of collinear variables on the
response.
! In other words, since limit and rating tend to increase or decrease
together, it can be difficult to determine how each one separately is associated
with the response, balance. 323 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 4/7
Recall that the t-statistic for each predictor is calculated by dividing β!j by its
standard error. Consequently, collinearity results in a decline in the t-statistic.
As a result, in the presence of collinearity, we may fail to reject H0 :βj = 0.
This means that the power of the hypothesis test – the probability of correctly
power detecting a non-zero coefficient–is reduced by collinearity.
324 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 5/7
325 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity 5/7
Look at the results of multiple regression of balance on age and limit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -173.411 43.828 -3.957 9.01e-05 ***
age -2.292 0.672 -3.407 0.000723 ***
limit 0.173 0.005 34.496 <2e-16 ***
Here, both age and limit are highly significant with very small p-values.
Consider now the results of multiple regression of balance on rating and limit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -377.53680 45.25418 -8.343 1.21e-15 ***
Rating 2.20167 0.95229 2.312 0.0213 *
Limit 0.02451 0.06383 0.384 0.7012
Here the collinearity between rating and limit has caused the standard error for the limit
coefficient estimate to increase by a factor of 12 and the p-value to increase to 0.701.
In other words, the importance of the limit variable has been masked due to the presence of
collinearity. To avoid such a situation, it is desirable to identify and address potential collinearity
problems while fitting the model.
326 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
329 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1
VIF(β!j ) =
1 − R2Xj |X−j
where R2Xj |X−j is the R2 from a regression of Xj onto all of the other
predictors.
! If R2X |X is close to one, then collinearity is present, and so the VIF will
j −j
be large.
332 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1
VIF(β!j ) =
1 − R2Xj |X−j
where R2Xj |X−j is the R2 from a regression of Xj onto all of the other
predictors.
! If R2X |X is close to one, then collinearity is present, and so the VIF will
j −j
be large.
334 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
library(car)
lm(Balance Age+Limit)->lm.fit
vif(lm.fit)
Age Limit
1.010283 1.010283
lm(Balance Age+Limit)->lm.fit
vif(lm.fit)
Rating Limit
160.4933 160.4933
335 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
336 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
337 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
340 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
that estimates the standard deviation of the response from the population
regression line. For the Advertising data, the RSE is 1.681 units
Residual standard error: 1.686 on 196 degrees of freedom
while the mean value for the response is
mean(sales) → ȳ = 14.0225
341 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
342 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
To answer this question, we can examine the p-values associated with each
predictor’s t-statistic:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
The p-values for TV and radio are low, but the p-value for newspaper is not.
This suggests that only TV and radio are related to sales.
343 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
lm(sales∼ TV+radio+newspaper)->lm.sales.fit.tot
confint(lm.sales.fit.tot)
2.5 % 97.5 %
(Intercept) 2.32376 3.55402
TV 0.04301 0.04852
radio 0.17155 0.20551
newspaper -0.01262 0.01054
! The confidence intervals for TV and radio are narrow and far from
zero, providing evidence that these media are related to sales.
! But the interval for newspaper includes zero, indicating that the
variable is not statistically significant given the values of TV and radio.
344 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Collinearity can result in very wide standard errors. Could collinearity be the
reason that the confidence interval associated with newspaper is so wide?
Consider the VIF scores:
vif(lm.sales.fit.tot)
TV radio newspaper
1.00461 1.14495 1.14519
! The VIF scores are around 1 for the three variables, suggesting no
evidence of collinearity.
345 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
346 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Sales
TV
Radio
347 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
350 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
351 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
352 / 352