0% found this document useful (0 votes)
109 views

StatLearning2r PDF

The document discusses regression analysis and modeling relationships between variables. It describes simple linear regression, which assumes a linear relationship between an output (dependent) variable and one or more input (independent) variables. Simple linear regression estimates parameters from sample data to model this relationship. The example of advertising data illustrates fitting a simple linear regression model to model sales based on newspaper spending.

Uploaded by

Bruno Casella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

StatLearning2r PDF

The document discusses regression analysis and modeling relationships between variables. It describes simple linear regression, which assumes a linear relationship between an output (dependent) variable and one or more input (independent) variables. Simple linear regression estimates parameters from sample data to model this relationship. The example of advertising data illustrates fitting a simple linear regression model to model sales based on newspaper spending.

Uploaded by

Bruno Casella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 267

Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.

Ingrassia

Regression and Model Building


Regression analysis is a statistical technique for investigating and modeling
the relationship between variables.
Applications of regression are numerous and occur in almost every field,
including , engineering, physical and social sciences, economics,
management, life and biological sciences, social sciences.

! The variable sales is the output variable (other names: response,


dependent variable) → Y.
! The variables TV, radio and newspaper are the input variable (other
names: predictors, independent variables) → X1 , X2 , X3 .

In general, we assume that the relationship between Y and X can be written


as
Y = f (x) + ε
where f is some fixed but unknown function of x = (x1 , . . . , xd )′ ∈ Rd and ε is
a random variable called error term, independent from X, and with
expectation equal to zero, i.e. E(ε) = 0 and finite variance.

7 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note:
Random variable and realization 1/2

! Consider the experiment of tossing a die: the sample space is


{1, 2, 3, 4, 5, 6}. If the die is fair, then the experiment is described by a
random variable X with uniform distribution on {1, 2, 3, 4, 5, 6}, i.e.
P(X = i) = 1/6, for i = 1, 2, . . . , 6. Thus we write X ∼ U(6).
! Now let’s toss the die and the result is 3. Then we say that 3 is a
realization of the random variable X.
! Let’s toss again the die and now the result is 5. Then we say that 5 is
another realization realization of the random variable X.
! Consider the experiment of flipping a coin. If the coin is fair, then the
experiment is described by a (categorical) random variable W assuming
values on {H, T}, with H =’head’ and T =’tail.
! Let’s flip the coin and get ’tail’. Then we say that ’tail’ is a realization of
the random variable W.

11 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note:
Random variable and realization 2/2

! Let us consider an urn U with 40 white balls and 60 black balls.


Consider a random sample of n = 10 with replacement from U and let X
be the number of white balls in the sample. Then X ∼ B(10, 0.4).
! Now consider an observed sample: (B, W, W, B, W, W, W, W, B, W)
containing 7 white balls. Then x = 7 is a realization of the random
variable X.
! Consider another observed sample: (W, W, B, W, B, B, W, W, B, B)
containing 5 white balls. Then x = 5 is a realization of the random
variable X.

14 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simple Linear Regression 1/2

The Simple Linear Regression Model assumes that x ∈ R and

f (x) = β0 + β1 x

so that
Y = β0 + β1 x + ε
where
! β0 is the intercept
! β1 is the slope
! ε is the random error component.
β0 , β1 are the unknown constants, they are also known as the model
parameters; ε is assumed to have mean zero and unknown variance σ 2 .

15 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simple Linear Regression 2/2

The equation
Y = β0 + β1 x + ε
maybe viewed as a population regression model.
Assume we have a sample of n pairs of data, say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Each yi (i = 1, . . . , n) is assumed to be a realization of the random variable

Yi = β0 + β1 xi + εi

that represents the sample regression model. Thus we assume that the
errors are uncorrelated, i.e. the value of one error does not depend on the
value of any other error.

16 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simple Linear Regression 2/2

The equation
Y = β0 + β1 x + ε
maybe viewed as a population regression model.
Assume we have a sample of n pairs of data, say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Each yi (i = 1, . . . , n) is assumed to be a realization of the random variable

Yi = β0 + β1 xi + εi

that represents the sample regression model. Thus we assume that the
errors are uncorrelated, i.e. the value of one error does not depend on the
value of any other error.

Parameter estimates
We denote by β!0 , β!1 the estimates of β0 , β1 based on the training data.

17 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 3/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × newspaper
25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

18 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 3/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × newspaper

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

What can we say about these data?

19 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 3/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × newspaper

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

GIven the training data, we can produce estimates β!0 and β!1 for β0 , β1 .

20 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 4/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × TV
25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

21 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 4/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × TV

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

What can we say about these data?

22 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 4/18


The simple linear regression model assumes that relationship between the
quantitative response and the predictor is approximately linear. For example,
for Advertising data we assume that

sales ≈ β0 + β1 × TV

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

GIven the training data, we can produce estimates β!0 and β!1 for β0 , β1 .

23 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Coefficients 1/3


Assume we have a sample of n pairs of data, say (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ),
i.e. the training data. The least-squares criterion requires to minimize the
quantity
n
"
S(β0 , β1 ) = (yi − (β0 + β1 xi ))2 .
i= 1

The least-squares estimators of β0 , β1 , say β!0 , β!1 , must satisfy


n
"
β!0 , β!1 = argβ0 ,β1 min S(β0 , β1 ) = argβ0 ,β1 min (yi − (β0 + β1 xi ))2
i= 1

i.e.
# n
"
∂S ##
= −2 (yi − β!0 − β!1 xi ) = 0
∂β0 #β0 = !0
β i= 1
# n
"
∂S ##
= −2 (yi − β!0 − β!1 xi )xi = 0
∂β1 #β1 = !1
β i= 1

24 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Coefficients 2/3

Simplifying these two equations yields


n
" n
"
nβ!0 + β!1 xi = yi
i= 1 i= 1
n
" "n " n
β!0 xi + β!1 x2i = xi yi
i= 1 i= 1 i= 1

These equations are called the least-squares normal equations.

25 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Coefficients 3/3


The solutions are in closed form:

σxy Sxy
β!1 = 2 = and β!0 = ȳ − β!1 x̄ .
σx SSx
where
n n
1" 1"
x̄ = xi ȳ = yi
n i= 1 n i= 1
n n
1" SSx 1" Sxy
σx2 = (xi − x̄)2 = σxy = (xi − x̄)(yi − ȳ) =
n i= 1 n n i= 1 n
n
" n
"
SSx = (xi − x̄)2 Sxy = (xi − x̄)(yi − ȳ).
i= 1 i= 1

Finally we get the estimated model

f (x) = β!0 + β!1 x .


!

26 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residuals

Define the residual as the difference between the observed values yi and the
corresponding fitted values f̂ (xi ) = β!0 + β!1 xi :

ei = yi − f̂ (xi ) = yi − ŷi = yi − (β!0 + β!1 xi ) i = 1, 2, . . . , n

where ŷi = f̂ (xi ) = β!0 + β!1 xi .


Residuals play an important role in investigating model adequacy and in
departures from underlying assumptions.

27 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Meaning of β!0, β!1

! Compute the estimated model in x and x + 1:

f̂ (x + 1) = β!0 + β!1 (x + 1) = β!0 + β!1 x + β!1


f̂ (x) = β!0 + β!1 x.

Consider
$ % $ %
f̂ (x + 1) − f̂ (x) = β!0 + β!1 x + β!1 − β!0 + β!1 x = β!1

thus the slope β!1 measures the average variation in Y associated with a
one-unit increase in X.
! β0 is the intercept, i.e. the expected value of Y for x = 0 provided that
x = 0 is in the range of tX in the training data.

28 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 5/18

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

sales = 12.35 + 0.0547 × newspaper

29 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 5/18

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

sales = 12.35 + 0.0547 × newspaper


According to the estimated model, an additional $1, 000 spent on newspaper
advertising is associated with selling approximately 54.7 additional units of
product.
30 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 5/18

25
20
15
sales

10
5

0 20 40 60 80 100

TV

sales = 12.35 + 0.0547 × newspaper

omoschedasticity: Var(ε) = σ 2 approximately constant with x

31 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 6/18

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

sales = 7.03 + 0.0547 × TV

32 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 6/18

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

sales = 7.03 + 0.0547 × TV


According to the estimated model, an additional $1, 000 spent on TV
advertising is associated with selling approximately 54.7 additional units of
product.
33 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 6/18

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

sales = 12.35 + 0.0547 × TV

heteroschedasticity: Var(ε) = σ 2 depends on x.

34 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 7/18

lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
TV 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

35 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 7/18

lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
TV 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

36 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 7/18

lm(sales∼ TV)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
TV 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

37 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Some considerations in the use of regression 1/3


Regression analysis is widely used and, unfortunately, frequently misused.
There are several common abuses of regression that should be mentioned.
The disposition of x values plays an important role in the least squares fit.
While all points have equal weight in determining the intercept β!0 of the line,
the slope β!1 is more strongly influenced by remote values of X.

11
11

10
10
B

Y
9
Y

8
8

A
7
7

6
6
5 7 9 11 13 15
5 7 9 11 13 15
X
X

The slope in the least-squares fit depends heavily on either the points A and
B. The points A and B are influential observations.
Situations such as this often require corrective action, such as further
analysis and possible deletion of the unusual points, or estimation of the
model parameter with some robust technique.
38 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Some considerations in the use of regression 2/3

In this case one of the 18 observations is very remote in x space. The slope
is largely determined by the extreme point.

10 10
C

9 9

Y
Y

8 8

C
7 7

9 11 13 15 9 11 13 15
X X

39 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Some considerations in the use of regression 3/3


Outliers or bad values can seriously disturb the least-squares fit. Observation
D seems to be an ”outlier” or ”bad value” because it falls far from the line
implied by the rest of the data.

12 12

D
11 11

10 10
Y

Y
9 9

8 8

7 7

9.0 9.5 10.0 10.5 11.0 11.5 9.0 9.5 10.0 10.5 11.0 11.5
X X

If this point is really an outlier, then the estimate of the intercept may be
incorrect.

40 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2.2 Model Adequacy Checking

41 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Adequacy Checking 1/2


We never know for certain whether the Simple Regression Model describes
the population. We only observe a sample and the fitted least-squares
regression line.

The regression line from the sample is


not the regression line from the
population.
What we want to do:
! Assess how well the line
describes the plot.
! Guess the slope of the
population line.
! Guess what value Y would take
for a given X value

44 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Adequacy Checking 2/2

The major assumptions that we have made in the study of regression


analysis are as follows:
1. The model explains the variability of Y.
2. The error term ε has zero mean and constant variance σ 2 .
3. The errors are uncorrelated.
4. The errors are normally distributed.

45 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The coefficient of determination R2 - 1/2


! Assumption 1: The model explains the variability of Y.
Some of the variation in Y can be explained by variation in the X’s and some
cannot. Consider the identity
yi − ȳ = (ŷi − ȳ) + (yi − ŷi )
and after sum algebras we get:
n
" n
" n
"
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 .
i= 1 i= 1 i= 1

or, equivalently
TSS = SSf + RSS
where
n
" n
" n
"
TSS = (yi − ȳ)2 SSf = (ŷi − ȳ)2 RSS = (yi − ŷi )2 .
i= 1 i= 1 i= 1

! TSS measures the total sum of squares


! RSS measures the amount of variability that is left unexplained after
performing the regression.
46 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The coefficient of determination R2 - 2/2


The coefficient of determination R2 is defined as
&n &n 2
(yi − ŷi )2 i= 1 ei RSS
R2 = 1 − &i=n 1 = 1 − & n =1−
(y
i= 1 i − ȳ) 2 (y
i= 1 i − ȳ) 2 TSS

Meaning of R2
Thus R2 gives the proportion of Y explained by the regressor X.

! An R2 statistic close to 1 indicates that a large proportion of the


variability in the response has been explained by the regression;
! A number near 0 indicates that the regression did not explain much of
the variability in the response: this might occur because the model is
wrong, or the inherent error σ 2 is high or both.

Remark
The statistics R2 should be used with caution, since it is always possible to
make R2 large by adding enough terms to the model.
48 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Degrees of freedom
Consider the identity
TSS = SSf + RSS.

&
! The total sum of squares TSS = ni= 1 (yi − ȳ)2 has νT = n − 1 degrees
of freedom& because one degree of freedom is lost as a result of the
constraint ni= 1 (yi − ȳ) = 0;
&
! The model sum of squares SSf = ni= 1 (ŷi − ȳ)2 has νf = 1 degree of
freedom because SSf is completely determined by the regression
parameter β!1 ;
&
! the residual sum of squares RSS = ni= 1 (yi − ŷi )2 has νR = n − 2
degrees of freedom because two constraints are imposed as a result of
estimating β!0 and β!1 .
Thus:

νT = νf + νR
n − 1 = 1 + (n − 2)

49 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 8/18

lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 ***
TV 0.05469 0.01658 3.30 0.00115 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

50 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 1/10

The residuals are defined as

ei = yi − ŷi i = 1, . . . , n

where yi is an observation and ŷi is the corresponding fitted value.


Since a residual may be viewed as the deviation between the data and the fit,
it is also a measure of the variability in the response variable not explained by
the regression model.
It is also convenient to think of the residuals as the realized or observed
values of the model errors. Thus, any departure from the assumption on the
errors should show up in the residuals, i.e. the realizations of the error term ε.

51 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 2/10

Analysis of residuals is an effective way to discover several types of model


inadequacies.
Plotting residuals is a very effective way to investigate how the regression
model fits the data and to check the assumptions on the errors.
Two kinds of plots:
1. Plot of the residuals against the fitted values ŷi ;
2. Quantile-Quantile (Q-Q) plot.

52 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 3/10


Plot of the residuals against the fitted values ŷi
! Assumption 2: The error term ε has zero mean and constant variance σ 2 .
39
26

1.5
8

1.0
24

0.5
22

Residuals
V2

0.0
20

-0.5
18

-1.0
16

-1.5

16 18 20 22 24 26 16 18 20 22 24 26

Fitted : X Fitted : X

Data with regression model and corresponding residual plot (homoscedasticity).

53 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 4/10


Plot of the residuals against the fitted values ŷi
! Assumption 2: The error term ε has zero mean and constant variance σ 2 .
26

10
50
43

5
40

Residuals
30

0
Y2

20

-5
10
-10

33
0
0 2 4 6 8 10 12 14 16 10 20 30 40

X Fitted : X

Data with regression model and corresponding residual plot (heteroscedasticity).

54 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 5/10


Plot of the residuals against the fitted values ŷi
! Assumption 3: The errors are uncorrelated.

1.5
41
50

1.0
40

0.5
Residuals

0.0
30
Y

-0.5
20

-1.0
10
-1.5

18

45
0
0 2 4 6 8 10 12 14 16 10 20 30 40

X Fitted : X

Data with regression model and corresponding residual plot (non linearity).
55 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 6/10 - Q-Q plot


! Assumption 4: The errors are normally distributed.

! Small departure from the normality assumption do not affect the model
greatly, but gross non-normality is potential more serious as inference
on the parameters depend on the normality assumption.
! A very simple method of checking the normality assumption is to
construct a Q-Q plot of the residuals, which is a graph designed so that
the cumulative normal distribution will plot as a straight line. In other
words, the Q-Q plot, or quantile-quantile plot, is a graphical tool to help
us assess if a set of data plausibly came from some theoretical
distribution such as a Normal or exponential.
! We remind that by a quantile, we mean the fraction (or percent) of points
below the given value. That is, the 0.3 (or 30%) quantile is the point at
which 30% percent of the data fall below and 70% fall above that value.
! Q-Q plots take your sample data, sort it in ascending order, and then
plot them versus quantiles calculated from a theoretical distribution. The
number of quantiles is selected to match the size of your sample data.
! R → qqnorm(), qqline()

60 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 7/10 - Q-Q plot


Ideal case: sample from the normal distribution
0.5

2
0.4

1
0.3

probability
Density

0
0.2

−1
0.1

−2
0.0

−4 −2 0 2 4 −3 −2 −1 0 1 2 3

w e(i)

61 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 8/10 - Q-Q plot


Ideal case: sample from a heavy-tailed distribution
0.5

6
0.4

4
2
0.3

probability
Density

0
0.2

−2
0.1

−4
0.0

−6

−4 −2 0 2 4 −3 −2 −1 0 1 2 3

w e(i)

62 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 9/10 - Q-Q plot


Ideal case: sample from a positive skew distribution

0.7
2.0

0.6
0.5
1.5

probability

0.4
Density

1.0

0.3
0.2
0.5

0.1
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 0 1 2 3

w e(i)

63 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Residual analysis 10/10 - Q-Q plot


Ideal case: sample from a negative skew distribution

1.0
0.9
2.0

0.8
1.5

0.7
probability
Density

0.6
1.0

0.5
0.5

0.4
0.3
0.0

0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 0 1 2 3

w e(i)

64 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R

65 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 9/18


5

5
quantili empirici
residuals

0
−5

−5

8 10 12 14 16 18 20 −3 −2 −1 0 1 2 3

fitted sales quantili teorici

sales = 7.03 + 0.0475 × TV

66 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example A: Advertising data 10/18


10

10
5

5
quantili empirici
residuals

0
−5

−5
−10

−10

13 14 15 16 17 18 −3 −2 −1 0 1 2 3

fitted sales quantili teorici

sales = 12.35 + 0.0547 × newspaper

67 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Assessing the Accuracy of the Coefficient Estimates


! The true relationship between X and Y

Y = f (x) + ε

! If f is to be approximated by a linear function, we can write this


relationship as the population regression model

Y = β0 + β1 x + ε

! based on a sample (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ) we have the sample


regression model
Yi = β0 + β1 xi + εi
where y1 , y2 , . . . , yn are realizations of Y1 , Y2 , . . . , Yn .

Note
The true relationship is generally not known for real data, but the least squares
line can always be computed using the training data.

68 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simulated data B: 1/3


6 different samples with n = 100 from the model Y = 2 + 3x + ε with ε ∼ N(0, 1.52 ):
10

10

10
5

5
y

y
0

0
−5

−5

−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

f (x) = 2.19 + 2.94x f (x) = 1.98 + 3.21x f (x) = 2.04 + 2.85x


10

10

10
5

5
y

y
0

0
−5

−5

−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

f (x) = 1.93 + 3.07x f (x) = 2.02 + 3.11x f (x) = 1.91 + 2.77x

69 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simulated data B: 2/3


6 different samples with n = 100 from the model Y = 2 + 3x + ε with ε ∼ N(0, 2.52 ):
10

10

10
5

5
y

y
0

0
−5

−5

−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

f (x) = 2.32 + 2.91x f (x) = 1.97 + 3.35x f (x) = 2.07 + 2.75x


10

10

10
5

5
y

y
0

0
−5

−5

−5
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

x x x

f (x) = 1.88 + 3.11x f (x) = 2.03 + 3.18x f (x) = 1.86 + 2.61x

70 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Simulated data B: 3/3 (comments)

! Each sample generates a different regression line, i.e. different


estimates β!0 , β!1 of β0 , β1 .
! Some regression lines have a larger slope estimate β!1 than β1 ; some
regression lines have a smaller value β!1 than β1 .
! Some regression lines have a larger intercept estimate β!0 than β0 ;
some regression lines have a smaller value β!0 than β0 .
! The variability of the estimates β!0 , β!1 increases with the variance σ 2 of ε.

74 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Properties of least-squares estimators 1/4

! For a given training set (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of size n, we get an


estimate β!0 , β!1 of β0 , β1 ;
! The values y1 , y2 , . . . , yn can be considered as realizations of
Y1 , Y2 , . . . , Yn where Yi = β0 + β1 xi + εi ;
! Theoretically, we could consider a huge number of realizations
y1 , y2 , . . . , yn of Y1 , Y2 , . . . , Yn : each realization yields an estimate β!0 , β!1
of β0 , β1 ;
! Hence, each estimate β!0 , β!1 can be considered as the realization of the
two random variables B0 , B1 , that are the estimators of β0 , β1 ,
respectively;
! It can be proved that the expected value of B0 is equal to β0 and the
expected value of B1 is equal to β1 , in symbols

E(B0 ) = β0 and E(B1 ) = β1 ;

79 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Properties of least-squares estimators 2/4


! From a practical point of view, this means that if we could average the
values β!1 over a huge number of sample data of size n, this average
would exactly equal to β1 ; analogously, If we could average the values
β!0 over a huge number of sample data of size n, this average would
exactly equal to β0 .
! From a statistical point of view, this property means that least squares
estimators are unbiased estimators, i.e. do not systematically over- or
under-estimate the true parameters.
! In a similar vein, we can wonder how close β!0 and β!1 are to the true
values β0 , β1 . This amounts to compute the variances of B0 and B1 . It
can be proved that it results

σ2 σ2
Var(B1 ) = &n =
1 (xi − x̄) SSx
2
i=
' ( ' (
1 x̄2 1 x̄2
Var(B0 ) = σ 2 + &n = σ2 + .
n i= 1 (xi − x̄)
2 n SSx

82 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Properties of least-squares estimators 3/4


! In these formulas, we don’t know the value of the variance σ 2 : we can
get an estimate s2e from the residuals ei = yi − ŷi :
&n 2
i= 1 ei RSS
s2e = =
n− 2 n− 2
&
where RSS = ni= 1 e2i is the Residual Sum of Squares.
! The quantity
) & *
√ n 2
i= 1 ei RSS
se = s2 = =
n− 2 n− 2
is know as the residual standard error, where n − 2 is the number of
degrees of freedom.

In summary
Roughly speaking s2e is the average amount that the response will deviate from
the true regression line.

85 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Properties of least-squares estimators 4/4

! The we get the estimated variances of B0 , B1

!1 ) = &n s2e s2
Var(B = e
1 (xi − x̄) SSx
2
i=
' ( ' (
!0 ) = s2e 1 x̄2 1 x̄2
Var(B + &n = s2e + .
n i= 1 (xi − x̄)
2 n SSx

! and afterwards the standard errors of β!0 , β!1 , by considering square


roots
+ *
!1 ) = s2e
SE(β!1 ) = Var(B
SSx
+ ) ' (
!0 ) = s2e 1 + x̄
2
SE(β!0 ) = Var(B .
n SSx

87 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Advertising data 11/18

lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 *** ---
TV 0.05469 0.01658 3.30 0.00115 **
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

88 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example B: Advertising data 12/18

lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(t>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 *** ---
TV 0.05469 0.01658 3.30 0.00115 **
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

89 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 1/5


Consider the following dataset1 :

X Y1 Y2 Y3 Z W
10 8.04 9.14 7.46 8 6.58
8 6.95 8.14 6.77 8 5.76
13 7.58 8.74 12.74 8 7.71
9 8.81 8.77 7.11 8 8.84
11 8.33 9.26 7.81 8 8.47
14 9.96 8.1 8.84 8 7.04
6 7.24 6.13 6.08 8 5.25
4 4.26 3.1 5.39 8 5.56
12 10.84 9.13 8.15 8 7.91
7 4.82 7.26 6.42 8 6.89
5 5.68 4.74 5.73 19 12.5

and the following regression lines: Y1 vs. X, Y2 vs. X, Y3 vs. X, W vs. Z.


1
Anscombe F.J. (1973). Graphs in Statistical Analysis, The American
Statistician, 27 (1), 17-21.
90 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 2/5

We have the following regression models:

dataset variabili modello di regressione R2 s


a) X, Y1 !
f1 (x) = 3.0001 + 0.5001x 0.6665 1.237
b) X, Y2 !
f2 (x) = 3.0009 + 0.5000x 0.6662 1.237
c) X, Y3 !
f3 (x) = 3.0025 + 0.4997x 0.6663 1.236
d) Z, W !f4 (z) = 3.0017 + 0.4999z 0.6665 1.236

i.e. the four datasets share the same regression model.

91 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 3/5


Look at the data:

10
8

8
Y1

Y2
6

2
2 4 6 8 10 12 14 4 6 8 10 12 14
X X

13 13

11 11
Y3

9 9

7 7

5 5

4 6 8 10 12 14 7 9 11 13 15 17 19
X Z

92 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 4/5


Look at the data+regression models:

10

10

8
Y1

Y2
6

2
2 4 6 8 10 12 14 4 6 8 10 12 14
X X

13 13

11 11
Y3

9 9

7 7

5 5

4 6 8 10 12 14 7 9 11 13 15 17 19
X Z
93 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 5/5

Conclusions:
! The first model seems to be perfectly appropriate;
! Figure 2 suggests that Y has a smoothed curved relation with X,
possibly quadratic;
! In Figure 3, all but one of the observations lie close to a straight line;
! Figure 4 shows that one observation has played a critical role.

94 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Anscombe Data 5/5

Conclusions:
! The first model seems to be perfectly appropriate;
! Figure 2 suggests that Y has a smoothed curved relation with X,
possibly quadratic;
! In Figure 3, all but one of the observations lie close to a straight line;
! Figure 4 shows that one observation has played a critical role.

Conclusion
A good statistical analysis begins always through a good graphical analysis.

95 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2.3 Confidence Intervals


and Hypothesis Testing

96 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Prediction 1/3

Once we have fit the linear regression model, it is straightforward to apply

ŷ = β!0 + β!1 x

in order to predict the response Y on the basis of a set of values for the
predictor X.
However, there are three sorts of uncertainty associated with this prediction:
1. reducible errors,
2. model bias,
3. irreducible errors.

98 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Prediction - 2/3
Reducible errors
The coefficient estimates β!0 , β!1 are estimates for β0 , β1 . That is, the least
squares line
ŷ = β!0 + β!1 x
is only an estimate for the true population regression line

f (x) = β0 + β1 x.

The inaccuracy in the coefficient estimates is related to the reducible error.


Assuming the random error to be Gaussian, i.e ε ∼ N(0, σ 2 ), we can compute
a confidence interval in order to determine how close ŷ = f̂ (x) will be to f (x).
Model bias
Of course, in practice assuming a linear model for f (x) is almost always an
approximation of reality, so there is an additional source of potentially
reducible error which we call model bias. So
when we use a linear model, we are in fact estimating the best linear approxi-
mation to the true line.

100 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Prediction - 3/3

Irreducible errors
Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + ε.

This random error is referred to as irreducible error.

101 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: Normal distribution N(0, 1) 1/2


A standard normal distribution is denoted as N(0, 1) and has density
1 − z2 /2
f (z) = e .

Let Z ∼ N(0, 1) then P(− 1.96 ≤ Z ≤ 1.96) = 0.95:

probability = 0.95

−1.96 0 1.96

102 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: Normal distribution N(0, 1) 2/2


Let Z ∼ N(0, 1) then P(− 2 ≤ Z ≤ 2) = 0.9545:

probability = 0.9545

−2 0 2

For this reason in the textbook, confidence intervals of kind

β!1 ± 2 SE(β!1 )

are considered rather than β!1 ± 1.96 SE(β!1 ).


103 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution 1/9

The Student t distribution t(ν) with ν degrees of freedom has density2


Γ(ν + 1)/2
fν (x) = √ (1 + x2 /ν)− (ν+ 1)/2
.
Γ(ν/2) νπ

! The t distribution has a bell shape


! For large value of ν (approximately larger than 30) it is quite similar to
the normal distribution.

2
The Gamma function Γ is a positive function defined for x > 0 and has the two
main properties: Γ(x) = (x − 1)Γ(x − 1); if x = n is a positive integer, then
Γ(n) = (n − 1)!.
104 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution 2/9

t(3) t(1)
0.4
N(0,1) t(2)
t(3)
t(10)
t(20)
N(0,1)
0.3

t(1)
0.2

0.1

0.0
-4 -2 0 2 4

Student t density with ν = 1, 2, 3, 10, 20


x
degrees of freedom and comparison
with the normal standard distribution (ν = ∞).
105 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution 3/9

ν = 10
Let Q ∼ t(10) then P(− 2.23 ≤ Q ≤ 2.23) = 0.95:

probability = 0.95

−2.23 0 2.23

106 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution 4/9

Let Q ∼ t(10): then P(− 2 ≤ Q ≤ 2) = 0.9266:

probability = 0.9266

−2 0 2

107 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution t(30) 5/9

Let Q ∼ t(30) then P(− 2.04 ≤ Q ≤ 2.04) = 0.95:

probability = 0.95

−2.04 0 2.04

108 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution t(30) 6/9

Let Q ∼ t(30) then P(− 2 ≤ Q ≤ 2) = 0.9454:

probability = 0.9454

−2 0 2

109 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution t(60) 7/9

Let Q ∼ t(60) then P(− 2 ≤ Q ≤ 2) = 0.95:

probability = 0.95

−2 0 2

110 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution t(100) 8/9

Let Q ∼ t(100) then P(− 1.98 ≤ Q ≤ 1.98) = 0.95:

probability = 0.95

−1.98 0 1.98

111 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: t distribution t(100) 9/9

Let Q ∼ t(100) then P(− 2 ≤ Q ≤ 2) = 0.9518:

probability = 0.9518

−2 0 2

112 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confidence intervals for β0 , β1 1/3


Fundamental assumption
Inference for β0 , β1 assumes Gaussian distributed errors, i.e. ε ∼ N(0, σ 2 ).

! Given training data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of size n, get the estimate
β!0 , β!1 of β0 , β1 ;
! compute the residual standard error se and afterwards SE(β!0 ), SE(β!1 ).
! Compute the intervals
, -
β!0 − 2 SE(β!0 ), β!0 + 2 SE(β!0 ) = β!0 ± 2 SE(β!0 )
, -
β!1 − 2 SE(β!1 ), β!1 + 2 SE(β!1 ) = β!1 ± 2 SE(β!1 )

The interval β!0 ± 2 SE(β!0 ) may or may not contain the true value β0 ;
analogously the interval β!1 ± 2 SE(β!1 ) may or may not contain the true
value β1 .

115 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confidence intervals for β0 , β1 2/3


! Consider a huge number of realizations y1 , y2 , . . . , yn of Y1 , Y2 , . . . , Yn .
! For each of them get the estimates β!0 , β!1 , the corresponding standard errors
SE(β!0 ), SE(β!1 ) and afterwards the confidence intervals β!0 ± 2 SE(β!0 ) and
β!1 ± 2 SE(β!1 ).
! As we stated above, each estimate β!0 , β!1 can be then considered as the
realization of the corresponding random variable B0 , B1 , that are the estimators
of β0 , β1 .
! It can be proved that

B0 − β0 B1 − β1
∼ t(n − 2) and ∼ t(n − 2)
SE(B0 ) SE(B1 )
where t(n − 2) denotes a t distribution with n − 2 degrees of freedom and n is
the sample size.
! Then:
! approximately the 95% of the intervals β!0 ± 2 SE(β!0 ) will contain β0 ,
! approximately the 95% of the intervals β!1 ± 2 SE(β!1 ) will contain β1 .

120 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Confidence intervals for β0 , β1 3/3


Given training data (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), we say that:
! the interval β!0 ± 2 SE(β!0 ) is a confidence interval for β0 with 95%
confidence level,
! the interval β!1 ± 2 SE(β!1 ) is a confidence interval for β1 with 95%
confidence level.
We have two operational definitions of confidence intervals
! Suppose many experiments have been perfomed to evaluate β1 .
If the outcomes of each experiment are used to calculate a level
95% confidence interval β!1 ± 2 SE(β!1 ) for β1 , then approximately
at least 95% of the intervals would bracket the actual value of β1 .

! If during his professional life a statistician calculates many level 95%


confidence intervals, then he will be right about 95% of the time.
! Analogous considerations hold for β!0 .

124 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 13/18

lm(sales∼ TV)->lmTV.fit
confint(lmTV.fit)

2.5 % 97.5 %
(Intercept) 6.12971927 7.93546783
x 0.04223072 0.05284256

lm(sales∼ newspaper)->lmNEWS.fit
confint(lmNEWS.fit)

2.5 % 97.5 %
(Intercept) 11.12595560 13.57685854
x 0.02200549 0.08738071

125 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 14/18


β!0 = 7.0326 SE(β!0 ) = 0.4578 ⇒ CI(β!0 ) = [6.130, 7.935]
β!1 = 0.0475 SE(β!0 ) = 0.0027 ⇒ CI(β!1 ) = [0.042, 0.053]

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

126 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 14/18


β!0 = 7.0326 SE(β!0 ) = 0.4578 ⇒ CI(β!0 ) = [6.130, 7.935]
β!1 = 0.0475 SE(β!0 ) = 0.0027 ⇒ CI(β!1 ) = [0.042, 0.053]

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

127 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 15/18


β!0 = 12.3514 SE(β!0 ) = 0.6214 ⇒ CI(β!0 ) = [11.126, 12.578]
β!1 = 0.0547 SE(β!0 ) = 0.0165 ⇒ CI(β!1 ) = [0.022, 0.087]

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

128 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 15/18


β!0 = 12.3514 SE(β!0 ) = 0.6214 ⇒ CI(β!0 ) = [11.126, 12.578]
β!1 = 0.0547 SE(β!0 ) = 0.0165 ⇒ CI(β!1 ) = [0.022, 0.087]

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

129 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Prediction Intervals 1/2

Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + ε.
This error is referred to as irreducible error.
How much will Y vary from Ŷ?
We use prediction intervals on a future observation ad x to answer this
question.

Prediction intervals are always wider than confidence intervals, because they
incorporate both the error in the estimate for f (x) (the reducible error) and the
uncertainty as to how much an individual point will differ from the population
regression plane (the irreducible error).

131 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Prediction Intervals 2/2

The (approximate) 95% prediction interval on a future observation at x is


$ %
ŷ − 2 × sŶ|x , ŷ + 2 × sŶ|x

or, equivalently,
ŷ ± 2 × sŶ|x ,
where ) ' (
1 (x − x̄)2
sŶ|x = s2e + ,
n SSx
and &n 2
i= 1 ei
s2e = .
n− 2

132 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 16/18


sales = β!0 + β!1 × TV

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

133 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 16/18


sales = β!0 + β!1 × TV

25
20
15
sales

10
5

0 50 100 150 200 250 300

TV

134 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 17/18


sales = β!0 + β!1 × newspaper

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

135 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 17/18


sales = β!0 + β!1 × newspaper

25
20
15
sales

10
5

0 20 40 60 80 100

newspaper

136 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Hypothesis testing on regression parameters 1/3


The most common hypothesis test involves testing the null hypothesis of

H0 : There is no relationship between X and Y

versus the alternative hypothesis:

H1 : There is some relationship between X and Y

Mathematically, we have the following hypothesis system:

H0 : β1 = 0
H1 : β1 ̸= 0.

If β1 = 0 then the model Y = β0 + β1 x + ε reduces to Y = β0 + ε and X is not


associated with Y.
If we can’t be sure that β1 ̸= 0, then there is no point in using X as one of our
predictors

137 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Hypothesis testing on regression parameters 2/3

! To test the null hypothesis, we need to determine whether the estimate


β!1 of β1 is sufficiently far from zero that we can be confident that β1 is
non-zero. How far is far enough?
! This of course depends on the accuracy of β!1 , i.e. it depends on SE(β!1 ).
! If SE(β!1 ) is small, then even relatively small values of β!1 may provide
string evidence that β1 ̸= 0, and hence there is a relationship between X
and Y.
! In contrast, if SE(β!1 ) is large, then β!1 must be large in absolute value in
order for us to reject the null hypothesis.
! Analogous ideas for the estimate β!0 of β0 .

142 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Hypothesis testing on regression parameters 3/3

Under the assumptions of the linear regression model, the statistics we


remind that
B0 − β0 B1 − β1
∼ t(n − 2) and ∼ t(n − 2)
SE(B0 ) SE(B1 )
where t(n − 2) denotes a t distribution with n − 2 degrees of freedom and n is
the sample size.
Thus, under the null hypothesis for β1 , we compute the t-statistic given by

β!1 − 0
t∗ =
SE(β!1 )

which measures the number of standard deviations that β!1 is away from 0.

143 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 1/5


Under the null hypothesis H0 :β1 = 0 consider the random variable
B1
Q=
SE(B1 )
which is called test statistic.
The test statistic ”tends” to be small if the null hypothesis H0 is true, an large
if H0 is false.
Once we compute the test statistic on the training set, we get the value t∗ .
Thus we need a criterion for assessing that t∗ is small or large.

The p-value is defined as


. # #/
B1 # β! #
# 1 #
P(Q > |t∗ |) = P ># # .
SE(B1 ) # SE(β!1 ) #

Thus the p-value, under the null hypothesis β1 = 0, measures the probability
to observe a test statistic larger than the observed value t∗ (in absolute
value).
145 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 2/5

! A small p-value indicates that it is unlikely to observe such substantial


association between the predictor and the response due to the chance,
in absence of any real association between the predictor and the
response (i.e. under the null hypothesis H0 :β1 = 0);
! in other words, if we see a small p-value, then we can infer that there is
an association between the predictor and the response;
! if the p-value is small enough, we reject the null hypothesis, that is we
declare a relationship to exist between X and Y.
! Typical p-value cutoffs for rejecting the null hypothesis are 5% or 1%.
! The same applies to β0 .

148 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 3/5


Based on similar arguments of confidence intervals, consider a huge number
of realizations y1 , y2 , . . . , yn of Y1 , Y2 , . . . , Yn .
For each of them get the estimate β!1 and the standard error SE(β!1 ) and
afterwards the test statistics t∗ = β!1 /SE(β!1 ).
Assume that the null hypothesis H0 :β1 = 0 is actually true:
! If the p-value is exactly equal to 0.05, then rejection of the null
hypothesis is the wrong decision in approximately 5% of cases, (type I
error);
! if the p-value is exactly equal to 0.01, then rejection of the null
hypothesis is the wrong decision in approximately 1% of cases;
! If the p-value is exactly equal to 0.001, then rejection of the null
hypothesis is the wrong decision in approximately 0.1% of cases.

In summary, the smaller p-value, the smaller evidence in favour of the null
hypothesis (or the larger evidence against the null hypothesis).

154 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case A

−t’ 0 t’

155 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case A

0.025 0.025

−t’ 0 t’

p-value = 0.0995: The null hypothesis is not rejected at significance level


α = 0.01 (and then α = 0.01 neither).
156 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case B

−t’ 0 t’

157 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case B

0.025 0.025

−t’ 0 t’

p-value = 0.0357: The null hypothesis is rejected at significance level


α = 0.05.
158 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case B

0.005 0.005

−t’ 0 t’

p-value = 0.0357: The null hypothesis is not rejected at significance level


α = 0.01.
159 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case C

−t’ 0 t’

160 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case C

0.025 0.025

−t’ 0 t’

p-value = 0.0069: The null hypothesis is rejected at significance level


α = 0.05.
161 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 4/5


Case C

0.005 0.005

−t’ 0 t’

p-value = 0.0069: The null hypothesis is rejected at significance level


α = 0.01.
162 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 5/5


The null hypothesis H0 :β1 = 0:
! means that there is no linear relationship between X and Y,
! but it does not means that there is no relationship between X and Y at
all.
7

60
6

40
5
y

20
4

0
3

−10 −5 0 5 10 15 −10 −5 0 5 10

x x

β1 = 0.006 β!1 = 0.069


p-value =0.602 p-value =0.891
163 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

The p-value 5/5


The null hypothesis H0 :β1 = 0:
! means that there is no linear relationship between X and Y,
! but it does not means that there is no relationship between X and Y at
all.
7

60
6

40
5
y

20
4

0
3

−10 −5 0 5 10 15 −10 −5 0 5 10

x x

β1 = 0.006 β!1 = 0.069


p-value =0.602 p-value =0.891
164 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 18/18

lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 *** ---
TV 0.05469 0.01658 3.30 0.00115 **
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

165 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 18/18

lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 *** ---
TV 0.05469 0.01658 3.30 0.00115 **
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

166 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Example C: Advertising data 18/18


lm(y∼ x)->lmTV.fit
summary(lmTV.fit)
Call:
lm(formula = sales∼ TV)

Min 1Q Median 3Q Max


Residuals:
-11.2272 -3.3873 -0.8392 3.5059 12.7751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.35141 0.62142 19.88 < 2e-16 *** ---
TV 0.05469 0.01658 3.30 0.00115 **
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’

Residual standard error: 5.092 on 198 degrees of freedom


Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733
F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148

Conclusion
We can conclude that β0 and β1 are statistically different from 0.

167 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R

168 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2.4 Multiple Linear Regression

169 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Introduction

Simple linear regression is a useful approach for predicting a response on the


basis of a single predictor variable. However, in pratice we often have more
that one predictor.
In general, suppose that we have p distinct predictors X1 , . . . , Xp . Then the
multiple linear regression model takes the form

Y = β0 + β1 x1 + · · · + βp xp + ε ,

where
! Y is the response,
! X1 , X2 , . . . , Xp are the predictors,
! β1 , . . . , βp are the regression coefficient, where βj quantifies the
association between the variable Xj (j = 1, . . . , p)and the response Y.
! ε is the v.a. error term, where we assume again ε ∼ N(0, σ 2 ).

170 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Meaning of the regression coefficients

For simplicity, consider the case p = 2:

f (x1 , x2 ) = β0 + β1 x1 + β2 x2 ,

and compute f (·) in (x1 , x2 ) and (x1 + 1, x2 ):

f (x1 , x2 ) = β0 + β1 x1 + β2 x2
f (x1 + 1, x2 ) = β0 + β1 (x1 + 1) + β2 x2 .

Thus:

f (x1 + 1, x2 ) − f (x1 , x2 ) = β0 + β1 (x1 + 1) + β2 x2 − (β0 + β1 x1 + β2 x2 )


= β1 ,

Hence we interpret the regression coefficient βj as the average effect on Y of


a one unit increase in Xj , holding all other predictors fixed.

171 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Regression Coefficients 1/3


As was the case in the simple linear regression setting, the regression
coefficients β0 , β1 , . . . , βp are unknown, and must be estimated.
Assume we are provided with a training set, i.e. a sample of size n of
(X1 , . . . , Xp , Y):

(x11 , . . . , x1p , y1 ), (x21 , . . . , x2p , y2 ), . . . , (xn1 , . . . , xnp , yn ) .

The parameters are estimated using the same least squares approach that
we saw in the context of simple linear regression and let β!0 , β!1 , . . . , β!p the
estimate of β0 , β1 , . . . , βp and set

ŷi = β!0 + β!1 xi1 + · · · + β!p xip .

We choose β!0 , β!1 , . . . , β!p to minimize the sum of squared residuals


n
"
S(β!0 , β!1 , . . . , β!p ) = (yi − ŷi )2
i= 1
n
"
= [yi − (β!0 + β!1 xi1 + · · · + β!p xip )]2 .
i= 1

174 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Regression Coefficients 2/3

We can formulate the problem using matrix notation. Let us set


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 1 x11 x12 · · · x1p β!0
⎜y ⎟ ⎜1 x ⎟ ⎜ ⎟
⎜ 2⎟ ⎜ 21 x22 · · · x2p ⎟ ⎜β!1 ⎟
y=⎜ ⎟ X ⎜ ⎟ ! ⎜ ⎟
⎜ .. ⎟ , = ⎜ .. .. .. .. ⎟ , β = ⎜ .. ⎟ .
⎝.⎠ ⎝. . . ··· . ⎠ ⎝.⎠
yn 1 xn1 xn2 · · · xnp β!p

Thus, it can be proved that the estimated parameters β!0 , β!1 , . . . , β!p are given
by
! = (X′ X)− 1 X′ y
β
where X′ denotes the transpose of X.
Thus the fitted values are given by
! = X(X′ X)− 1 X′ y.
ŷ = Xβ

176 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Estimating the Regression Coefficients 3/3


Y

X2

X1

In a three-dimensional setting, with two predictors and one response, the


least squares regression line becomes a plane (in general, for p predictors it
becomes a hyperplane in Rp+ 1 ).
The plane is chosen to minimize the sum of the squared vertical distances
between each observation (shown in red) and the plane.
177 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 1/11


Consider the model

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

Call:
lm(formula = sales ∼ TV + radio + newspaper)

Min 1Q Median 3Q Max


Residuals:
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.686 on 198 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

178 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 1/11


Consider the model

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

Call:
lm(formula = sales ∼ TV + radio + newspaper)

Min 1Q Median 3Q Max


Residuals:
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.686 on 198 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

179 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 2/11

We interpret these results as follows:


! for a given amount of TV and newspaper advertising, spending an
additional $1,000 on radio advertising leads to an increase in sales by
approximately 189 units.
! Comparing these coefficient estimates to those concerning the simple
regression models, we notice that the multiple regression coefficient
estimates for TV and radio are pretty similar to the simple linear
regression coefficient estimates.
! However, while the newspaper regression coefficient estimate in the
simple regression model was significantly non-zero, the coefficient
estimate for newspaper in the multiple regression model is close to
zero, and the corresponding p-value is no longer significant, with a
value around 0.86.

182 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 3/11

! This illustrates that the simple and multiple regression coefficients can
be quite different.
! This difference stems from the fact that in the simple regression case,
the slope term represents the average effect of a $1,000 increase in
newspaper advertising, ignoring other predictors such as TV and radio.
! In contrast, in the multiple regression setting, the coefficient for
newspaper represents the average effect of increasing newspaper
spending by $1,000 while holding TV and radio fixed.

185 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 4/11


! Does it make sense for the multiple regression to suggest no
relationship between sales and newspaper while the simple linear
regression implies the opposite?
! In fact it does. Consider the correlation matrix for the three predictor
variables and response variable
W<-cbind(sales,TV,radio,newspaper)
cor(W)

sales TV radio newspaper


sales 1.00 0.78 0.58 0.23
TV 0.78 1.00 0.05 0.06
radio 0.58 0.05 1.00 0.35
newspaper 0.23 0.06 0.35 1.00

! Notice that the correlation between radio and newspaper is 0.35. This
reveals a tendency to spend more on newspaper advertising in markets
where more is spent on radio advertising.

187 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 5/ 11

! Now suppose that the multiple regression is correct and newspaper


advertising has no direct impact on sales, but radio advertising does
increase sales.
! Then in markets where we spend more on radio, our sales will tend to
be higher, and as our correlation matrix shows, we also tend to spend
more on newspaper advertising in those same markets.
! Hence, in a simple linear regression which only examines sales versus
newspaper, we will observe that higher values of newspaper tend to be
associated with higher values of sales, even though newspaper
advertising does not actually affect sales.
! So newspaper sales are a surrogate for radio advertising; newspaper
gets ”credit” for the effect of radio on sales.

188 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Degrees of freedom 1/2

Also for multiple regression models we can consider the identity

TSS = SSf + RSS

where
n
" n
" n
"
TSS = (yi − ȳ)2 SSf = (ŷi − ȳ)2 RSS = (yi − ŷi )2 .
i= 1 i= 1 i= 1

189 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Degrees of freedom 2/2


As for the degrees of freedom:
&
! The total sum of squares TSS = ni= 1 (yi − ȳ)2 has again νT = n − 1
degrees of freedom
& because one degree of freedom is lost as a result
of the constraint ni= 1 (yi − ȳ) = 0;
&
! The model sum of squares SSf = ni= 1 (ŷi − ȳ)2 has νf = p degree of
freedom because SSf is completely determined by the regression
parameters β!1 , . . . , β!p ;
&
! the residual sum of squares RSS = ni= 1 (yi − ŷi )2 has νR = n − (p + 1)
degrees of freedom because p + 1 constraints are imposed as a result
of estimating β!0 and β!1 .
Thus:

νT = νf + νR
n − 1 = p + (n − p − 1)

190 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Some Important Questions

When we perform multiple linear regression, we usually are interested in


answering a few important questions.
1. Is at least one of the predictors X1 , X2 , . . . , Xp useful in predicting the
response?
2. Do all the predictors help to explain Y, or is only a subset of the
predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict,
and how accurate is our prediction?

191 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Is There a Relationship Between the Response and


Predictors? 1/4

Recall that in the simple linear regression setting, in order to determine


whether there is a relationship between the response and the predictor we
can simply check the hypothesis system

H0 : β1 = 0
H1 : β1 ̸= 0.

In the multiple regression setting with p predictors, we need to ask whether all
of the regression coefficients are zero. Thus, we test the hypothesis system

H0 : β1 = β2 = · · · = βp = 0
H1 : at least one βj is non-zero.

192 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Is There a Relationship Between the Response and


Predictors? 2/4

We assumed ε ∼ N(0, σ 2 ). Given a training set


(y1 , x11 , x12 , . . . , x1p ), (y2 , x21 , x22 , . . . , x2p ), . . . , (y1 , xn1 , xn2 , . . . , xnp ), after the
estimation of the parameter, compute the the residuals e1 , e2 , . . . , en and then
the quantities
&n 2 &n 2 &n 2
i= 1 ei RSS i= 1 (yi − ȳ) − i= 1 ei TSS − RSS
= and = .
n− p− 1 n− p− 1 p p
where TSS is the Total Sum of Squares of Y and RSS is the Residual Sum of
Squares given by:
n
" n
"
TSS = (yi − ȳ)2 and RSS = ei .
i= 1 i= 1

193 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Is There a Relationship Between the Response and


Predictors? 3/4
Consider a huge number of realizations y1 , y2 , . . . , yn of Y1 , Y2 , . . . , Yn . For
each of them get the parameter estimates and then compute the residuals
e1 , e2 , . . . , en .
It can be proved that the average of the ratio between RSS and n − p − 1 over
these realization is equal to the variance of the error term, i.e.
. & / ' (
n 2
i= 1 ei RSS
E =E = σ2
n− p− 1 n− p− 1

Moreover, under the null hypothesis H0 : β1 = β2 = · · · = βp = 0, it can be


proved also that
.& &n 2 / ' (
n 2
i= 1 (yi − ȳ) − i= 1 ei TSS − RSS
E =E = σ2
p p

195 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Is There a Relationship Between the Response and


Predictors? 4/4

Consider the F-statistic:


(TSS − RSS)/p
F=
RSS/(n − p − 1)

! When there is no relationship between the response and the predictors


(the null hypothesis H0 :β1 = β2 = · · · = βp = 0 ) , under the Gaussian
assumption for the errors, it can be proved that one would expect the
F-statistic to take on a value significantly close to 1;
! On the other hand, if the alternative hypothesis H1 :at least one βj is
non-zero is true, then we expect F to be significantly greater than 1.
! Under the assumption ε ∼ N(0, σ 2 ), the analyze the p values of F and
take a decision trough the so called F-distribution.

196 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Theoretical note: the F-distribution

Let V and W be two independent random variables with V ∼ χ2 (n) and


W ∼ χ2 (m). The distribution of the random variable
V/n
U=
W/m
is called the F distribution with n, m degrees of freedom and we write
U ∼ F(n, m) .

1.0 F(2,5)

F(2,5)
0.8 F(5,10)
F(10,5)
F(10,20)
F(10,5)
F(5,10)
0.6

F(10,20)
0.4

0.2

0.0
1 3 5 7 9
x 197 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 6/11


Consider the model

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

Call:
lm(formula = sales ∼ TV + radio + newspaper)

Min 1Q Median 3Q Max


Residuals:
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.686 on 198 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

198 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deciding on Important Variables 1/5

It is possible that all of the predictors are associated with the response, but it
is more often the case that the response is only related to a subset of the
predictors.
The task of determining which predictors are associated with the response, in
order to fit a single model involving only those predictors, is referred to as
variable selection.
Ideally, we would like to perform variable selection by trying out a lot of
different models, each containing a different subset of predictors.
Unfortunately, there are a total of 2p models that contains subsets of p
variables.

199 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deciding on Important Variables 2/5

Three classical approaches:


1. Forward selection.
! We begin with a model that contains the intercept.
! Then sequentially adds into the model the predictor that most
improves the fit: we fit p simple linear regressions and add to the
null model the variable that results in the lowest RSS.
! We then add to that model the variable that results in the lowest
RSS for the new two-variable model.
! This approach is continued until some stopping rule is satisfied.

200 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deciding on Important Variables 3/5

2. Backward selection.
! We start with all variables in the model, and backward remove the
variable with the largest p-value–that is, the variable selection that
is the least statistically significant.
! The new (p − 1)-variable model is fit, and the variable with the
largest p-value is removed.
! This procedure continues until a stopping rule is reached. For
instance, we may stop when all remaining variables have a
p-value below some threshold.

201 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deciding on Important Variables 4/5

3. Mixed selection (stepwise regression). This is a combination of forward


and backward selection.
! We start with no variables in the model, and as with forward
selection selection, we add the variable that provides the best fit.
! We continue to add variables one-by-one. Of course, as we noted
with the Advertising example, the p-values for variables can
become larger as new predictors are added to the model.
! Hence, if at any point the p-value for one of the variables in the
model rises above a certain threshold, then we remove that
variable from the model.
! We continue to perform these forward and backward steps until all
variables in the model have a sufficiently low p-value, and all
variables outside the model would have a large p-value if added to
the model

202 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Deciding on Important Variables 5/5

step(lm(sales∼ TV+radio+newspaper),direction="backward")

step(lm(sales∼ TV+radio+newspaper),direction="forward")

step(lm(sales∼ TV+radio+newspaper),direction="both")

AIC: Akaike Information Criterion


The AIC is a criterion for model selection. The model is chosen according to
the smallest value, see later on.

204 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: the R2 statistic 1/6

The most common numerical measures of model fit is the R2 , the fraction of
variance explained. This quantity is computed and interpreted in the same
fashion as for simple linear regression:
&n 2
&n 2
2 TSS − RSS RSS i= 1 (yi − ŷi ) i= 1 ei
R = =1− =1− & n =1− & n
RSS TSS i= 1 (yi − ȳ) 2
i= 1 i − ȳ)
(y 2

where:
! TSS measures the Total Sum of Squares of Y
! RSS measures the Sum of Squares of the Residuals, i.e. amount of
variability that is left unexplained.
An R2 value close to 1 indicates that the model explains a large proportion of
the variance in the response variable.

205 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 7/11


Consider the relationship between sales and TV, radio, newspaper.

predictor(s) R2
newspaper 0.0512
radio 0.3320
TV 0.6119
TV+radio 0.8971943
TV+newspaper 0.6458
radio+newspaper 0.3327
TV+radio+newspaper 0.8972106

! The model that uses all three advertising media to predict sales has
an R2 of 0.8972106;
! the model that using only TV and radio has an R2 value of 0.8971943
(approximately the same);
! in other words, there is a small increase in R2 if we include newspaper
advertising in the model that already contains TV and radio advertising;
! it turns out that R2 does not decrease when a variable is added to the
model, even if this variable is only weakly associated with the response.
206 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 7/11


Consider the relationship between sales and TV, radio, newspaper.

predictor(s) R2
newspaper 0.0512
radio 0.3320
TV 0.6119
TV+radio 0.8971943
TV+newspaper 0.6458
radio+newspaper 0.3327
TV+radio+newspaper 0.8972106

! In constrast, the model containing only TV as a predictor had an R2 of


0.6119.
! Adding radio to the model leads to a substantial improvement in R2 .
! This implies that a model that uses TV and radio expenditures to predict
sales is substantially better than one that uses only TV advertising.
! We could further quantify this improvement by looking at the p-value for
the radio coefficient in a model that contain only TV and radio.
208 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Adjusted R2 2/6

As we stated before, R2 never decreases when a regressor is added to the


model, regardless of the value of the contribution of that variable.
We can use an adjusted R2 statistic defined ad
RSS/(n − p − 1)
R2Adj = 1 − .
TSS/(n − 1)

We remind that:
! RSS is an unbiased estimate of the error variance ε,
! TSS is an unbiased estimate of the total variance of Y.

209 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Adjusted R2 2/6


As we stated before, R2 never decreases when a regressor is added to the
model, regardless of the value of the contribution of that variable.
We can use an adjusted R2 statistic defined ad
RSS/(n − p − 1)
R2Adj = 1 − .
TSS/(n − 1)

predictor(s) R2 R2Adj
newspaper 0.0512 0.0473
radio 0.3320 0.3287
TV 0.6119 0.6099
TV+radio 0.8971943 0.8962
TV+newspaper 0.6458 0.6422
radio+newspaper 0.3327 0.3259
TV+radio+newspaper 0.8972106 0.8956

210 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Adjusted R2 2/6


As we stated before, R2 never decreases when a regressor is added to the
model, regardless of the value of the contribution of that variable.
We can use an adjusted R2 statistic defined ad
RSS/(n − p − 1)
R2Adj = 1 − .
TSS/(n − 1)
predictor(s) R2 R2Adj
newspaper 0.0512 0.0473
radio 0.3320 0.3287
TV 0.6119 0.6099
TV+radio 0.8971943 0.8962
TV+newspaper 0.6458 0.6422
radio+newspaper 0.3327 0.3259
TV+radio+newspaper 0.8972106 0.8956

Since RSS/(n − p − 1) is the residual mean square and TSS/(n − 1) is constant


regardless of how many variables are in the model, R2Adj will only increase
on adding a variable to the model if the addition of the variable reduces the
residual mean square.

211 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Adjusted R2 2/6


As we stated before, R2 never decreases when a regressor is added to the
model, regardless of the value of the contribution of that variable.
We can use an adjusted R2 statistic defined ad
RSS/(n − p − 1)
R2Adj = 1 − .
TSS/(n − 1)

predictor(s) R2 R2Adj
newspaper 0.0512 0.0473
radio 0.3320 0.3287
TV 0.6119 0.6099
TV+radio 0.8971943 0.8962
TV+newspaper 0.6458 0.6422
radio+newspaper 0.3327 0.3259
TV+radio+newspaper 0.8972106 0.8956

Note that the model TV+radio has an R2Adj larger than the model
TV+radio+newspaper.

212 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 8/11


Consider the model

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

Call:
lm(formula = sales ∼ TV + radio + newspaper)

Min 1Q Median 3Q Max


Residuals:
-8.8277 -0.8908 0.2418 1.1893 2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.686 on 198 degrees of freedom


Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

213 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Residual standard error 3/6

Consider the linear regression model:

Y = β0 + β1 x1 + · · · + βp xp + ε ,

where ε ∼ N(0, σ 2 ).
In general, if the model has p predictors the estimate s2e of σ 2 is given by
&n
i= 1 ei RSS
s2e = =
n− p− 1 n− p− 1
where ei are the residuals, i.e. ei = yi − ŷi .

*
RSS 6
RSE = s2e =
n− 2
is know as the residual standard error, where n − p − 1 is the number of
degrees of freedom.

214 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.

215 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.
Consider the three-dimensional plot of TV and radio versus sales.

Sales

TV

Radio

We see that some observations lie above and some observations lie below
the least squares regression plane.
Notice that there is a clear pattern of negative residuals, followed by positive
residuals, followed by negative residuals.
216 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.
Consider the three-dimensional plot of TV and radio versus sales.

Sales

TV

Radio

In particular, the linear model seems to overestimate sales for instances in


which most of the advertising money was spent exclusively on either TV or ra-
dio. It underestimates sales for instances where the budget was split between
the two media.
217 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.
Consider the three-dimensional plot of TV and radio versus sales.

Sales

TV

Radio

This pronounced non-linear pattern cannot be modeled accurately using linear


regression.

218 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.
Consider the three-dimensional plot of TV and radio versus sales.

Sales

TV

Radio

The positive residuals (those visible above the surface), tend to lie along the
45-degree line, where TV and Radio budgets are split evenly. The negative
residuals (most not visible), tend to lie away from this line, where budgets are
more lopsided.
219 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11


Graphical summaries can reveal problems with a model that are not visible
from numerical statistics.
Consider the three-dimensional plot of TV and radio versus sales.

Sales

TV

Radio

It suggests a synergy or interaction effect between the advertising media,


whereby combining the media together results in a bigger boost to sales than
using any single medium.

220 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Prediction 4/6

Once we have fit the multiple regression model, it is straightforward to apply

ŷ = β!0 + β!1 x1 + · · · + β!p xp

in order to predict the response Y on the basis of a set of values for the
predictors X1 , X2 , . . . , Xp . However, like in the simple regression model, there
are three sorts of uncertainty associated with this prediction:
1. reducible errors,
2. model bias,
3. irreducible errors.

221 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Prediction 5/6


Reducible errors
The coefficient estimates β!0 , β!1 are estimates for β0 , β1 . That is, the least
squares line
ŷ = β!0 + β!1 x1 + · · · + β!p xp
is only an estimate for the true population regression line

f (x) = β0 + β1 x + · · · + βp xp .

The inaccuracy in the coefficient estimates is related to the reducible error.


Assuming the random error to be Gaussian, i.e ε ∼ N(0, σ 2 ), we can compute
a confidence interval in order to determine how close ŷ = f̂ (x) will be to f (x).

Model bias
Of course, in practice assuming a linear model for f (x) is almost always an
approximation of reality, so there is an additional source of potentially
reducible error which we call model bias. So when we use a linear model,
we are in fact estimating the best linear approximation to the true line.

222 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Model Fit: Prediction 6/6

Irreducible errors
Even if we knew f (x) – that is, even if we knew the true values for β0 , β1 – the
response value cannot be predicted perfectly because of the random error ε
in the model
Y = β0 + β1 x + · · · + βp xp ε.
This error is referred to as irreducible error.
How much will Y vary from Ŷ?
We use prediction intervals to answer this question, like in the case of simple
regression models.
Prediction intervals are always wider than confidence intervals, because they
incorporate both the error in the estimate for f(X) (the reducible error) and the
uncertainty as to how much an individual point will differ from the population
regression plane (the irreducible error).

223 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R

224 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 1/5

In multiple regression models, often some predictors are qualitative.

Consider the dataset credit.csv


’data.frame’: 400 obs. of 12 variables:
$ Income : num 14.9 106 104.6 148.9 55.9 ...
$ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 ...
$ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
$ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
$ Age : int 34 82 71 36 68 77 37 87 66 41 ...
$ Education: int 11 15 11 11 16 10 12 9 13 19 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 ...
$ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 ...
$ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 ...
$ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2
3 3 1 2 3 1 ...
$ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...

225 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 1/5

In multiple regression models, often some predictors are qualitative.

Consider the dataset credit.csv


’data.frame’: 400 obs. of 12 variables:
$ Income : num 14.9 106 104.6 148.9 55.9 ...
$ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 ...
$ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
$ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
$ Age : int 34 82 71 36 68 77 37 87 66 41 ...
$ Education: int 11 15 11 11 16 10 12 9 13 19 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 ...
$ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 ...
$ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 ...
$ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1
$ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...

226 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 1/10


20 40 60 80 100 5 10 15 20 2000 8000 14000

1500
Balance

500
0
100
80

Age
60
40
20

8
6
Cards

4
2
20
15

Education
10
5

150
100
Income

50
14000
8000

Limit
2000

1000
600
Rating

200
0 500 1500 2 4 6 8 50 100 150 200 600 1000

227 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 2/5 - Predictors with Two Levels

If a qualitative predictor (also known as a factor) only has two levels, or


possible values, then incorporating it into a regression model is very simple.
We simply create an indicator or dummy variable that takes on two possible
dummy numerical values. For example, based on the gender variable, we
can create variable a new variable that takes the form

⎨ 1 if the ith person is a female
xi =
⎩ 0 if the ith person is a male,

and use this variable as a predictor in the regression equation. This results in
the model

⎨ β0 + β1 + εi if the ith person is a female
Yi = β0 + β1 xi + εi =
⎩ β0 + εi if the ith person is a male.

228 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 2/10

attach(Credit)
levels(factor(Gender))
[1] " Male" "Female"

contrasts(Gender)
Female
Male 0
Female 1

229 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 3/10


lm(Balance Gender)->lm.fit
summary(lm.fit)

Call:
lm(formula = Balance∼ Gender)

Min 1Q Median 3Q Max


-529.54 -455.35 -60.17 334.71 1489.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept 509.80 33.13 15.389 <2e-16 ***
GenderFemale 19.73 46.05 0.429 0.669
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 460.2 on 398 degrees of freedom


Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685

230 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 4/10

Estimate Std. Error t value Pr(>|t|)


Intercept 509.80 33.13 15.389 <2e-16 ***
GenderFemale 19.73 46.05 0.429 0.669

231 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 4/10

Estimate Std. Error t value Pr(>|t|)


Intercept 509.80 33.13 15.389 <2e-16 ***
GenderFemale 19.73 46.05 0.429 0.669

Now:
! β0 can be interpreted as the average credit card balance among males,
! β0 + β1 as the average credit card balance among females, and β1 as the
average difference in credit card balance between females and males.

232 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 4/10

Estimate Std. Error t value Pr(>|t|)


Intercept 509.80 33.13 15.389 <2e-16 ***
GenderFemale 19.73 46.05 0.429 0.669

Now:
! β0 can be interpreted as the average credit card balance among males,
! β0 + β1 as the average credit card balance among females, and β1 as the
average difference in credit card balance between females and males.

The average credit card debt


! for males is estimated to be $509.80,
! whereas females are estimated to carry $19.73 in additional debt for a total of
$509.80 + $19.73 = $529.53.

233 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 4/10


Estimate Std. Error t value Pr(>|t|)
Intercept 509.80 33.13 15.389 <2e-16 ***
GenderFemale 19.73 46.05 0.429 0.669

! β0 can be interpreted as the average credit card balance among males,


! β0 + β1 as the average credit card balance among females, and β1 as the
average difference in credit card balance between females and males.

The average credit card debt


! for males is estimated to be $509.80,
! whereas females are estimated to carry $19.73 in additional debt for a total of
$509.80 + $19.73 = $529.53.

However, we notice that the p-value for the dummy variable is very high. This
indicates that there is no statistical evidence of a difference in average credit card
balance between females and males.

234 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 3/5 -


Predictors with More than Two Levels
When a qualitative predictor has more than two levels, a single dummy
variable cannot represent all possible values. In this situation, we can create
additional dummy variables.
For example, for the ethnicity variable we create two dummy variables.
The first could be

⎨ 1 if the ith person is a Asian
xi1 =
⎩ 0 if the ith person is not Asian,

and the second



⎨ 1 if the ith person is a Caucasian
xi2 =
⎩ 0 if the ith person is not Caucasian,

235 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 4/5 -


Predictors with More than Two Levels
Then both of these variables can be used in the regression equation, in order
to obtain the model


⎪ β + εi if the ith person is African American.
⎨ 0

Yi = β0 +β1 xi1 +β2 xi2 εi = β0 + β1 + εi if the ith person is Asian



⎩ β + β + ε if the ith person is Caucasian
0 2 i

Now:
! β0 can be interpreted as the average credit card balance for African
Americans,
! β1 can be interpreted as the difference in the average balance between
the Asian and African American categories, and
! β2 can be interpreted as the difference in the average balance between
the Caucasian and African American categories.

236 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 4/5 -


Predictors with More than Two Levels

Then both of these variables can be used in the regression equation, in order
to obtain the model


⎪ β + εi if the ith person is African American.
⎨ 0

Yi = β0 +β1 xi1 +β2 xi2 εi = β0 + β1 + εi if the ith person is Asian



⎩ β + β + ε if the ith person is Caucasian
0 2 i

There will always be one fewer dummy variable than the number of levels.
The level with no dummy variable – African American in this example – is
known as the baseline.

237 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Qualitative Predictors 5/5 -


Predictors with More than Two Levels

levels(factor(Ethnicity))
[1] "African American" "Asian" "Caucasian"

contrasts(Ethnicity)
Asian Caucasian
African American 0 0
Asian 1 0
Caucasian 0 1

238 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 5/10


lm(Balance Ethnicity)->lm.fit
summary(lm.fit)

Call:
lm(formula = Balance∼ Ethnicity)

Min 1Q Median 3Q Max


-531.00 -457.08 -63.25 339.25 1480.50
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 460.2 on 398 degrees of freedom


Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685

239 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 6/10

Estimate Std. Error t value Pr(>|t|)


(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826

240 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 6/10

Estimate Std. Error t value Pr(>|t|)


(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826

! Look at the Estimates. We see that:


the estimated balance for the baseline, African American, is $531.00.

241 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 6/10

Estimate Std. Error t value Pr(>|t|)


(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826

! Look at the Estimates. We see that:


the estimated balance for the baseline, African American, is $531.00.
the Asian category will have $18.69 less debt than the African American category,
and that the Caucasian category will have $12.50 less debt than the African
American category.

242 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 6/10


Estimate Std. Error t value Pr(>|t|)
(Intercept) 531.00 46.32 11.464 <2e-16 ***
EthnicityAsian -18.69 65.02 -0.287 0.774
EthnicityCaucasian -12.50 56.68 -0.221 0.826

! Look at the Estimates. We see that:


the estimated balance for the baseline, African American, is $531.00.
the Asian category will have $18.69 less debt than the African American category,
and that the Caucasian category will have $12.50 less debt than the African
American category.

! Look at the p-values. We see that:


the p-values associated with the coefficient estimates for the two dummy variables
are very large, suggesting no statistical evidence of a real difference in credit card
balance between the ethnicities.

243 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2.5 Extension of the Linear Model

244 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-linear relationships

Models that are more complex in structure than

Y = β0 + β1 x1 + β2 x2 + · · · + βp xp + ε.

may often still be analyzed by multiple linear regression techniques.


For example, consider the cubic polynomial model

Y = β0 + β1 x + β2 x2 + β3 x3 + ε.

If we let
x1 = x x2 = x2 x3 = x3
then we can rewrite

Y = β0 + β1 x1 + β2 x2 + β3 x3 + ε

which is a multiple linear regression model with three regressor variables.

245 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 1/9

Consider the dataset Auto


attach(Auto)
str(Auto)
’data.frame’: 392 obs. of 9 variables:
$ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
$ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
$ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
$ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
$ weight : num 3504 3693 3436 3433 3449 ...
$ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ year : num 70 70 70 70 70 70 70 70 70 70 ...
$ origin : num 1 1 1 1 1 1 1 1 1 1 ...
$ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36
231 14 161 141 54 223 241 2 ..

246 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 2/9


Consider the relationship between the variables horsepower and mpg

40
30
mpg

20
10

50 100 150 200

horsepower

247 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 3/9


Consider the model
mpg = β0 + β1 × horsepower + ε

lm(mpg∼ horsepower)->lm.fit
summary(lm.fit)
Call: lm(formula = mpg∼ horsepower )
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom


Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16

248 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 4/9

40
30
mpg

20
10

50 100 150 200

horsepower

Clearly, there is a large bias because the straight line does not fit the data
pattern.

249 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 5/9


Consider the model

mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε

lm(mpg∼ horsepower+I(horsepower∧2))->lm.fit
summary(lm.fit)

Call: lm(formula = mpg∼ horsepower + I(horsepower∧ 2))


Residuals:
Min 1Q Median 3Q Max
-14.7135 -2.5943 -0.0859 2.2868 15.8961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
I(horsepower∧ 2) 0.0012305 0.0001221 10.08 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 4.374 on 389 degrees of freedom


Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
250 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 6/9

40
30
mpg

20
10

50 100 150 200

horsepower

Now the model fits better the data pattern.

251 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 7/9


Comparison among different models

50
Linear
Degree 2
40 Degree 5
Miles per gallon

30
20
10

50 100 150 200

Horsepower

The linear regression fit is shown in orange. The linear regression fit for a
model that includes horsepower2 is shown as a blue curve. The linear
regression fit for a model that includes all polynomials of horsepower up to
fifth-degree is shown in green.
252 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interaction effects 1/2


Models that include interaction effects may also be analyzed by multiple
linear regression methods. For example, suppose that the model is

Y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ε.

If we let
x3 = x1 x2 and β3 = β12
then we can write
Y = β0 + β1 x1 + β2 x2 + β3 x3 + ε
which is a linear regression model.
Note that we can also write

Y = β0 + (β1 + β3 x2 )x1 + β2 x2 + ε = β0 + β˜1 x1 + β2 x2 + ε

where β˜1 = β1 + β3 x2 .
Since β˜1 changes with X2 , the effect of X1 on Y is no longer constant:
adjusting X2 will change the impact of X1 on Y.

253 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 9/11

Consider again the Advertising data. A linear model that uses radio, TV
and an interaction between the two to predict sales takes the form

sales = β0 + β1 × TV + β2 × radio + β3 × (radio × TV) + ε


= β0 + (β1 + β3 × radio) × TV + β2 × radio + ε

We can interpret β3 as the increase in the effectiveness of TV advertising for


a one unit increase in radio advertising (or vice-versa).

254 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 10/11


lm(sales∼ TV+radio+TV*radio)->lm.fit
summary(lm.fit)

Call: lm(formula = sales∼ TV + radio + TV * radio )


Residuals:
Min 1Q Median 3Q Max
-6.3366 -0.4028 0.1831 0.5948 1.5246
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*?’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.9435 on 196 degrees of freedom


Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673
F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16

255 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11

Estimate Std. Error t value Pr(>|t|)


(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***

256 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11

Estimate Std. Error t value Pr(>|t|)


(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***

These results strongly suggest that the model that includes the interaction term is superior to
the model that contains only main effects.

257 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11

Estimate Std. Error t value Pr(>|t|)


(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***

These results strongly suggest that the model that includes the interaction term is superior to
the model that contains only main effects.
The p-value for the interaction term, radio × TV, is extremely low, indicating that there is
strong evidence for H1 :β3 ̸ = 0.
In other words, it is clear that the true relationship is not additive.

258 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11

Estimate Std. Error t value Pr(>|t|)


(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673

The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that

(96.8 − 89.7)/(100 − 89.7) = 69%

of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.

259 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11


Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673

The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that

(96.8 − 89.7)/(100 − 89.7) = 69%

of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.

The coefficient estimates suggest that an increase in TV advertising of $1,000 is associated


with increased sales of

(β!1 + β3
! × radio) × 1, 000 = 19 + 1.1 × radio units.

260 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Advertising data 11/11


Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 ***
TV 1.910e-02 1.504e-03 12.699 <2e-16 ***
radio 2.886e-02 8.905e-03 3.241 0.0014 **
TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 ***
Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673

The R2 for this model is 96.8%, compared to only 89.7% for the model that predicts sales
using TV and radio without an interaction term. This means that
(96.8 − 89.7)/(100 − 89.7) = 69%
of the variability in sales that remains after fitting the additive model has been explained by
the interaction term.

The coefficient estimates suggest that an increase in TV advertising of $1,000 is associated


with increased sales of
(β!1 + β3
! × radio) × 1, 000 = 19 + 1.1 × radio units.

And an increase in radio advertising of $1,000 will be associated with an increase in sales of
(β!2 + β!3 × TV) × 1, 000 = 29 + 1.1 × TV units.

261 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Remark: meaning of the term linear


The curve generated by the model

Y = β0 + β1 x + β2 x2 + β3 x3 + ε

and the shape of the surface generated

Y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ε

are clearly not linear.


The term linear in linear regression model do not refer to the predictors but to
the parameters:
What does linear model means
Any regression model that is linear in the parameters (the β’s) is a linear
regression model, regardless of the shape of the surface that it gener-
ates.
For example:
! the model Y = β0 + β1 ln(x) + β2 x + ε is linear
! the model Y = β0 + β1 ln(x) + β1 β2 x + ε is not linear.

262 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interaction effects - Qualitative predictors 2/2

The concept of interactions applies just as well to qualitative variables, or to a


combination of quantitative and qualitative variables.
In fact, an interaction between a qualitative variable and a quantitative
variable has a particularly nice interpretation.

263 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interaction effects - Qualitative predictors 2/2

The concept of interactions applies just as well to qualitative variables, or to a


combination of quantitative and qualitative variables.
In fact, an interaction between a qualitative variable and a quantitative
variable has a particularly nice interpretation.

Consider the Credit data, and suppose that we wish to predict balance
using the income (quantitative) and student (qualitative) variables. In the
absence of an interaction term, the model takes the form
;
β2 if ith person is a student
balancei ≈ β0 + β1 × incomei +
0 if ith person is not a student

264 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Interaction effects - Qualitative predictors 2/2

Consider the Credit data, and suppose that we wish to predict balance
using the income (quantitative) and student (qualitative) variables. In the
absence of an interaction term, the model takes the form
;
β2 if ith person is a student
balancei ≈ β0 + β1 × incomei +
0 if ith person is not a student

Notice that this amounts to fitting two parallel lines to the data, one for
students and one for non-students. The lines for students and non-students
have different intercepts, β0 + β2 versus β0 , but the same slope, β1 .

265 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 7/10

1400
1000
Balance

600
200

0 50 100 150

Income

The fact that the lines are parallel means that the average effect on balance of a
one-unit increase in income does not depend on whether or not the individual is a
student.
This represents a potentially serious limitation of the model, since in fact a change in
income may have a very different effect on the credit card balance of a student versus
a non-student.

266 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 7/10

1400
1000
Balance

600
200

0 50 100 150

Income

The fact that the lines are parallel means that the average effect on balance of a
one-unit increase in income does not depend on whether or not the individual is a
student.
This represents a potentially serious limitation of the model, since in fact a change in
income may have a very different effect on the credit card balance of a student versus
a non-student.
This limitation can be addressed by adding an interaction variable, created by
multiplying income with the dummy variable for student.

267 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 8/10

The model becomes

;
β2 + β3 × incomei if student
balancei ≈ β0 + β1 × incomei +
0 if not a student
;
(β0 + β2 ) + (β1 + β3 ) × incomei if student

β0 + β1 × incomei if not a student

Once again, we have two different regression lines for the students and the
non-students.
Those regression lines have different intercepts, β0 + β2 versus β0 , as well as
different slopes, β1 + β3 versus β1 .
This allows for the possibility that changes in income may affect the credit
card balances of students and non-students differently.

268 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 9/10

1400
student
non−student

1000
Balance

600
200

0 50 100 150

Income

The figure shows the estimated relationships between income and balance for students
and non-students in this model.

269 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 9/10

1400
student
non−student

1000
Balance

600
200

0 50 100 150

Income

The figure shows the estimated relationships between income and balance for students
and non-students in this model.
We note that the slope for students is lower than the slope for non-students.

270 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 9/10

1400
student
non−student

1000
Balance

600
200

0 50 100 150

Income

The figure shows the estimated relationships between income and balance for students
and non-students in this model.
We note that the slope for students is lower than the slope for non-students.
This suggests that increases in income are associated with smaller increases in credit
card balance among students as compared to non-students.

271 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R

272 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2.6 Potential Problems


and Regression Diagnostics

273 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Introduction

When we fit a linear regression model to a particular data set, many problems
may occur. Most common among these are the following:
1. Non-linearity of the response-predictor relationships.
2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers.
5. High-leverage points.
6. Collinearity.
Some issues have been already anlysed for simple regression models.
Other are typical of multiple linear regression models.

274 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-linearity of the response-predictor relationships


! The linear regression model assumes that there is a straight-line
relationship between the predictors and the response. If the true
relationship is far from linear, then virtually all of the conclusions that we
draw from the fit are suspect. In addition, the prediction accuracy of the
model can be significantly reduced.
! Residual plots are a useful graphical tool for identifying non-linearity.
residual plot
! Given a simple linear regression model, we can plot the residuals,
ei = yi − ŷi , versus the predictor xi .
! In the case of a multiple regression model since there are multiple
predictors, we instead plot the residuals versus the predicted (or fitted)
values ŷi .
! Ideally, the residual plot will show no fitted discernible pattern.
! The presence of a pattern may indicate a problem with some aspect of
the linear model.

278 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-linearity of the response-predictor relationships


! The linear regression model assumes that there is a straight-line
relationship between the predictors and the response. If the true
relationship is far from linear, then virtually all of the conclusions that we
draw from the fit are suspect. In addition, the prediction accuracy of the
model can be significantly reduced.
! Residual plots are a useful graphical tool for identifying non-linearity.
residual plot
! Given a simple linear regression model, we can plot the residuals,
ei = yi − ŷi , versus the predictor xi .
! In the case of a multiple regression model since there are multiple
predictors, we instead plot the residuals versus the predicted (or fitted)
values ŷi .
! Ideally, the residual plot will show no fitted discernible pattern.
! The presence of a pattern may indicate a problem with some aspect of
the linear model.

280 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 8/9


Consider the model

mpg = β0 + β1 × horsepower + ε
Residual Plot for Linear Fit

20
323

15
330
334
40

10
Residuals

5
30
mpg

0
−5
20

−15 −10
10

50 100 150 200


5 10 15 20 25 30

horsepower Fitted values

The red line is a smooth fit to the residuals, which is displayed in order to
make it easier to identify any trends.
The residuals exhibit a clear U-shape, which provides a strong indication of
non-linearity in the data.

281 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 9/9

Consider the model which contains a quadratic term

mpg = β0 + β1 × horsepower + β2 × horsepower2 + ε


Residual Plot for Quadratic Fit

334

15
323
40

10
5
Residuals
30
mpg

0
−5
20

−15 −10
155
10

50 100 150 200 15 20 25 30 35

horsepower Fitted values

There appears to be little pattern in the residuals, suggesting that the


quadratic term improves the fit to the data.

282 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 1/6

! An important assumption of the linear regression model is that the error


terms, ε1 , ε2 , . . . , εn , are uncorrelated. What does this mean?
! For instance, if the errors are uncorrelated, then the fact that εi is
positive provides little or no information about the sign of εi+ 1 .
! The standard errors that are computed for the estimated regression
coefficients or the fitted values are based on the assumption of
uncorrelated error terms. If in fact there is correlation among the error
terms, then the estimated standard errors will tend to underestimate the
true standard errors.
! As a result, confidence and prediction intervals will be narrower than
they should be. For example, a 95% confidence interval may in reality
have a much lower probability than 0.95 of containing the true value of
the parameter.

286 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 2/6

! In addition, p-values associated with the model will be lower than they
should be; this could cause us to erroneously conclude that a
parameter is statistically significant.
! In short, if the error terms are correlated, we may have an unwarranted
sense of confidence in our model.

288 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 3/6


! Why might correlations among the error terms occur? Such correlations
frequently occur in the context of time series data, which consists of
observations for which measurements are obtained at discrete points in
time.
! In many cases, observations that are obtained at adjacent time points
will have positively correlated errors.
! In order to determine if this is the case for a given data set, we can plot
the residuals from our model as a function of time.
! If the errors are uncorrelated, then there should be no discernible
pattern. On the other hand, if the error terms are positively correlated,
then we may see tracking in the residuals – that is, adjacent residuals
may have tracking similar values.
! In general, the assumption of uncorrelated errors is extremely
important for linear regression as well as for other statistical
methods, and good experimental design is crucial in order to
mitigate the risk of such correlations.

291 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 3/6


! Why might correlations among the error terms occur? Such correlations
frequently occur in the context of time series data, which consists of
observations for which measurements are obtained at discrete points in
time.
! In many cases, observations that are obtained at adjacent time points
will have positively correlated errors.
! In order to determine if this is the case for a given data set, we can plot
the residuals from our model as a function of time.
! If the errors are uncorrelated, then there should be no discernible
pattern. On the other hand, if the error terms are positively correlated,
then we may see tracking in the residuals – that is, adjacent residuals
may have tracking similar values.
! In general, the assumption of uncorrelated errors is extremely
important for linear regression as well as for other statistical
methods, and good experimental design is crucial in order to
mitigate the risk of such correlations.

292 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 4/6

ρ=0.0
3
2
1
Residual

−1 0
−3

0 20 40 60 80 100

The residuals from a linear regression fit to data generated with uncorrelated
errors. There is no evidence of a time-related trend in the residuals.

293 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 5/6

ρ=0.5
2
1
Residual

0
−2
−4

0 20 40 60 80 100

The residuals illustrate a more moderate case in which the residuals had a
correlation of 0.5.
There is still evidence of tracking, but the pattern is less clear

294 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Correlation of Error Terms 6/6

ρ=0.9
1.5
0.5
Residual

−0.5
−1.5

0 20 40 60 80 100

The residuals are from a data set in which adjacent errors had a correlation
of 0.9.
There is a clear pattern in the residuals (note that adjacent residuals tend to
take on similar values).

295 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-constant Variance of Error Terms 1/3

Another important assumption of the linear regression model is that the error
terms have a constant variance, Var(εi ) = σ 2 . The standard errors,
confidence intervals, and hypothesis tests associated with the linear model
rely upon this assumption.
Unfortunately, it is often the case that the variances of the error terms are
non-constant. For instance, the variances of the error terms may increase
with the value of the response.
One can identify non-constant variances in the errors, or heteroscedasticity,
from the presence of a funnel shape in the residual plot

296 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-constant Variance of Error Terms 2/3


Response Y

15
998
975
845

10
5
Residuals

0
−5
−10

10 15 20 25 30

Fitted values

In this example the magnitude of the residuals tends to increase with the fitted values.
When faced with this problem, one possible
√solution is to transform the response Y
using a concave function such as log Y or Y.
Such a transformation results in a greater amount of shrinkage of the larger responses,
leading to a reduction in heteroscedasticity.
297 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Non-constant Variance of Error Terms 3/3


Response log(Y)

0.4
0.2
0.0
Residuals

−0.8 −0.6 −0.4 −0.2


605

671
437

2.4 2.6 2.8 3.0 3.2 3.4

Fitted values

This figure displays the residual plot after transforming the response using
log Y.
The residuals now appear to have constant variance, though there is some
evidence of a slight non-linear relationship in the data. using log Y . The
residuals now appear to have constant variance, though there is some
evidence of a slight non-linear relationship in the data.
298 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Preliminary note: Back to the regression model.


The Hat matrix

Given a multiple regression model, we remind that the parameter estimate is


given by
! = (X′ X)− 1 X′ y
β
and the fitted values are given by
! = X(X′ X)− 1 X′ y.
ŷ = Xβ

The matrix
H = X(X′ X)− 1 X′
is called the hat matrix, and thus

ŷ = Hy.

299 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 1/4
An outlier is a point for which yi is far from the value predicted by the outlier
model. Outliers can arise for a variety of reasons, such as incorrect recording
of an observation during data collection.

20

6
4
2
Y
0
−2
−4

−2 −1 0 1 2

300 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 1/4
20

6
4
2
Y
0
−2
−4
−2 −1 0 1 2

The red point (observation 20) illustrates a typical outlier.


The red solid line is the least squares regression fit, while the blue dashed
line is the least squares fit after removal of the outlier.
In this case, removing the outlier has little effect on the least squares line: it
leads to almost no change in the slope, and a miniscule reduction in the
intercept.
It is typical for an outlier that does not have an unusual predictor value to
have little effect on the least squares fit.
301 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 2/4

20

4
3
Residuals

2
1
0
−1

−2 0 2 4 6

Fitted Values

Tthe outlier is clearly visible in the residual plot.


But in practice, it can be difficult to decide how large a residual needs to be
before we consider the point to be an outlier.

302 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 3/4
To address this problem, instead of plotting the residuals, we can plot the
studentized residuals, computed by dividing each residual ei by its estimated
standard error
ei
ri = 6
RSE(1 − hii )
6
where hii is the ith diagonal matrix of the hat matrix H and RSE(1 − hii ) is
the estimated standard error of ei .

20
6
Studentized Residuals

4
2
0

−2 0 2 4 6

Fitted Values

303 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 3/4

20

6
Studentized Residuals

4
2
0

−2 0 2 4 6

Fitted Values

Observations whose studentized residuals are greater than 3 in absolute


value are possible outliers.
In this plot the outlier’s studentized residual exceeds 6, while all other
observations have studentized residuals between − 2 and 2.

304 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Outliers 4/4

If we believe that an outlier has occurred due to an error in data collection or


recording, then one solution is to simply remove the observation.
However, care should be taken, since an outlier may instead indicate a
deficiency with the model, such as a missing predictor.

305 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 8/9


lm(mpg∼ horsepower)->lm.fit
summary(lm.fit)
stdres<-rstandard(lm.fit)
hatf<-lm.fit$fitted.values
plot(hatf,stdres, xlab="fitted values", ylab="Standardized Residuals",
pch=20, col="blue")
3
Standardized Residuals

2
1
0
−1
−2
−3

5 10 15 20 25 30

fitted values

306 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 1/5


We just saw that outliers are observations for which the response yi is
unusual given the predictor xi .
In contrast, observations with high leverage have an unusual value for xi .
41

10 20
Y

5
0

−2 −1 0 1 2 3 4

Observation 41 in has high leverage, in that the predictor value for this
observation is large relative to the other observations.

307 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 1/5


We just saw that outliers are observations for which the response yi is
unusual given the predictor xi .
In contrast, observations with high leverage have an unusual value for xi .
41

10
20
Y

5
0

−2 −1 0 1 2 3 4

Observation 41 in has high leverage, in that the predictor value for this
observation is large relative to the other observations.
The red solid line is the least squares fit to the data, while the blue dashed
line is the fit produced when observation 41 is removed.
308 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 2/5


outlier (20) high leverage point (41)
20 41
6

10
4

20
2
Y

5
0

0
−2
−4

−2 −1 0 1 2 −2 −1 0 1 2 3 4

X X
! Comparing the effect of an outlier and an high leverage point, we observe that
removing the high leverage observation has a much more substantial impact on
the least squares line than removing the outlier.
! In fact, high leverage observations tend to have a sizable impact on the
estimated regression line.
! It is cause for concern if the least squares line is heavily affected by just a couple
of observations, because any problems with these points may invalidate the
entire fit.
310 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 3/5


In a simple linear regression, high leverage observations are fairly easy to
identify, since we can simply look for observations for which the predictor
value is outside of the normal range of the observations.
But in a multiple linear regression with many predictors, it is possible to have
an observation that is well within the range of each individual predictor’s
values, but that is unusual in terms of the full set of predictors.

2
1
X2

0
−1
−2

−2 −1 0 1 2

X1

311 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 3/5

2
1
X2

0
−1
−2

−2 −1 0 1 2

X1

The example shows a data set with two predictors, X1 and X2 . Most of the
observations’ predictor values fall within the blue dashed ellipse, but the red
observation is well outside of this range.
But neither its value for X1 nor its value for X2 is unusual. So if we examine
just X1 or just X2 , we will fail to notice this high leverage point. simultaneously.

312 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 3/5

2
1
X2

0
−1
−2

−2 −1 0 1 2

X1

This problem is more pronounced in multiple regression settings with more


than two predictors, because then there is no simple way to plot all
dimensions of the data simultaneously.

313 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 4/5

In order to quantify an observation’s leverage, we compute the leverage


statistic. A large value of this statistic indicates an observation with high
leverage.
For a simple linear regression,

1 (xi − x̄)2
hi = + &n
n j= 1 (xj − x̄)
2

It is clear from this equation that hi increases with the distance of xi from x̄.

314 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 4/5


In order to quantify an observation’s leverage, we compute the leverage
statistic. A large value of this statistic indicates an observation with high
leverage.
For a simple linear regression,

1 (xi − x̄)2
hi = + &n
n j= 1 (xj − x̄)
2

It is clear from this equation that hi increases with the distance of xi from x̄.

! There is a simple extension of hi to the case of multiple predictors.


! The leverage statistic hi is always between 1/n and 1, and the average
leverage for all the observations is always equal to (p + 1)/n.
! So if a given observation has a leverage statistic that greatly exceeds
(p + 1)/n, then we may suspect that the corresponding point has high
leverage.

315 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

High Leverage Points 5/5

20

5
41

4
Studentized Residuals
10

41

3
20

2
Y

1
0
0

−1
−2 −1 0 1 2 3 4 0.00 0.05 0.10 0.15 0.20 0.25

X Leverage

Observation 41 stands out as having a very high leverage statistic as well as


a high studentized residual.
In other words, it is an outlier as well as a high leverage observation. This is a
particularly dangerous combination!

316 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Auto data 9/9


lm(mpg∼ horsepower)->lm.fit
hatvalues(lm.fit) # returns the leverages
leveragePlots(lm.fit)

321
328
20
10
mpg | others

116
−10

−20 −15 −10 −5 0 5 10

horsepower | others

317 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 1/7

Collinearity refers to the situation in which two or more predictor variables are
closely related to one another.

Consider the Credit data and look at the correlation matrix of quantitative variables:

str(Credit)
Credit num<-subset(Credit, select=Income:Education)
cor(Credit num)

Income Limit Rating Cards Age Education


Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277
Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235
Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301
Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511
Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000

318 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 2/7

! The two predictors limit and age appear to have no obvious relationship.
Income Limit Rating Cards Age Education
Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277
Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235
Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301
Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511
Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000

319 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 2/7
! The two predictors limit and age appear to have no obvious relationship.

80
70
60
Age

50
40
30

2000 4000 6000 8000 12000

Limit

320 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 3/7

In contrast, there is a strong correlation between the variables Rating and limit.
Income Limit Rating Cards Age Education
Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277
Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235
Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301
Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511
Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000

321 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 3/7

800
600
Rating

400
200

2000 4000 6000 8000 12000

Limit

! We say that the predictors limit and rating are very highly correlated with
each other, and we say that they are collinear.

322 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 3/7

800
600
Rating

400
200

2000 4000 6000 8000 12000

Limit

! We say that the predictors limit and rating are very highly correlated with
each other, and we say that they are collinear.
The presence of collinearity can pose problems in the regression context, since it
can be difficult to separate out the individual effects of collinear variables on the
response.
! In other words, since limit and rating tend to increase or decrease
together, it can be difficult to determine how each one separately is associated
with the response, balance. 323 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 4/7

Since collinearity reduces the accuracy of the estimates of the regression


coefficients, it causes the standard error for β!j to grow.

Recall that the t-statistic for each predictor is calculated by dividing β!j by its
standard error. Consequently, collinearity results in a decline in the t-statistic.
As a result, in the presence of collinearity, we may fail to reject H0 :βj = 0.
This means that the power of the hypothesis test – the probability of correctly
power detecting a non-zero coefficient–is reduced by collinearity.

324 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 5/7

Look at the results of multiple regression of balance on age and limit.


Estimate Std. Error t value Pr(>|t|)
(Intercept) -173.411 43.828 -3.957 9.01e-05 ***
age -2.292 0.672 -3.407 0.000723 ***
limit 0.173 0.005 34.496 <2e-16 ***
Here, both age and limit are highly significant with very small p-values.

325 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 5/7
Look at the results of multiple regression of balance on age and limit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -173.411 43.828 -3.957 9.01e-05 ***
age -2.292 0.672 -3.407 0.000723 ***
limit 0.173 0.005 34.496 <2e-16 ***
Here, both age and limit are highly significant with very small p-values.
Consider now the results of multiple regression of balance on rating and limit.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -377.53680 45.25418 -8.343 1.21e-15 ***
Rating 2.20167 0.95229 2.312 0.0213 *
Limit 0.02451 0.06383 0.384 0.7012
Here the collinearity between rating and limit has caused the standard error for the limit
coefficient estimate to increase by a factor of 12 and the p-value to increase to 0.701.
In other words, the importance of the limit variable has been masked due to the presence of
collinearity. To avoid such a situation, it is desirable to identify and address potential collinearity
problems while fitting the model.

326 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 6/7 - Variance Inflation Factor (VIF)

! A simple way to detect collinearity is to look at the correlation matrix of


the predictors. An element of this matrix that is large in absolute value
indicates a pair of highly correlated variables, and therefore a
collinearity problem in the data.
! Unfortunately, not all collinearity problems can be detected by
inspection of the correlation matrix: it is possible for collinearity to exist
between three or more variables even if no pair of variables has a
particularly high correlation.
! We call this situation multicollinearity.

329 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 7/7 - Variance Inflation Factor (VIF)


! Instead of inspecting the correlation matrix, a better way to assess
multi- collinearity collinearity is to compute the variance inflation factor
(VIF). The VIF is the ratio of the variance of β!j j when fitting the full
model divided by the variance of β!j if fit on its own.
! The smallest possible value for VIF is 1, which indicates the complete
absence of collinearity. Typically in practice there is a small amount of
collinearity among the predictors.
! As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a
problematic amount of collinearity.
! The VIF for each variable can be computed using the formula

1
VIF(β!j ) =
1 − R2Xj |X−j

where R2Xj |X−j is the R2 from a regression of Xj onto all of the other
predictors.
! If R2X |X is close to one, then collinearity is present, and so the VIF will
j −j
be large.
332 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Collinearity 7/7 - Variance Inflation Factor (VIF)


! Instead of inspecting the correlation matrix, a better way to assess
multi- collinearity collinearity is to compute the variance inflation factor
(VIF). The VIF is the ratio of the variance of β!j j when fitting the full
model divided by the variance of β!j if fit on its own.
! The smallest possible value for VIF is 1, which indicates the complete
absence of collinearity. Typically in practice there is a small amount of
collinearity among the predictors.
! As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a
problematic amount of collinearity.
! The VIF for each variable can be computed using the formula

1
VIF(β!j ) =
1 − R2Xj |X−j

where R2Xj |X−j is the R2 from a regression of Xj onto all of the other
predictors.
! If R2X |X is close to one, then collinearity is present, and so the VIF will
j −j
be large.
334 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Analysis of Credit data 10/10

library(car)
lm(Balance Age+Limit)->lm.fit
vif(lm.fit)

Age Limit
1.010283 1.010283

lm(Balance Age+Limit)->lm.fit
vif(lm.fit)

Rating Limit
160.4933 160.4933

335 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R

336 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Back to Questions for Advertising data:


The Marketing Plan

1. Is there a relationship between advertising budget and sales?


2. How strong is the relationship between advertising budget and sales?
3. Which media contribute to sales?
4. How accurately can we estimate the effect of each medium on sales?
5. How accurately can we predict future sales?
6. Is the relationship linear?
7. Is there synergy among the advertising media?

337 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

1. Is there a relationship between advertising budget


and sales?
! This question can be answered by fitting a multiple regression model of sales
onto TV, radio, and newspaper:

sales = β0 + β1 × TV + β2 × radio + β3 × newspaper + ε

and testing the hypothesis

H0 :βTV = βradio = βnewspaper = 0.


! The F-statistic
(TSS − RSS)/p
F= .
RSS/(n − p − 1)
can be used to determine whether or not we should reject this null hypothesis.
! In this case the we get
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
and thus the p-value corresponding to the F-statistic is very low, indicating clear
evidence of a relationship between advertising and sales.

340 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2. How strong is the relationship between advertising


budget and sales? 1/2

We considered two measures of model accuracy.


! First, the residual standard error
) & *
√ n
i= 1 ei SSe
RSE = s =2 =
n− p− 1 n− 2

that estimates the standard deviation of the response from the population
regression line. For the Advertising data, the RSE is 1.681 units
Residual standard error: 1.686 on 196 degrees of freedom
while the mean value for the response is

mean(sales) → ȳ = 14.0225

indicating a percentage error of 1.686/14.0225 ≈ 0.12 = 12%.

341 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

2. How strong is the relationship between advertising


budget and sales? 2/2

! Second, the R2 statistic


&n &n 2
TSS − RSS RSS (yi − ŷi )2 i= 1 ei
R2 = =1− = 1 − &i=n 1 = 1 − & n
RSS TSS i= 1 (y i − ȳ) 2
i= 1 i − ȳ)
(y 2

that gives the percentage of variability in the response that is explained


by the predictors.
For the Advertising data, the predictors explain almost 90% of the
variance in sales
Multiple R-squared: 0.8972.

342 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

3. Which media contribute to sales?

To answer this question, we can examine the p-values associated with each
predictor’s t-statistic:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889 0.311908 9.422 < 2e-16 ***
TV 0.045765 0.001395 32.809 < 2e-16 ***
radio 0.188530 0.008611 21.893 < 2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86

The p-values for TV and radio are low, but the p-value for newspaper is not.
This suggests that only TV and radio are related to sales.

343 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

4. How accurately can we estimate the effect of each


medium on sales? 1/2
The standard error of β!j can be used to construct confidence intervals for βj . For
the Advertising data:

lm(sales∼ TV+radio+newspaper)->lm.sales.fit.tot
confint(lm.sales.fit.tot)

2.5 % 97.5 %
(Intercept) 2.32376 3.55402
TV 0.04301 0.04852
radio 0.17155 0.20551
newspaper -0.01262 0.01054

! The confidence intervals for TV and radio are narrow and far from
zero, providing evidence that these media are related to sales.
! But the interval for newspaper includes zero, indicating that the
variable is not statistically significant given the values of TV and radio.

344 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

4. How accurately can we estimate the effect of each


medium on sales? 2/2

Collinearity can result in very wide standard errors. Could collinearity be the
reason that the confidence interval associated with newspaper is so wide?
Consider the VIF scores:
vif(lm.sales.fit.tot)

TV radio newspaper
1.00461 1.14495 1.14519

! The VIF scores are around 1 for the three variables, suggesting no
evidence of collinearity.

345 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

5. How accurately can we predict future sales?

The response can be predicted using

ŷ = β!0 + β!1 x1 + +β!2 x2 + · · · + β!p xp .

The accuracy associated with this estimate depends on whether we wish


! to predict an individual response

Y = f (x) + ε → use a prediction interval

! predict the average response

f (x) → use a confidence interval

Prediction intervals will always be wider than confidence intervals because


they account for the uncertainty associated with the irreducible error ε.

346 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

6. Is the relationship linear?


Residual plots can be used in order to identify non-linearity. If the
relationships are linear, then the residual plots should display no pattern. In
the case of the Advertising data

Sales

TV

Radio

we observe a non-linear effect

347 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

7. Is there synergy among the advertising media?

! The standard linear regression model assumes an additive relationship


between the predictors and the response. An additive model is easy to
interpret because the effect of each predictor on the response is
unrelated to the values of the other predictors.
! However, the additive assumption may be unrealistic for certain data
sets.
! Interaction terms can be included in the regression model in order to
accommodate non-additive relationships. A small p-value associated
with the interaction term indicates the presence of such relationships.
! For Advertising data the relationship may not be additive. Including an
interaction term in the model results in a substantial increase in R2 , from
around 90% to almost 97%.

350 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

7. Is there synergy among the advertising media?

! The standard linear regression model assumes an additive relationship


between the predictors and the response. An additive model is easy to
interpret because the effect of each predictor on the response is
unrelated to the values of the other predictors.
! However, the additive assumption may be unrealistic for certain data
sets.
! Interaction terms can be included in the regression model in order to
accommodate non-additive relationships. A small p-value associated
with the interaction term indicates the presence of such relationships.
! For Advertising data the relationship may not be additive. Including an
interaction term in the model results in a substantial increase in R2 , from
around 90% to almost 97%.

351 / 352
Data Analysis and Statistical Learning:02 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia

Labo activity with R

Labo activity 2.R (ggplot2)

352 / 352

You might also like