0% found this document useful (0 votes)
17 views34 pages

Regression

Uploaded by

dianeadouke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views34 pages

Regression

Uploaded by

dianeadouke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Regression Models

Serge Nyawa

October 2023
Roadmap

▶ Linear Regression
▶ Logistic Regression
Introductory examples

Given the fact that Digital Transformation has been revealed to


have remarkable impacts on economic development, factors that
impact Digitalization is of great interest for researchers and policy
makers. We are interested on relationship between the
digitalization level, as measured by an index of digitalization, and a
set of economic, socio-demographic and institutional factors. A
linear regression model is appropriated to achieve this goal.
Introductory examples

In a linear regression model, there is a linear relation between the


dependent variable(variable to be explained) and the independent
variables (explanatory variables):
Digitalization = a0 + a1 GDP + a2 Population + a3 School + a4 Internet + ε

In the previous example, a0 , a1 , a2 , a3 and a4 are parameters to be


estimated. ε is a random error term that represents the difference
between the linear model and a particular observed value for the outcome
variable.
Objectives

▶ Use R to estimate a regression model


▶ Measure the impact of some covariates or predictors on
another dependent variable or outcome
▶ Use a regression model for prediction purposes
Linear Regression
▶ The model

Y = a0 + a1 X1 + a2 X2 + ... + ap Xp + ε
In the previous example, Y is the outcome variable, X1 , ..., Xp are
input variables, ai , i = 0, 1, ..., p, are parameters to be estimated.
ε is a random error term, normally distributed, with zero mean and
constant variance: ε ∼ N(0, σ 2 ).
▶ Ordinary least Squares (OLS) is a common technique to
estimate parameters. The aim is to minimize the sum of
squared residuals:

n
X
[Yi − (a0 + a1 X1 + a2 X2 + ... + ap Xp )]2
i=1
Linear Regression

▶ The solution to this optimization problem is:

(â0 , â1 , ..., âp )T = (X T X )−1 X T Y


where X = (1, X1 , ..., Xp ).
▶ After estimating the model, it is important to confirm the
goodness of fit of the model and the statistical significance of
the estimated parameters. This includes checking the
R-squared, analysing the pattern of residuals and hypothesis
testing. Statistical significance is checked by an F-test of the
overall fit, followed by t-tests of individual parameters.
Linear regression with R: case study

Given the fact that Digital Transformation has been revealed to


have remarkable impacts on economic development, factors that
impact Digitalization is of great interest for researchers and policy
makers. We are interested on relationship between the
digitalization level, as measured by an index of digitalization, and a
set of economic, socio-demographic and institutional factors. A
linear regression model is appropriated to achieve this goal.
Linear regression with R: case study

▶ Variables
▶ Digitalization: measures countries’ digital adoption across
three dimensions of the economy: people, government, and
business;
▶ GDP: a monetary measure of the market value of all the final
goods and services produced in a specific time period by
countries;
▶ Population: all residents regardless of legal status or
citizenship;
▶ School: percentage of the population with successful
completion of education at the secondary level ;
▶ Internet: number of individuals who have used the Internet
(from any location) in the last 3 months.
Linear regression with R: case study

▶ The regression equation to be estimated :

Digitalization = a0 + a1 GDP + a2 Population + a3 School + a4 Internet + ε


Linear regression with R: a case study

▶ Load packages

library('readxl')
library('lattice')
Linear regression with R: a case study

▶ Data loading

data_regr <- read_excel("C:/Users/[Link]/Documents/Regression_data.xlsx")

data_regr$Population<-log(data_regr$Population)

data_regr$GDP<-log(data_regr$GDP)
Linear regression with R: a case study

▶ A summary of the dataset

summary(data_regr)

## country Digital_Index Population GDP


## Length:125 Min. :0.1599 Min. :11.46 Min. :20.76
## Class :character 1st Qu.:0.4298 1st Qu.:15.23 1st Qu.:23.50
## Mode :character Median :0.6049 Median :16.16 Median :24.98
## Mean :0.5735 Mean :16.20 Mean :25.12
## 3rd Qu.:0.7142 3rd Qu.:17.33 3rd Qu.:26.48
## Max. :0.8706 Max. :21.05 Max. :30.55
## NA’s :3 NA’s :2 NA’s :2
## School Internet
## Min. : 3.506 Min. : 2.40
## 1st Qu.: 20.922 1st Qu.:31.45
## Median : 48.039 Median :60.26
## Mean : 47.057 Mean :56.86
## 3rd Qu.: 67.253 3rd Qu.:79.46
## Max. :131.541 Max. :98.24
## NA’s :2 NA’s :2
Linear regression with R: a case study
▶ Pair-wise relationships of the variables: the scatterplot matrix

splom(~data_regr[c(2:6)], groups=NULL,data=data_regr)

100
60 80 100
80
60
Internet
40
20
0 20 40
0
120 60 80 120
100
80
60 School 60
40
0 20 40 60 20
0
30 26 28 30
28
26 GDP 26
24
22 24 26 22

20 16 18 20
18
Population16
16
14
12 14 16 12

0.8 0.6 0.8

0.6
Digital_Index
0.4
0.2 0.4 0.2

Matrice de nuages de points


Linear regression with R: a case study
▶ Estimation of the model

results <- lm(Digital_Index ~ GDP+Population + School + Internet, data_regr)


summary(results)

##
## Call:
## lm(formula = Digital_Index ~ GDP + Population + School + Internet,
## data = data_regr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.279285 -0.044079 0.004405 0.042522 0.176924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1850669 0.0977502 -1.893 0.06084 .
## GDP 0.0387886 0.0116399 3.332 0.00116 **
## Population -0.0288345 0.0122183 -2.360 0.01996 *
## School 0.0007503 0.0003494 2.148 0.03384 *
## Internet 0.0038918 0.0006219 6.258 6.89e-09 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.06919 on 115 degrees of freedom
## (5 observations effacées parce que manquantes)
## Multiple R-squared: 0.8684, Adjusted R-squared: 0.8639
## F-statistic: 189.8 on 4 and 115 DF, p-value: < 2.2e-16
Linear regression with R: a case study

▶ Confidence Intervals on the Parameters

confint(results, level = .95)

## 2.5 % 97.5 %
## (Intercept) -3.786912e-01 0.008557346
## GDP 1.573210e-02 0.061845093
## Population -5.303663e-02 -0.004632407
## School 5.830613e-05 0.001442306
## Internet 2.660068e-03 0.005123600
Linear regression with R: a case study

▶ Prediction: 95% confidence interval of the Digital Adoption


Index for CHAD
GDP <- 23.05
Population <- 16.49
School <- 7
Internet<-5
prediction_info<-[Link](GDP, Population, School,Internet)
Linear regression with R: a case study

▶ Prediction: 95% confidence interval on the expected sales

conf_int_Digital <- predict(results,prediction_info,leve1=.95,interval="confide


conf_int_Digital

## fit lwr upr


## 1 0.2582403 0.2316145 0.2848661
Linear regression with R: a case study
▶ Diagnostics
▶ Evaluating the Residuals: centered on zero with a constant
variance
with(results, {plot([Link],residuals,ylim=c(-40,40))
points(c(min([Link]),max([Link])), c(0,0), type="l")})
40
20
residuals

0
−20
−40

0.2 0.3 0.4 0.5 0.6 0.7 0.8

[Link]
Linear regression with R: a case study
▶ Diagnostics
▶ Evaluating the Normality Assumption of residuals

hist(results$residuals, main="Histogram of residuals")

Histogram of residuals
35
30
25
Frequency

20
15
10
5
0

−0.3 −0.2 −0.1 0.0 0.1 0.2

results$residuals
Linear regression with R: a case study
▶ Evaluating the Normality Assumption of residuals

qqnorm(results$residuals, ylab="Residuals")
qqline(results$residuals)

Normal Q−Q Plot


0.1
0.0
Residuals

−0.1
−0.2

−2 −1 0 1 2

Theoretical Quantiles
Logistic Regression

▶ Dependent variable Y: categorical (binary or multinomial)


▶ One or more predictor variables that may be either continuous
or categorical
▶ The goal is to model the probability of a random variable Y
being 0 or 1 given experimental data
Logistic Regression

▶ Description of the model

(
1 if α0 + α1 X1 + α2 X2 + ... + αp Xp + ε > 0
Y =
0 else

where ε follows a logistic distribution (Its CDF is given by:


F (x ) = 1+e1 −x ).
▶ Y and (X1 , ..., Xp ) are observed;
▶ There is not a direct functionnal link between Y and X;
▶ S(X , α) = α0 + α1 X1 + α2 X2 + ... + αp Xp is the score
function.
Linear regression with R: a case study
▶ It can be shown that

1
hα (X ) ≡ P(Y = 1|X , α) =
1+ e −S(X ,α)
▶ Also we can check that

hα (X )
log( ) = α0 + α1 X1 + α2 X2 + ... + αp Xp
1 − hα (X )
▶ The log-odds or natural logarithm of the odds of the
“success” is a linear function of the values of predictors;
▶ This latter equation is usefull for the interpretation of
coefficients;
▶ MLE is used to estimate the model:

(αˆ1 , ..., αˆp ) = Arg Max Πi hα (xi )yi (1 − hα (xi ))(1−yi )


Logistic regression with R: Predictive Maintenance

Manufacturers have only begun to capitalize on artificial


intelligence capabilities on the factory floor. Today, AI’s key roles
in manufacturing expand beyond robotics and automation: they
now include creating a pivotal role in predictive maintenance. A
company has collected data on maintenance encountered. They
want to fit a model able to predict the machine failure.
Logistic regression with R: Predictive Maintenance

▶ Variable description
▶ Air_temperature: air temperature
▶ Process_temperature: process temperature
▶ Rotational_speed: speed of rotation
▶ Torque: torque values
▶ Tool_wear: tool wear in the process
▶ Machine_failure: label that indicates, whether the machine
has failed
Logistic regression with R

▶ Load the dataset and have a view

Maintenance_data<-read.csv2("C:/Users/[Link]/Documents/Maintenance_data.csv")

summary(Maintenance_data)

## [Link] Type Air_temperature Process_temperature


## Length:1000 Length:1000 Min. :295.6 Min. :306.1
## Class :character Class :character 1st Qu.:297.6 1st Qu.:308.5
## Mode :character Mode :character Median :298.3 Median :309.0
## Mean :299.0 Mean :309.3
## 3rd Qu.:299.2 3rd Qu.:309.9
## Max. :304.4 Max. :313.7
## Rotational_speed Torque Tool_wear Machine_failure
## Min. :1181 Min. : 3.80 Min. : 0 Min. :0.000
## 1st Qu.:1370 1st Qu.:34.80 1st Qu.: 61 1st Qu.:0.000
## Median :1459 Median :43.50 Median :122 Median :0.000
## Mean :1524 Mean :43.07 Mean :120 Mean :0.339
## 3rd Qu.:1585 3rd Qu.:51.60 3rd Qu.:184 3rd Qu.:1.000
## Max. :2886 Max. :76.60 Max. :253 Max. :1.000
Logistic regression with R

▶ Correcting importation errors

Maintenance_data$Type<-factor(Maintenance_data$Type)
Maintenance_data$Machine_failure<-factor(Maintenance_data$Machine_failure)
summary(Maintenance_data)

## [Link] Type Air_temperature Process_temperature


## Length:1000 H: 99 Min. :295.6 Min. :306.1
## Class :character L:633 1st Qu.:297.6 1st Qu.:308.5
## Mode :character M:268 Median :298.3 Median :309.0
## Mean :299.0 Mean :309.3
## 3rd Qu.:299.2 3rd Qu.:309.9
## Max. :304.4 Max. :313.7
## Rotational_speed Torque Tool_wear Machine_failure
## Min. :1181 Min. : 3.80 Min. : 0 0:661
## 1st Qu.:1370 1st Qu.:34.80 1st Qu.: 61 1:339
## Median :1459 Median :43.50 Median :122
## Mean :1524 Mean :43.07 Mean :120
## 3rd Qu.:1585 3rd Qu.:51.60 3rd Qu.:184
## Max. :2886 Max. :76.60 Max. :253
Logistic regression with R

▶ The dataset is divided in two: a training set and a test set


Maintenance_data_training<-Maintenance_data[1:(dim(Maintenance_data)[1]-100),]
Maintenance_data_test<-Maintenance_data[(dim(Maintenance_data)[1]-100+1):dim(Maintenance_data)[1],]
Logistic regression with R
logit_model <- glm (Machine_failure~Type+Air_temperature+Process_temperature+
Rotational_speed+Torque+Tool_wear,
data=Maintenance_data_training,binomial(link="logit"))
summary(logit_model)

##
## Call:
## glm(formula = Machine_failure ~ Type + Air_temperature + Process_temperature +
## Rotational_speed + Torque + Tool_wear, family = binomial(link = "logit"),
## data = Maintenance_data_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2171 -0.2198 -0.0553 0.0362 3.4500
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.550e+02 6.750e+01 -8.222 < 2e-16 ***
## TypeL 1.022e+00 6.251e-01 1.634 0.102
## TypeM 6.482e-01 6.721e-01 0.964 0.335
## Air_temperature 2.100e+00 2.260e-01 9.292 < 2e-16 ***
## Process_temperature -3.905e-01 2.753e-01 -1.419 0.156
## Rotational_speed 1.760e-02 1.996e-03 8.818 < 2e-16 ***
## Torque 3.728e-01 3.609e-02 10.331 < 2e-16 ***
## Tool_wear 2.487e-02 3.217e-03 7.729 1.08e-14 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1161.6 on 899 degrees of freedom
## Residual deviance: 269.3 on 892 degrees of freedom
## AIC: 285.3
##
Logistic regression with R

▶ Interpretation: an increase in balance of one unit is associated


with an increase of 5.820e-03 in the log odds of default; being
student reduces the the log odds of default by -7.113e-01.
Students are less likely to default than non-students
(Intuitive??)
Logistic regression with R

▶ Which variable is the most important?

[Link](“caret”)

library("caret")

varImp(logit_model)

## Overall
## TypeL 1.6344863
## TypeM 0.9644174
## Air_temperature 9.2923369
## Process_temperature 1.4188504
## Rotational_speed 8.8181380
## Torque 10.3305166
## Tool_wear 7.7290194
Logistic regression with R

▶ Diagnostics
▶ Pseudo-R2: how well the fitted model explains the data as
compared to the default model of no predictor variables and
only an intercept term; values closer to one indicating that the
model has good predictive power

Residual deviance 1473.0


Pseudo − R 2 = 1 − =1− = 0.466
Null deviance 2758.8
Logistic regression with R
▶ Diagnostics
▶ Classification Rate: how well the model does in predicting
the dependent variable on out-of-sample observations?

prediction_test <- predict(logit_model, newdata = Maintenance_data_test,


type = "response")

[Link](table(Maintenance_data_test$Machine_failure,
prediction_test > 0.5))

##
## FALSE TRUE
## 0 0.72 0.01
## 1 0.02 0.25

▶ The results show 95.6% of the predicted observations are true negatives and
about 1.4% are true positives;
▶ Type II error is 2.4%: in those cases, the model predicts customers will not
default but they did;
▶ Type I error is 0.006%: in those cases, the models predicts customers will
default but they never did.

You might also like