3.multiple Linear Regression - Jupyter Notebook
3.multiple Linear Regression - Jupyter Notebook
library(tidyverse)
-- Conflicts ---------------------------------------------------------------
------------------- tidyverse_conflicts() --
As a predictive analysis, the multiple linear regression is used to explain the relationship between one
continuous dependent variable and two or more independent variables. The independent variables can be
continuous or categorical
Second, it can be used to forecast effects or impacts of changes. That is, multiple linear regression analysis
helps us to understand how much will the dependent variable change when we change the independent
variables.
Third, multiple linear regression analysis predicts trends and future values. The multiple linear regression
analysis can be used to get point estimates.
In [12]:
head(data)
A data.frame: 6 × 25
2/24/2
1 10107 30 95.70 2 2871.00
0
8/25/2
4 10145 45 83.26 6 3746.70
0
10/10/2
5 10159 49 100.00 14 5205.27
0
10/28/2
6 10168 36 96.66 1 3479.76
0
Building model
sales(dependent) = b0 + b1* price + b2* quantity_ordered+b3* quarter_id
In [14]:
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation:
it can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least,
one of the predictor variables is significantly related to the outcome variable.
To see which predictor variables are significant, we use coefficients table, which shows the estimate of
regression beta coefficients and the associated t-statitic p-values
In [15]:
summary(model)$coefficient
For a given the predictor, the t-statistic evaluates whether or not there is significant association between the
predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different
from zero.
It can be seen that,quantity ordered and price each are significantly associated to changes in sales while
changes in different quadrant is not significantly associated with sales.
We found that different quadrants is not significant in the multiple regression model. This means that, for a fixed
change in quantity ordered and price each, changes in different quadrant will not significantly affect sales units.
As the quarter_id variable is not significant, it is possible to remove it from the model:
In [16]:
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared
In multiple linear regression, the R2 represents the correlation coefficient between the observed values of the
outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be
positive and will range from zero to one.
R2 represents the proportion of variance, in the outcome variable y, that may be predicted by knowing the value
of the x variables. An R2 value close to 1 indicates that the model explains a large portion of the variance in the
outcome variable.
A problem with the R2, is that, it will always increase when more variables are added to the model, even if
those variables are only weakly associated with the response. A solution is to adjust the R2 by taking into
account the number of predictor variables.
The adjustment in the “Adjusted R Square” value in the summary output is a correction for the number of x
variables included in the prediction model.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: