From Equations To Predictions Understanding The Mathematics and Machine Learning of Multiple Linear Regression
From Equations To Predictions Understanding The Mathematics and Machine Learning of Multiple Linear Regression
net/publication/379564737
CITATION READS
1 162
2 authors:
All content following this page was uploaded by Marija Prchkovska on 16 April 2024.
1
University “St Kliment Ohridski”- Bitola, Faculty or Technology and Technical Science -Veles, 7000, Bitola, Republic of North Macedonia
2
Mother Teresa University, Faculty of Computer Science, Informatics, 1000,Skopje, Republic of North Macedonia
ABSTRACT
In this paper, the core concepts of multiple linear regression are explored, with a focus on its mathematical foundations and integration with machine
learning principles. The objective is to bridge the gap between theory and practical application, providing readers with a comprehensive understanding
of this versatile method and highlighting its synergy with traditional statistical approaches and modern computational methods. The paper begins by
applying multiple linear regression to predict wine quality based on physicochemical attributes, using a comprehensive dataset. The least squares method
is used to estimate regression coefficients, facilitating the construction of a predictive model. The study also encompasses the testing of assumptions
such as homoscedasticity and normality of residuals, along with the assessment of autocorrelation to ensure model robustness. To illustrate the practical
implementation of multiple linear regression, a demonstration using PyTorch, a popular deep learning framework, is provided. A linear model is defined,
and the significance of gradient descent in optimizing model parameters is elucidated. Additionally, the paper covers topics such as data preprocessing,
model evaluation, and insights into interpreting regression results.
Furthermore, the performance of linear regression is evaluated in comparison to decision trees, random forests, and support vector regression, showcasing
the versatility of this classic technique. By presenting a holistic view of multiple linear regression, emphasizing its mathematical foundations, practical
implementation, and integration with machine learning, researchers and practitioners are empowered to leverage the potential of linear regression across
various domains.
*Corresponding author
Vesna Knights, University “St Kliment Ohridski”- Bitola, Faculty or Technology and Technical Science -Veles, 7000, Bitola, Republic of North
Macedonia.
Received: March 19, 2024; Accepted: March 21, 2024, Published: April 03, 2024
Keywords: Linear Regression, Machine Learning, Mathematical or mean square error (MSE) [10-12]. OLS serves as a method to
Foundations, Model Implementation, Predictive Modeling estimate the unknown parameters of the linear regression function,
with its primary objective being the minimization of the sum of
Introductıon squared differences between the observed dependent variable and
Multiple linear regression, a foundational statistical technique, the values predicted by the linear regression function [10,11].
plays a pivotal role in modeling the intricate relationships
that exist between a dependent variable (response) and one or This paper embarks on an exploration of the intricate world of
more independent variables (predictors) [1-3]. This method multiple linear regression, aiming to bridge the chasm between
involves fitting a linear equation to observed data, enabling us to theoretical understanding and practical application. The
comprehend, quantify, and predict associations among variables. following sections delve into the mathematical foundations of
Its versatility extends across a multitude of domains, including this method, in alignment with the insights presented by Kutner,
economics, marketing, and scientific research, where it serves as Nachtsheim, Neter, and Li [13]. The discussion extends further,
an invaluable tool for making predictions and unraveling intricate encompassing the synergistic relationship between traditional
variable connections [4-6]. statistical approaches and contemporary computational methods.
Our journey begins with the practical application of multiple
At its core, multiple linear regression is a supervised learning linear regression to predict wine quality based on physicochemical
algorithm. It's particularly adept at handling continuous real- attributes, employing an extensive dataset [14]. Leveraging
numbered target variables [7-9]. This method establishes the least squares method, we estimate regression coefficients,
relationships between the dependent variable, denoted as 'y,' and paving the way for the construction of a predictive model [15].
one or more independent variables, collectively represented as Assumptions, such as homoscedasticity and normality of residuals,
'x,' through the creation of a best-fit line. This process operates are rigorously tested. Additionally, we assess autocorrelation,
under the fundamental principle of ordinary least squares (OLS) ensuring the robustness of our model.
On the practical implementation of multiple linear regression, Loss Function: Multiple Linear Regression employs a loss
this paper provides a hands-on demonstration using PyTorch, function that measures the squared differences between the
a well-regarded deep learning framework [16-18]. Within this observed and predicted values. The ultimate goal is to minimize
context, a linear model is defined, emphasizing the critical role of the sum of squared residuals.
gradient descent in optimizing model parameters [18]. Subsequent
sections of the paper delve into essential topics such as data Assumptions
preprocessing, model evaluation, and insightful approaches for Multiple Linear Regression assumes that the errors (residuals) are
interpreting regression results [19]. normally distributed with constant variance (homoscedasticity)
and does not require a specific probabilistic model for the errors.
Furthermore, this study broadens its scope by evaluating the
performance of linear regression against other contemporary Linear Regression Model
machine learning techniques, including decision trees, random In simple linear regression, with one independent variable (X) and
forests, and support vector regression [17,20,21]. This comparative one dependent variable (Y), the model is defined as:
analysis underscores the enduring adaptability of this time-honored Y = β0 + βiX
method within the domain of predictive modeling. By offering For Multiple Linear Regression, where there are multiple
a comprehensive perspective on multiple linear regression, independent variables (x1, x2, ..., xp), the model is represented as:
emphasizing its mathematical foundations, practical applications, Y(yi)=β0+β1x1+β2x2+...+βpxp
and integration with modern machine learning, this work aims to where Y(yi) presents the observed value
empower researchers and practitioners, equipping them to leverage
the substantial potential of linear regression across various fields İn order to make predictions, the model is expressed as:
[22]. Y` = β0 + β1X1 + β2X2 + ... + βpXp + ε
Y` represents the predicted value of the dependent variable Y for
Materıal and Methods a given set of independent variables.
Material β0 is the y-intercept, representing the expected value of Y when
For the purpose of this study, a database from Cortez et al. (2009) all independent variables are 0.
was utilized [14]. The dataset includes the following attribute β1, β2, ..., βp are the coefficients (slopes) for the independent
information: variables.
Input variables (based on physicochemical tests): ε (Error or Residual) is the difference between the actual observed
Input variables (based on physicochemical tests) value (Y(yi)) and the predicted value (Y’). Matematically:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3) ε = yi-ŷi
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3) The primary objective of linear regression is to determine the
5 - chlorides (sodium chloride - g / dm^3 coefficients that minimize the sum of squared errors (SSE) and
6 - free sulfur dioxide (mg / dm^3) provide an accurate model for predicting the target variable based
7 - total sulfur dioxide (mg / dm^3) on the input features. This is achieved through methods like the
8 - density (g / cm^3) least squares approach, optimizing the coefficients to create a
9 - pH predictive model.
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume) In the context of machine learning, this approach allows us to find
Output variable (based on sensory data): the best-fitting linear model that captures the relationship between
12 - quality (score between 0 and 10) the independent variables and the dependent variable, facilitating
accurate predictions on new, unseen data.
Methods
The Collection of the Data Results
The data for this study were obtained from the dataset provided The dataset comprises m = 1599 examples and n = 11 independent
by Cortez et al. in 2009 [14]. Modeling wine preferences by variables (Table 1). The target variable, 'quality,' falls within a
data mining from physicochemical properties. Decision Support range of 0 to 10, while the remaining eleven variables represent
Systems, 47(4), 547-553]. The dataset contains information on various physicochemical attributes. Given the presence of multiple
physicochemical attributes of wine, making it suitable for the independent variables, we are tasked with fitting a multiple linear
analysis and implementation of multiple linear regression. regression model.
Statistical Analysis The equation for multiple linear regression can be expressed as:
The statistical analysis in this study primarily involves the
implementation of Multiple Linear Regression. (Y(yi)) = β0 + β1 * fixed acidity + β2 * volatile acidity + β3 *
citric acid + β4 * residual sugar + β5 * chlorides + β6 * free sulfur
Implementation of Multiple Linear Regression dioxide + β7 * total sulfur dioxide + β8 * density + β9 * pH + β10
Objective: The objective of Multiple Linear Regression is to find * sulphates + β11 * alcohol (1)
the estimates of the regression coefficients (β0, β1, β2, ..., βp) that
minimize the sum of the squared differences between the observed
values (y) and the values predicted by the linear regression model.
Before making predictions with linear regression, it's essential to estimate the coefficients β0 and βi from the available data. The
estimation of βj, representing the coefficients, can be calculated using the following formula:
Where, xij is the value of the j-th feature for the i-th data point Chlorides -1.8742 0.000
(e.g., fixed acidity, volatile acidity, citric acid, etc.). Free Sulfur Dioxide 0.0044 0.045
x̅ j is the mean of the j-th feature across all data points. Total Sulfur Dioxide -0.0033 0.000
Density -17.8812 0.409
Y is the mean of the dependent variable (quality) across all data
points. pH -0.4137 0.031
Sulphates 0.9163 0.000
The intercept term (β0) can be computed as: Alcohol 0.2762 0.000
Intercept
These coefficients represent the estimated associations between
(3) each independent variable and the dependent variable, quality. For
instance, the coefficient for volatile acidity (-1.0836) indicates
that an increase in volatile acidity is correlated with a decrease
Instead of performing complex calculations manually using the in wine quality. Conversely, the coefficient for alcohol (0.2762)
given formulas to estimate the coefficients, (β0, β1, β2, β3, ..., suggests that a higher alcohol content tends to be associated with
β11) we leveraged machine learning techniques and libraries to higher wine quality.
automate this process. The coefficients were computed using the
following code: This comprehensive analysis contributes valuable insights into
the collective impact of these physicochemical attributes on wine
Table 2: Code for Computed Coefficients
Code The next step is preparing data for a machine-learning model by
import statsmodels.formula.api as smf performing:
# Update the formula to encompass the relevant variables • Separating the features (X) and the target variable (y- quality).
formula = "quality ~ fixed_acidity + volatile_acidity + citric_acid • Standardizing the features using `StandardScaler`, by
+ residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_ performing the following transformations on each feature:
dioxide + density + pH + sulphates + alcohol" It calculates the mean (μ) and standard deviation (σ) of each
# Fit the regression model feature in the training data.
est = smf.ols(formula=formula, data=data).fit() • For each feature, it subtracts the mean (μ) and then divides
# Display the summary of the regression analysis
by the standard deviation (σ):
print(est.summary())
X standardized = (X−μ)/ σ
By utilizing this approach, we achieved a more efficient and
automated means of estimating the coefficients, allowing us to
Where: X is the original feature value, Xstandardized is the standardized
focus on the interpretation and insights drawn from the results
feature value.
Splitting the data into training and testing sets using train_test_split
The results of the multiple linear regression analysis are
(X, y, random_state = 0, test_size=0.25).
summarized in the following table:
Once these coefficients have been calculated, they can be used to
Table 3: The Results of The Multiple Linear Regression
make predictions for new data points by plugging in the values
Analysis
of the independent variables into the linear regression equation.
Variable Coefficient P-value
Intercept 21.9652 0.300 ŷi - Predicted values based on the linear model
Fixed Acidity 0.0250 0.336
ŷi = β0 + βjX +ei (4)
Volatile Acidity -1.0836 0.000
Citric Acid -0.1826 0.215
The error term (e) is known as a residual, represents the difference Table 5: Code for Testing Heteroscedasticity
between the actual observed values (yi) and the predicted values Code
(ŷi ) for each data point (i).
residuals = y_train. values-y_pred
mean_residuals = np. mean(residuals)
Table 4: Code for Calculated Residuals print ("Mean of Residuals {}”. format(mean_residuals))
Code Mean of Residuals 1.2741174994864182e-16
residuals = y_train. values-y_pred
mean_residuals = np. mean(residuals) In statistical analysis, the Goldfeld-Quandt test is commonly
print ("Mean of Residuals {}”. format(mean_residuals)) employed to assess homoscedasticity, a concept denoting the
Mean of Residuals 1.2741174994864182e-16 assumption that the variance of errors (residuals) in a regression
model remains consistent irrespective of the levels of independent
Residuals are calculated by subtracting the predicted values (y_ variables. Homoscedasticity holds significance in regression
pred) from the actual values (y_train). These residuals represent analysis as it signifies that the model's errors exhibit uniform
the differences between the observed (actual) values and the variability, thereby contributing to the reliability of the model's
values predicted by linear regression model for each data point performance.
in your training dataset.
When interpreting the Goldfeld-Quandt test results, the pivotal
mean_residuals calculates the mean (average) of the residuals. element is the p-value. In the context of the obtained p-value in
wine analysis (0.9197664304253765), it signifies the following
The output tha is provided, "Mean of Residuals hypotheses:
1.2741174994864182e-16," indicates that the mean of the
residuals is extremely close to zero but not exactly zero. The Null Hypothesis (H0): The error terms exhibit homoscedasticity,
value is approximately, which is a very small number. implying they possess a constant variance.
In theory, the mean of residuals should ideally be exactly zero Alternative Hypothesis (Ha): The error terms display
for a well-fitted linear regression model. Indicating that the linear heteroscedasticity, indicating varying variance.
regression model is reasonably well-calibrated on the training
data, and, on average, it does not exhibit systematic bias in its In our specific case, the calculated p-value (0.9197664304253765)
predictions. significantly exceeds the conventional significance level of 0.05.
When the p-value surpasses the significance level, it implies that
In the context of regression analysis, homoscedasticity,indicates there is insufficient evidence to support the conclusion that the
that the residuals exhibit consistent or nearly consistent variance error terms exhibit heteroscedasticity. The null hypothesis implies
along the regression line. To assess this, we can create a scatter that the error terms maintain homoscedasticity.
plot of the error terms against the predicted values, ensuring that
there is no discernible pattern in the residuals. Homoscedasticity is a fundamental assumption in linear regression
models. When this assumption is met, it signifies that the errors
in the model consistently vary across different levels of the
independent variables. This uniformity ensures that the model's
predictions maintain reliability across the entire spectrum of
predictor values. This uniformity facilitates a clearer interpretation
of the relationship between the dependent and independent
variables.
Figure 3: Autocorrelation Function Least squares method, which minimizes the sum of squared
differences between the observed Y values and the predicted Ỹ
The autocorrelation function (ACF) is used to plot the correlation values:
between a time series and its lagged values at various lags.
The relationship between ε and SSE is expressed by the formula
for SSE:
J Mathe & Comp Appli, 2024 Volume 3(2): 5-8
Citation: Vesna Knights, Marıja Prchkovska (2024) From Equations to Predictions: Understanding the Mathematics and Machine Learning of Multiple Linear
Regression. Journal of Mathematical & Computer Applications. SRC/JMCA-168. DOI: doi.org/10.47363/JMCA/2024(3)137
Where SSE is Sum of Square Error and SST is Sum of Square Total The Random Forest Regressor, on the other hand, employs an
ensemble approach, combining multiple decision trees to enhance
Table 6: Multiple Linear Regression Model prediction accuracy. This model often strikes a balance between
Loss Function Multiple Linear Regression complexity and generalization, making it a popular choice for
various regression tasks.
y_train y_test
МАЕ 0.48949 0.53303 Lastly, the Support Vector Machine, or SVM, is a powerful
MSE 0.38888 0.490888 algorithm that excels in capturing intricate patterns within data.
While it may exhibit a lower accuracy on the training set compared
RMSE 0.62360 0.700634
to other models, it can provide robust predictions and is particularly
R2 0.38123 0.303635 adept at handling non-linear relationships.
VIF 1.6161 1.43602
In this comparative analysis, we present the results of these models
A lower MAE, MSE and RMSE, indicates a very good fit of the based on metrics such as accuracy, R-squared, and various error
model to the data because it the predicted values are closer to the measures. By understanding the strengths and limitations of each
actual values. model, we aim to guide the selection process towards the algorithm
best suited for the specific nuances of our dataset and objectives.
The Variance Inflation Factor (VIF) is a measure that helps us
understand how much the variance of an estimated regression
Table 7: Comparising Linerar Regression Problem Solwing with Diferent Type of Machine Learning Models for Wine Dataset
Model Performance Comparison of Regression Models
Accuracy R2 MAE MSE RMSE
DecisionTreeRegressor 1.0 1.0 0.000 0.000 0.00
RandomForestRegressor 0.929 0.929 0.158 0.047 0.217
SVM 0.556 0.556 0.380 0.295 0.543
Gradient Descent is used for optimization. The code runs for These assumptions collectively help ensure that a multiple linear
5,000 iterations, updating the coefficients with small steps in the regression model is appropriate for the given data and that the
direction of gradient descent. This process iteratively refines the model's predictions are reliable. Violations of these assumptions
coefficients to improve the model's accuracy. may require further analysis or potential model adjustments.