Unit – IV: Model Development
Simple and Multiple Regression – Model Evaluation using
Visualization – Residual Plot – Distribution Plot – Polynomial
Regression and Pipelines – Measures for In-sample Evaluation –
Prediction and Decision Making. Unit – V: Model Evaluation
Generalization Error – Out-of-Sample Evaluation Metrics – Cross
Validation – Overfitting – Under Fitting and Model Selection –
Prediction by using Ridge Regression – Testing Multiple
Parameters by using Grid Search.
Simple Linear Regression in Machine Learning:
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is
called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value.
Simple Linear regression algorithm has mainly two objectives:
•Model the relationship between the two variables.
•Such as the relationship between Income and expenditure, experience and Salary, etc.
•Forecasting new observations.
•Such as Weather forecasting according to temperature, Revenue of a company according to the investments
in a year, etc.
y= a0+a1x+ ε
Where,
a0 It is the intercept of the Regression line (can be obtained putting x=0
a1 It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. For a good model it will be negligible)
Above equation is equivalent to :
y= mx + c
Implementation of Simple Linear Regression Algorithm using Python:
Problem Statement example for Simple Linear Regression:
Dataset:
That has two variables: 1.salary (dependent variable)
2. experience Independent variable).
The goals of this problem is:
•To find out if there is any correlation between these two variables
•To find the best fit line for the dataset.
•How the dependent variable is changing by changing the independent variable.
Creating a Simple Linear Regression model to find out the best fitting line for representing the relationship
between these two variables.
To implement the Simple Linear regression model in machine learning using Python, follow the below steps:
Step-1 Data Pre-processing
The first step for creating the Simple Linear Regression model is data pre-processing.
•Firstly, import the three important libraries, which helps for loading the dataset, plotting the graphs, and
creating the Simple Linear Regression model.
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
•Next load the dataset :
•data_set= pd.read_csv('Salary_Data.csv’)
The above output shows the dataset, which has two variables: Salary and Experience.
Extract the dependent and independent variables from the given dataset.
The independent variable is years of experience, and the dependent variable is salary.
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
For x variable, -1 value is for removing the last column from the dataset.
For y variable, 1 value as a parameter, extracting second column and indexing starts from the zero.
Split both variables into the test set and training set.
Total = 30 observations
Take 20 observations for the training set and
10 observations for the test set.
Splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)
Step-2: Fitting the Simple Linear Regression to the Training Set:
import the LinearRegression class of the linear_model library from the scikit learn.
create an object of the class named as a regressor. The code for this is given below:
#Fitting the Simple Linear Regression model to the training dataset
from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
fit() method to fit our Simple Linear Regression object to the training set
In the fit() function, pass the x_train and y_train, which is training dataset for the dependent and an independent variable.
Fit regressor object to the training set so that the model can easily learn the correlations between the predictor and target
variables.
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Step: 3. Prediction of test set result:
dependent (salary)
independent variable (Experience).
our model is ready to predict the output for the new observations.
In this step,provide the test dataset (new observations) to the model to check whether it can predict the correct output or
not.
create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set
respectively.
#Prediction of Test and Training set result
y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
Step: 4. visualizing the Training set results:
visualize the training set result.
use the scatter() function of the pyplot library, which we have already imported in the pre-processing
step.
The scatter () function will create a scatter plot of observations.
x-axis, plots the Years of Experience of employees
y-axis, plots salary of employees.
In the function,pass the real values of training set, which means a year of experience x_train, training set
of Salaries y_train, and color of the observations.
Now, plot the regression line, so for this,use the plot() function of the pyplot library.
In this function, pass the years of experience for training set, predicted salary for training set x_pred, and
color of the line.
mtp.scatter(x_train, y_train, color="green")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Step: 5. visualizing the Test set results:
use x_test, and y_test instead of x_train and y_train.
#visualizing the Test set results
mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Model evaluation using visualization :
Residual Plot:
For regression, there are numerous methods to evaluate the goodness of fit i.e. how well the model fits the data.
R² values are just one such measure. But they are not always the best at making us feel confident about our model.
Residuals
A residual is a measure of how far away a point is vertically from the regression line.
Simply, it is the error between a predicted value and the observed actual value.
Figure is an example of how to visualize
residuals against the line of best fit.
The vertical lines are the residuals.
Residual Plots
A typical residual plot has the residual values on the Y-axis and the independent variable on the x-axis.
Figure below is a good example of how a typical residual plot looks like.
Residual Plot Analysis:
The most important assumption of a linear regression
model is that the errors are independent and normally
distributed.
Every regression model inherently has some degree of
error since we can never predict something 100%
accurately.
More importantly, randomness and unpredictability are
always a part of the regression model.
Hence, a regression model can be explained as:
Characteristics of Good Residual Plots:
A few characteristics of a good residual plot are as follows:
1.It has a high density of points close to the origin and a low density of points away from the origin
2.It is symmetric about the origin
To explain why Fig. 3 is a good residual plot based on the characteristics above, project all the residuals onto the y-axis.
As seen in Figure 3b, end up with a normally distributed curve; satisfying the assumption of the normality of the residuals.
Fig3: Good residual plot
Finally, one other reason this is a good residual plot is, that
independent of the value of an independent variable (x-axis), the
residual errors are approximately distributed in the same manner.
In other words, we do not see any patterns in the value of the
residuals as we move along the x-axis.
Hence, this satisfies our earlier assumption that regression
model residuals are independent and normally distributed.
Using the characteristics described above, we can see why Figure fig3b
4 is a bad residual plot.
This plot has high density far away from the origin and low density
close to the origin.
Also, when we project the residuals on the y-axis, we can see the
distribution curve is not normal.
Finally, one other reason this is a good residual plot is, that independent of the value of an independent variable
(x-axis), the residual errors are approximately distributed in the same manner.
In other words, we do not see any patterns in the value of the residuals as we move along the x-axis.
Hence, this satisfies our earlier assumption that regression model residuals are independent and normally
distributed.
Using the characteristics described above, Figure 4 is a bad residual plot.
This plot has high density far away from the origin and low density close to the origin.
Also, when we project the residuals on the y-axis, we can see the distribution curve is not normal.
Fig 4(a):Example of bad residual plot Fig 4(B): project on to y axis
Distribution plot/Density Plots :
A density plot is like a smoother version of a histogram.
Generally, the kernel density estimate is used in density plots to show the probability density function of the variable.
A continuous curve, which is the kernel is drawn to generate a smooth density estimation for the whole data.
Plotting density plot of the variable ‘petal.length’ :
use the pandas df.plot() function (built over matplotlib) or the seaborn library’s sns.kdeplot() function to plot a
density plot .
Many features like shade, type of distribution, etc can be set using the parameters available in the functions.
By default, the kernel used is Gaussian (this produces a Gaussian bell curve).
Also, other graph smoothing techniques/filters are applicable.
Polynomial Regression
•Polynomial Regression is a regression algorithm that models the relationship between a dependent(y)
and independent variable(x) as nth degree polynomial.
•The Polynomial Regression equation is given below:
•y= b0+b1x1+ b2x12+ b2x13+...... bnx1n
•It is also called the special case of Multiple Linear Regression in ML.
• Because we add some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
•It is a linear model with some modification in order to increase the accuracy.
•The dataset used in Polynomial regression for training is of non-linear nature.
•It makes use of a linear regression model to fit the complicated and non-linear functions and
datasets.
•Hence, "In Polynomial regression, the original features are converted into Polynomial features of
required degree 2,3,..,n) and then modeled using a linear model."
Need for Polynomial Regression:
The need of Polynomial Regression in ML can be understood in the below points:
•If we apply a linear model on a linear dataset, then it provides us a good result as we have seen in
Simple Linear Regression, but if we apply the same model without any modification on a non-linear
dataset, then it will produce a drastic output.
•Due to which loss function will increase, the error rate will be high, and accuracy will be decreased.
•So for such cases, where data points are arranged in a non-linear fashion, we need the Polynomial
Regression model.
•We can understand it in a better way using the below comparison diagram of the linear dataset and
non-linear dataset.
•In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover it with
a linear model, then we can clearly see that it hardly covers any data point. On the other hand, a curve is
suitable to cover most of the data points, which is of the Polynomial model.
•Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.
Equation of the Polynomial Regression Model:
Simple Linear Regression equation: y = b 0+b1x .........(a)
Multiple Linear Regression equation: y= b 0+b1x+ b2x2+ b3x3+....+
b n xn .........(b)
Polynomial Regression equation: y= b 0+b1x + b2x2+ b3x3+....+ bnxn
..........(c)
Analysis of variants in Data Science process, it could be
Univariate (or) Bi-variate (or) Multivariate.
•Univariate: only one variable at a time.
•Bi-variate: compare two variables.
•Multivariate: compare more than two variables