0% found this document useful (0 votes)

90 views21 pages

DS Unit 4

Uploaded by

rachitsihwe77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views21 pages

DS Unit 4

Uploaded by

rachitsihwe77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Unit – IV: Model Development

Simple and Multiple Regression – Model Evaluation using

Visualization – Residual Plot – Distribution Plot – Polynomial
Regression and Pipelines – Measures for In-sample Evaluation –
Prediction and Decision Making. Unit – V: Model Evaluation
Generalization Error – Out-of-Sample Evaluation Metrics – Cross
Validation – Overfitting – Under Fitting and Model Selection –
Prediction by using Ridge Regression – Testing Multiple
Parameters by using Grid Search.
Simple Linear Regression in Machine Learning:
Simple Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable.

The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is
called Simple Linear Regression.

The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value.

Simple Linear regression algorithm has mainly two objectives:

•Model the relationship between the two variables.

•Such as the relationship between Income and expenditure, experience and Salary, etc.

•Forecasting new observations.

•Such as Weather forecasting according to temperature, Revenue of a company according to the investments
in a year, etc.
y= a0+a1x+ ε

Where,
a0 It is the intercept of the Regression line (can be obtained putting x=0
a1 It is the slope of the regression line, which tells whether the line is increasing or
decreasing.
ε = The error term. For a good model it will be negligible)

Above equation is equivalent to :

y= mx + c
Implementation of Simple Linear Regression Algorithm using Python:

Problem Statement example for Simple Linear Regression:

Dataset:
That has two variables: 1.salary (dependent variable)
2. experience Independent variable).

The goals of this problem is:

•To find out if there is any correlation between these two variables

•To find the best fit line for the dataset.

•How the dependent variable is changing by changing the independent variable.

Creating a Simple Linear Regression model to find out the best fitting line for representing the relationship
between these two variables.

To implement the Simple Linear regression model in machine learning using Python, follow the below steps:

Step-1 Data Pre-processing

The first step for creating the Simple Linear Regression model is data pre-processing.
•Firstly, import the three important libraries, which helps for loading the dataset, plotting the graphs, and
creating the Simple Linear Regression model.

import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

•Next load the dataset :

•data_set= pd.read_csv('Salary_Data.csv’)

The above output shows the dataset, which has two variables: Salary and Experience.

Extract the dependent and independent variables from the given dataset.
The independent variable is years of experience, and the dependent variable is salary.
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values

For x variable, -1 value is for removing the last column from the dataset.
For y variable, 1 value as a parameter, extracting second column and indexing starts from the zero.
Split both variables into the test set and training set.
Total = 30 observations
Take 20 observations for the training set and
10 observations for the test set.

Splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset

# Splitting the dataset into training and test set.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

Step-2: Fitting the Simple Linear Regression to the Training Set:

import the LinearRegression class of the linear_model library from the scikit learn.

create an object of the class named as a regressor. The code for this is given below:

#Fitting the Simple Linear Regression model to the training dataset

from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)
fit() method to fit our Simple Linear Regression object to the training set

In the fit() function, pass the x_train and y_train, which is training dataset for the dependent and an independent variable.

Fit regressor object to the training set so that the model can easily learn the correlations between the predictor and target
variables.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Step: 3. Prediction of test set result:

dependent (salary)
independent variable (Experience).
our model is ready to predict the output for the new observations.
In this step,provide the test dataset (new observations) to the model to check whether it can predict the correct output or
not.
create a prediction vector y_pred, and x_pred, which will contain predictions of test dataset, and prediction of training set
respectively.

#Prediction of Test and Training set result

y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)
Step: 4. visualizing the Training set results:

visualize the training set result.

use the scatter() function of the pyplot library, which we have already imported in the pre-processing
step.

The scatter () function will create a scatter plot of observations.

x-axis, plots the Years of Experience of employees
y-axis, plots salary of employees.

In the function,pass the real values of training set, which means a year of experience x_train, training set
of Salaries y_train, and color of the observations.

Now, plot the regression line, so for this,use the plot() function of the pyplot library.
In this function, pass the years of experience for training set, predicted salary for training set x_pred, and
color of the line.

mtp.scatter(x_train, y_train, color="green")

mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Step: 5. visualizing the Test set results:

use x_test, and y_test instead of x_train and y_train.

#visualizing the Test set results

mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()
Model evaluation using visualization :

Residual Plot:

For regression, there are numerous methods to evaluate the goodness of fit i.e. how well the model fits the data.

R² values are just one such measure. But they are not always the best at making us feel confident about our model.

Residuals
A residual is a measure of how far away a point is vertically from the regression line.

Simply, it is the error between a predicted value and the observed actual value.

Figure is an example of how to visualize

residuals against the line of best fit.

The vertical lines are the residuals.

Residual Plots
A typical residual plot has the residual values on the Y-axis and the independent variable on the x-axis.

Figure below is a good example of how a typical residual plot looks like.

Residual Plot Analysis:

The most important assumption of a linear regression

model is that the errors are independent and normally
distributed.

Every regression model inherently has some degree of

error since we can never predict something 100%
accurately.

More importantly, randomness and unpredictability are

always a part of the regression model.

Hence, a regression model can be explained as:

Characteristics of Good Residual Plots:

A few characteristics of a good residual plot are as follows:

1.It has a high density of points close to the origin and a low density of points away from the origin
2.It is symmetric about the origin

To explain why Fig. 3 is a good residual plot based on the characteristics above, project all the residuals onto the y-axis.

As seen in Figure 3b, end up with a normally distributed curve; satisfying the assumption of the normality of the residuals.

Fig3: Good residual plot

Finally, one other reason this is a good residual plot is, that
independent of the value of an independent variable (x-axis), the
residual errors are approximately distributed in the same manner.

In other words, we do not see any patterns in the value of the

residuals as we move along the x-axis.

Hence, this satisfies our earlier assumption that regression

model residuals are independent and normally distributed.

Using the characteristics described above, we can see why Figure fig3b
4 is a bad residual plot.

This plot has high density far away from the origin and low density
close to the origin.

Also, when we project the residuals on the y-axis, we can see the
distribution curve is not normal.
Finally, one other reason this is a good residual plot is, that independent of the value of an independent variable
(x-axis), the residual errors are approximately distributed in the same manner.

In other words, we do not see any patterns in the value of the residuals as we move along the x-axis.

Hence, this satisfies our earlier assumption that regression model residuals are independent and normally
distributed.

Using the characteristics described above, Figure 4 is a bad residual plot.

This plot has high density far away from the origin and low density close to the origin.

Also, when we project the residuals on the y-axis, we can see the distribution curve is not normal.

Fig 4(a):Example of bad residual plot Fig 4(B): project on to y axis

Distribution plot/Density Plots :

A density plot is like a smoother version of a histogram.

Generally, the kernel density estimate is used in density plots to show the probability density function of the variable.
A continuous curve, which is the kernel is drawn to generate a smooth density estimation for the whole data.
Plotting density plot of the variable ‘petal.length’ :
use the pandas df.plot() function (built over matplotlib) or the seaborn library’s sns.kdeplot() function to plot a
density plot .
Many features like shade, type of distribution, etc can be set using the parameters available in the functions.

By default, the kernel used is Gaussian (this produces a Gaussian bell curve).

Also, other graph smoothing techniques/ﬁlters are applicable.

Polynomial Regression
•Polynomial Regression is a regression algorithm that models the relationship between a dependent(y)
and independent variable(x) as nth degree polynomial.

•The Polynomial Regression equation is given below:

•y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

•It is also called the special case of Multiple Linear Regression in ML.
• Because we add some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
•It is a linear model with some modification in order to increase the accuracy.
•The dataset used in Polynomial regression for training is of non-linear nature.
•It makes use of a linear regression model to fit the complicated and non-linear functions and
datasets.
•Hence, "In Polynomial regression, the original features are converted into Polynomial features of
required degree 2,3,..,n) and then modeled using a linear model."
Need for Polynomial Regression:

The need of Polynomial Regression in ML can be understood in the below points:

•If we apply a linear model on a linear dataset, then it provides us a good result as we have seen in
Simple Linear Regression, but if we apply the same model without any modification on a non-linear
dataset, then it will produce a drastic output.

•Due to which loss function will increase, the error rate will be high, and accuracy will be decreased.

•So for such cases, where data points are arranged in a non-linear fashion, we need the Polynomial
Regression model.
•We can understand it in a better way using the below comparison diagram of the linear dataset and
non-linear dataset.
•In the above image, we have taken a dataset which is arranged non-linearly. So if we try to cover it with
a linear model, then we can clearly see that it hardly covers any data point. On the other hand, a curve is
suitable to cover most of the data points, which is of the Polynomial model.

•Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.

Equation of the Polynomial Regression Model:

Simple Linear Regression equation: y = b 0+b1x .........(a)

Multiple Linear Regression equation: y= b 0+b1x+ b2x2+ b3x3+....+

b n xn .........(b)

Polynomial Regression equation: y= b 0+b1x + b2x2+ b3x3+....+ bnxn

..........(c)
Analysis of variants in Data Science process, it could be
Univariate (or) Bi-variate (or) Multivariate.
•Univariate: only one variable at a time.
•Bi-variate: compare two variables.
•Multivariate: compare more than two variables

Consolidating Data in Openoffice Calc
No ratings yet
Consolidating Data in Openoffice Calc
2 pages
Unit III
No ratings yet
Unit III
19 pages
B.SC IT Python Programming Question Paper Sets
No ratings yet
B.SC IT Python Programming Question Paper Sets
4 pages
OOP Concepts and Java Overview
No ratings yet
OOP Concepts and Java Overview
285 pages
Process Synchronization Techniques
No ratings yet
Process Synchronization Techniques
33 pages
User Defined Functions in Javascript
No ratings yet
User Defined Functions in Javascript
6 pages
CS3381 - Object Oriented Programming Laboratory
No ratings yet
CS3381 - Object Oriented Programming Laboratory
39 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
PHP Programming Unit III
No ratings yet
PHP Programming Unit III
23 pages
PHP Programming-1
No ratings yet
PHP Programming-1
147 pages
BCA Program Syllabus Overview
No ratings yet
BCA Program Syllabus Overview
28 pages
Advance Python Question Paper 2023
No ratings yet
Advance Python Question Paper 2023
2 pages
PHP Lab - Iv Sem - Bca
No ratings yet
PHP Lab - Iv Sem - Bca
16 pages
Siddharth Arya 76 ML Practical File
No ratings yet
Siddharth Arya 76 ML Practical File
30 pages
R Programming Unit - 2 Complete Notes
No ratings yet
R Programming Unit - 2 Complete Notes
27 pages
Skill Development Practical File
No ratings yet
Skill Development Practical File
18 pages
PHP Lab Manual Odd Sem Bca
No ratings yet
PHP Lab Manual Odd Sem Bca
50 pages
Python Data Cleaning with Pandas
No ratings yet
Python Data Cleaning with Pandas
11 pages
R Programming
No ratings yet
R Programming
11 pages
Office Automation - UNIT - 1
No ratings yet
Office Automation - UNIT - 1
39 pages
Unit IV Cfoa
No ratings yet
Unit IV Cfoa
15 pages
UNIT 2-3 - Notes - Unit-2-3-Notes
No ratings yet
UNIT 2-3 - Notes - Unit-2-3-Notes
16 pages
101 Onwards On Python Pandas and Pyplot
No ratings yet
101 Onwards On Python Pandas and Pyplot
33 pages
PHP Programming Unit 1 2 3
No ratings yet
PHP Programming Unit 1 2 3
56 pages
Tech Quiz
No ratings yet
Tech Quiz
22 pages
Unit 01. PWP (22616) - 1
No ratings yet
Unit 01. PWP (22616) - 1
16 pages
Data Science (Sushil Geol)
No ratings yet
Data Science (Sushil Geol)
101 pages
SC&RP - Unit 5
No ratings yet
SC&RP - Unit 5
36 pages
Performance Measurement of The Algorithm
No ratings yet
Performance Measurement of The Algorithm
10 pages
3rd Sem Python 2023 MQ Paper With Solution
No ratings yet
3rd Sem Python 2023 MQ Paper With Solution
27 pages
Python Programming with Django Framework
No ratings yet
Python Programming with Django Framework
2 pages
R Programming Course Notes
No ratings yet
R Programming Course Notes
28 pages
Database Management System
No ratings yet
Database Management System
35 pages
Bangladeshi Flower ID via ML Techniques
100% (1)
Bangladeshi Flower ID via ML Techniques
16 pages
ML Notes (III BCA)
No ratings yet
ML Notes (III BCA)
64 pages
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
No ratings yet
Lab Manual B.Sc. (CA) : Department of Computer Science Ccb-2P2: Laboratory Course - Ii
31 pages
C# and VB.NET Windows Forms Guide
No ratings yet
C# and VB.NET Windows Forms Guide
61 pages
BCA 2nd Sem Fundamentals of Statistics
0% (1)
BCA 2nd Sem Fundamentals of Statistics
2 pages
Object-Oriented Design Q&A
100% (2)
Object-Oriented Design Q&A
5 pages
Lab Manual of DATA Science - DAS-Final
No ratings yet
Lab Manual of DATA Science - DAS-Final
6 pages
Bca-Iv Sem Dar Imp Questions
100% (1)
Bca-Iv Sem Dar Imp Questions
1 page
Lab 6 - Naive Bayesian Classification Exercises
No ratings yet
Lab 6 - Naive Bayesian Classification Exercises
9 pages
Concurrent Processes in Operating System
No ratings yet
Concurrent Processes in Operating System
20 pages
UNIT 1 Ost - Part-1
No ratings yet
UNIT 1 Ost - Part-1
4 pages
HTML Lab Report Program 8 To 13
100% (1)
HTML Lab Report Program 8 To 13
16 pages
I - AI & DS - Python
No ratings yet
I - AI & DS - Python
102 pages
Sources and Nature of Data
No ratings yet
Sources and Nature of Data
44 pages
CG Unit 3 Notes
No ratings yet
CG Unit 3 Notes
45 pages
UG BBA Syllabus NEP 1st and 2nd Sem 2023
No ratings yet
UG BBA Syllabus NEP 1st and 2nd Sem 2023
8 pages
Class 12 Computer Notes by Binod Rijal
100% (1)
Class 12 Computer Notes by Binod Rijal
31 pages
Unit 2 Machine Learning
No ratings yet
Unit 2 Machine Learning
32 pages
PP LAB MANUAL 2nd Sem
No ratings yet
PP LAB MANUAL 2nd Sem
76 pages
Python
0% (1)
Python
8 pages
Machine Learning 2
No ratings yet
Machine Learning 2
45 pages
Simple Linear Regression in Machine Learning
No ratings yet
Simple Linear Regression in Machine Learning
7 pages
Lecture-2 Unit 2
No ratings yet
Lecture-2 Unit 2
56 pages
Practical # 10
No ratings yet
Practical # 10
5 pages
Practical 5
No ratings yet
Practical 5
8 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
Michael C. Whitlock and Dolph Schluter - The Analysis of Biological Data (2015, W. H. Freeman and Company)
No ratings yet
Michael C. Whitlock and Dolph Schluter - The Analysis of Biological Data (2015, W. H. Freeman and Company)
1,058 pages
Research Methods Quiz for Educators
No ratings yet
Research Methods Quiz for Educators
10 pages
Discovering Statistics Using Ibm Spss Statistics 4Th Edition (Ebook PDF) Download
No ratings yet
Discovering Statistics Using Ibm Spss Statistics 4Th Edition (Ebook PDF) Download
50 pages
Model Selection and Averaging Insights
No ratings yet
Model Selection and Averaging Insights
16 pages
Multinomial Logistic Regression Guide
No ratings yet
Multinomial Logistic Regression Guide
34 pages
CS-E4820 Machine Learning: Advanced Probabilistic Methods
No ratings yet
CS-E4820 Machine Learning: Advanced Probabilistic Methods
2 pages
BSGPT Notes
No ratings yet
BSGPT Notes
5 pages
The Persistence of Mutual Fund Performance - Grinblatt - 1992
No ratings yet
The Persistence of Mutual Fund Performance - Grinblatt - 1992
9 pages
Aiml Unit Ii Probabilistic Reasoning
No ratings yet
Aiml Unit Ii Probabilistic Reasoning
37 pages
Linear Regression: Height vs Weight
No ratings yet
Linear Regression: Height vs Weight
4 pages
Probability Distributions & Regression
No ratings yet
Probability Distributions & Regression
53 pages
Central Tendency and Dispersion Explained
No ratings yet
Central Tendency and Dispersion Explained
23 pages
12 Online Mandatory 12 12 50 Yes No Yes 0 1 640653100858 No Null
No ratings yet
12 Online Mandatory 12 12 50 Yes No Yes 0 1 640653100858 No Null
12 pages
Answer Key FYbcom Sem 1
No ratings yet
Answer Key FYbcom Sem 1
9 pages
2024 MI2022 Probability-and-Statistics ICT Final
No ratings yet
2024 MI2022 Probability-and-Statistics ICT Final
7 pages
Business Decision Making Exams (PT DGSCM) 7
No ratings yet
Business Decision Making Exams (PT DGSCM) 7
2 pages
Business Statistics L3 Past Paper Series 2 2011
100% (1)
Business Statistics L3 Past Paper Series 2 2011
7 pages
Quantitative Data Analysis Steps Guide
100% (1)
Quantitative Data Analysis Steps Guide
15 pages
Selection Models - Heckman Correction
No ratings yet
Selection Models - Heckman Correction
7 pages
15multiple Linear Regression
No ratings yet
15multiple Linear Regression
168 pages
HW2Solutions PDF
No ratings yet
HW2Solutions PDF
7 pages
Terence C. Mills - The Econometric of Modelling of Financial Time Series
No ratings yet
Terence C. Mills - The Econometric of Modelling of Financial Time Series
11 pages
MATH3081 Syllabus Spring2023 Webber
No ratings yet
MATH3081 Syllabus Spring2023 Webber
6 pages
Engineering Probability Problems
No ratings yet
Engineering Probability Problems
4 pages
Clinical Lab Quality Control Guide
No ratings yet
Clinical Lab Quality Control Guide
31 pages
QBW M1V2Ch12 ENG Vw5Kosuc
No ratings yet
QBW M1V2Ch12 ENG Vw5Kosuc
21 pages
Impact of Discounts and Store Atmosphere on Impulse Buying
No ratings yet
Impact of Discounts and Store Atmosphere on Impulse Buying
13 pages
4.5/5.0 - 391 Downloads
No ratings yet
4.5/5.0 - 391 Downloads
60 pages
Introduction to Probability Concepts
No ratings yet
Introduction to Probability Concepts
102 pages
Propensity Score Methods Using SAS
No ratings yet
Propensity Score Methods Using SAS
37 pages

DS Unit 4

Uploaded by

DS Unit 4

Uploaded by

Unit – IV: Model Development

Simple and Multiple Regression – Model Evaluation using

Simple Linear regression algorithm has mainly two objectives:

•Model the relationship between the two variables.

•Forecasting new observations.

Above equation is equivalent to :

Problem Statement example for Simple Linear Regression:

The goals of this problem is:

•To find the best fit line for the dataset.

•How the dependent variable is changing by changing the independent variable.

Step-1 Data Pre-processing

•Next load the dataset :

# Splitting the dataset into training and test set.

Step-2: Fitting the Simple Linear Regression to the Training Set:

#Fitting the Simple Linear Regression model to the training dataset

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Step: 3. Prediction of test set result:

#Prediction of Test and Training set result

visualize the training set result.

The scatter () function will create a scatter plot of observations.

mtp.scatter(x_train, y_train, color="green")

use x_test, and y_test instead of x_train and y_train.

#visualizing the Test set results

Figure is an example of how to visualize

The vertical lines are the residuals.

Residual Plot Analysis:

The most important assumption of a linear regression

Every regression model inherently has some degree of

More importantly, randomness and unpredictability are

Hence, a regression model can be explained as:

A few characteristics of a good residual plot are as follows:

Fig3: Good residual plot

In other words, we do not see any patterns in the value of the

Hence, this satisfies our earlier assumption that regression

Using the characteristics described above, Figure 4 is a bad residual plot.

Fig 4(a):Example of bad residual plot Fig 4(B): project on to y axis

A density plot is like a smoother version of a histogram.

Also, other graph smoothing techniques/ﬁlters are applicable.

•The Polynomial Regression equation is given below:

•y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

The need of Polynomial Regression in ML can be understood in the below points:

Equation of the Polynomial Regression Model:

Simple Linear Regression equation: y = b 0+b1x .........(a)

Multiple Linear Regression equation: y= b 0+b1x+ b2x2+ b3x3+....+

Polynomial Regression equation: y= b 0+b1x + b2x2+ b3x3+....+ bnxn

You might also like