0% found this document useful (0 votes)
34 views

Regression

Uploaded by

lillyjoywin1235
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Regression

Uploaded by

lillyjoywin1235
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Regression Analysis

in
Machine learning
Regression Analysis
• Regression analysis is a statistical method.
What is Statistics?
• Statistics is the science concerned with developing
and studying methods for collecting, analyzing,
interpreting and presenting empirical data.
• Empirical data is information acquired by scientists
through experimentation and observation, and it
is essential to the scientific process.
• Use of the scientific method involves making an
observation, developing an idea, testing the idea,
getting results, and making a conclusion.
Impact of social media usage
Other definitions of Statistics
• Statistics is the discipline that concerns the collection,
organization, analysis, interpretation, and presentation of
data.
• a collection of numerical facts or measurements, as about
people, business conditions, or weather:
• Statistics is the branch of mathematics for collecting, analysing
and interpreting data.
• Statistics can be used to predict the future, determine the
probability that a specific event will happen, or help answer
questions about a survey.
• Statistics is used in many different fields such as business,
medicine, biology, psychology and social sciences.
Types of Statistics

• The two main branches of statistics are:


1. Descriptive Statistics
2. Inferential Statistics

1. Descriptive Statistics – Through graphs or tables, or numerical


calculations, descriptive statistics uses the data to provide
descriptions of the population.
2.Inferential Statistics – Based on the data sample taken from the
population, inferential statistics makes the predictions and
inferences.
• Both types of statistics are equally employed in the field of
statistical analysis.
Characteristics of Statistics
• The important characteristics of Statistics are
as follows:
• Statistics are numerically expressed.
• It has an aggregate of facts
• Data are collected in systematic order
• It should be comparable to each other
• Data are collected for a planned purpose
Importance of Statistics
• Statistics helps in gathering information about the
appropriate quantitative data
• It depicts the complex data in graphical form, tabular form
and in diagrammatic representation to understand it easily
• It provides the exact description and a better understanding
• It helps in designing the effective and proper planning of the
statistical inquiry in any field
• It gives valid inferences with the reliability measures about
the population parameters from the sample data
• It helps to understand the variability pattern through the
quantitative observations
What are statistical methods?
• Statistical models are tools to help you analyze sets of data.
• Experts use statistical models as part of statistical analysis,
which is gathering and interpreting quantitative data.
• Using a statistical model can help you evaluate the
characteristics of a sample size within a given population
and apply your findings to the larger group.
• While statisticians and data analysts may use statistical
models more than others, many can benefit from
understanding statistical models, including marketing
representatives, business executives and government
officials.
• Why are statistical methods important?
• Many organizations now have a lot of data available about their
customers, operations, services or products and related factors.
• A statistical model can make all of this data more
comprehensible.
• When businesses have a way to analyze and understand all of
their data, they can perform tasks such as:Designing more
effective surveys for customers or employees
• Creating experimental studies, such as a study to test a new
product in development.
• Assessing the value of a potential investment
• Conducting scientific experiments
Statistical Methods
• Here are six types of statistical models:
1. Mean
• The mean is the total sum of all your numbers divided by the total
numbers in the set.
• For example, a data set comprises the numbers 2, 5, 9 and 3. You first add
all of these numbers to get a total of 19, and then you divide that total by
four to get a mean of 4.75.
• More often called the average, the mean tries to provide you with
information about the overall trend of your data set.
• The mean is most helpful in analyzing data sets that have few outliers,
meaning data points that have little in common with most of the data set.
• Calculating the mean is a fast and relatively simple way to analyze your
data.
• 2. Standard deviation
• The standard deviation evaluates the data spread around your mean.
• To calculate the standard deviation, you subtract the mean from each
value within the data set and square the answer.
• Then you find the mean of all the squared answers and square the total.
• If you get a high standard deviation, it means that your mean has its
data points spread wide.
• A low standard deviation, meanwhile, shows that more of your data
points align with your mean, which is also sometimes called the
expected value.
• Like the mean, standard deviation works best for data sets with a low
number of outliers.
• 3. Hypothesis testing
• Hypothesis testing evaluates if a certain premise or characteristic is true for your
data set.
• A hypothesis analyzes if your data set could've occurred randomly or if your data
reveals underlying patterns about the wider population of the sample sizes.
• Also called a t-test, a hypothesis test examines the relationship between two sets
of random variables within your data set.
• While a hypothesis test can be more complex than a mean or standard deviation
method, the hypothesis test is better at testing underlying assumptions about the
connections between your data points.
• For example, a company may operate under the assumption that a product of
higher quality takes more time to develop and ultimately generates more revenue.
• A hypothesis test evaluates the truth of this assumption by reviewing the
company's previous products' quality and time for development in relation to
profit.
• 4. Regression
• A linear regression model evaluates the relationship between
your dependent variable and your independent variable.
• An independent variable is the data used to predict your
dependent variable, and a dependent variable is the data you
want to measure.
• Regression is common to determine if one variable influences
or helps change another.
• For example, a marketing company might use a regression
model to determine whether the frequency of their social
media ads increases the number of customers who visit their
website.
• 5. Sample size determination
• With big data increasingly common among businesses, some companies
decide to reduce the size of their data set to a more manageable amount.
• People refer to this process as sample size determination.
• Sample size determination helps you determine how large the sample
size is for your data to be accurate in relation to the larger population.
• While there's no exact formula for sample size determination, you may
find it helpful to use proportions and standard deviation.
• For example, if you're a large corporation and want to research your
target consumers, that may be too big of a population for you to
accurately gather data about.
• Instead, you might use sample size determination to reduce the number
of people involved in your analysis while still getting accurate results.
• 6. Analysis of variance
• The analysis of variance, also known as ANOVA, determines if your results
or findings are statistically significant.
• ANOVA evaluates if your independent variables influence your dependent
variable, and if so, how much influence the independent variables wield
over the dependent one.
• For example, a company wants to evaluate a new drug treatment.
• It might design a study that compares how many patient symptoms
improve with its treatment versus an existing treatment.
• The ANOVA can help it analyze how much its new treatment improves or
mitigates a patient's symptoms.
• If the results are statistically significant and positive, that means the
ANOVA determined its treatment helps patients more than the existing
treatment.
Regression Analysis
• Regression analysis is a statistical method to model
the relationship between a dependent (target) and
independent (predictor) variables with one or more
independent variables.
• More specifically, Regression analysis helps us to
understand how the value of the dependent variable
is changing corresponding to an independent variable
when other independent variables are held fixed.
• It predicts continuous/real values such
as temperature, age, salary, price, etc.
• We can understand the concept of regression analysis using the below example:
• Example: Suppose there is a marketing company A, who does various
advertisement every year and get sales on that. The below list shows the
advertisement made by the company in the last 5 years and the corresponding
sales:
• Now, the company wants to do the advertisement of $200 in
the year 2019 and wants to know the prediction about the
sales for this year.
• So to solve such type of prediction problems in machine
learning, we need regression analysis.
• Regression is a supervised learning technique which helps in
finding the correlation between variables and enables us to
predict the continuous output variable based on the one or
more predictor variables.
• It is mainly used for prediction, forecasting, time series
modeling, and determining the causal-effect relationship
between variables.
• In Regression, we plot a graph between the variables which best
fits the given datapoints, using this plot, the machine learning
model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through
all the datapoints on target-predictor graph in such a way that
the vertical distance between the datapoints and the regression
line is minimum." The distance between datapoints and line tells
whether a model has captured a strong relationship or not.
• Some examples of regression can be as:
• Prediction of rain using temperature and other factors
• Determining Market trends
• Prediction of road accidents due to rash driving.
• Terminologies Related to the Regression Analysis:
• Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
• Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is
called underfitting.
• Why do we use Regression Analysis?
• As mentioned above, Regression analysis helps in the prediction of a continuous
variable. There are various scenarios in the real world where we need some
future predictions such as weather condition, sales prediction, marketing
trends, etc., for such case we need some technology which can make
predictions more accurately. So for such case we need Regression analysis
which is a statistical method and used in machine learning and data science.
Below are some other reasons for using Regression analysis:
• Regression estimates the relationship between the target and the independent
variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most
important factor, the least important factor, and how each factor is affecting
the other factors.
• Types of Regression
• There are various types of regressions which are used in data science and
machine learning. Each type has its own importance on different scenarios,
but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:
• Linear Regression
• Logistic Regression
• Polynomial Regression
• Support Vector Regression
• Decision Tree Regression
• Random Forest Regression
• Ridge Regression
• Lasso Regression:
• Linear Regression:
• Linear regression is a statistical regression method which is used for predictive
analysis.
• It is one of the very simple and easy algorithms which works on regression and
shows the relationship between the continuous variables.
• It is used for solving the regression problem in machine learning.
• Linear regression shows the linear relationship between the independent
variable (X-axis) and the dependent variable (Y-axis), hence called linear
regression.
• If there is only one input variable (x), then such linear regression is
called simple linear regression. And if there is more than one input variable,
then such linear regression is called multiple linear regression.
• The relationship between variables in the linear regression model can be
explained using the below image. Here we are predicting the salary of an
employee on the basis of the year of experience.
• Below is the mathematical equation for Linear regression:
Y= aX+b
• Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
• Some popular applications of linear regression are:
• Analyzing trends and sales estimates
• Salary forecasting
• Real estate prediction
• Arriving at ETAs in traffic.
• Logistic Regression:
• Logistic regression is another supervised learning algorithm
which is used to solve the classification problems.
In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
• Logistic regression algorithm works with the categorical variable
such as 0 or 1, Yes or No, True or False, Spam or not spam, etc.
• It is a predictive analysis algorithm which works on the concept
of probability.
• Logistic regression is a type of regression, but it is different from
the linear regression algorithm in the term how they are used.
• Logistic regression uses sigmoid function or
logistic function which is a complex cost
function. This sigmoid function is used to
model the data in logistic regression. The
function can be represented as:
• f(x)= Output between the 0 and 1 value.
• x= input to the function
• e= base of natural logarithm.
• When we provide the input values (data) to
the function, it gives the S-curve as follows:
• It uses the concept of threshold levels, values
above the threshold level are rounded up to 1,
and values below the threshold level are
rounded up to 0.
• There are three types of logistic regression:
• Binary(0/1, pass/fail)
• Multi(cats, dogs, lions)
• Ordinal(low, medium, high)
• Polynomial Regression:
• Polynomial Regression is a type of regression which models the non-
linear dataset using a linear model.
• It is similar to multiple linear regression, but it fits a non-linear curve
between the value of x and corresponding conditional values of y.
• Suppose there is a dataset which consists of datapoints which are
present in a non-linear fashion, so for such case, linear regression
will not best fit to those datapoints. To cover such datapoints, we
need Polynomial regression.
• In Polynomial regression, the original features are transformed into
polynomial features of given degree and then modeled using a
linear model. Which means the datapoints are best fitted using a
polynomial line.
• The equation for polynomial regression also derived
from linear regression equation that means Linear
regression equation Y= b0+ b1x, is transformed into
Polynomial regression equation Y= b0+b1x+ b2x2+
b3x3+.....+ bnxn.
• Here Y is the predicted/target output, b0, b1,... bn are
the regression coefficients. x is
our independent/input variable.
• The model is still linear as the coefficients are still
linear with quadratic
• Note: This is different from Multiple Linear
regression in such a way that in Polynomial
regression, a single element has different
degrees instead of multiple variables with the
same degree.
Support Vector Regression:
• Support Vector Machine is a supervised learning algorithm which can be used for
regression as well as classification problems. So if we use it for regression problems,
then it is termed as Support Vector Regression.
• Support Vector Regression is a regression algorithm which works for continuous
variables. Below are some keywords which are used in Support Vector Regression:
• Kernel: It is a function used to map a lower-dimensional data into higher dimensional
data.
• Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is
a line which helps to predict the continuous variables and cover most of the datapoints.
• Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
• Support vectors: Support vectors are the datapoints which are nearest to the
hyperplane and opposite class.
• In SVR, we always try to determine a hyperplane with a maximum margin, so that
maximum number of datapoints are covered in that margin. The main goal of SVR is to
consider the maximum datapoints within the boundary lines and the hyperplane (best-
fit line) must contain a maximum number of datapoints. Consider the below image:
• Here, the blue line is called hyperplane, and the other two
lines are known as boundary lines.
• Decision Tree Regression:
• Decision Tree is a supervised learning algorithm which can be used
for solving both classification and regression problems.
• It can solve problems for both categorical and numerical data
• Decision Tree regression builds a tree-like structure in which each
internal node represents the "test" for an attribute, each branch
represent the result of the test, and each leaf node represents the
final decision or result.
• A decision tree is constructed starting from the root node/parent
node (dataset), which splits into left and right child nodes (subsets of
dataset). These child nodes are further divided into their children
node, and themselves become the parent node of those nodes.
Consider the below image:
• Above image showing the example of Decision Tee regression,
here, the model is trying to predict the choice of a person
between Sports cars or Luxury car.
• Random forest is one of the most powerful supervised learning
algorithms which is capable of performing regression as well as
classification tasks.
• The Random Forest regression is an ensemble learning method
which combines multiple decision trees and predicts the final
output based on the average of each tree output. The
combined decision trees are called as base models, and it can
be represented more formally as:
• g(x)= f0(x)+ f1(x)+ f2(x)+....
• Random forest uses Bagging or Bootstrap
Aggregation technique of ensemble learning
in which aggregated decision tree runs in
parallel and do not interact with each other.
• With the help of Random Forest regression,
we can prevent Overfitting in the model by
creating random subsets of the datas
• Ridge Regression:
• Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long
term predictions.
• The amount of bias added to the model is known as Ridge Regression
penalty. We can compute this penalty term by multiplying with the
lambda to the squared weight of each individual features.
• The equation for ridge regression will be:
• A general linear or polynomial regression will fail if there is high
collinearity between the independent variables, so to solve such
problems, Ridge regression can be used.
• Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
• It helps to solve the problems if we have more parameters than samples.
• Lasso Regression:
• Lasso regression is another regularization technique to
reduce the complexity of the model.
• It is similar to the Ridge Regression except that penalty
term contains only the absolute weights instead of a
square of weights.
• Since it takes absolute values, hence, it can shrink the
slope to 0, whereas Ridge Regression can only shrink it
near to 0.
• It is also called as L1 regularization. The equation for Lasso
regression will be:
Linear Regression in Machine Learning
• Linear regression is one of the easiest and most popular Machine
Learning algorithms. It is a statistical method that is used for predictive
analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a
dependent (y) and one or more independent (y) variables, hence
called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent
variable is changing according to the value of the independent
variable.
• The linear regression model provides a sloped straight line
representing the relationship between the variables. Consider the
below image:
• Mathematically, we can represent a linear regression as:
• y= a0+a1x+ ε
• Here,
• Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of
freedom)
a1 = Linear regression coefficient (scale factor to each input
value).
ε = random error
• The values for x and y variables are training datasets for Linear
Regression model representation.
Types of Linear Regression
• Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
• Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
• Linear Regression Line
• A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:
• Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases
on X-axis, then such a relationship is termed as a Positive linear relationship.
• Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and
independent variable increases on the X-axis, then such a
relationship is called a negative linear relationship.
• Finding the best fit line:
• When working with linear regression, our main goal is to find the best fit
line that means the error between predicted values and actual values
should be minimized. The best fit line will have the least error.
• The different values for weights or the coefficient of lines (a0, a1) gives a
different line of regression, so we need to calculate the best values for
a0 and a1 to find the best fit line, so to calculate this we use cost function.
• Cost function-
• The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
• Cost function optimizes the regression coefficients or weights. It
measures how a linear regression model is performing.
• We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also
known as Hypothesis function.
• For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values. It can be written as:
• For the above linear equation, MSE can be calculated as:

• Where,
• N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
• Residuals: The distance between the actual
value and predicted values is called residual. If
the observed points are far from the
regression line, then the residual will be high,
and so cost function will high. If the scatter
points are close to the regression line, then
the residual will be small and hence the cost
function.
• Gradient Descent:
• Gradient descent is used to minimize the MSE by
calculating the gradient of the cost function.
• A regression model uses gradient descent to
update the coefficients of the line by reducing
the cost function.
• It is done by a random selection of values of
coefficient and then iteratively update the
values to reach the minimum cost function.
• Model Performance:
• The Goodness of fit determines how the line
of regression fits the set of observations. The
process of finding the best model out of
various models is called optimization. It can
be achieved by below method:
1. R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
• It can be calculated from the below formula:
• Assumptions of Linear Regression
• Below are some important assumptions of Linear Regression. These are some formal
checks while building a Linear Regression model, which ensures to get the best possible
result from the given dataset.
• Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and
independent variables.
• Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
• Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern
distribution of data in the scatter plot.
• Normal distribution of error terms:
Linear regression assumes that the error term should follow the
normal distribution pattern. If error terms are not normally
distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight
line without any deviation, which means the error is normally
distributed.
• No autocorrelations:
The linear regression model assumes no autocorrelation in error
terms. If there will be any correlation in the error term, then it
will drastically reduce the accuracy of the model. Autocorrelation
usually occurs if there is a dependency between residual errors.
Simple Linear Regression
• Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.
• The key point in Simple Linear Regression is that the dependent variable
must be a continuous/real value. However, the independent variable can
be measured on continuous or categorical values.
• Simple Linear regression algorithm has mainly two objectives:
• Model the relationship between the two variables. Such as the
relationship between Income and expenditure, experience and Salary, etc.
• Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year,
etc.
• Simple Linear Regression Model:
• The Simple Linear Regression model can be
represented using the below equation:
• y= a0+a1x+ ε
• Where,
• a0= It is the intercept of the Regression line (can be
obtained putting x=0)
a1= It is the slope of the regression line, which tells
whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression
Algorithm using Python
• Problem Statement example for Simple Linear Regression:
• Here we are taking a dataset that has two variables: salary (dependent
variable) and experience (Independent variable). The goals of this
problem is:
• We want to find out if there is any correlation between these two
variables
• We will find the best fit line for the dataset.
• How the dependent variable is changing by changing the independent
variable.
• In this section, we will create a Simple Linear Regression model to find out
the best fitting line for representing the relationship between these two
variables.
• To implement the Simple Linear regression model in machine learning
using Python, we need to follow the below steps:
• Step-1: Data Pre-processing
• The first step for creating the Simple Linear Regression model is
data pre-processing. We have already done it earlier in this tutorial. But
there will be some changes, which are given in the below steps:
• First, we will import the three important libraries, which will help us for
loading the dataset, plotting the graphs, and creating the Simple Linear
Regression model.
• Next, we will load the dataset into our code:
data_set= pd.read_csv('Salary_Data.csv')
• By executing the above line of code
(ctrl+ENTER), we can read the dataset on our
Spyder IDE screen by clicking on the variable
explorer option.
• The above output shows the dataset, which has two variables:
Salary and Experience.
• Note: In Spyder IDE, the folder containing the code file must
be saved as a working directory, and the dataset or csv file
should be in the same folder.
• After that, we need to extract the dependent and
independent variables from the given dataset. The
independent variable is years of experience, and the
dependent variable is salary. Below is code for it:
x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values
• In the above lines of code, for x variable, we
have taken -1 value since we want to remove
the last column from the dataset. For y
variable, we have taken 1 value as a
parameter, since we want to extract the
second column and indexing starts from the
zero.
• By executing the above line of code, we will get the output for X and Y variable as:

• In the above output image, we can see the X (independent) variable and Y (dependent) variable has been
• Next, we will split both variables into the test
set and training set. We have 30 observations,
so we will take 20 observations for the training
set and 10 observations for the test set. We
are splitting our dataset so that we can train
our model using a training dataset and then
test the model using a test dataset. The code
for this is given below:
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, te


st_size= 1/3, random_state=0)

• By executing the above code, we will get x-test, x-train and y-test, y-train
dataset. Consider the below images:
Training Dataset:
• For simple linear Regression, we will not use
Feature Scaling. Because Python libraries take
care of it for some cases, so we don't need to
perform it here. Now, our dataset is well
prepared to work on it and we are going to
start building a Simple Linear Regression
model for the given problem.
• Step-2: Fitting the Simple Linear Regression
to the Training Set:
• Now the second step is to fit our model to the
training dataset. To do so, we will import
the LinearRegression class of
the linear_model library from the scikit learn.
After importing the class, we are going to
create an object of the class named as
a regressor. The code for this is given below:
#Fitting the Simple Linear Regression model to t
he training dataset
from sklearn.linear_model import LinearRegress
ion
regressor= LinearRegression()
regressor.fit(x_train, y_train)

• In the above code, we have used a fit() method to fit our


Simple Linear Regression object to the training set. In the fit()
function, we have passed the x_train and y_train, which is our
training dataset for the dependent and an independent
variable. We have fitted our regressor object to the training
set so that the model can easily learn the correlations
between the predictor and target variables. After executing
the above lines of code, we will get the below output.
• Output:
• Out[7]: LinearRegression(copy_X=True, fit_intercept=True,
n_jobs=None, normalize=False)
• Step: 3. Prediction of test set result:
• dependent (salary) and an independent variable (Experience).
So, now, our model is ready to predict the output for the new
observations. In this step, we will provide the test dataset (new
observations) to the model to check whether it can predict the
correct output or not.
• We will create a prediction vector y_pred, and x_pred, which
will contain predictions of test dataset, and prediction of
training set respectively.
#Prediction of Test and Training set result
y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)

• On executing the above lines of code, two variables named y_pred and
x_pred will generate in the variable explorer options that contain salary
predictions for the training set and test set.
• Output:
• You can check the variable by clicking on the variable explorer option in
the IDE, and also compare the result by comparing values from y_pred
and y_test. By comparing these values, we can check how good our model
is performing.
• Step: 4. visualizing the Training set results:
• Now in this step, we will visualize the training set result. To do so, we will use the
scatter() function of the pyplot library, which we have already imported in the pre-
processing step. The scatter () function will create a scatter plot of observations.
• In the x-axis, we will plot the Years of Experience of employees and on the y-axis,
salary of employees. In the function, we will pass the real values of training set,
which means a year of experience x_train, training set of Salaries y_train, and color
of the observations. Here we are taking a green color for the observation, but it can
be any color as per the choice.
• Now, we need to plot the regression line, so for this, we will use the plot()
function of the pyplot library. In this function, we will pass the years of experience
for training set, predicted salary for training set x_pred, and color of the line.
• Next, we will give the title for the plot. So here, we will use the title() function of
the pyplot library and pass the name ("Salary vs Experience (Training Dataset)".
• After that, we will assign labels for x-axis and
y-axis using xlabel() and ylabel() function.
• Finally, we will represent all above things in a
graph using show(). The code is given below:
mtp.scatter(x_train, y_train, color="green")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

• Output:
• By executing the above lines of code, we will
get the below graph plot as an output
• In the above plot, we can see the real values
observations in green dots and predicted values are
covered by the red regression line. The regression line
shows a correlation between the dependent and
independent variable.
• The good fit of the line can be observed by calculating
the difference between actual values and predicted
values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our
model is good for the training set.
• Step: 5. visualizing the Test set results:
• In the previous step, we have visualized the
performance of our model on the training set. Now,
we will do the same for the Test set. The complete
code will remain the same as the above code,
except in this, we will use x_test, and y_test instead
of x_train and y_train.
• Here we are also changing the color of observations
and regression line to differentiate between the two
plots, but it is optional.
#visualizing the Test set results
mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

• Output:
• By executing the above line of code, we will
get the output as:

You might also like