EXP-4 DMusingPYTHON
EXP-4 DMusingPYTHON
As a result, a linear function that predicts the values of the dependent variable as
a function of the independent variable is discovered for two-dimensional sample
points with one independent variable and one dependent variable.
The below graph explains the relation between Salary and Years of Experience
Equation : y = mx + c
This is the simple linear regression equation where c is the constant and m is
the slope and describes the relationship between x (independent
variable) and y (dependent variable). The coefficient can be positive or negative
and is the degree of change in the dependent variable for every 1 unit of change
in the independent variable.
β0 (y-intercept) and β1 (slope) are the coefficients whose values represent the
accuracy of predicted values with the actual values.
Implement Simple Linear Regression in Python
In this example, we will use the salary data concerning the experience of
employees. In this dataset, we have two columns YearsExperience and Salary
Step 1: Import the required python packages
We need Pandas for data manipulation, NumPy for mathematical calculations,
and MatplotLib, and Seaborn for visualizations. Sklearn libraries are used for
machine learning operations
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
Step 2: Load the dataset
Download the dataset from here and upload it to your notebook and read it into
the pandas dataframe.
# Get dataset
df_sal = pd.read_csv(r"C:\Users\ayyap\OneDrive\Desktop\DMDW#LAB\
Salary_Data.csv")
df_sal.head()
Step 3: Data analysis
Now that we have our data ready, let's analyze and understand its trend in detail.
To do that we can first describe the data below –
# Describe data
df_sal.describe()
Here, we can see Salary ranges from 37731 to 122391 and a median of65237.We
can also find how the data is distributed visually using Seaborn Histplot
# Data distribution
plt.title('Salary Distribution Plot')
sns.Histplot(df_sal['Salary'])
plt.show()
It is clearly visible now, our data varies linearly. That means, that an individual
receives more Salary as they gain Experience.
Step 4: Split the dataset into dependent/independent variables
Experience (X) is the independent variable
Salary (y) is dependent on experience
# Splitting variables
X = df_sal.iloc[:, :1] # independent
y = df_sal.iloc[:, 1:] # dependent
Step 4: Split data into Train/Test sets
Further, split your data into training (80%) and test (20%) sets
using train_test_split
# Regressor model
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Step 6: Predict the result
Here comes the interesting part, when we are all set and ready to predict any
value of y (Salary) dependent on X (Experience) with the trained model
using regressor.predict
# Prediction result
y_pred_test = regressor.predict(X_test) # predicted value of y_test
y_pred_train = regressor.predict(X_train) # predicted value of y_train
Step 7: Plot the training and test results
Its time to test our predicted results by plotting graphs
Plot training set data vs predictions
First we plot the result of training sets (X_train, y_train) with X_train and
predicted value of y_train (regressor.predict(X_train))
# Prediction on training set
plt.scatter(X_train, y_train, color = 'lightcoral')
plt.plot(X_train, y_pred_train, color = 'firebrick')
plt.title('Salary vs Experience (Training Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.legend(['X_train/Pred(y_test)', 'X_train/y_train'], title = 'Sal/Exp', loc='best',
facecolor='white')
plt.box(False)
plt.show()
Plot test set data vs predictions
Secondly, we plot the result of test sets (X_test, y_test) with X_train and
predicted value of y_train (regressor.predict(X_train))