100% found this document useful (1 vote)
115 views

Salary Prediction LinearRegression

The document presents a linear regression model to predict salary based on years of experience. It loads and explores the dataset, prepares the data by separating features and target, trains a linear regression model on 80% of the data, tests it on the remaining 20%, and achieves a high R2 score of 92.78% when comparing predicted vs actual salaries on the test data. Finally, it plots the actual and predicted salaries to visualize the model performance.

Uploaded by

Yagnesh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
115 views

Salary Prediction LinearRegression

The document presents a linear regression model to predict salary based on years of experience. It loads and explores the dataset, prepares the data by separating features and target, trains a linear regression model on 80% of the data, tests it on the remaining 20%, and achieves a high R2 score of 92.78% when comparing predicted vs actual salaries on the test data. Finally, it plots the actual and predicted salaries to visualize the model performance.

Uploaded by

Yagnesh Vyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

hf66etmrt

June 7, 2023

1 Predicting Salary according to Years of experience :


1.0.1 Importing necessary libraries

[66]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

1.0.2 Loading the dataset

[25]: data = pd.read_csv('data.csv') # Loading the dataset and displaying the first 5␣
↪rows

data.head(5)

[25]: YearsExperience Salary


0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0

1.0.3 Exploring the Dataset

[31]: data.shape # Dataset contains 30 rows and 2 columns.

[31]: (30, 2)

[32]: data.info() # Checking information about the datset like columns, Non_null␣
↪values, datatypes.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype

1
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes

[29]: data.describe()

[29]: YearsExperience Salary


count 30.000000 30.000000
mean 5.313333 76003.000000
std 2.837888 27414.429785
min 1.100000 37731.000000
25% 3.200000 56720.750000
50% 4.700000 65237.000000
75% 7.700000 100544.750000
max 10.500000 122391.000000

The describe() function is a convenient method in pandas that provides a statistical summary of a
DataFrame or Series. It calculates various descriptive statistics for each numerical column in the
dataset, including count, mean, standard deviation, minimum value, 25th percentile (Q1), median
(50th percentile or Q2), 75th percentile (Q3), and maximum value.

[33]: data.isnull().sum() # Checking if the datset contains any null values.

[33]: YearsExperience 0
Salary 0
dtype: int64

[40]: num_duplicates = data.duplicated().sum() # Checking if there is any duplicate␣


↪rows in the dataset.

if num_duplicates > 0:
print(f"The dataset contains {num_duplicates} duplicate values.")
data = data.drop_duplicates()
print("Dropped duplicates.")
print("Number of Duplicate Values after dropping :",num_duplicates)
else:
print("The dataset doesn't contain any duplicate values.")

The dataset doesn't contain any duplicate values.

1.0.4 Preparing the data

[50]: X = data.iloc[:,:-1] # Independent feature


X.head(5)

2
[50]: YearsExperience
0 1.1
1 1.3
2 1.5
3 2.0
4 2.2

[53]: Y = data.iloc[:,-1] # Dependent feature


Y.head(5)

[53]: 0 39343.0
1 46205.0
2 37731.0
3 43525.0
4 39891.0
Name: Salary, dtype: float64

1.0.5 Plotting the data to a look of the data distribution

[54]: plt.scatter(X,Y)
plt.title("Salary according to Experience")
plt.xlabel("Salary")
plt.ylabel("Years of experience")

[54]: Text(0, 0.5, 'Years of experience')

3
1.0.6 Splitting the dataset into train and test

[56]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33,␣


↪random_state=51)

[57]: print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(20, 1)
(10, 1)
(20,)
(10,)

1.0.7 Training the model

[58]: linear = LinearRegression()

[59]: linear.fit(X_train, Y_train)

4
[59]: LinearRegression()

1.0.8 Checking intercept and coeficient(slope)

[60]: linear.coef_

[60]: array([9523.14578831])

[61]: linear.intercept_

[61]: 24006.035761469633

1.0.9 Testing the model

[62]: Y_pred = linear.predict(X_test)

[64]: Y_pred

[64]: array([106857.40411979, 54480.10228407, 38290.75444394, 102095.83122563,


54480.10228407, 115428.23532927, 70669.4501242 , 80192.59591251,
36386.12528628, 81144.91049134])

[65]: Y_test

[65]: 24 109431.0
8 64445.0
2 37731.0
23 113812.0
7 54445.0
27 112635.0
15 67938.0
18 81363.0
1 46205.0
19 93940.0
Name: Salary, dtype: float64

[70]: score = r2_score(Y_test, Y_pred)


print(f"Score: {score *100}")

Score: 92.78148083974355

1.0.10 Plotting the graph


[79]: # Plotting the scatter plot of actual data points
plt.scatter(X_test, Y_test, color='blue', label='Actual')

# Plotting the predicted line

5
plt.plot(X_test, Y_pred, color='red', linewidth=2, label='Predicted')

plt.title("Salary Prediction")
plt.xlabel("Salary")
plt.ylabel("Years of experience")
plt.legend()

plt.show()

You might also like