LinearRegression HandsOn
LinearRegression HandsOn
Data Description:
The dataset has 9 variables, including the name of the car and its various attributes like horsepower, weight, region of origin, etc. Missing values
in the data are marked by a series of question marks.
Import Libraries
import pandas as pd
import numpy as np
# for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns
# For randomized data splitting
from sklearn.model_selection import train_test_split
# To build linear regression_model
import statsmodels.api as sm
# To check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error
Loding data into Google Colab. (If running it locally on jupyter this is not necessary - just use 'cData = pd.read_csv("auto-mpg.csv")')
from google.colab import files
import io
try:
uploaded
except NameError:
uploaded = files.upload()
cData = pd.read_csv(io.BytesIO(uploaded['auto-mpg.csv']))
#cData = pd.read_csv("auto-mpg.csv")
# let's check the shape of the data
cData.shape
(398, 9)
# let's check the first 5 rows of the data
cData.head()
model
mpg cylinders displacement horsepower weight acceleration origin car name
year
chevrolet
# let's check column types and number of values
0 18.0 8 307.0 130 3504 12.0 70 1 chevelle
cData.info()
malibu
Most of the columns in the data are numeric in nature ('int64' or 'float64' type).
The horsepower and car name columns are string columns ('object' type).
cData = cData.drop(["car name"], axis=1)
Bivariate Analysis
A bivariate analysis among the different variables can be done using scatter matrix plot. Seaborn libs create a dashboard reflecting useful
information about the dimensions. The result can be stored as a .png file.
[ ] ↳ 2 cells hidden
So we create 3 simple true or false columns with titles equivalent to "Is this car American?", "Is this car European?" and "Is this car Asian?".
These will be used as independent variables without imposing any kind of ordering between the three regions.
We will also be dropping one of those three columns to ensure there is no linear dependency between the three columns.
[ ] ↳ 1 cell hidden
Split Data
[ ] ↳ 6 cells hidden
olsmod = sm.OLS(y_train, X_train)
olsres = olsmod.fit()
# let's print the regression summary
print(olsres.summary())
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.45e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Interpretation of R-squared
The R-squared value tells us that our model can explain 81.4% of the variance in the training set.