0% found this document useful (0 votes)
20 views

LinearRegression HandsOn

The document describes a linear regression model to predict a car's miles per gallon (mpg) using its attributes. It provides details on the dataset, which includes variables like cylinders, displacement, horsepower, weight, and origin for 398 cars. It describes preparing the data, creating dummy variables for origin, splitting the data into train and test sets, fitting a linear model on the training set, and reporting that the model can explain 81.4% of the variance in mpg.

Uploaded by

SHEKHAR SWAMI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

LinearRegression HandsOn

The document describes a linear regression model to predict a car's miles per gallon (mpg) using its attributes. It provides details on the dataset, which includes variables like cylinders, displacement, horsepower, weight, and origin for 398 cars. It describes preparing the data, creating dummy variables for origin, splitting the data into train and test sets, fitting a linear model on the training set, and reporting that the model can explain 81.4% of the variance in mpg.

Uploaded by

SHEKHAR SWAMI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

We will construct a linear model that can predict a car's mileage (mpg) by using its other attributes.

Data Description:
The dataset has 9 variables, including the name of the car and its various attributes like horsepower, weight, region of origin, etc. Missing values
in the data are marked by a series of question marks.

A detailed description of the variables is given below.

1. mpg: miles per gallon


2. cylinders: number of cylinders
3. displacement: engine displacement in cubic inches
4. horsepower: horsepower of the car
5. weight: weight of the car in pounds
6. acceleration: time taken, in seconds, to accelerate from O to 60 mph
7. model year: year of manufacture of the car (modulo 100)
8. origin: region of origin of the car (1 - American, 2 - European, 3 - Asian)
9. car name: name of the car

Import Libraries

import pandas as pd
import numpy as np

# for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

# For randomized data splitting
from sklearn.model_selection import train_test_split

# To build linear regression_model
import statsmodels.api as sm

# To check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error

Load and explore the data

Loding data into Google Colab. (If running it locally on jupyter this is not necessary - just use 'cData = pd.read_csv("auto-mpg.csv")')

from google.colab import files
import io

try:
    uploaded
except NameError:
    uploaded = files.upload()

cData = pd.read_csv(io.BytesIO(uploaded['auto-mpg.csv']))
#cData = pd.read_csv("auto-mpg.csv")

# let's check the shape of the data
cData.shape

(398, 9)

# let's check the first 5 rows of the data
cData.head()
model
mpg cylinders displacement horsepower weight acceleration origin car name
year

chevrolet
# let's check column types and number of values
0 18.0 8 307.0 130 3504 12.0 70 1 chevelle
cData.info()
malibu

<class 'pandas.core.frame.DataFrame'> buick


RangeIndex:
1 15.0 398 entries,
8 0 to 397
350.0 165 3693 11.5 70 1 skylark
Data columns (total 9 columns): 320
# Column Non-Null Count Dtype
--- ------ -------------- ----- plymouth
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 398 non-null object
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB

Most of the columns in the data are numeric in nature ('int64' or 'float64' type).
The horsepower and car name columns are string columns ('object' type).

We will be dropping the 'car name' column for prediction purposes.

cData = cData.drop(["car name"], axis=1)

Dealing with Missing Values


[ ] ↳ 11 cells hidden

Bivariate Analysis
A bivariate analysis among the different variables can be done using scatter matrix plot. Seaborn libs create a dashboard reflecting useful
information about the dimensions. The result can be stored as a .png file.

[ ] ↳ 2 cells hidden

Create Dummy Variables


Values like 'america' cannot be read into an equation. Using substitutes like 1 for america, 2 for europe and 3 for asia would end up implying
that European cars fall exactly half way between American and Asian cars! We don't want to impose such a baseless assumption!

So we create 3 simple true or false columns with titles equivalent to "Is this car American?", "Is this car European?" and "Is this car Asian?".
These will be used as independent variables without imposing any kind of ordering between the three regions.

We will also be dropping one of those three columns to ensure there is no linear dependency between the three columns.

[ ] ↳ 1 cell hidden

Split Data
[ ] ↳ 6 cells hidden

Fit Linear Model

olsmod = sm.OLS(y_train, X_train)
olsres = olsmod.fit()

# let's print the regression summary
print(olsres.summary())

OLS Regression Results


==============================================================================
Dep. Variable: mpg R-squared: 0.814
Model: OLS Adj. R-squared: 0.809
Method: Least Squares F-statistic: 147.3
Date: Tue, 08 Jun 2021 Prob (F-statistic): 1.20e-93
Time: 23:03:45 Log-Likelihood: -734.21
No. Observations: 278 AIC: 1486.
Df Residuals: 269 BIC: 1519.
Df Model: 8
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const -21.2847 5.679 -3.748 0.000 -32.465 -10.104
cylinders -0.3948 0.423 -0.933 0.352 -1.228 0.439
displacement 0.0289 0.010 2.870 0.004 0.009 0.049
horsepower -0.0218 0.016 -1.330 0.185 -0.054 0.010
weight -0.0074 0.001 -8.726 0.000 -0.009 -0.006
acceleration 0.0619 0.118 0.524 0.601 -0.171 0.295
model year 0.8369 0.064 13.149 0.000 0.712 0.962
origin_asia 2.3953 0.684 3.503 0.001 1.049 3.741
origin_europe 3.0013 0.704 4.262 0.000 1.615 4.388
==============================================================================
Omnibus: 13.244 Durbin-Watson: 2.244
Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.958
Skew: 0.386 Prob(JB): 0.000208
Kurtosis: 3.932 Cond. No. 8.45e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.45e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation of R-squared
The R-squared value tells us that our model can explain 81.4% of the variance in the training set.

Colab paid products - Cancel contracts here

You might also like