0% found this document useful (0 votes)
40 views15 pages

ML - Practical - 1 - Jupyter Notebook

This document loads and analyzes an Uber trip dataset with 200,000 rows and 9 columns using Python libraries like Pandas and Seaborn. It cleans the data by dropping null rows, removes unnecessary columns, and converts types. Descriptive statistics and distribution plots of fare amounts, latitudes and longitudes are generated to identify outliers.

Uploaded by

Devaki Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views15 pages

ML - Practical - 1 - Jupyter Notebook

This document loads and analyzes an Uber trip dataset with 200,000 rows and 9 columns using Python libraries like Pandas and Seaborn. It cleans the data by dropping null rows, removes unnecessary columns, and converts types. Descriptive statistics and distribution plots of fare amounts, latitudes and longitudes are generated to identify outliers.

Uploaded by

Devaki Borse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [55]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [56]: df = pd.read_csv(r"C:\Users\Dell\Downloads\Uber.csv")

In [57]: df.head()

Out[57]:
Unnamed:
key fare_amount pickup_datetime pickup_longitude pickup_latitude
0

2015-05-07 2015-05-07
0 24238194 7.5 -73.999817 40.738354
19:52:06.0000003 19:52:06 UTC

2009-07-17 2009-07-17
1 27835199 7.7 -73.994355 40.728225
20:04:56.0000002 20:04:56 UTC

2009-08-24 2009-08-24
2 44984355 12.9 -74.005043 40.740770
21:45:00.00000061 21:45:00 UTC

2009-06-26 2009-06-26
3 25894730 5.3 -73.976124 40.790844
08:22:21.0000001 08:22:21 UTC

2014-08-28 2014-08-28
4 17610152 16.0 -73.925023 40.744085
17:47:00.000000188 17:47:00 UTC

In [58]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 200000 non-null int64
1 key 200000 non-null object
2 fare_amount 200000 non-null float64
3 pickup_datetime 200000 non-null object
4 pickup_longitude 200000 non-null float64
5 pickup_latitude 200000 non-null float64
6 dropoff_longitude 199999 non-null float64
7 dropoff_latitude 199999 non-null float64
8 passenger_count 200000 non-null int64
dtypes: float64(5), int64(2), object(2)
memory usage: 13.7+ MB

In [59]: df.shape

Out[59]: (200000, 9)

localhost:8888/notebooks/ML_Practical_1.ipynb 1/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [60]: ​
df.isnull().sum()

Out[60]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 1
dropoff_latitude 1
passenger_count 0
dtype: int64

In [61]: df.dropna(inplace = True)

In [62]: df.isnull().sum()

Out[62]: Unnamed: 0 0
key 0
fare_amount 0
pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

In [63]: df.drop(labels='Unnamed: 0',axis=1,inplace=True)


df.drop(labels='key',axis=1,inplace=True)

In [64]: df.head()

Out[64]:
fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitu

2015-05-07
0 7.5 -73.999817 40.738354 -73.999512 40.7232
19:52:06 UTC

2009-07-17
1 7.7 -73.994355 40.728225 -73.994710 40.7503
20:04:56 UTC

2009-08-24
2 12.9 -74.005043 40.740770 -73.962565 40.7726
21:45:00 UTC

2009-06-26
3 5.3 -73.976124 40.790844 -73.965316 40.8033
08:22:21 UTC

2014-08-28
4 16.0 -73.925023 40.744085 -73.973082 40.7612
17:47:00 UTC

In [65]: df["pickup_datetime"] = pd.to_datetime(df["pickup_datetime"])


localhost:8888/notebooks/ML_Practical_1.ipynb 2/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [66]: df.dtypes

Out[66]: fare_amount float64


pickup_datetime datetime64[ns, UTC]
pickup_longitude float64
pickup_latitude float64
dropoff_longitude float64
dropoff_latitude float64
passenger_count int64
dtype: object

In [67]: df.describe()

Out[67]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passen

count 199999.000000 199999.000000 199999.000000 199999.000000 199999.000000 1999

mean 11.359892 -72.527631 39.935881 -72.525292 39.923890

std 9.901760 11.437815 7.720558 13.117408 6.794829

min -52.000000 -1340.648410 -74.015515 -3356.666300 -881.985513

25% 6.000000 -73.992065 40.734796 -73.991407 40.733823

50% 8.500000 -73.981823 40.752592 -73.980093 40.753042

75% 12.500000 -73.967154 40.767158 -73.963658 40.768001

max 499.000000 57.418457 1644.421482 1153.572603 872.697628 2

localhost:8888/notebooks/ML_Practical_1.ipynb 3/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [68]: #identify outliers



# data visualization
# plotting distribution plot

import warnings
warnings.filterwarnings("ignore")
sns.distplot(df['fare_amount'])

Out[68]: <AxesSubplot:xlabel='fare_amount', ylabel='Density'>

In [69]: sns.distplot(df['pickup_latitude'])

Out[69]: <AxesSubplot:xlabel='pickup_latitude', ylabel='Density'>

localhost:8888/notebooks/ML_Practical_1.ipynb 4/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [70]: sns.distplot(df['pickup_longitude'])

Out[70]: <AxesSubplot:xlabel='pickup_longitude', ylabel='Density'>

In [71]: sns.distplot(df['dropoff_longitude'])

Out[71]: <AxesSubplot:xlabel='dropoff_longitude', ylabel='Density'>

localhost:8888/notebooks/ML_Practical_1.ipynb 5/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [72]: sns.distplot(df['dropoff_latitude'])

Out[72]: <AxesSubplot:xlabel='dropoff_latitude', ylabel='Density'>

In [73]: #creating a function to identify outliers



def find_outliers_IQR(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
IQR = q3-q1
outliers = df[((df<(q1-1.5*IQR)) | (df>(q3+1.5*IQR)))]
return outliers

localhost:8888/notebooks/ML_Practical_1.ipynb 6/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [74]: #getting outlier details for column "fair_amount" using the above function

outliers = find_outliers_IQR(df["fare_amount"])
print("number of outliers: "+ str(len(outliers)))
print("max outlier value: "+ str(outliers.max()))
print("min outlier value: "+ str(outliers.min()))
outliers

number of outliers: 17166


max outlier value: 499.0
min outlier value: -52.0

Out[74]: 6 24.50
30 25.70
34 39.50
39 29.00
48 56.80
...
199976 49.70
199977 43.50
199982 57.33
199985 24.00
199997 30.90
Name: fare_amount, Length: 17166, dtype: float64

In [75]: #you can also pass two columns as argument to the function (here "passenger_count

outliers = find_outliers_IQR(df[["passenger_count","fare_amount"]])
outliers

Out[75]:
passenger_count fare_amount

0 NaN NaN

1 NaN NaN

2 NaN NaN

3 NaN NaN

4 5.0 NaN

... ... ...

199995 NaN NaN

199996 NaN NaN

199997 NaN 30.9

199998 NaN NaN

199999 NaN NaN

199999 rows × 2 columns

localhost:8888/notebooks/ML_Practical_1.ipynb 7/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [76]: upper_limit = df['fare_amount'].mean() + 3*df['fare_amount'].std()


print(upper_limit)
lower_limit = df['fare_amount'].mean() - 3*df['fare_amount'].std()
print(lower_limit)

41.06517154774204
-18.3453884488253

In [77]: corrMatrix = df.corr()


sns.heatmap(corrMatrix, annot=True)
plt.show()

In [78]: import calendar


df['day']=df['pickup_datetime'].apply(lambda x:x.day)
df['hour']=df['pickup_datetime'].apply(lambda x:x.hour)
df['month']=df['pickup_datetime'].apply(lambda x:x.month)
df['year']=df['pickup_datetime'].apply(lambda x:x.year)
df['weekday']=df['pickup_datetime'].apply(lambda x: calendar.day_name[x.weekday()
df.drop(['pickup_datetime'],axis=1,inplace=True)

In [79]: df.weekday = df.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,'Thu

localhost:8888/notebooks/ML_Practical_1.ipynb 8/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [80]: df.head()

Out[80]:
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_co

0 7.5 -73.999817 40.738354 -73.999512 40.723217

1 7.7 -73.994355 40.728225 -73.994710 40.750325

2 12.9 -74.005043 40.740770 -73.962565 40.772647

3 5.3 -73.976124 40.790844 -73.965316 40.803349

4 16.0 -73.925023 40.744085 -73.973082 40.761247

In [81]: df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199999 entries, 0 to 199999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fare_amount 199999 non-null float64
1 pickup_longitude 199999 non-null float64
2 pickup_latitude 199999 non-null float64
3 dropoff_longitude 199999 non-null float64
4 dropoff_latitude 199999 non-null float64
5 passenger_count 199999 non-null int64
6 day 199999 non-null int64
7 hour 199999 non-null int64
8 month 199999 non-null int64
9 year 199999 non-null int64
10 weekday 199999 non-null int64
dtypes: float64(5), int64(6)
memory usage: 18.3 MB

In [82]: from sklearn.model_selection import train_test_split

localhost:8888/notebooks/ML_Practical_1.ipynb 9/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [83]: x=df.drop("fare_amount", axis=1)


x

Out[83]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count day

0 -73.999817 40.738354 -73.999512 40.723217 1 7

1 -73.994355 40.728225 -73.994710 40.750325 1 17

2 -74.005043 40.740770 -73.962565 40.772647 1 24

3 -73.976124 40.790844 -73.965316 40.803349 3 26

4 -73.925023 40.744085 -73.973082 40.761247 5 28

... ... ... ... ... ... ..

199995 -73.987042 40.739367 -73.986525 40.740297 1 28

199996 -73.984722 40.736837 -74.006672 40.739620 1 14

199997 -73.986017 40.756487 -73.858957 40.692588 2 29

199998 -73.997124 40.725452 -73.983215 40.695415 1 20

199999 -73.984395 40.720077 -73.985508 40.768793 1 15

199999 rows × 10 columns

In [84]: y=df["fare_amount"]

In [85]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1

In [86]: x_train.head()

Out[86]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count day

80768 -73.983703 40.725752 -73.972000 40.793888 1 22

111783 -73.961175 40.760667 -73.976507 40.747570 1 7

24615 -73.947784 40.783111 -73.955408 40.779405 1 17

46932 -73.980596 40.733797 -73.972092 40.747297 1 15

86655 -73.963035 40.758380 -73.987877 40.745477 2 28

localhost:8888/notebooks/ML_Practical_1.ipynb 10/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [87]: x_test.head()

Out[87]:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count day

13588 -73.982810 40.771687 -73.977065 40.763200 1 25

29803 -73.991985 40.725763 -73.995762 40.759797 1 20

138265 -73.985730 40.767882 -73.998525 40.760667 1 20

82856 -73.973200 40.748100 -73.973500 40.748200 1 17

162747 -74.007432 40.716580 -73.986858 40.761328 1 10

In [88]: y_train.head()

Out[88]: 80768 19.7


111783 7.7
24615 4.5
46932 4.5
86655 10.0
Name: fare_amount, dtype: float64

In [89]: y_test.head()

Out[89]: 13588 5.5


29803 11.3
138265 6.5
82856 18.1
162747 11.3
Name: fare_amount, dtype: float64

In [90]: print(x_train.shape)
print(x_test.shape)
print(y_test.shape)
print(y_train.shape)

(159999, 10)
(40000, 10)
(40000,)
(159999,)

In [91]: from sklearn.linear_model import LinearRegression


lrmodel=LinearRegression()
lrmodel.fit(x_train, y_train)

Out[91]: LinearRegression()

In [92]: predictedvalues = lrmodel.predict(x_test)

localhost:8888/notebooks/ML_Practical_1.ipynb 11/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [93]: from sklearn.metrics import mean_squared_error


lrmodelrmse = np.sqrt(mean_squared_error(predictedvalues, y_test))
print("RMSE value for Linear regression is", lrmodelrmse)

RMSE value for Linear regression is 9.806687708433811

In [94]: from sklearn.ensemble import RandomForestRegressor


rfrmodel = RandomForestRegressor(n_estimators=100, random_state=101)

In [95]: rfrmodel.fit(x_train,y_train)
rfrmodel_pred= rfrmodel.predict(x_test)

In [96]: rfrmodel_rmse=np.sqrt(mean_squared_error(rfrmodel_pred, y_test))


print("RMSE value for Random forest regression is ",rfrmodel_rmse)

RMSE value for Random forest regression is 4.756011315216782

In [115]: test= pd.read_csv("https://raw.githubusercontent.com/piyushpandey758/Uber-Fare-Pr

In [116]: test.head()

Out[116]:
Unnamed: Unnamed: Unnamed:
key pickup_datetime pickup_longitude pickup
0 0.1 0.1.1

2011-02-10 2011-02-10
0 0 37338 31401407 -73.951662 40
19:06:00.000000169 19:06:00 UTC

2011-06-23 2011-06-23
1 1 160901 33158465 -73.951007 40
09:24:00.000000157 09:24:00 UTC

2012-07-14 2012-07-14
2 2 40428 10638355 -73.996473 40
10:37:00.000000149 10:37:00 UTC

2014-10-19 2014-10-19
3 3 63353 3836845 -73.997934 40
22:27:05.0000002 22:27:05 UTC

2015-05-25 2015-05-25
4 4 165491 27114503 -73.952583 40
22:54:43.0000001 22:54:43 UTC

In [117]: test.drop(test[['Unnamed: 0','Unnamed: 0.1','key']],axis=1,inplace=True)


test.isnull().sum()

Out[117]: Unnamed: 0.1.1 0


pickup_datetime 0
pickup_longitude 0
pickup_latitude 0
dropoff_longitude 0
dropoff_latitude 0
passenger_count 0
dtype: int64

In [118]: test["pickup_datetime"] = pd.to_datetime(test["pickup_datetime"])

localhost:8888/notebooks/ML_Practical_1.ipynb 12/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [119]: #splitting column "pickup_datetime" into 5 columns: "day", "hour", "month", "year
#for a simplified view
#label encoding weekdays

test['day']=test['pickup_datetime'].apply(lambda x:x.day)
test['hour']=test['pickup_datetime'].apply(lambda x:x.hour)
test['month']=test['pickup_datetime'].apply(lambda x:x.month)
test['year']=test['pickup_datetime'].apply(lambda x:x.year)
test['weekday']=test['pickup_datetime'].apply(lambda x: calendar.day_name[x.weekd

test.weekday = test.weekday.map({'Sunday':0,'Monday':1,'Tuesday':2,'Wednesday':3,

test.drop(['pickup_datetime'], axis = 1, inplace = True)

test.head(5)

Out[119]:
Unnamed:
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_coun
0.1.1

0 31401407 -73.951662 40.790710 -73.947570 40.756220

1 33158465 -73.951007 40.771508 -73.974075 40.763553

2 10638355 -73.996473 40.747930 -73.990298 40.756152

3 3836845 -73.997934 40.716890 -73.952617 40.727149

4 27114503 -73.952583 40.714039 -73.906128 40.711281

Dividing the dataset into training and testing dataset

In [128]: from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(x,y,test_size = 0.33)

In [129]: from sklearn.linear_model import LinearRegression


regression = LinearRegression()

In [130]: regression.fit(X_train,y_train)

Out[130]: LinearRegression()

In [131]: regression.intercept_ #To find the linear intercept

Out[131]: -1290.1588587396827

In [132]: regression.coef_ #To find the linear coeeficient

Out[132]: array([ 0.01184457, 0.00171658, -0.00631054, -0.01103023, 0.09269442,


0.00212464, -0.03322793, 0.11060264, 0.64719083, -0.04152989])

In [133]: prediction = regression.predict(X_test) #To predict the target values

localhost:8888/notebooks/ML_Practical_1.ipynb 13/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [134]: print(prediction)

[11.92493543 8.81713487 9.95240115 ... 13.74024677 11.81925646


10.25274095]

In [135]: y_test

Out[135]: 194959 32.33


71279 8.90
177324 5.70
97137 15.50
134768 8.50
...
84415 28.50
169022 7.50
90775 7.50
72711 10.50
117466 27.47
Name: fare_amount, Length: 66000, dtype: float64

Metrics Evaluation using R2, Mean Squared Error, Root Mean Sqared
Error

In [137]: from sklearn.metrics import r2_score

In [138]: r2_score(y_test,prediction)

Out[138]: 0.01603948315081527

In [139]: from sklearn.metrics import mean_squared_error

In [140]: MSE = mean_squared_error(y_test,prediction)

In [141]: MSE

Out[141]: 98.78999628170801

In [142]: RMSE = np.sqrt(MSE)

In [143]: RMSE

Out[143]: 9.939315684779713

Random Forest Regression

In [144]: from sklearn.ensemble import RandomForestRegressor

localhost:8888/notebooks/ML_Practical_1.ipynb 14/15
11/13/22, 12:12 PM ML_Practical_1 - Jupyter Notebook

In [145]: RandomForestRegressor(n_estimators=100) #Here n_estimators means number of trees

In [146]: rf.fit(X_train,y_train)

Out[146]: RandomForestRegressor()

In [147]: y_pred = rf.predict(X_test)

In [148]: y_pred

Out[148]: array([34.5784, 10.2875, 6.294 , ..., 7.79 , 10.3497, 28.6942])

Metrics evaluatin for Random Forest

In [149]: R2_Random = r2_score(y_test,y_pred)

In [150]: R2_Random

Out[150]: 0.7617634409141242

In [151]: MSE_Random = mean_squared_error(y_test,y_pred)

In [152]: MSE_Random

Out[152]: 23.919037789875

In [153]: RMSE_Random = np.sqrt(MSE_Random)

In [154]: RMSE_Random

Out[154]: 4.8907093340204755

In [ ]: ​

localhost:8888/notebooks/ML_Practical_1.ipynb 15/15

You might also like