0% found this document useful (0 votes)
58 views15 pages

Data Cleaning and Fill Missing Values

The document discusses cleaning and preparing a dataset for analysis. It loads automobile data from an Excel file into a Pandas dataframe. It then cleans the data by dropping unnecessary columns, removing missing values, and renaming columns for clarity. New columns and rows are added to the dataframe. The dataframe is inspected and summarized to check for any remaining issues before analysis.

Uploaded by

Nazakat ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views15 pages

Data Cleaning and Fill Missing Values

The document discusses cleaning and preparing a dataset for analysis. It loads automobile data from an Excel file into a Pandas dataframe. It then cleans the data by dropping unnecessary columns, removing missing values, and renaming columns for clarity. New columns and rows are added to the dataframe. The dataframe is inspected and summarized to check for any remaining issues before analysis.

Uploaded by

Nazakat ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Data cleaning and fill missing values

January 22, 2021

[1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.style.use('ggplot')

[3]: df_2=pd.read_excel('C:/Users/Nazakat ali/Desktop/Stat711/New folder/Book1.xlsx')


df_2

[3]: Vehicle fm Mileage lh lc mc State ggg year


0 1 0 863 1.1 66.30 697.23 MS 55 2000
1 2 10 4644 2.4 233.03 119.66 CA NaN 2000
2 3 15 16330 4.2 325.08 175.46 WI f 2000
3 4 0 13 1.0 66.64 0.00 OR fg 2000
4 5 13 22537 4.5 328.66 175.46 AZ sd 2000
5 6 21 40931 3.1 205.28 175.46 FL gh 2000
6 7 11 34762 0.7 49.17 145.20 LA sd 2000
7 8 5 11051 2.9 208.80 270.04 GA sd 2000
8 9 8 7003 3.4 212.06 119.66 WA sd 2000
9 10 1 11 0.7 44.43 0.00 PA sd 2000
10 11 17 24879 3.5 260.29 119.66 TX sd 2000
11 12 3 5339 3.2 236.93 440.13 LA sd 2000
12 13 14 29782 10.0 695.10 228.12 FL sd 2000
13 14 19 56111 2.0 116.00 183.31 OH sd 2000
14 15 13 21946 3.8 312.36 175.46 MA sd 2000
15 16 8 3101 3.1 220.61 119.66 VA sd 2000
16 17 15 41965 0.9 66.25 119.66 OH sd 2000
17 18 3 15365 2.0 158.94 175.46 CO sd 2000
18 19 12 44865 4.9 319.51 119.66 FL sd 2000

[4]: df_2.shape

[4]: (19, 9)

[5]: df_2.info

[5]: <bound method DataFrame.info of Vehicle fm Mileage lh lc mc


State ggg year

1
0 1 0 863 1.1 66.30 697.23 MS 55 2000
1 2 10 4644 2.4 233.03 119.66 CA NaN 2000
2 3 15 16330 4.2 325.08 175.46 WI f 2000
3 4 0 13 1.0 66.64 0.00 OR fg 2000
4 5 13 22537 4.5 328.66 175.46 AZ sd 2000
5 6 21 40931 3.1 205.28 175.46 FL gh 2000
6 7 11 34762 0.7 49.17 145.20 LA sd 2000
7 8 5 11051 2.9 208.80 270.04 GA sd 2000
8 9 8 7003 3.4 212.06 119.66 WA sd 2000
9 10 1 11 0.7 44.43 0.00 PA sd 2000
10 11 17 24879 3.5 260.29 119.66 TX sd 2000
11 12 3 5339 3.2 236.93 440.13 LA sd 2000
12 13 14 29782 10.0 695.10 228.12 FL sd 2000
13 14 19 56111 2.0 116.00 183.31 OH sd 2000
14 15 13 21946 3.8 312.36 175.46 MA sd 2000
15 16 8 3101 3.1 220.61 119.66 VA sd 2000
16 17 15 41965 0.9 66.25 119.66 OH sd 2000
17 18 3 15365 2.0 158.94 175.46 CO sd 2000
18 19 12 44865 4.9 319.51 119.66 FL sd 2000>

[10]: df_2.describe()

[10]: Vehicle fm Mileage lh lc mc \


count 19.000000 19.000000 19.000000 19.000000 19.000000 19.000000
mean 10.000000 9.894737 20078.842105 3.021053 217.128421 187.331053
std 5.627314 6.462668 17281.824384 2.139205 151.779089 155.020900
min 1.000000 0.000000 11.000000 0.700000 44.430000 0.000000
25% 5.500000 4.000000 4991.500000 1.550000 91.320000 119.660000
50% 10.000000 11.000000 16330.000000 3.100000 212.060000 175.460000
75% 14.500000 14.500000 32272.000000 3.650000 286.325000 179.385000
max 19.000000 21.000000 56111.000000 10.000000 695.100000 697.230000

year
count 19.0
mean 2000.0
std 0.0
min 2000.0
25% 2000.0
50% 2000.0
75% 2000.0
max 2000.0

[11]: df_2.head(2)

[11]: Vehicle fm Mileage lh lc mc State ggg year


0 1 0 863 1.1 66.30 697.23 MS 55 2000
1 2 10 4644 2.4 233.03 119.66 CA NaN 2000

2
1 Clean sum coloumns from data frame

2 Plzz Rembered every time , axis=0 mean that dellet row

3 and axis=1 mean that dellet coloumn


[12]: df_2.drop(['ggg', 'year', 'lh'], axis=1, inplace=True)

[13]: df_2

[13]: Vehicle fm Mileage lc mc State


0 1 0 863 66.30 697.23 MS
1 2 10 4644 233.03 119.66 CA
2 3 15 16330 325.08 175.46 WI
3 4 0 13 66.64 0.00 OR
4 5 13 22537 328.66 175.46 AZ
5 6 21 40931 205.28 175.46 FL
6 7 11 34762 49.17 145.20 LA
7 8 5 11051 208.80 270.04 GA
8 9 8 7003 212.06 119.66 WA
9 10 1 11 44.43 0.00 PA
10 11 17 24879 260.29 119.66 TX
11 12 3 5339 236.93 440.13 LA
12 13 14 29782 695.10 228.12 FL
13 14 19 56111 116.00 183.31 OH
14 15 13 21946 312.36 175.46 MA
15 16 8 3101 220.61 119.66 VA
16 17 15 41965 66.25 119.66 OH
17 18 3 15365 158.94 175.46 CO
18 19 12 44865 319.51 119.66 FL

[14]: # Delet row 18 and 17 use exis=0


df_2.drop([18,17], axis=0, inplace=True)

[15]: df_2

[15]: Vehicle fm Mileage lc mc State


0 1 0 863 66.30 697.23 MS
1 2 10 4644 233.03 119.66 CA
2 3 15 16330 325.08 175.46 WI
3 4 0 13 66.64 0.00 OR
4 5 13 22537 328.66 175.46 AZ
5 6 21 40931 205.28 175.46 FL
6 7 11 34762 49.17 145.20 LA
7 8 5 11051 208.80 270.04 GA
8 9 8 7003 212.06 119.66 WA
9 10 1 11 44.43 0.00 PA

3
10 11 17 24879 260.29 119.66 TX
11 12 3 5339 236.93 440.13 LA
12 13 14 29782 695.10 228.12 FL
13 14 19 56111 116.00 183.31 OH
14 15 13 21946 312.36 175.46 MA
15 16 8 3101 220.61 119.66 VA
16 17 15 41965 66.25 119.66 OH

4 Rename columns of Data frame


[16]: df_2.rename(columns={'Vehicle':'VCL', 'Mileage':'MLG', 'State':'country'},␣
,→inplace=True)

[17]: df_2

[17]: VCL fm MLG lc mc country


0 1 0 863 66.30 697.23 MS
1 2 10 4644 233.03 119.66 CA
2 3 15 16330 325.08 175.46 WI
3 4 0 13 66.64 0.00 OR
4 5 13 22537 328.66 175.46 AZ
5 6 21 40931 205.28 175.46 FL
6 7 11 34762 49.17 145.20 LA
7 8 5 11051 208.80 270.04 GA
8 9 8 7003 212.06 119.66 WA
9 10 1 11 44.43 0.00 PA
10 11 17 24879 260.29 119.66 TX
11 12 3 5339 236.93 440.13 LA
12 13 14 29782 695.10 228.12 FL
13 14 19 56111 116.00 183.31 OH
14 15 13 21946 312.36 175.46 MA
15 16 8 3101 220.61 119.66 VA
16 17 15 41965 66.25 119.66 OH

5 Add columns and Row

6 column use axis=1

7 row use axis=0


[18]: # add coloumns
df_2['Total']=df_2.sum(axis=1)
df_2

[18]: VCL fm MLG lc mc country Total


0 1 0 863 66.30 697.23 MS 1627.53

4
1 2 10 4644 233.03 119.66 CA 5008.69
2 3 15 16330 325.08 175.46 WI 16848.54
3 4 0 13 66.64 0.00 OR 83.64
4 5 13 22537 328.66 175.46 AZ 23059.12
5 6 21 40931 205.28 175.46 FL 41338.74
6 7 11 34762 49.17 145.20 LA 34974.37
7 8 5 11051 208.80 270.04 GA 11542.84
8 9 8 7003 212.06 119.66 WA 7351.72
9 10 1 11 44.43 0.00 PA 66.43
10 11 17 24879 260.29 119.66 TX 25286.95
11 12 3 5339 236.93 440.13 LA 6031.06
12 13 14 29782 695.10 228.12 FL 30732.22
13 14 19 56111 116.00 183.31 OH 56443.31
14 15 13 21946 312.36 175.46 MA 22461.82
15 16 8 3101 220.61 119.66 VA 3465.27
16 17 15 41965 66.25 119.66 OH 42182.91

[19]: # Check missing value


df_2.isnull()

[19]: VCL fm MLG lc mc country Total


0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
5 False False False False False False False
6 False False False False False False False
7 False False False False False False False
8 False False False False False False False
9 False False False False False False False
10 False False False False False False False
11 False False False False False False False
12 False False False False False False False
13 False False False False False False False
14 False False False False False False False
15 False False False False False False False
16 False False False False False False False

[32]: # Sum missing value


df_2.isnull().sum()

[32]: VCL 0
fm 0
MLG 0
lc 0
mc 0

5
country 0
Total 0
dtype: int64

[34]: # Data filter


df_2['MLG']
df_2.MLG

[34]: 0 863
1 4644
2 16330
3 13
4 22537
5 40931
6 34762
7 11051
8 7003
9 11
10 24879
11 5339
12 29782
13 56111
14 21946
15 3101
16 41965
Name: MLG, dtype: int64

[38]: # data filter


df_3=df_2.loc[:,['country', 'VCL', 'MLG', 'fm', 'lc', 'mc', 'Total']]
df_3

[38]: country VCL MLG fm lc mc Total


0 MS 1 863 0 66.30 697.23 1627.53
1 CA 2 4644 10 233.03 119.66 5008.69
2 WI 3 16330 15 325.08 175.46 16848.54
3 OR 4 13 0 66.64 0.00 83.64
4 AZ 5 22537 13 328.66 175.46 23059.12
5 FL 6 40931 21 205.28 175.46 41338.74
6 LA 7 34762 11 49.17 145.20 34974.37
7 GA 8 11051 5 208.80 270.04 11542.84
8 WA 9 7003 8 212.06 119.66 7351.72
9 PA 10 11 1 44.43 0.00 66.43
10 TX 11 24879 17 260.29 119.66 25286.95
11 LA 12 5339 3 236.93 440.13 6031.06
12 FL 13 29782 14 695.10 228.12 30732.22
13 OH 14 56111 19 116.00 183.31 56443.31
14 MA 15 21946 13 312.36 175.46 22461.82

6
15 VA 16 3101 8 220.61 119.66 3465.27
16 OH 17 41965 15 66.25 119.66 42182.91

[39]: # filter and setting country colums


df_3.set_index('country', inplace=True)

[40]: df_3

[40]: VCL MLG fm lc mc Total


country
MS 1 863 0 66.30 697.23 1627.53
CA 2 4644 10 233.03 119.66 5008.69
WI 3 16330 15 325.08 175.46 16848.54
OR 4 13 0 66.64 0.00 83.64
AZ 5 22537 13 328.66 175.46 23059.12
FL 6 40931 21 205.28 175.46 41338.74
LA 7 34762 11 49.17 145.20 34974.37
GA 8 11051 5 208.80 270.04 11542.84
WA 9 7003 8 212.06 119.66 7351.72
PA 10 11 1 44.43 0.00 66.43
TX 11 24879 17 260.29 119.66 25286.95
LA 12 5339 3 236.93 440.13 6031.06
FL 13 29782 14 695.10 228.12 30732.22
OH 14 56111 19 116.00 183.31 56443.31
MA 15 21946 13 312.36 175.46 22461.82
VA 16 3101 8 220.61 119.66 3465.27
OH 17 41965 15 66.25 119.66 42182.91

[46]: # Remove column name


df_3.index.name=None

[49]: df_3

[49]: VCL MLG fm lc mc Total


MS 1 863 0 66.30 697.23 1627.53
CA 2 4644 10 233.03 119.66 5008.69
WI 3 16330 15 325.08 175.46 16848.54
OR 4 13 0 66.64 0.00 83.64
AZ 5 22537 13 328.66 175.46 23059.12
FL 6 40931 21 205.28 175.46 41338.74
LA 7 34762 11 49.17 145.20 34974.37
GA 8 11051 5 208.80 270.04 11542.84
WA 9 7003 8 212.06 119.66 7351.72
PA 10 11 1 44.43 0.00 66.43
TX 11 24879 17 260.29 119.66 25286.95
LA 12 5339 3 236.93 440.13 6031.06
FL 13 29782 14 695.10 228.12 30732.22

7
OH 14 56111 19 116.00 183.31 56443.31
MA 15 21946 13 312.36 175.46 22461.82
VA 16 3101 8 220.61 119.66 3465.27
OH 17 41965 15 66.25 119.66 42182.91

[55]: # show row any data


df_3.loc['FL']

[55]: VCL MLG fm lc mc Total


FL 6 40931 21 205.28 175.46 41338.74
FL 13 29782 14 695.10 228.12 30732.22

[57]: df_3.loc['LA']

[57]: VCL MLG fm lc mc Total


LA 7 34762 11 49.17 145.20 34974.37
LA 12 5339 3 236.93 440.13 6031.06

[59]: # specific data pick


df_3.loc['LA',['lc', 'mc', 'Total']]

[59]: lc mc Total
LA 49.17 145.20 34974.37
LA 236.93 440.13 6031.06

[60]: df_3.loc['WA','Total']

[60]: 7351.72

[61]: df_3.plot(kind='line')

[61]: <AxesSubplot:>

8
[62]: df_3.plot(kind='box')

[62]: <AxesSubplot:>

9
[64]: df_3.plot(kind='bar', figsize=(15,7))

[64]: <AxesSubplot:>

[66]: df_3.plot(kind='hist')

[66]: <AxesSubplot:ylabel='Frequency'>

10
[67]: df_3.plot(kind='scatter', x='VCL', y='Total')

[67]: <AxesSubplot:xlabel='VCL', ylabel='Total'>

[71]: df_31=df_2.loc[:,['country', 'VCL', 'MLG', 'fm', 'lc', 'mc', 'Total']]


df_31.set_index('country', inplace=True)
df_31

[71]: VCL MLG fm lc mc Total


country
MS 1 863 0 66.30 697.23 1627.53
CA 2 4644 10 233.03 119.66 5008.69
WI 3 16330 15 325.08 175.46 16848.54
OR 4 13 0 66.64 0.00 83.64
AZ 5 22537 13 328.66 175.46 23059.12
FL 6 40931 21 205.28 175.46 41338.74
LA 7 34762 11 49.17 145.20 34974.37
GA 8 11051 5 208.80 270.04 11542.84
WA 9 7003 8 212.06 119.66 7351.72
PA 10 11 1 44.43 0.00 66.43
TX 11 24879 17 260.29 119.66 25286.95
LA 12 5339 3 236.93 440.13 6031.06
FL 13 29782 14 695.10 228.12 30732.22
OH 14 56111 19 116.00 183.31 56443.31

11
MA 15 21946 13 312.36 175.46 22461.82
VA 16 3101 8 220.61 119.66 3465.27
OH 17 41965 15 66.25 119.66 42182.91

[73]: df_312=df_31.groupby('country', axis=0).sum()


df_312

[73]: VCL MLG fm lc mc Total


country
AZ 5 22537 13 328.66 175.46 23059.12
CA 2 4644 10 233.03 119.66 5008.69
FL 19 70713 35 900.38 403.58 72070.96
GA 8 11051 5 208.80 270.04 11542.84
LA 19 40101 14 286.10 585.33 41005.43
MA 15 21946 13 312.36 175.46 22461.82
MS 1 863 0 66.30 697.23 1627.53
OH 31 98076 34 182.25 302.97 98626.22
OR 4 13 0 66.64 0.00 83.64
PA 10 11 1 44.43 0.00 66.43
TX 11 24879 17 260.29 119.66 25286.95
VA 16 3101 8 220.61 119.66 3465.27
WA 9 7003 8 212.06 119.66 7351.72
WI 3 16330 15 325.08 175.46 16848.54

[79]: df_31['Total'].plot(kind='pie',
figsize=(15,6),
autopct='%1.0f%%',
startangle=150,
shadow=True,
labels=None,
pctdistance=1.12,)
plt.legend(labels=df_31.index, loc='upper left')

[79]: <matplotlib.legend.Legend at 0x258700991c0>

12
[82]: # 2nd meethof
df_31['Total'].plot(kind='pie',
figsize=(15,6),
autopct='%1.0f%%',
startangle=90,
shadow=True,)

[82]: <AxesSubplot:ylabel='Total'>

13
[20]: df_21=pd.read_excel('C:/Users/Nazakat ali/Desktop/Stat711/New folder/Book2.
,→xlsx')

df_21

[20]: Vehicle fm Mileage lh lc mc


0 1 0 863.0 1.1 66.30 697.23
1 2 10 4644.0 2.4 233.03 119.66
2 3 15 16330.0 4.2 325.08 175.46
3 4 0 13.0 1.0 66.64 NaN
4 5 13 22537.0 4.5 328.66 175.46
5 6 21 NaN 3.1 205.28 175.46
6 7 11 34762.0 0.7 49.17 145.20
7 8 5 11051.0 2.9 208.80 270.04
8 9 8 7003.0 3.4 212.06 NaN

[53]: df_21.iloc[5]

[53]: Vehicle 6.000


fm 21.000

14
Mileage 12150.375
lh 3.100
lc 205.280
mc 175.460
Name: 5, dtype: float64

[27]: # Check missing value


df_21.isnull().sum()

[27]: Vehicle 0
fm 0
Mileage 1
lh 0
lc 0
mc 2
dtype: int64

8 Fill Missing Values


[30]: df_21.fillna(df_21.mean(), inplace=True)
df_21

[30]: Vehicle fm Mileage lh lc mc


0 1 0 863.000 1.1 66.30 697.230000
1 2 10 4644.000 2.4 233.03 119.660000
2 3 15 16330.000 4.2 325.08 175.460000
3 4 0 13.000 1.0 66.64 251.215714
4 5 13 22537.000 4.5 328.66 175.460000
5 6 21 12150.375 3.1 205.28 175.460000
6 7 11 34762.000 0.7 49.17 145.200000
7 8 5 11051.000 2.9 208.80 270.040000
8 9 8 7003.000 3.4 212.06 251.215714

[31]: df_21.isnull().sum()

[31]: Vehicle 0
fm 0
Mileage 0
lh 0
lc 0
mc 0
dtype: int64

15

You might also like