AM19 EDA Assignment3
AM19 EDA Assignment3
[2]: df = pd.read_excel('Bengaluru_House_Data.xlsx')
df
1
[13320 rows x 9 columns]
[3]: df.isnull().sum()
[3]: area_type 0
availability 0
location 1
size 16
society 5502
total_sqft 0
bath 73
balcony 609
price 0
dtype: int64
[4]: df.isnull().sum().sum()
[4]: 6201
[5]: df.isna().sum()
[5]: area_type 0
availability 0
location 1
size 16
society 5502
total_sqft 0
bath 73
balcony 609
price 0
dtype: int64
[6]: df.isna().sum().sum()
[6]: 6201
[7]: df[df['society'].isnull()]
2
13311 Plot Area Ready To Move Ramamurthy Nagar 7 Bedroom
13312 Super built-up Area Ready To Move Bellandur 2 BHK
13316 Super built-up Area Ready To Move Richards Town 4 BHK
13319 Super built-up Area Ready To Move Doddathoguru 1 BHK
[8]: df.fillna(0)
3
[13320 rows x 9 columns]
[9]: df.fillna(1)
[10]: df.ffill()
4
size society total_sqft bath balcony price
0 2 BHK Coomee 1056 2.0 1.0 39.07
1 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 3 BHK Theanmp 1440 2.0 3.0 62.00
3 3 BHK Soiewre 1521 3.0 1.0 95.00
4 2 BHK Soiewre 1200 2.0 1.0 51.00
… … … … … … …
13315 5 Bedroom ArsiaEx 3453 4.0 0.0 231.00
13316 4 BHK ArsiaEx 3600 5.0 0.0 400.00
13317 2 BHK Mahla T 1141 2.0 1.0 60.00
13318 4 BHK SollyCl 4689 4.0 1.0 488.00
13319 1 BHK SollyCl 550 1.0 1.0 17.00
[11]: df.bfill()
[12]: df.ffill(axis=1)
5
[12]: area_type availability location \
0 Super built-up Area 2024-12-19 00:00:00 Electronic City Phase II
1 Plot Area Ready To Move Chikka Tirupathi
2 Built-up Area Ready To Move Uttarahalli
3 Super built-up Area Ready To Move Lingadheeranahalli
4 Super built-up Area Ready To Move Kothanur
… … … …
13315 Built-up Area Ready To Move Whitefield
13316 Super built-up Area Ready To Move Richards Town
13317 Built-up Area Ready To Move Raja Rajeshwari Nagar
13318 Super built-up Area 2024-06-18 00:00:00 Padmanabhanagar
13319 Super built-up Area Ready To Move Doddathoguru
[13]: df.fillna({'society':'abc','balcony':'xyz'},inplace=True)
df.head(15)
6
size society total_sqft bath balcony price
0 2 BHK Coomee 1056 2.0 1.0 39.07
1 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 3 BHK abc 1440 2.0 3.0 62.00
3 3 BHK Soiewre 1521 3.0 1.0 95.00
4 2 BHK abc 1200 2.0 1.0 51.00
5 2 BHK DuenaTa 1170 2.0 1.0 38.00
6 4 BHK Jaades 2732 4.0 xyz 204.00
7 4 BHK Brway G 3300 4.0 xyz 600.00
8 3 BHK abc 1310 3.0 1.0 63.25
9 6 Bedroom abc 1020 6.0 xyz 370.00
10 3 BHK abc 1800 2.0 2.0 70.00
11 4 Bedroom Prrry M 2785 5.0 3.0 295.00
12 2 BHK Shncyes 1000 2.0 1.0 38.00
13 2 BHK abc 1100 2.0 2.0 40.00
14 3 Bedroom Skityer 2250 3.0 2.0 148.00
[14]: 0 1.0
1 3.0
2 3.0
3 1.0
4 1.0
…
13315 0.0
13316 2.0
13317 1.0
13318 1.0
13319 1.0
Name: balcony, Length: 13320, dtype: float64
[15]: df["balcony"].fillna(value=df["balcony"].max())
[15]: 0 1.0
1 3.0
2 3.0
3 1.0
4 1.0
…
13315 0.0
13316 3.0
13317 1.0
13318 1.0
13319 1.0
7
Name: balcony, Length: 13320, dtype: float64
[16]: df["balcony"].fillna(value=df["balcony"].min())
[16]: 0 1.0
1 3.0
2 3.0
3 1.0
4 1.0
…
13315 0.0
13316 0.0
13317 1.0
13318 1.0
13319 1.0
Name: balcony, Length: 13320, dtype: float64
[17]: 0 1.0
1 3.0
2 3.0
3 1.0
4 1.0
…
13315 0.0
13316 2.0
13317 1.0
13318 1.0
13319 1.0
Name: balcony, Length: 13320, dtype: float64
[18]: df.dropna()
8
size society total_sqft bath balcony price
0 2 BHK Coomee 1056 2.0 1.0 39.07
1 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 3 BHK abc 1440 2.0 3.0 62.00
3 3 BHK Soiewre 1521 3.0 1.0 95.00
4 2 BHK abc 1200 2.0 1.0 51.00
… … … … … … …
13314 3 BHK SoosePr 1715 3.0 3.0 112.00
13315 5 Bedroom ArsiaEx 3453 4.0 0.0 231.00
13317 2 BHK Mahla T 1141 2.0 1.0 60.00
13318 4 BHK SollyCl 4689 4.0 1.0 488.00
13319 1 BHK abc 550 1.0 1.0 17.00
[19]: df1=df.dropna(how='all')
df1.isnull().sum()
[19]: area_type 0
availability 0
location 1
size 16
society 0
total_sqft 0
bath 73
balcony 609
price 0
dtype: int64
[20]: df2=df.dropna(how='any')
df2
9
0 2 BHK Coomee 1056 2.0 1.0 39.07
1 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 3 BHK abc 1440 2.0 3.0 62.00
3 3 BHK Soiewre 1521 3.0 1.0 95.00
4 2 BHK abc 1200 2.0 1.0 51.00
… … … … … … …
13314 3 BHK SoosePr 1715 3.0 3.0 112.00
13315 5 Bedroom ArsiaEx 3453 4.0 0.0 231.00
13317 2 BHK Mahla T 1141 2.0 1.0 60.00
13318 4 BHK SollyCl 4689 4.0 1.0 488.00
13319 1 BHK abc 550 1.0 1.0 17.00
[21]: df3=df.replace(to_replace=np.nan,value=200)
df3.head(10)
[22]: df4=df.replace(to_replace=1,value=100)
df4.head(10)
10
2 Built-up Area Ready To Move Uttarahalli
3 Super built-up Area Ready To Move Lingadheeranahalli
4 Super built-up Area Ready To Move Kothanur
5 Super built-up Area Ready To Move Whitefield
6 Super built-up Area 2024-05-18 00:00:00 Old Airport Road
7 Super built-up Area Ready To Move Rajaji Nagar
8 Super built-up Area Ready To Move Marathahalli
9 Plot Area Ready To Move Gandhi Bazar
[23]: df5=df.copy()
df5['balcony']=df['balcony'].interpolate(method='linear')
df5.head(15)
11
4 2 BHK abc 1200 2.0 1.0 51.00
5 2 BHK DuenaTa 1170 2.0 1.0 38.00
6 4 BHK Jaades 2732 4.0 1.0 204.00
7 4 BHK Brway G 3300 4.0 1.0 600.00
8 3 BHK abc 1310 3.0 1.0 63.25
9 6 Bedroom abc 1020 6.0 1.5 370.00
10 3 BHK abc 1800 2.0 2.0 70.00
11 4 Bedroom Prrry M 2785 5.0 3.0 295.00
12 2 BHK Shncyes 1000 2.0 1.0 38.00
13 2 BHK abc 1100 2.0 2.0 40.00
14 3 Bedroom Skityer 2250 3.0 2.0 148.00
[24]: df.duplicated().sum()
[24]: 530
[25]: df_dr_dup=df.drop_duplicates(keep='first')
df_dr_dup
12
[26]: df_dr_dup=df.drop_duplicates(keep='last')
df_dr_dup
[27]: df_dr_dup=df.drop_duplicates(keep=False)
df_dr_dup
13
size society total_sqft bath balcony price
0 2 BHK Coomee 1056 2.0 1.0 39.07
1 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 3 BHK abc 1440 2.0 3.0 62.00
3 3 BHK Soiewre 1521 3.0 1.0 95.00
4 2 BHK abc 1200 2.0 1.0 51.00
… … … … … … …
13314 3 BHK SoosePr 1715 3.0 3.0 112.00
13315 5 Bedroom ArsiaEx 3453 4.0 0.0 231.00
13316 4 BHK abc 3600 5.0 NaN 400.00
13317 2 BHK Mahla T 1141 2.0 1.0 60.00
13318 4 BHK SollyCl 4689 4.0 1.0 488.00
[30]: df1
14
550067 1006039 P00371644 F 46-50 0 B
[31]: df2
15
1 0 0 3
2 4+ 1 5
3 4+ 1 4
4 1 0 4
… … … …
233594 4+ 1 8
233595 4+ 1 5
233596 4+ 1 1
233597 4+ 0 10
233598 4+ 1 4
Product_Category_2 Product_Category_3
0 11.0 NaN
1 5.0 NaN
2 14.0 NaN
3 9.0 NaN
4 5.0 12.0
… … …
233594 NaN NaN
233595 8.0 NaN
233596 5.0 12.0
233597 16.0 NaN
233598 5.0 NaN
[51]: df=pd.concat([df1,df2],axis=0)
df
16
4 4+ 0 8
… … … …
233594 4+ 1 8
233595 4+ 1 5
233596 4+ 1 1
233597 4+ 0 10
233598 4+ 1 4
[52]: df['Gender']=df['Gender'].map({'F':0,'M':1})
df
17
233595 4+ 1 5
233596 4+ 1 1
233597 4+ 0 10
233598 4+ 1 4
[53]: df['Age'].unique()
[54]: df['Age']=df['Age'].map({'0-17':1,'18-25':2,'26-35':3,'36-45':4,'46-50':
↪5,'51-55':6,'55+':7})
df
18
4 4+ 0 8
… … … …
233594 4+ 1 8
233595 4+ 1 5
233596 4+ 1 1
233597 4+ 0 10
233598 4+ 1 4
[55]: df['Age'].unique()
[56]: df['City_Category']=df['City_Category'].map({'A':1,'B':2,'C':3})
df
19
3 2 0 12
4 4+ 0 8
… … … …
233594 4+ 1 8
233595 4+ 1 5
233596 4+ 1 1
233597 4+ 0 10
233598 4+ 1 4
[57]: df['City_Category'].unique()
[58]: df['Stay_In_Current_City_Years'].unique()
[59]: df['Stay_In_Current_City_Years'].value_counts()
[59]: Stay_In_Current_City_Years
1 276425
2 145427
3 135428
4+ 120671
0 105716
Name: count, dtype: int64
[60]: df['Stay_In_Current_City_Years']=df['Stay_In_Current_City_Years'].str.
↪replace("+","")
[61]: df['Stay_In_Current_City_Years'].value_counts()
20
[61]: Stay_In_Current_City_Years
1 276425
2 145427
3 135428
4 120671
0 105716
Name: count, dtype: int64
[62]: df['Product_ID']=df['Product_ID'].str[1:]
df['Product_ID']
[62]: 0 00069042
1 00248942
2 00087842
3 00085442
4 00285442
…
233594 00118942
233595 00254642
233596 00031842
233597 00124742
233598 00316642
Name: Product_ID, Length: 783667, dtype: object
[63]: 0 69042
1 248942
2 87842
3 85442
4 285442
…
233594 118942
233595 254642
233596 31842
233597 124742
233598 316642
Name: Product_ID, Length: 783667, dtype: int64
[65]: df_city=pd.get_dummies(df['City_Category'],prefix=␣
↪'City_Category',drop_first=True)
df_city
21
2 False False
3 False False
4 False True
… … …
233594 True False
233595 True False
233596 True False
233597 False True
233598 True False
[66]: df_city=pd.get_dummies(df['City_Category'],prefix=␣
↪'City_Category',drop_first=False)
df_city
0.1 Part B
[67]: df = pd.read_csv("WDICountry.csv")
df
22
266 VEN Venezuela Venezuela, RB
23
.. … …
262 2015 …
263 2010 …
264 2019 …
265 Original chained constant price data are resca… …
266 1997 …
24
0 NaN
1 NaN
2 Integrated household survey (IHS), 2016/17
3 NaN
4 Integrated household survey (IHS), 2008/09
.. …
262 Expenditure survey/budget survey (ES/BS), 2014/15
263 Integrated household survey (IHS), 2015
264 Integrated household survey (IHS), 2011/12
265 Expenditure survey/budget survey (ES/BS), 2010
266 Integrated household survey (IHS), 2015
[68]: df.isnull().sum()
25
Currency Unit 48
Region 48
Income Group 50
WB-2 code 1
National accounts base year 56
National accounts reference year 192
SNA price valuation 58
Lending category 122
Other groups 208
System of National Accounts 57
Alternative conversion factor 267
PPP survey year 267
Balance of Payments Manual in use 70
External debt Reporting status 146
System of trade 94
Government Accounting concept 108
IMF data dissemination standard 77
Latest population census 51
Latest household survey 112
Source of most recent Income and expenditure data 97
Vital registration complete 146
Latest agricultural census 137
Latest industrial data 118
Latest trade data 74
dtype: int64
[69]: df.isnull().sum().sum()
[69]: 2606
[70]: df.dtypes
26
Alternative conversion factor float64
PPP survey year float64
Balance of Payments Manual in use object
External debt Reporting status object
System of trade object
Government Accounting concept object
IMF data dissemination standard object
Latest population census object
Latest household survey object
Source of most recent Income and expenditure data object
Vital registration complete object
Latest agricultural census object
Latest industrial data float64
Latest trade data float64
dtype: object
df['Region'] = df['Region'].fillna(df['Region'].mode().iloc[0])
df['Income Group'] = df['Income Group'].fillna(df['Income Group'].mode().
↪iloc[0])
C:\Users\Swapnil\AppData\Local\Temp\ipykernel_20100\2186738220.py:3:
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a
future version. Use obj.ffill() or obj.bfill() instead.
27
df['National accounts base year'] = df['National accounts base
year'].fillna(method='ffill')
C:\Users\Swapnil\AppData\Local\Temp\ipykernel_20100\2186738220.py:4:
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a
future version. Use obj.ffill() or obj.bfill() instead.
df['National accounts reference year'] = df['National accounts reference
year'].fillna(method='ffill')
[75]: df.isnull().sum()
[76]: df.head(2)
28
[76]: Country Code Short Name Table Name \
0 ABW Aruba Aruba
1 AFE Africa Eastern and Southern Africa Eastern and Southern
Latest household survey Source of most recent Income and expenditure data \
0 NaN NaN
1 NaN NaN
[2 rows x 29 columns]
29
df['Balance of Payments Manual in use'] = df['Balance of Payments Manual in␣
↪use'].fillna(df['Balance of Payments Manual in use'].mode().iloc[0])
[79]: 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: Latest industrial data, dtype: float64
C:\Users\Swapnil\AppData\Local\Temp\ipykernel_20100\160896208.py:1:
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a
future version. Use obj.ffill() or obj.bfill() instead.
df['Latest industrial data'] = df['Latest industrial
data'].fillna(method='bfill')
C:\Users\Swapnil\AppData\Local\Temp\ipykernel_20100\160896208.py:2:
FutureWarning: Series.fillna with 'method' is deprecated and will raise in a
future version. Use obj.ffill() or obj.bfill() instead.
df['Latest trade data'] = df['Latest trade data'].fillna(method='bfill')
30
[81]: df['Latest industrial data'].head()
[81]: 0 2013.0
1 2013.0
2 2013.0
3 2013.0
4 2013.0
Name: Latest industrial data, dtype: float64
[82]: df
31
265 Australian dollar East Asia & Pacific
266 Venezuelan bolivar fuerte Latin America & Caribbean
32
4 Enhanced General Data Dissemination System (e-…
.. …
262 Special Data Dissemination Standard (SDDS)
263 Enhanced General Data Dissemination System (e-…
264 Enhanced General Data Dissemination System (e-…
265 Special Data Dissemination Standard (SDDS)
266 Enhanced General Data Dissemination System (e-…
33
Latest industrial data Latest trade data
0 2013.0 2018.0
1 2013.0 2018.0
2 2013.0 2018.0
3 2013.0 2018.0
4 2013.0 2018.0
.. … …
262 2010.0 2018.0
263 1994.0 2018.0
264 2013.0 2018.0
265 2013.0 2018.0
266 1998.0 2013.0
[83]: df = df.drop_duplicates()
df
34
2 Afghan afghani South Asia
3 Euro Europe & Central Asia
4 Angolan kwanza Sub-Saharan Africa
.. … …
261 Yemeni rial Middle East & North Africa
262 South African rand Sub-Saharan Africa
263 New Zambian kwacha Sub-Saharan Africa
264 Zimbabwean Dollar Sub-Saharan Africa
266 Venezuelan bolivar fuerte Latin America & Caribbean
IMF
data dissemination standard \
0 Enhanced General Data
Dissemination System (e-…
1 Enhanced General Data
Dissemination System (e-…
2 Enhanced General Data
Dissemination System (e-…
3 Enhanced General Data
Dissemination System (e-…
4 Enhanced General Data
Dissemination System (e-…
.. …
261 Enhanced General Data Dissemination System (e-…
262 Special Data Dissemination Standard (SDDS)
263 Enhanced General Data Dissemination System (e-…
264 Enhanced General Data Dissemination System (e-…
35
266 Enhanced General Data Dissemination System (e-…
36
.. … …
261 2012.0 2015.0
262 2010.0 2018.0
263 1994.0 2018.0
264 2013.0 2018.0
266 1998.0 2013.0
[ ]:
37