0% found this document useful (0 votes)
4 views8 pages

L-4 (Handling of Missing Values).Ipynb - Colab

The document provides a comprehensive guide on handling missing values in Python using Pandas, including methods for adding, finding, counting, dropping, and filling missing values. It demonstrates the use of functions like isnull(), dropna(), and fillna() with practical examples. Additionally, it covers reindexing data and the implications of introducing missing values during this process.

Uploaded by

ashishpal2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views8 pages

L-4 (Handling of Missing Values).Ipynb - Colab

The document provides a comprehensive guide on handling missing values in Python using Pandas, including methods for adding, finding, counting, dropping, and filling missing values. It demonstrates the use of functions like isnull(), dropna(), and fillna() with practical examples. Additionally, it covers reindexing data and the implications of introducing missing values during this process.

Uploaded by

ashishpal2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

3/14/25, 4:37 PM L-4 (Handling of Missing Values).

ipynb - Colab

keyboard_arrow_down Python DataFrames Part 3: Handling of Missing Data


Adding missing Values in Data

import numpy as np
import pandas as pd

# dictionary of lists
# Adding NaN Values using np.nan
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)
print(df)

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0

# Alternate way of adding NaN Values


from numpy import nan as NA
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

0 1 2

0 0.360383 NaN NaN

1 -0.726426 NaN NaN

2 -1.007049 NaN 0.392787

3 0.534430 NaN -0.783299

4 -0.366662 -0.652505 -0.041234

5 0.607733 -0.336879 -1.796508

6 -1.766997 2.214246 0.573730


 

Finding and Counting Missing Vlaues

# using isnull() function


dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)
print(df)
print(df.isnull() )

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False

# using isnan() Numpy function


print(np.isnan(df))

First Score Second Score Third Score


0 False False True
1 False False False
2 True False False
3 False True False

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 1/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab

s1 = pd.isnull(df["First Score"])
# displaying data only with First Score = NaN
df[s1]

First Score Second Score Third Score

2 NaN 56.0 80.0


 

# Counting the NaN values in each the columns


print(df.isnull().sum())

First Score 1
Second Score 1
Third Score 1
dtype: int64

# Counting the total NaN values in the DataFrame


print("Total Number of null values in the DataFrame : " +
str(df.isnull().sum().sum()))

Total Number of null values in the DataFrame : 3

# Count the NaN values in one or more columns in Pandas DataFrame


# Counting the NaN values in a single column
print("Number of null values in column 1 : " +
str(df.iloc[:, 1].isnull().sum())) # using iloc

Number of null values in column 1 : 1

print("\n Number of null values in column 2, 3, 4 : \n" +


str(df.iloc[:, 1:4].isnull().sum()))

Number of null values in column 2, 3, 4 :


Second Score 1
Third Score 1
dtype: int64

# Counting the NaN values in a single row


print("Number of null values in row 1 : " +
str(df.iloc[0, ].isnull().sum()))
print("Number of null values in row 2 : " +
str(df.iloc[1, ].isnull().sum()))
print("Number of null values in row 4 : " +
str(df.iloc[3, ].isnull().sum()))

Number of null values in row 1 : 1


Number of null values in row 2 : 0
Number of null values in row 4 : 1

print("Number of null values in column 1 : " +


str(df.loc[:, "Second Score"].isnull().sum())) # Using loc

Number of null values in column 1 : 1

Drop Missing Values

# dataframe.dropna() deletes rows with null values


print(df)
new_df = df.dropna(inplace = False) # original data frame will NOT be changed
print(df)
print(new_df)

df.dropna(inplace = True) # original data frame will be changed


print(df)

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
1 90.0 45.0 40.0

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 2/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab
First Score Second Score Third Score
1 90.0 45.0 40.0

df.dropna(thresh=2) # To keep certain no of Observations

First Score Second Score Third Score

1 90.0 45.0 40.0

new_df = df.dropna(axis=1, how='all') # Drop columns that have all NaN Values

Fill Missing Values

from numpy import nan as NA


df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

0 1 2

0 -0.222188 NaN NaN

1 -0.263588 NaN NaN

2 1.044907 NaN 1.393626

3 -0.118102 NaN -0.489017

4 0.555326 -0.375695 -0.969556

5 -0.501783 0.397232 -1.451455

6 1.179618 -1.341373 0.336935

newdf=df.fillna(0)
newdf

0 1 2

0 -0.222188 0.000000 0.000000

1 -0.263588 0.000000 0.000000

2 1.044907 0.000000 1.393626

3 -0.118102 0.000000 -0.489017

4 0.555326 -0.375695 -0.969556

5 -0.501783 0.397232 -1.451455

6 1.179618 -1.341373 0.336935

newdf=df.fillna({1: 0.5, 2: 0})


newdf

0 1 2

0 -0.222188 0.500000 0.000000

1 -0.263588 0.500000 0.000000

2 1.044907 0.500000 1.393626

3 -0.118102 0.500000 -0.489017

4 0.555326 -0.375695 -0.969556

5 -0.501783 0.397232 -1.451455

6 1.179618 -1.341373 0.336935

df.fillna(0, inplace=True)
df

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 3/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab

0 1 2

0 -0.222188 0.000000 0.000000

1 -0.263588 0.000000 0.000000

2 1.044907 0.000000 1.393626

3 -0.118102 0.000000 -0.489017

4 0.555326 -0.375695 -0.969556

5 -0.501783 0.397232 -1.451455

6 1.179618 -1.341373 0.336935

from numpy import nan as NA


df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

0 1 2

0 0.502151 -1.590141 -0.122283

1 0.032070 0.346746 -0.137181

2 -0.409375 NaN -1.549798

3 0.434145 NaN -0.531255

4 -0.100843 NaN NaN

5 -1.227151 NaN NaN

df.fillna(method='ffill') # ffill() function is used forward fill the missing value in the dataframe.

0 1 2

0 1.394587 2.423929 0.144844

1 0.718246 -0.800276 0.230170

2 -0.185163 -0.800276 -1.160679

3 -0.549160 -0.800276 -0.804154

4 -1.994827 -0.800276 -0.804154

5 -0.147952 -0.800276 -0.804154

df.fillna(method='bfill') # bfill() function is used backward fill the missing value in the dataframe. Last Value is NaN so no change

0 1 2

0 0.502151 -1.590141 -0.122283

1 0.032070 0.346746 -0.137181

2 -0.409375 NaN -1.549798

3 0.434145 NaN -0.531255

4 -0.100843 NaN NaN

5 -1.227151 NaN NaN

df.iloc[5,1]= 0.230171
df.iloc[5,2]= 0.430171
df

0 1 2

0 1.394587 2.423929 0.144844

1 0.718246 -0.800276 0.230170

2 -0.185163 NaN -1.160679

3 -0.549160 NaN -0.804154

4 -1.994827 NaN NaN

5 -0.147952 0.230171 0.430171

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 4/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab

df.fillna(method='bfill')

0 1 2

0 1.394587 2.423929 0.144844

1 0.718246 -0.800276 0.230170

2 -0.185163 0.230171 -1.160679

3 -0.549160 0.230171 -0.804154

4 -1.994827 0.230171 0.430171

5 -0.147952 0.230171 0.430171

keyboard_arrow_down REINDEXING
An important method on pandas objects is reindex, which means to create a new object with the data conformed to a new index. Operations
can be accomplished through reindexing are −

# Create a series
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
print(obj)
print(type(obj))

d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
<class 'pandas.core.series.Series'>

# Calling reindex on this Series rearranges the data according to the new index,
# introducing missing values if any index values were not already present:

obj1 = obj.reindex(['a', 'b', 'c', 'd', 'e'])


print(obj1)

a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64

# For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method optio
# method such as ffill, which forward-fills the values:
obj2 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 4, 9])
print(obj2)

0 blue
4 purple
9 yellow
dtype: object

obj2a= obj2.reindex(range(12)) #, method='bfill')


print(obj2a)

0 blue
1 NaN
2 NaN
3 NaN
4 purple
5 NaN
6 NaN
7 NaN
8 NaN
9 yellow
10 NaN
11 NaN
dtype: object

print("\nobj2\n", obj2)
obj2b= obj2.reindex(range(12), method='bfill') # also chek ffill
print("\nnew obj2\n", obj2b)

obj2
0 blue

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 5/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab
4 purple
9 yellow
dtype: object

new obj2
0 blue
1 purple
2 purple
3 purple
4 purple
5 yellow
6 yellow
7 yellow
8 yellow
9 yellow
10 NaN
11 NaN
dtype: object

print("\nnew obj2\n", obj2)


obj2c= obj2.reindex(range(12), method='ffill') # also chek ffill
print("\nnew obj2c\n", obj2c)

new obj2
0 blue
4 purple
9 yellow
dtype: object

new obj2c
0 blue
1 blue
2 blue
3 blue
4 purple
5 purple
6 purple
7 purple
8 purple
9 yellow
10 yellow
11 yellow
dtype: object

# With DataFrame, reindex can alter either the (row) index, columns, or both.
# When passed only a sequence, it reindexes the rows in the result:

frame = pd.DataFrame(np.arange(9).reshape((3, 3)),


index=['a', 'c', 'd'], columns=['Delhi', 'Mumbai', 'Kolkata'])
print(frame)

Delhi Mumbai Kolkata


a 0 1 2
c 3 4 5
d 6 7 8

frame2 = frame.reindex(['a', 'b', 'd'])


print(frame2)

Delhi Mumbai Kolkata


a 0.0 1.0 2.0
b NaN NaN NaN
d 6.0 7.0 8.0

# The columns can be reindexed with the columns keyword:


cities = ['Mumbai', 'Chennai', 'Kolkata']
frame.reindex(columns=cities)

Mumbai Chennai Kolkata

a 1 NaN 2

c 4 NaN 5

d 7 NaN 8

# fill_value
cities = ['Mumbai', 'Chennai', 'Kolkata']
frame.reindex(columns=cities, fill_value = 100)

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 6/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab

Mumbai Chennai Kolkata

a 1 100 2

c 4 100 5

d 7 100 8

Sorting

# create a sample dataframe details of citizens


df_Citizens = pd.DataFrame({
'name' : ['Vinish','Savita','Ritu','Sita','Ritu','Anu','Darvid'],
'age' : [ 22, 22, 13, 21, 12, 21, 17],
'section' : [ 'B', 'A', 'C', 'B', 'B', 'A', 'A'],
'city' : [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
'gender' : [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'grade' : [92, 94, 95, 79, 87, 94, 90],
'favourite_color' : [ 'red', np.NAN, 'yellow', np.NAN, 'black', 'green', 'red'] })
df_Citizens.index=['a','b','c','d','e','f','g']
print(df_Citizens)

name age section city gender grade favourite_color


a Vinish 22 B Gurgaon M 92 red
b Savita 22 A Delhi F 94 NaN
c Ritu 13 C Mumbai F 95 yellow
d Sita 21 B Delhi M 79 NaN
e Ritu 12 B Mumbai M 87 black
f Anu 21 A Delhi M 94 green
g Darvid 17 A Mumbai F 90 red

# Sort Dataframe based on ‘age'(in descending order) and ‘grade’ (in ascending order) column.
# based on age and grade
print(df_Citizens.sort_values(['age', 'grade'], ascending = [False, True]))
#print("\n\n", df_Citizens) #Original dataframe remains same

name age section city gender grade favourite_color


a Vinish 22 B Gurgaon M 92 red
b Savita 22 A Delhi F 94 NaN
d Sita 21 B Delhi M 79 NaN
f Anu 21 A Delhi M 94 green
g Darvid 17 A Mumbai F 90 red
c Ritu 13 C Mumbai F 95 yellow
e Ritu 12 B Mumbai M 87 black

#Sort Dataframe based on ‘name’ and ‘favorite_color’ column in ascending order.


df_Citizens.sort_values(['name', 'favourite_color'], ascending=[True, True])

name age section city gender grade favourite_color

f Anu 21 A Delhi M 94 green

g Darvid 17 A Mumbai F 90 red

e Ritu 12 B Mumbai M 87 black

c Ritu 13 C Mumbai F 95 yellow

b Savita 22 A Delhi F 94 NaN

d Sita 21 B Delhi M 79 NaN

a Vinish 22 B Gurgaon M 92 red

# In-place sorting of Dataframe based on ‘grade’ and ‘favourite_color’ column. In case of in-place sorting,
# Dataframe.sort_values() method returns nothing it performs changes in the actual dataframe.
# na_position : Puts NaNs at the beginning if first; last puts NaNs at the end.
# na_position{‘first’, ‘last’}, default ‘last’

df_Citizens.sort_values(["grade", "favourite_color"],
axis = 0, ascending = [True, False],
inplace = True, na_position ='first')
print(df_Citizens)

name age section city gender grade favourite_color


d Sita 21 B Delhi M 79 NaN
e Ritu 12 B Mumbai M 87 black
g Darvid 17 A Mumbai F 90 red
a Vinish 22 B Gurgaon M 92 red
b Savita 22 A Delhi F 94 NaN
f Anu 21 A Delhi M 94 green
c Ritu 13 C Mumbai F 95 yellow

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 7/8
3/14/25, 4:37 PM L-4 (Handling of Missing Values).ipynb - Colab

# Getting frequency counts of a columns in Pandas DataFrame


# Using Series.value_counts()
# frequency count of column city
count = df_Citizens['city'].value_counts()
print(count)

Delhi 3
Mumbai 3
Gurgaon 1
Name: city, dtype: int64

https://colab.research.google.com/drive/1Twqn4BjyBrFr0ev-wLIvsJ0tFOcKu_Zf#printMode=true 8/8

You might also like