Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values

This document discusses different types of missing data and how to detect them using Python and Pandas. It begins by explaining sources of missing values like user errors or intentional omissions. It then demonstrates how to detect standard missing values marked as NaN or empty cells, non-standard missing values encoded in other ways, and unexpected missing values like numeric inputs where text is expected. Code shows how to recognize these various types of missing data and convert them to NaN for later cleaning and analysis. The document concludes by discussing summarizing missing value counts after data cleaning.

Uploaded by

vanjchao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views13 pages

Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values

Uploaded by

vanjchao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Cleaning with Python and Pandas:

Detecting Missing Values

Sources of Missing Values
Before we dive into code, it’s important to understand the sources of
missing data. Here’s some typical reasons why data is missing:
• User forgot to fill in a field.
• Data was lost while transferring manually from a legacy database.
• There was a programming error.
• Users chose not to fill out a field tied to their beliefs about how the results would be used or
interpreted.
As you can see, some of these sources are just simple random mistakes.
Other times, there can be a deeper reason why data is missing.
It’s important to understand these different types of missing data from a
statistics point of view. The type of missing data will influence how you
deal with filling in the missing values.
Today we’ll learn how to detect missing values, and do some basic
imputation. For a detailed statistical approach for dealing with missing
data, check out these awesome slides from data scientist Matt Brems.
Keep in mind, imputing with a median or mean value is usually a bad idea,
so be sure to check out Matt’s slides for the correct approach.
Getting Started
Before you start cleaning a data set, it’s a good idea to just get a general
feel for the data. After that, you can put together a plan to clean the data.
I like to start by asking the following questions:
• What are the features?
• What are the expected types (int, float, string, boolean)?
• Is there obvious missing data (values that Pandas can detect)?
• Is there other types of missing data that’s not so obvious (can’t easily detect with Pandas)?
To show you what I mean, let’s start working through the example.
The data we’re going to work with is a very small real estate dataset.
Head on over to our github page to grab a copy of the csv file so that you
can code along.
Here’s a quick look at the data:

This is a much smaller dataset than what you’ll typically work with. Even
though it’s a small dataset, it highlights a lot of real-world situations that
you will encounter.
A good way to get a quick feel for the data is to take a look at the first few
rows. Here’s how you would do that in Pandas:
# Importing libraries
import pandas as pd
import numpy as np
# Read csv file into a pandas dataframe
df = pd.read_csv("property data.csv")
# Take a look at the first few rows
print df.head()Out:
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON N 3.0
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y 3.0
I know that I said we’ll be working with Pandas, but you can see that I
also imported Numpy. We’ll use this a little bit later on to rename some
missing values, so we might as well import it now.
After importing the libraries we read the csv file into a Pandas dataframe.
You can think of the dataframe as a spreadsheet.
With the .head()method, we can easily see the first few rows.
Now I can answer my original question, what are my features? It’s pretty
easy to infer the following features from the column names:
• ST_NUM : Street number
• ST_NAME : Street name

• OWN_OCCUPIED : Is the residence owner occupied

• NUM_BEDROOMS : Number of bedrooms

We can also answer, what are the expected types?

• ST_NUM : float or int… some sort of numeric type
• ST_NAME : string

• OWN_OCCUPIED : string… Y (“Yes”) or N (“No”)

• NUM_BEDROOMS : float or int, a numeric type

To answer the next two questions, we’ll need to start getting more in-
depth width Pandas. Let’s start looking at examples of how to detect
missing values
Standard Missing Values
So what do I mean by “standard missing values”? These are missing
values that Pandas can detect.
Going back to our original dataset, let’s take a look at the “Street
Number” column.

In the third row there’s an empty cell. In the seventh row there’s an
“NA” value.
Clearly these are both missing values. Let’s see how Pandas deals with
these.
# Looking at the ST_NUM column
print df['ST_NUM']
print df['ST_NUM'].isnull()Out:
0 104.0
1 197.0
2 NaN
3 201.0
4 203.0
5 207.0
6 NaN
7 213.0
8 215.0
Out:
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
Taking a look at the column, we can see that Pandas filled in the blank
space with “NA”. Using the isnull() method, we can confirm that both the
missing value and “NA” were recognized as missing values. Both boolean
responses are True.
This is a simple example, but highlights an important point. Pandas will
recognize both empty cells and “NA” types as missing values. In the next
section, we’ll take a look at some types that Pandas won’t recognize.
Non-Standard Missing Values
Sometimes it might be the case where there’s missing values that have
different formats.
Let’s take a look at the “Number of Bedrooms” column to see what I
mean.
In this column, there’s four missing values.
• n/a
• NA
• —
• na
From the previous section, we know that Pandas will recognize “NA” as a
missing value, but what about the others? Let’s take a look.
# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()Out:
0 3
1 3
2 n/a
3 1
4 3
5 NaN
6 2
7 --
8 na
Out:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
Just like before, Pandas recognized the “NA” as a missing value.
Unfortunately, the other types weren’t recognized.
If there’s multiple users manually entering data, then this is a common
problem. Maybe i like to use “n/a” but you like to use “na”.
An easy way to detect these various formats is to put them in a list. Then
when we import the data, Pandas will recognize them right away. Here’s
an example of how we would do that.
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property data.csv", na_values = missing_values)
Now let’s take another look at this column and see what happens.
# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()Out:
0 3.0
1 3.0
2 NaN
3 1.0
4 3.0
5 NaN
6 2.0
7 NaN
8 NaN
Out:
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
This time, all of the different formats were recognized as missing values.
You might not be able to catch all of these right away. As you work through
the data and see other types of missing values, you can add them to the list.
It’s important to recognize these non-standard types of missing values for
purposes of summarizing and transforming missing values. If you try and
count the number of missing values before converting these non-standard
types, you could end up missing a lot of missing values.
In the next section we’ll take a look at a more complicated, but very
common, type of missing value.
Unexpected Missing Values
So far we’ve seen standard missing values, and non-standard missing
values. What if we have an unexpected type?
For example, if our feature is expected to be a string, but there’s a
numeric type, then technically this is also a missing value.
Let’s take a look at the “Owner Occupied” column to see what I’m
talking about.
From our previous examples, we know that Pandas will detect the empty
cell in row seven as a missing value. Let’s confirm with some code.
# Looking at the OWN_OCCUPIED column
print df['OWN_OCCUPIED']
print df['OWN_OCCUPIED'].isnull()# Looking at the ST_NUM column
Out:
0 Y
1 N
2 N
3 12
4 Y
5 Y
6 NaN
7 Y
8 Y
Out:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
In the fourth row, there’s the number 12. The response for Owner
Occupied should clearly be a string (Y or N), so this numeric type should be
a missing value.
This example is a little more complicated so we’ll need to think through a
strategy for detecting these types of missing values. There’s a number of
different approaches, but here’s the way that I’m going to work through
this one.
1. Loop through the OWN_OCCUPIED column
2. Try and turn the entry into an integer
3. If the entry can be changed into an integer, enter a missing value
4. If the number can’t be an integer, we know it’s a string, so keep going
Let’s take a look at the code and then we’ll go through it in detail.
# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1
In the code we’re looping through each entry in the “Owner Occupied”
column. To try and change the entry to an integer, we’re using
int(row) .

If the value can be changed to an integer, we change the entry to a missing

value using Numpy’s np.nan .
On the other hand, if it can’t be changed to an integer, we pass and
keep going.
You’ll notice that I used try and except ValueError . This is called
exception handling, and we use this to handle errors.
If we were to try and change an entry into an integer and it couldn’t be
changed, then a ValueError would be returned, and the code would
stop. To deal with this, we use exception handling to recognize these
errors, and keep going.
Another important bit of the code is the .loc method. This is the preferred
Pandas method for modifying entries in place. For more info on this you
can check out the Pandas documentation.
Now that we’ve worked through the different ways of detecting missing
values, we’ll take a look at summarizing, and replacing them.
Summarizing Missing Values
After we’ve cleaned the missing values, we will probably want to
summarize them. For instance, we might want to look at the total number
of missing values for each feature.
# Total missing values for each feature
print df.isnull().sum()Out:
ST_NUM 2
ST_NAME 0
OWN_OCCUPIED 2
NUM_BEDROOMS 4
Other times we might want to do a quick check to see if we have any
missing values at all.
# Any missing values?
print df.isnull().values.any()Out:
True
We might also want to get a total count of missing values.
# Total number of missing values
print df.isnull().sum().sum()Out:
8
Now that we’ve summarized the number of missing values, let’s take a
look at doing some simple replacements.
Replacing
Often times you’ll have to figure out how you want to handle missing
values.
Sometimes you’ll simply want to delete those rows, other times you’ll
replace them.
As I mentioned earlier, this shouldn’t be taken lightly. We’ll go over
some basic imputations, but for a detailed statistical approach for dealing
with missing data, check out these awesome slides from data scientist Matt
Brems.
That being said, maybe you just want to fill in missing values with a single
value.
# Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)
More likely, you might want to do a location based imputation. Here’s
how you would do that.
# Location based replacement
df.loc[2,'ST_NUM'] = 125
A very common way to replace missing values is using a median.
# Replace using median
median = df['NUM_BEDROOMS'].median()
df['NUM_BEDROOMS'].fillna(median, inplace=True)
We’ve gone over a few simple ways to replace missing values, but be sure
to check out Matt’s slides for the proper techniques.
Conclusion
Dealing with messy data is inevitable. Data cleaning is just part of the
process on a data science project.
In this article we went over some ways to detect, summarize, and replace
missing values.

Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
08 - Aurelia Anjani - 3B-AC - 215154040 - Remedial Audit-Thesis
No ratings yet
08 - Aurelia Anjani - 3B-AC - 215154040 - Remedial Audit-Thesis
319 pages
v21-n12 (1)
No ratings yet
v21-n12 (1)
184 pages
Pandas Missing data
No ratings yet
Pandas Missing data
30 pages
Pandas
No ratings yet
Pandas
30 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
20 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Lab File
No ratings yet
Lab File
96 pages
Pandas
No ratings yet
Pandas
20 pages
12 Information Practices Text Book Preeti Arora
No ratings yet
12 Information Practices Text Book Preeti Arora
45 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Topic Test Oxfordaqa Int Gcse Combined Science 9204 Biology Organisation
No ratings yet
Topic Test Oxfordaqa Int Gcse Combined Science 9204 Biology Organisation
21 pages
Ifeanyi Project Work 2
No ratings yet
Ifeanyi Project Work 2
27 pages
AI Practical 2025
No ratings yet
AI Practical 2025
14 pages
Practice 1
No ratings yet
Practice 1
45 pages
Dealing with Missing Values
No ratings yet
Dealing with Missing Values
19 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Module 3.Pptx
No ratings yet
Module 3.Pptx
20 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
IT-2024-EMC-Fundamentals-Guide
No ratings yet
IT-2024-EMC-Fundamentals-Guide
40 pages
Protected by PDF Anti-Copy Free: (Upgrade To Pro Version To Remove The Watermark)
No ratings yet
Protected by PDF Anti-Copy Free: (Upgrade To Pro Version To Remove The Watermark)
60 pages
Handling Missing Data in Pandas by Jaume Boguñá
No ratings yet
Handling Missing Data in Pandas by Jaume Boguñá
17 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Top 70+ Data Engineer Interview Questions and Answers
No ratings yet
Top 70+ Data Engineer Interview Questions and Answers
18 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Get (Ebook) The Quest For Sultan Abdülhamid's Oil Assets. His heirs' legal battle for their rights by Sami (E. Mahmud) ISBN 9789754283297, 975428329X free all chapters
No ratings yet
Get (Ebook) The Quest For Sultan Abdülhamid's Oil Assets. His heirs' legal battle for their rights by Sami (E. Mahmud) ISBN 9789754283297, 975428329X free all chapters
76 pages
12. Pandas -Dataframe - Handling Missing Nan Values
No ratings yet
12. Pandas -Dataframe - Handling Missing Nan Values
16 pages
lec16
No ratings yet
lec16
11 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
grade-9-assessment-report-book-tp-2025-02-04-byHramk4zd
No ratings yet
grade-9-assessment-report-book-tp-2025-02-04-byHramk4zd
24 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
No ratings yet
How to Handle Missing Data in Python. [Explained in 5 Easy Steps]
10 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Kenny 230723 Data Structures Interview Questions
No ratings yet
Kenny 230723 Data Structures Interview Questions
16 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Chapter 4 ENG - Imperfections in Solids
No ratings yet
Chapter 4 ENG - Imperfections in Solids
28 pages
2-Introduction to data cleaning P02
No ratings yet
2-Introduction to data cleaning P02
7 pages
Missing Data
No ratings yet
Missing Data
25 pages
10 Apple SQL Interview Questions
No ratings yet
10 Apple SQL Interview Questions
15 pages
Bulb
No ratings yet
Bulb
3 pages
Data Analyst Career Guide
No ratings yet
Data Analyst Career Guide
51 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Supply Chain Management
No ratings yet
Supply Chain Management
73 pages
Missing Data
No ratings yet
Missing Data
14 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
lec 4
No ratings yet
lec 4
9 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Chem 3 Pre Mock 2024 - 115722
No ratings yet
Chem 3 Pre Mock 2024 - 115722
13 pages
N5 Entrepreneurship and Business Management Paper 2 November 2020 Memorandum
No ratings yet
N5 Entrepreneurship and Business Management Paper 2 November 2020 Memorandum
10 pages
Phython Example
No ratings yet
Phython Example
12 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
chai
No ratings yet
chai
5 pages
ملخص قطع الكتاب صف سادس
No ratings yet
ملخص قطع الكتاب صف سادس
14 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Pandas
No ratings yet
Pandas
4 pages
Handling Missing Values
No ratings yet
Handling Missing Values
4 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Kenny-230722-NumPy Interview Questions and Answers
No ratings yet
Kenny-230722-NumPy Interview Questions and Answers
21 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Code explanation for date types
No ratings yet
Code explanation for date types
8 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Science Probability Cheatsheet
No ratings yet
Data Science Probability Cheatsheet
13 pages
Task Site - Telangana
No ratings yet
Task Site - Telangana
13 pages
Kenny-230723-Top 61 Business Analyst Interview Questions and Answers
No ratings yet
Kenny-230723-Top 61 Business Analyst Interview Questions and Answers
12 pages
Rocha 2014
No ratings yet
Rocha 2014
9 pages
May 2022 Ce Board Exam Policarpio 3: Eview Nnovations
No ratings yet
May 2022 Ce Board Exam Policarpio 3: Eview Nnovations
3 pages
Agathiyar 12000 Thoguppu PDF
75% (32)
Agathiyar 12000 Thoguppu PDF
48 pages
DM Lab Cycle 1
No ratings yet
DM Lab Cycle 1
12 pages
L32, 33 Pandas
No ratings yet
L32, 33 Pandas
7 pages
Kenny-230722-65 Excel Interview Questions For Data Analysts
No ratings yet
Kenny-230722-65 Excel Interview Questions For Data Analysts
11 pages
Landing Gear: Extra - Flugzeugbau GMBH Maintenance Manual Extra 300/Sc
No ratings yet
Landing Gear: Extra - Flugzeugbau GMBH Maintenance Manual Extra 300/Sc
10 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Company Profile: Viswarethna Technology Solutions (P) LTD
No ratings yet
Company Profile: Viswarethna Technology Solutions (P) LTD
10 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Reception Table
No ratings yet
Reception Table
1 page
LNG Tech 2
50% (2)
LNG Tech 2
28 pages
Cherry Pie Strain - Info & Effects - 420 Dispensary
No ratings yet
Cherry Pie Strain - Info & Effects - 420 Dispensary
4 pages
Entrep Reviewer
No ratings yet
Entrep Reviewer
3 pages
Exercise of FOR and SINCE
No ratings yet
Exercise of FOR and SINCE
4 pages
British Standards
75% (4)
British Standards
12 pages
Notice Under Section 80
No ratings yet
Notice Under Section 80
3 pages
Re-Order Level Maximum Consumption X Lead Time
No ratings yet
Re-Order Level Maximum Consumption X Lead Time
2 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
LEAK DETECTION IN PIPELINE-jijo
No ratings yet
LEAK DETECTION IN PIPELINE-jijo
17 pages
EC-80220 User Manual
No ratings yet
EC-80220 User Manual
8 pages
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet

Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values

Uploaded by

Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values

Uploaded by

Data Cleaning with Python and Pandas:

Detecting Missing Values​

• OWN_OCCUPIED : Is the residence owner occupied​

• NUM_BEDROOMS : Number of bedrooms​

We can also answer, what are the expected types?​

• OWN_OCCUPIED : string… Y (“Yes”) or N (“No”)​

• NUM_BEDROOMS : float or int, a numeric type​

If the value can be changed to an integer, we change the entry to a missing

You might also like

Detecting Missing Values

• OWN_OCCUPIED : Is the residence owner occupied

• NUM_BEDROOMS : Number of bedrooms

We can also answer, what are the expected types?

• OWN_OCCUPIED : string… Y (“Yes”) or N (“No”)

• NUM_BEDROOMS : float or int, a numeric type