Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
This is a much smaller dataset than what you’ll typically work with. Even
though it’s a small dataset, it highlights a lot of real-world situations that
you will encounter.
A good way to get a quick feel for the data is to take a look at the first few
rows. Here’s how you would do that in Pandas:
# Importing libraries
import pandas as pd
import numpy as np
# Read csv file into a pandas dataframe
df = pd.read_csv("property data.csv")
# Take a look at the first few rows
print df.head()Out:
ST_NUM ST_NAME OWN_OCCUPIED NUM_BEDROOMS
0 104.0 PUTNAM Y 3.0
1 197.0 LEXINGTON N 3.0
2 NaN LEXINGTON N 3.0
3 201.0 BERKELEY NaN 1.0
4 203.0 BERKELEY Y 3.0
I know that I said we’ll be working with Pandas, but you can see that I
also imported Numpy. We’ll use this a little bit later on to rename some
missing values, so we might as well import it now.
After importing the libraries we read the csv file into a Pandas dataframe.
You can think of the dataframe as a spreadsheet.
With the .head()method, we can easily see the first few rows.
Now I can answer my original question, what are my features? It’s pretty
easy to infer the following features from the column names:
• ST_NUM : Street number
• ST_NAME : Street name
To answer the next two questions, we’ll need to start getting more in-
depth width Pandas. Let’s start looking at examples of how to detect
missing values
Standard Missing Values
So what do I mean by “standard missing values”? These are missing
values that Pandas can detect.
Going back to our original dataset, let’s take a look at the “Street
Number” column.
In the third row there’s an empty cell. In the seventh row there’s an
“NA” value.
Clearly these are both missing values. Let’s see how Pandas deals with
these.
# Looking at the ST_NUM column
print df['ST_NUM']
print df['ST_NUM'].isnull()Out:
0 104.0
1 197.0
2 NaN
3 201.0
4 203.0
5 207.0
6 NaN
7 213.0
8 215.0
Out:
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
Taking a look at the column, we can see that Pandas filled in the blank
space with “NA”. Using the isnull() method, we can confirm that both the
missing value and “NA” were recognized as missing values. Both boolean
responses are True.
This is a simple example, but highlights an important point. Pandas will
recognize both empty cells and “NA” types as missing values. In the next
section, we’ll take a look at some types that Pandas won’t recognize.
Non-Standard Missing Values
Sometimes it might be the case where there’s missing values that have
different formats.
Let’s take a look at the “Number of Bedrooms” column to see what I
mean.
In this column, there’s four missing values.
• n/a
• NA
• —
• na
From the previous section, we know that Pandas will recognize “NA” as a
missing value, but what about the others? Let’s take a look.
# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()Out:
0 3
1 3
2 n/a
3 1
4 3
5 NaN
6 2
7 --
8 na
Out:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
Just like before, Pandas recognized the “NA” as a missing value.
Unfortunately, the other types weren’t recognized.
If there’s multiple users manually entering data, then this is a common
problem. Maybe i like to use “n/a” but you like to use “na”.
An easy way to detect these various formats is to put them in a list. Then
when we import the data, Pandas will recognize them right away. Here’s
an example of how we would do that.
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("property data.csv", na_values = missing_values)
Now let’s take another look at this column and see what happens.
# Looking at the NUM_BEDROOMS column
print df['NUM_BEDROOMS']
print df['NUM_BEDROOMS'].isnull()Out:
0 3.0
1 3.0
2 NaN
3 1.0
4 3.0
5 NaN
6 2.0
7 NaN
8 NaN
Out:
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
This time, all of the different formats were recognized as missing values.
You might not be able to catch all of these right away. As you work through
the data and see other types of missing values, you can add them to the list.
It’s important to recognize these non-standard types of missing values for
purposes of summarizing and transforming missing values. If you try and
count the number of missing values before converting these non-standard
types, you could end up missing a lot of missing values.
In the next section we’ll take a look at a more complicated, but very
common, type of missing value.
Unexpected Missing Values
So far we’ve seen standard missing values, and non-standard missing
values. What if we have an unexpected type?
For example, if our feature is expected to be a string, but there’s a
numeric type, then technically this is also a missing value.
Let’s take a look at the “Owner Occupied” column to see what I’m
talking about.
From our previous examples, we know that Pandas will detect the empty
cell in row seven as a missing value. Let’s confirm with some code.
# Looking at the OWN_OCCUPIED column
print df['OWN_OCCUPIED']
print df['OWN_OCCUPIED'].isnull()# Looking at the ST_NUM column
Out:
0 Y
1 N
2 N
3 12
4 Y
5 Y
6 NaN
7 Y
8 Y
Out:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
In the fourth row, there’s the number 12. The response for Owner
Occupied should clearly be a string (Y or N), so this numeric type should be
a missing value.
This example is a little more complicated so we’ll need to think through a
strategy for detecting these types of missing values. There’s a number of
different approaches, but here’s the way that I’m going to work through
this one.
1. Loop through the OWN_OCCUPIED column
2. Try and turn the entry into an integer
3. If the entry can be changed into an integer, enter a missing value
4. If the number can’t be an integer, we know it’s a string, so keep going
Let’s take a look at the code and then we’ll go through it in detail.
# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1
In the code we’re looping through each entry in the “Owner Occupied”
column. To try and change the entry to an integer, we’re using
int(row) .