NumPy: Replace NaN (np.nan) using np.nan_to_num() and np.isnan()
In NumPy, to replace NaN
(np.nan
) in an array (ndarray
) with any values like 0
, use np.nan_to_num()
. Additionally, while np.isnan()
is primarily used to identify NaN
, its results can be used to replace NaN
. You can also replace NaN
with the mean of the non-NaN values.
To delete the row or column containing NaN
instead of replacing them, see the following article.
For handling missing values in pandas, see the following article.
The NumPy version used in this article is as follows. Note that functionality may vary between versions.
import numpy as np
print(np.__version__)
# 1.26.1
NaN
(np.nan
) in NumPy
When you read a CSV file with np.genfromtxt()
, by default, missing data is represented as NaN
(Not a Number). These are displayed as nan
when output with print()
.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
If you want to generate NaN
explicitly, use np.nan
or float('nan')
. You can also import the math
module of the standard library and use math.nan
. They are all the same.
a_nan = np.array([0, 1, np.nan, float('nan')])
print(a_nan)
# [ 0. 1. nan nan]
Since comparing NaN
with ==
returns False
, use np.isnan()
to check if the value is NaN
.
print(np.nan == np.nan)
# False
print(np.isnan(np.nan))
# True
np.isnan()
can also check if each element of an ndarray
is NaN
.
print(a_nan == np.nan)
# [False False False False]
print(np.isnan(a_nan))
# [False False True True]
Replace NaN
using np.genfromtxt()
with filling_values
To fill missing data in a CSV file, use the filling_values
argument with np.genfromtxt()
.
For example, fill NaN
with 0
:
a_fill = np.genfromtxt('data/src/sample_nan.csv', delimiter=',',
filling_values=0)
print(a_fill)
# [[11. 12. 0. 14.]
# [21. 0. 0. 24.]
# [31. 32. 33. 34.]]
Note that filling with the mean of the non-NaN values is not possible during the initial read with np.genfromtxt()
. For this, refer to the method described below.
Replace NaN
using np.nan_to_num()
You can use np.nan_to_num()
to replace NaN
.
Note that np.nan_to_num()
also replaces infinity (inf
). See the following article for details.
When you specify the array (ndarray
) as the first argument to np.nan_to_num()
, by default, a new ndarray
is generated with NaN
replaced by 0
. The original ndarray
remains unchanged.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
print(np.nan_to_num(a))
# [[11. 12. 0. 14.]
# [21. 0. 0. 24.]
# [31. 32. 33. 34.]]
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
Setting the second argument (copy
) to False
modifies the original ndarray.
np.nan_to_num(a, copy=False)
print(a)
# [[11. 12. 0. 14.]
# [21. 0. 0. 24.]
# [31. 32. 33. 34.]]
From NumPy version 1.17, the third argument (nan
) allows you to specify the value to replace NaN
.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
print(np.nan_to_num(a, nan=-1))
# [[11. 12. -1. 14.]
# [21. -1. -1. 24.]
# [31. 32. 33. 34.]]
You can use np.nanmean()
to replace NaN
with the mean of non-NaN values. This replacement can be done for the entire array or separately for each row or column.
print(np.nanmean(a))
# 23.555555555555557
print(np.nan_to_num(a, nan=np.nanmean(a)))
# [[11. 12. 23.55555556 14. ]
# [21. 23.55555556 23.55555556 24. ]
# [31. 32. 33. 34. ]]
print(np.nanmean(a, axis=0, keepdims=True))
# [[21. 22. 33. 24.]]
print(np.nan_to_num(a, nan=np.nanmean(a, axis=0, keepdims=True)))
# [[11. 12. 33. 14.]
# [21. 22. 33. 24.]
# [31. 32. 33. 34.]]
print(np.nanmean(a, axis=1, keepdims=True))
# [[12.33333333]
# [22.5 ]
# [32.5 ]]
print(np.nan_to_num(a, nan=np.nanmean(a, axis=1, keepdims=True)))
# [[11. 12. 12.33333333 14. ]
# [21. 22.5 22.5 24. ]
# [31. 32. 33. 34. ]]
If you specify an ndarray
as the third argument (nan
) in np.nan_to_num()
, it will be broadcast to match the shape of the ndarray
specified as the first argument.
If keepdims
is set to True
in np.nanmean()
, the resulting array is broadcast correctly. While keepdims=False
(default) is fine for axis=0
, it is less error-prone to always set keepdims=True
regardless of the axis.
For versions before 1.17, where the nan
argument is not implemented, use the following method to replace NaN
with values other than 0
.
Identify and replace NaN
using np.isnan()
You can use np.isnan()
to check if values in an ndarray
are NaN
.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(a)
# [[11. 12. nan 14.]
# [21. nan nan 24.]
# [31. 32. 33. 34.]]
print(np.isnan(a))
# [[False False True False]
# [False True True False]
# [False False False False]]
With the result from np.isnan()
, you can assign a specific value to replace NaN
.
a[np.isnan(a)] = 0
print(a)
# [[11. 12. 0. 14.]
# [21. 0. 0. 24.]
# [31. 32. 33. 34.]]
You can also use np.nanmean()
to replace NaN
with the mean of the non-missing values.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
a[np.isnan(a)] = np.nanmean(a)
print(a)
# [[11. 12. 23.55555556 14. ]
# [21. 23.55555556 23.55555556 24. ]
# [31. 32. 33. 34. ]]
To replace with the mean value for each row or column, use np.where()
.
a = np.genfromtxt('data/src/sample_nan.csv', delimiter=',')
print(np.where(np.isnan(a), np.nanmean(a, axis=0, keepdims=True), a))
# [[11. 12. 33. 14.]
# [21. 22. 33. 24.]
# [31. 32. 33. 34.]]
print(np.where(np.isnan(a), np.nanmean(a, axis=1, keepdims=True), a))
# [[11. 12. 12.33333333 14. ]
# [21. 22.5 22.5 24. ]
# [31. 32. 33. 34. ]]