pandas: How to use astype() to cast dtype of DataFrame
pandas.Series
has a single data type (dtype
), while pandas.DataFrame
can have a different data type for each column.
You can specify dtype
in various contexts, such as when creating a new object using a constructor or when reading from a CSV file. Additionally, you can cast an existing object to a different dtype
using the astype()
method.
See the following article on how to extract columns by dtype
.
See the following article about dtype
and astype()
in NumPy.
Please note that the sample code used in this article is based on pandas version 2.0.3
and behavior may vary with different versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.0.3
List of basic data types (dtype
) in pandas
The following is a list of basic data types (dtype
) in pandas.
dtype |
character code | description |
---|---|---|
int8 |
i1 |
8-bit signed integer |
int16 |
i2 |
16-bit signed integer |
int32 |
i4 |
32-bit signed integer |
int64 |
i8 |
64-bit signed integer |
uint8 |
u1 |
8-bit unsigned integer |
uint16 |
u2 |
16-bit unsigned integer |
uint32 |
u4 |
32-bit unsigned integer |
uint64 |
u8 |
64-bit unsigned integer |
float16 |
f2 |
16-bit floating-point number |
float32 |
f4 |
32-bit floating-point number |
float64 |
f8 |
64-bit floating-point number |
float128 |
f16 |
128-bit floating-point number |
complex64 |
c8 |
64-bit complex floating-point number |
complex128 |
c16 |
128-bit complex floating-point number |
complex256 |
c32 |
256-bit complex floating-point number |
bool |
? |
Boolean (True or False ) |
unicode |
U |
Unicode string |
object |
O |
Python objects |
Note that the numbers in dtype
represent bits, whereas those in character codes represent bytes. The character code for the bool
type is ?
. It does not mean unknown; rather, ?
is literally assigned.
You can specify dtype
in various ways. For example, any of the following representations can be used for float64
:
np.float64
'float64'
'f8'
s = pd.Series([0, 1, 2], dtype=np.float64)
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='float64')
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='f8')
print(s.dtype)
# float64
You can also specify data types using Python types like int
, float
, or str
, without specifying bit-precision.
In such cases, the type is converted to the equivalent dtype
. Examples in Python3, 64-bit environment are as follows. Although uint
is not a native Python type, it's included in the table for convenience.
Python type | Example of equivalent dtype |
---|---|
int |
int64 |
float |
float64 |
str |
object (Each element is str ) |
(uint ) |
uint64 |
You can use types like int
, float
, or the strings 'int'
and 'float'
. However, you cannot use uint
because it is not a native Python type.
s = pd.Series([0, 1, 2], dtype='float')
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype=float)
print(s.dtype)
# float64
s = pd.Series([0, 1, 2], dtype='uint')
print(s.dtype)
# uint64
You can check the range of possible values (minimum and maximum values) for integer and floating-point numbers types with np.iinfo()
and np.finfo()
.
The data types discussed here are primarily based on NumPy, but pandas has extended some of its own data types.
object
type and string
This section explains the object
type and the string (str
).
Note that StringDtype
was introduced in pandas version 1.0.0
as a data type for strings. This type might become the standard in the future, but it is not mentioned here. See the official documentation for details.
The special data type: object
The object
type is a special data type that can store references to any Python objects. Each element may be of a different type.
The data type for Series
and DataFrame
columns containing strings is object
. However, each element can have its own distinct type, meaning not all elements need to be strings.
Here are some examples. The built-in function type()
is applied to each element using the map()
method to check its type. np.nan
represents a missing value.
- pandas: Apply functions to values, rows, columns with map(), apply()
- Get and check the type of an object in Python: type(), isinstance()
- Missing values in pandas (nan, None, pd.NA)
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.map(type))
# 0 <class 'int'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
If str
is specified in the astype()
method (see below for details), all elements, including NaN
, are converted to str
. The dtype
remains as object
.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'str'>
# dtype: object
If str
is specified in the dtype
argument of the constructor, NaN
remains float
. Note that, in version 0.22.0
, NaN
was converted to str
.
s_str_constructor = pd.Series([0, 'abcde', np.nan], dtype=str)
print(s_str_constructor)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_str_constructor.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
Note: String methods
Note that even when the dtype
is object
, the result of string methods (accessed via the str
accessor) can differ based on the type of each element.
For example, applying str.len()
, which returns the number of characters, an element of numeric type returns NaN
.
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.str.len())
# 0 NaN
# 1 5.0
# 2 NaN
# dtype: float64
If the result of the string method includes NaN
, it indicates that not all elements might be of type str
, even if the column's data type is object
. In such cases, you can apply astype(str)
before using the string method.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.str.len())
# 0 1
# 1 5
# 2 3
# dtype: int64
See also the following articles for string methods.
- pandas: Handle strings (replace, strip, case conversion, etc.)
- pandas: Extract rows that contain specific strings from a DataFrame
- pandas: Split string columns by delimiters or regular expressions
Note: NaN
You can determine the missing value NaN
with isnull()
or remove it with dropna()
.
- pandas: Detect and count NaN (missing values) with isnull(), isna()
- pandas: Remove NaN (missing values) with dropna()
s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_object.map(type))
# 0 <class 'int'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
print(s_object.isnull())
# 0 False
# 1 False
# 2 True
# dtype: bool
print(s_object.dropna())
# 0 0
# 1 abcde
# dtype: object
Note that if cast to the string (str
), NaN
becomes the string 'nan'
and is not treated as a missing value.
s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0 0
# 1 abcde
# 2 nan
# dtype: object
print(s_str_astype.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'str'>
# dtype: object
print(s_str_astype.isnull())
# 0 False
# 1 False
# 2 False
# dtype: bool
print(s_str_astype.dropna())
# 0 0
# 1 abcde
# 2 nan
# dtype: object
You can treat it as a missing value before casting, or replace the string 'nan'
with NaN
using replace()
.
s_str_astype_nan = s_str_astype.replace('nan', np.nan)
print(s_str_astype_nan)
# 0 0
# 1 abcde
# 2 NaN
# dtype: object
print(s_str_astype_nan.map(type))
# 0 <class 'str'>
# 1 <class 'str'>
# 2 <class 'float'>
# dtype: object
print(s_str_astype_nan.isnull())
# 0 False
# 1 False
# 2 True
# dtype: bool
Cast data type (dtype
) with astype()
You can cast the data type (dtype
) with the method astype()
of DataFrame
and Series
.
- pandas.DataFrame.astype — pandas 2.0.3 documentation
- pandas.Series.astype — pandas 2.0.3 documentation
astype()
returns a new Series
or DataFrame
with the specified dtype
. The original object is not changed.
Cast data type of pandas.Series
You can specify the data type (dtype
) to astype()
.
s = pd.Series([1, 2, 3])
print(s)
# 0 1
# 1 2
# 2 3
# dtype: int64
s_f = s.astype('float64')
print(s_f)
# 0 1.0
# 1 2.0
# 2 3.0
# dtype: float64
As mentioned above, you can specify dtype
in various forms.
s_f = s.astype('float')
print(s_f.dtype)
# float64
s_f = s.astype(float)
print(s_f.dtype)
# float64
s_f = s.astype('f8')
print(s_f.dtype)
# float64
Cast data type of all columns of pandas.DataFrame
DataFrame
has the data type (dtype
) for each column. You can check each dtype
with the dtypes
attribute.
df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
# a b c
# 0 11 12 13
# 1 21 22 23
# 2 31 32 33
print(df.dtypes)
# a int64
# b int64
# c int64
# dtype: object
If you specify the data type (dtype
) to astype()
, the data types of all columns are changed.
df_f = df.astype('float64')
print(df_f)
# a b c
# 0 11.0 12.0 13.0
# 1 21.0 22.0 23.0
# 2 31.0 32.0 33.0
print(df_f.dtypes)
# a float64
# b float64
# c float64
# dtype: object
Cast data type of any column of pandas.DataFrame
individually
You can change the data type (dtype
) of any column individually by specifying a dictionary of {column name: data type}
to astype()
.
df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
# a b c
# 0 11 12 13
# 1 21 22 23
# 2 31 32 33
print(df.dtypes)
# a int64
# b int64
# c int64
# dtype: object
df_fcol = df.astype({'a': float})
print(df_fcol)
# a b c
# 0 11.0 12 13
# 1 21.0 22 23
# 2 31.0 32 33
print(df_fcol.dtypes)
# a float64
# b int64
# c int64
# dtype: object
df_fcol2 = df.astype({'a': 'float32', 'c': 'int8'})
print(df_fcol2)
# a b c
# 0 11.0 12 13
# 1 21.0 22 23
# 2 31.0 32 33
print(df_fcol2.dtypes)
# a float32
# b int64
# c int8
# dtype: object
Specify data type (dtype
) when reading CSV files with read_csv()
In pandas, pd.read_csv()
is used to read CSV files, and you can set data types using the dtype
argument.
Use the following CSV file as an example.
,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z
If the dtype
argument is omitted, a data type is automatically chosen for each column.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 NaN y
# THREE 3 300 300.0 z
print(df.dtypes)
# a int64
# b int64
# c float64
# d object
# dtype: object
Specify the same data type (dtype
) for all columns
If you specify a data type for the dtype
argument, all columns are converted to that type. If there are columns that cannot be converted to the specified data type, an error will be raised.
# pd.read_csv('data/src/sample_header_index_dtype.csv',
# index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'
If you set dtype=str
, all columns are converted to strings. However, in this case, the missing value (NaN
) will still be of type float
.
df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype=str)
print(df_str)
# a b c d
# ONE 1 001 100 x
# TWO 2 020 NaN y
# THREE 3 300 300 z
print(df_str.dtypes)
# a object
# b object
# c object
# d object
# dtype: object
print(df_str.applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'float'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
If you read the file without specifying dtype
and then cast it to str
with astype()
, NaN
values are also converted to the string 'nan'
.
df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
# a b c d
# ONE 1 1 100.0 x
# TWO 2 20 nan y
# THREE 3 300 300.0 z
print(df.astype(str).applymap(type))
# a b c d
# ONE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# TWO <class 'str'> <class 'str'> <class 'str'> <class 'str'>
# THREE <class 'str'> <class 'str'> <class 'str'> <class 'str'>
Specify data type (dtype
) for each column
As with astype()
, you can use a dictionary to specify the data type for each column in read_csv()
.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={'a': float, 'b': str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.
df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
index_col=0, dtype={1: float, 2: str})
print(df_col)
# a b c d
# ONE 1.0 001 100.0 x
# TWO 2.0 020 NaN y
# THREE 3.0 300 300.0 z
print(df_col.dtypes)
# a float64
# b object
# c float64
# d object
# dtype: object
Implicit type conversions
In addition to explicit type conversions using astype()
, data types may also be converted implicitly during certain operations.
Consider a DataFrame
with columns of integer (int
) and columns of floating point (float
) as an example.
df_mix = pd.DataFrame({'col_int': [0, 1, 2], 'col_float': [0.0, 0.1, 0.2]}, index=['A', 'B', 'C'])
print(df_mix)
# col_int col_float
# A 0 0.0
# B 1 0.1
# C 2 0.2
print(df_mix.dtypes)
# col_int int64
# col_float float64
# dtype: object
Implicit type conversion by arithmetic operations
For example, the result of addition by the +
operator of an int
column to a float
column is a float
.
print(df_mix['col_int'] + df_mix['col_float'])
# A 0.0
# B 1.1
# C 2.2
# dtype: float64
Similarly, operations with scalar values implicitly convert the data type. The result of division by the /
operator is float
.
print(df_mix / 1)
# col_int col_float
# A 0.0 0.0
# B 1.0 0.1
# C 2.0 0.2
print((df_mix / 1).dtypes)
# col_int float64
# col_float float64
# dtype: object
For arithmetic operations like +
, -
, *
, //
, and **
, operations involving only integers return int
, while those involving at least one floating-point number return float
. This is equivalent to the implicit type conversion of the NumPy array ndarray
.
print(df_mix * 1)
# col_int col_float
# A 0 0.0
# B 1 0.1
# C 2 0.2
print((df_mix * 1).dtypes)
# col_int int64
# col_float float64
# dtype: object
print(df_mix * 1.0)
# col_int col_float
# A 0.0 0.0
# B 1.0 0.1
# C 2.0 0.2
print((df_mix * 1.0).dtypes)
# col_int float64
# col_float float64
# dtype: object
Implicit type conversion by transposition, etc.
The data type may change when you select a row as a Series
using loc
or iloc
, or when you transpose a DataFrame
with T
or transpose()
.
print(df_mix.loc['A'])
# col_int 0.0
# col_float 0.0
# Name: A, dtype: float64
print(df_mix.T)
# A B C
# col_int 0.0 1.0 2.0
# col_float 0.0 0.1 0.2
print(df_mix.T.dtypes)
# A float64
# B float64
# C float64
# dtype: object
Implicit type conversion by assignment to elements
The data type may also be implicitly converted when assigning a value to an element.
For example, assigning a float
value to an element in the int
column converts that column to float
, while assigning an int
value to an element in the float
column retains the float
type for that element.
df_mix.at['A', 'col_int'] = 10.1
df_mix.at['A', 'col_float'] = 10
print(df_mix)
# col_int col_float
# A 10.1 10.0
# B 1.0 0.1
# C 2.0 0.2
print(df_mix.dtypes)
# col_int float64
# col_float float64
# dtype: object
When a string value is assigned to an element in the numeric column, the data type of the column is cast to object
.
df_mix.at['A', 'col_float'] = 'abc'
print(df_mix)
# col_int col_float
# A 10.1 abc
# B 1.0 0.1
# C 2.0 0.2
print(df_mix.dtypes)
# col_int float64
# col_float object
# dtype: object
print(df_mix.applymap(type))
# col_int col_float
# A <class 'float'> <class 'str'>
# B <class 'float'> <class 'float'>
# C <class 'float'> <class 'float'>
The sample code above is based on version 2.0.3
. In version 0.22.0
, the column type remained unchanged after assigning an element of a different type, though the type of the assigned element itself changed. Note that the behavior might differ depending on the version.