pandas: How to use astype() to cast dtype of DataFrame

Modified: | Tags: Python, pandas

pandas.Series has a single data type (dtype), while pandas.DataFrame can have a different data type for each column.

You can specify dtype in various contexts, such as when creating a new object using a constructor or when reading from a CSV file. Additionally, you can cast an existing object to a different dtype using the astype() method.

See the following article on how to extract columns by dtype.

See the following article about dtype and astype() in NumPy.

Please note that the sample code used in this article is based on pandas version 2.0.3 and behavior may vary with different versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.0.3

List of basic data types (dtype) in pandas

The following is a list of basic data types (dtype) in pandas.

dtype character code description
int8 i1 8-bit signed integer
int16 i2 16-bit signed integer
int32 i4 32-bit signed integer
int64 i8 64-bit signed integer
uint8 u1 8-bit unsigned integer
uint16 u2 16-bit unsigned integer
uint32 u4 32-bit unsigned integer
uint64 u8 64-bit unsigned integer
float16 f2 16-bit floating-point number
float32 f4 32-bit floating-point number
float64 f8 64-bit floating-point number
float128 f16 128-bit floating-point number
complex64 c8 64-bit complex floating-point number
complex128 c16 128-bit complex floating-point number
complex256 c32 256-bit complex floating-point number
bool ? Boolean (True or False)
unicode U Unicode string
object O Python objects

Note that the numbers in dtype represent bits, whereas those in character codes represent bytes. The character code for the bool type is ?. It does not mean unknown; rather, ? is literally assigned.

You can specify dtype in various ways. For example, any of the following representations can be used for float64:

  • np.float64
  • 'float64'
  • 'f8'
s = pd.Series([0, 1, 2], dtype=np.float64)
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='float64')
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='f8')
print(s.dtype)
# float64

You can also specify data types using Python types like int, float, or str, without specifying bit-precision.

In such cases, the type is converted to the equivalent dtype. Examples in Python3, 64-bit environment are as follows. Although uint is not a native Python type, it's included in the table for convenience.

Python type Example of equivalent dtype
int int64
float float64
str object (Each element is str)
(uint) uint64

You can use types like int, float, or the strings 'int' and 'float'. However, you cannot use uint because it is not a native Python type.

s = pd.Series([0, 1, 2], dtype='float')
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype=float)
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='uint')
print(s.dtype)
# uint64

You can check the range of possible values (minimum and maximum values) for integer and floating-point numbers types with np.iinfo() and np.finfo().

The data types discussed here are primarily based on NumPy, but pandas has extended some of its own data types.

object type and string

This section explains the object type and the string (str).

Note that StringDtype was introduced in pandas version 1.0.0 as a data type for strings. This type might become the standard in the future, but it is not mentioned here. See the official documentation for details.

The special data type: object

The object type is a special data type that can store references to any Python objects. Each element may be of a different type.

The data type for Series and DataFrame columns containing strings is object. However, each element can have its own distinct type, meaning not all elements need to be strings.

Here are some examples. The built-in function type() is applied to each element using the map() method to check its type. np.nan represents a missing value.

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.map(type))
# 0      <class 'int'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

If str is specified in the astype() method (see below for details), all elements, including NaN, are converted to str. The dtype remains as object.

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.map(type))
# 0    <class 'str'>
# 1    <class 'str'>
# 2    <class 'str'>
# dtype: object

If str is specified in the dtype argument of the constructor, NaN remains float. Note that, in version 0.22.0, NaN was converted to str.

s_str_constructor = pd.Series([0, 'abcde', np.nan], dtype=str)
print(s_str_constructor)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_str_constructor.map(type))
# 0      <class 'str'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

Note: String methods

Note that even when the dtype is object, the result of string methods (accessed via the str accessor) can differ based on the type of each element.

For example, applying str.len(), which returns the number of characters, an element of numeric type returns NaN.

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.str.len())
# 0    NaN
# 1    5.0
# 2    NaN
# dtype: float64

If the result of the string method includes NaN, it indicates that not all elements might be of type str, even if the column's data type is object. In such cases, you can apply astype(str) before using the string method.

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.str.len())
# 0    1
# 1    5
# 2    3
# dtype: int64

See also the following articles for string methods.

Note: NaN

You can determine the missing value NaN with isnull() or remove it with dropna().

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.map(type))
# 0      <class 'int'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

print(s_object.isnull())
# 0    False
# 1    False
# 2     True
# dtype: bool

print(s_object.dropna())
# 0        0
# 1    abcde
# dtype: object

Note that if cast to the string (str), NaN becomes the string 'nan' and is not treated as a missing value.

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.map(type))
# 0    <class 'str'>
# 1    <class 'str'>
# 2    <class 'str'>
# dtype: object

print(s_str_astype.isnull())
# 0    False
# 1    False
# 2    False
# dtype: bool

print(s_str_astype.dropna())
# 0        0
# 1    abcde
# 2      nan
# dtype: object

You can treat it as a missing value before casting, or replace the string 'nan' with NaN using replace().

s_str_astype_nan = s_str_astype.replace('nan', np.nan)
print(s_str_astype_nan)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_str_astype_nan.map(type))
# 0      <class 'str'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

print(s_str_astype_nan.isnull())
# 0    False
# 1    False
# 2     True
# dtype: bool

Cast data type (dtype) with astype()

You can cast the data type (dtype) with the method astype() of DataFrame and Series.

astype() returns a new Series or DataFrame with the specified dtype. The original object is not changed.

Cast data type of pandas.Series

You can specify the data type (dtype) to astype().

s = pd.Series([1, 2, 3])
print(s)
# 0    1
# 1    2
# 2    3
# dtype: int64

s_f = s.astype('float64')
print(s_f)
# 0    1.0
# 1    2.0
# 2    3.0
# dtype: float64

As mentioned above, you can specify dtype in various forms.

s_f = s.astype('float')
print(s_f.dtype)
# float64

s_f = s.astype(float)
print(s_f.dtype)
# float64

s_f = s.astype('f8')
print(s_f.dtype)
# float64

Cast data type of all columns of pandas.DataFrame

DataFrame has the data type (dtype) for each column. You can check each dtype with the dtypes attribute.

df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
#     a   b   c
# 0  11  12  13
# 1  21  22  23
# 2  31  32  33

print(df.dtypes)
# a    int64
# b    int64
# c    int64
# dtype: object

If you specify the data type (dtype) to astype(), the data types of all columns are changed.

df_f = df.astype('float64')
print(df_f)
#       a     b     c
# 0  11.0  12.0  13.0
# 1  21.0  22.0  23.0
# 2  31.0  32.0  33.0

print(df_f.dtypes)
# a    float64
# b    float64
# c    float64
# dtype: object

Cast data type of any column of pandas.DataFrame individually

You can change the data type (dtype) of any column individually by specifying a dictionary of {column name: data type} to astype().

df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
#     a   b   c
# 0  11  12  13
# 1  21  22  23
# 2  31  32  33

print(df.dtypes)
# a    int64
# b    int64
# c    int64
# dtype: object

df_fcol = df.astype({'a': float})
print(df_fcol)
#       a   b   c
# 0  11.0  12  13
# 1  21.0  22  23
# 2  31.0  32  33

print(df_fcol.dtypes)
# a    float64
# b      int64
# c      int64
# dtype: object

df_fcol2 = df.astype({'a': 'float32', 'c': 'int8'})
print(df_fcol2)
#       a   b   c
# 0  11.0  12  13
# 1  21.0  22  23
# 2  31.0  32  33

print(df_fcol2.dtypes)
# a    float32
# b      int64
# c       int8
# dtype: object

Specify data type (dtype) when reading CSV files with read_csv()

In pandas, pd.read_csv() is used to read CSV files, and you can set data types using the dtype argument.

Use the following CSV file as an example.

,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z

If the dtype argument is omitted, a data type is automatically chosen for each column.

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    NaN  y
# THREE  3  300  300.0  z

print(df.dtypes)
# a      int64
# b      int64
# c    float64
# d     object
# dtype: object

Specify the same data type (dtype) for all columns

If you specify a data type for the dtype argument, all columns are converted to that type. If there are columns that cannot be converted to the specified data type, an error will be raised.

# pd.read_csv('data/src/sample_header_index_dtype.csv',
#             index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'

If you set dtype=str, all columns are converted to strings. However, in this case, the missing value (NaN) will still be of type float.

df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype=str)
print(df_str)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object

print(df_str.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>

If you read the file without specifying dtype and then cast it to str with astype(), NaN values are also converted to the string 'nan'.

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    nan  y
# THREE  3  300  300.0  z

print(df.astype(str).applymap(type))
#                    a              b              c              d
# ONE    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>

Specify data type (dtype) for each column

As with astype(), you can use a dictionary to specify the data type for each column in read_csv().

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={'a': float, 'b': str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

The dictionary keys can also be column numbers. Be careful, if you are specifying the index column, you need to specify the column numbers including the index column.

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={1: float, 2: str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

Implicit type conversions

In addition to explicit type conversions using astype(), data types may also be converted implicitly during certain operations.

Consider a DataFrame with columns of integer (int) and columns of floating point (float) as an example.

df_mix = pd.DataFrame({'col_int': [0, 1, 2], 'col_float': [0.0, 0.1, 0.2]}, index=['A', 'B', 'C'])
print(df_mix)
#    col_int  col_float
# A        0        0.0
# B        1        0.1
# C        2        0.2

print(df_mix.dtypes)
# col_int        int64
# col_float    float64
# dtype: object

Implicit type conversion by arithmetic operations

For example, the result of addition by the + operator of an int column to a float column is a float.

print(df_mix['col_int'] + df_mix['col_float'])
# A    0.0
# B    1.1
# C    2.2
# dtype: float64

Similarly, operations with scalar values implicitly convert the data type. The result of division by the / operator is float.

print(df_mix / 1)
#    col_int  col_float
# A      0.0        0.0
# B      1.0        0.1
# C      2.0        0.2

print((df_mix / 1).dtypes)
# col_int      float64
# col_float    float64
# dtype: object

For arithmetic operations like +, -, *, //, and **, operations involving only integers return int, while those involving at least one floating-point number return float. This is equivalent to the implicit type conversion of the NumPy array ndarray.

print(df_mix * 1)
#    col_int  col_float
# A        0        0.0
# B        1        0.1
# C        2        0.2

print((df_mix * 1).dtypes)
# col_int        int64
# col_float    float64
# dtype: object

print(df_mix * 1.0)
#    col_int  col_float
# A      0.0        0.0
# B      1.0        0.1
# C      2.0        0.2

print((df_mix * 1.0).dtypes)
# col_int      float64
# col_float    float64
# dtype: object

Implicit type conversion by transposition, etc.

The data type may change when you select a row as a Series using loc or iloc, or when you transpose a DataFrame with T or transpose().

print(df_mix.loc['A'])
# col_int      0.0
# col_float    0.0
# Name: A, dtype: float64

print(df_mix.T)
#              A    B    C
# col_int    0.0  1.0  2.0
# col_float  0.0  0.1  0.2

print(df_mix.T.dtypes)
# A    float64
# B    float64
# C    float64
# dtype: object

Implicit type conversion by assignment to elements

The data type may also be implicitly converted when assigning a value to an element.

For example, assigning a float value to an element in the int column converts that column to float, while assigning an int value to an element in the float column retains the float type for that element.

df_mix.at['A', 'col_int'] = 10.1
df_mix.at['A', 'col_float'] = 10
print(df_mix)
#    col_int  col_float
# A     10.1       10.0
# B      1.0        0.1
# C      2.0        0.2

print(df_mix.dtypes)
# col_int      float64
# col_float    float64
# dtype: object

When a string value is assigned to an element in the numeric column, the data type of the column is cast to object.

df_mix.at['A', 'col_float'] = 'abc'
print(df_mix)
#    col_int col_float
# A     10.1       abc
# B      1.0       0.1
# C      2.0       0.2

print(df_mix.dtypes)
# col_int      float64
# col_float     object
# dtype: object

print(df_mix.applymap(type))
#            col_int        col_float
# A  <class 'float'>    <class 'str'>
# B  <class 'float'>  <class 'float'>
# C  <class 'float'>  <class 'float'>

The sample code above is based on version 2.0.3. In version 0.22.0, the column type remained unchanged after assigning an element of a different type, though the type of the assigned element itself changed. Note that the behavior might differ depending on the version.

Related Categories

Related Articles