pandas: Get the mode (the most frequent value) with mode()
In pandas, the mode()
method is used to find the mode, the most frequent value, of a column or row in a DataFrame
. This method is also available on Series
.
To get unique values and their counts, use the unique()
, value_counts()
, and nunique()
methods.
The describe()
method is useful to compute summary statistics including the mode.
The pandas version used in this article is as follows. Note that functionality may vary between versions.
import pandas as pd
print(pd.__version__)
# 2.1.4
pandas.Series.mode()
mode()
on a Series
returns a Series
, even if there is only one mode.
s = pd.Series(['X', 'X', 'X', 'Y'])
print(s)
# 0 X
# 1 X
# 2 X
# 3 Y
# dtype: object
print(s.mode())
# 0 X
# dtype: object
print(type(s.mode()))
# <class 'pandas.core.series.Series'>
print(s.mode()[0])
# X
print(type(s.mode()[0]))
# <class 'str'>
If there are multiple modes, the result is as follows. A Series
can be converted to a list using the tolist()
method.
s_multi = pd.Series(['X', 'X', 'Y', 'Y'])
print(s_multi)
# 0 X
# 1 X
# 2 Y
# 3 Y
# dtype: object
print(s_multi.mode())
# 0 X
# 1 Y
# dtype: object
print(s_multi.mode()[0])
# X
print(s_multi.mode().tolist())
# ['X', 'Y']
print(type(s_multi.mode().tolist()))
# <class 'list'>
By default, the missing value NaN
is excluded. If you set the dropna
argument to False
, NaN
will also be counted.
s_nan = pd.Series(['X', float('nan'), float('nan'), float('nan')])
print(s_nan)
# 0 X
# 1 NaN
# 2 NaN
# 3 NaN
# dtype: object
print(s_nan.mode())
# 0 X
# dtype: object
print(s_nan.mode(dropna=False))
# 0 NaN
# dtype: object
See the following article for handling missing values in pandas.
pandas.DataFrame.mode()
Consider the following DataFrame
.
df = pd.DataFrame({'col1': ['X', 'X', 'X', 'Y'],
'col2': ['X', 'X', 'Y', 'Y']},
index=['row1', 'row2', 'row3', 'row4'])
print(df)
# col1 col2
# row1 X X
# row2 X X
# row3 X Y
# row4 Y Y
Get the mode for each column
By default, the mode()
method on a DataFrame
returns a DataFrame
with the modes of each column as elements. Even if there is only one mode, a one-row DataFrame
is returned.
If the number of modes varies by column, the empty part is filled with the missing value NaN
.
print(df.mode())
# col1 col2
# 0 X X
# 1 NaN Y
print(type(df.mode()))
# <class 'pandas.core.frame.DataFrame'>
The number of modes in each column can be obtained using the count()
method, which counts the number of non-NaN elements.
print(df.mode().count())
# col1 1
# col2 2
# dtype: int64
The first row of the resulting DataFrame
shows the mode for each column, or one of the modes if multiple exist. The first row can be obtained using iloc[0]
.
print(df.mode().iloc[0])
# col1 X
# col2 X
# Name: 0, dtype: object
Calling mode()
from a DataFrame
might include missing values NaN
, but selecting a column first and then calling mode()
from a Series does not include NaN
.
print(df.mode()['col1'])
# 0 X
# 1 NaN
# Name: col1, dtype: object
print(df['col1'].mode())
# 0 X
# Name: col1, dtype: object
By applying the mode()
and tolist()
methods to each column using apply()
, you can obtain a Series
where each element is a list of modes.
- pandas: Apply functions to values, rows, columns with map(), apply()
- Lambda expressions in Python
- pandas: Get/Set values with loc, iloc, at, iat
s_list = df.apply(lambda x: x.mode().tolist())
print(s_list)
# col1 [X]
# col2 [X, Y]
# dtype: object
print(s_list.at['col2'])
# ['X', 'Y']
print(type(s_list.at['col2']))
# <class 'list'>
Get the mode for each row: axis
Setting the axis
argument to 1
or 'columns'
allows you to obtain the mode for each row. The count()
method, which counts the number of non-NaN elements, also has the axis
argument.
print(df.mode(axis=1))
# 0 1
# row1 X NaN
# row2 X NaN
# row3 X Y
# row4 Y NaN
print(df.mode(axis=1).count(axis=1))
# row1 1
# row2 1
# row3 2
# row4 1
# dtype: int64
Note that in pandas, each column has a specific data type (dtype
), with the assumption that similar types of data are contained within each column. Transposing the DataFrame
may be more suitable if similar data types are aligned across rows.
print(df.T)
# row1 row2 row3 row4
# col1 X X X Y
# col2 X X Y Y
print(df.T.mode())
# row1 row2 row3 row4
# 0 X X X Y
# 1 NaN NaN Y NaN
Specify whether to include missing values NaN
: dropna
By default, missing values NaN
are excluded. If you set the dropna
argument to False
, NaN
will also be counted.
df_nan = df.copy()
df_nan.iloc[1:, 1] = float('nan')
print(df_nan)
# col1 col2
# row1 X X
# row2 X NaN
# row3 X NaN
# row4 Y NaN
print(df_nan.mode())
# col1 col2
# 0 X X
print(df_nan.mode(dropna=False))
# col1 col2
# 0 X NaN
Specify whether to target only numerical columns: numeric_only
By default, mode()
targets both numerical and non-numerical columns. Setting the numeric_only
argument to True
changes the focus to only numerical columns.
df_num = df.copy()
df_num['col3'] = [1, 1, 1, 0]
print(df_num)
# col1 col2 col3
# row1 X X 1
# row2 X X 1
# row3 X Y 1
# row4 Y Y 0
print(df_num.mode())
# col1 col2 col3
# 0 X X 1.0
# 1 NaN Y NaN
print(df_num.mode(numeric_only=True))
# col3
# 0 1
To exclusively target non-numeric columns, you can use the select_dtypes()
method, which filters columns based on their data type (dtype
).
print(df_num.select_dtypes(exclude='number').mode())
# col1 col2
# 0 X X
# 1 NaN Y
Get the frequency (number of occurrences) of the mode
The frequency (number of occurrences) of the mode can be obtained using the value_counts()
method of a Series
.
value_counts()
returns a Series
with unique elements as labels and their counts as elements. By default, it is sorted in order of frequency, so the first value of the returned Series
represents the frequency of the mode.
df = pd.DataFrame({'col1': ['X', 'X', 'X', 'Y'],
'col2': ['X', 'X', 'Y', 'Y']},
index=['row1', 'row2', 'row3', 'row4'])
print(df)
# col1 col2
# row1 X X
# row2 X X
# row3 X Y
# row4 Y Y
print(df['col1'].value_counts())
# col1
# X 3
# Y 1
# Name: count, dtype: int64
print(df['col1'].value_counts().iat[0])
# 3
The original Series
elements become the index labels in the resulting Series
. In the above example, where the labels are strings, you can use [number]
. However, to avoid errors when the index is numerical, it is better to use iat[number]
for precise indexing.
The describe()
method, which calculates summary statistics for each column, can also provide the mode and its frequency.
print(df.describe())
# col1 col2
# count 4 4
# unique 2 2
# top X X
# freq 3 2
top
represents the mode, and freq
indicates its frequency, with only one mode returned in cases of multiple modes. Since the result is a DataFrame
, rows and specific elements can be accessed using loc
or at
.
print(df.describe().loc['freq'])
# col1 3
# col2 2
# Name: freq, dtype: object
print(df.describe().at['freq', 'col2'])
# 2
describe()
does not have the axis
argument, so if you want to apply it to rows, transpose it first.
print(df.T.describe())
# row1 row2 row3 row4
# count 2 2 2 2
# unique 1 1 2 1
# top X X X Y
# freq 2 2 1 2