pandas: Get summary statistics for each column with describe()
In pandas, the describe()
method on DataFrame
and Series
allows you to get summary statistics such as the mean, standard deviation, maximum, minimum, and mode for each column.
- pandas.DataFrame.describe — pandas 2.1.4 documentation
- pandas.Series.describe — pandas 2.1.4 documentation
- Basic usage of
describe()
- Specify target types:
include
,exclude
- Calculate
top
,freq
, and more for numerical columns - Calculate
mean
,std
, and more for strings of numbers - Details of items by
describe()
and corresponding methods - Specify percentiles to calculate in
describe()
:percentiles
- For
datetime64[ns]
type
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.DataFrame({'int': [1, 2, 3, 1],
'float': [0.1, 0.2, 0.3, float('nan')],
'str': ['X', 'Y', 'X', 'Z'],
'str_num': ['1', '2', '3', '1'],
'bool': [True, True, False, True]})
print(df)
# int float str str_num bool
# 0 1 0.1 X 1 True
# 1 2 0.2 Y 2 True
# 2 3 0.3 X 3 False
# 3 1 NaN Z 1 True
print(df.dtypes)
# int int64
# float float64
# str object
# str_num object
# bool bool
# dtype: object
Basic usage of describe()
The describe()
method on Series
returns a Series
.
print(df['float'].describe())
# count 3.00
# mean 0.20
# std 0.10
# min 0.10
# 25% 0.15
# 50% 0.20
# 75% 0.25
# max 0.30
# Name: float, dtype: float64
print(df['str'].describe())
# count 4
# unique 3
# top X
# freq 2
# Name: str, dtype: object
print(type(df['float'].describe()))
# <class 'pandas.core.series.Series'>
The items calculated by describe()
vary depending on the data type (dtype
). Details about these items will be provided later.
- For numerical columns:
count
,mean
,std
,min
,25%
,50%
,75%
,max
- For object columns (such as strings):
count
,unique
,top
,freq
The describe()
method on DataFrame
returns a DataFrame
.
print(df.describe())
# int float
# count 4.000000 3.00
# mean 1.750000 0.20
# std 0.957427 0.10
# min 1.000000 0.10
# 25% 1.000000 0.15
# 50% 1.500000 0.20
# 75% 2.250000 0.25
# max 3.000000 0.30
print(type(df.describe()))
# <class 'pandas.core.frame.DataFrame'>
Rows, columns, and values can be accessed using loc
and at
.
print(df.describe().loc['std'])
# int 0.957427
# float 0.100000
# Name: std, dtype: float64
print(df.describe().at['std', 'int'])
# 0.9574271077563381
In a DataFrame
containing mixed column types, calculations are applied only to numerical columns by default. The include
and exclude
arguments, which will be explained next, allow you to specify the data types to include and exclude.
Specify target types: include
, exclude
To target specific data types with the describe()
method in a DataFrame
, use the include
and exclude
arguments. Remember to specify the data type, not the column name.
Target non-numerical columns
Numerical types can be represented as 'number'
. Setting exclude='number'
will calculate results for non-numerical columns, such as those containing strings.
print(df.describe(exclude='number'))
# str str_num bool
# count 4 4 4
# unique 3 3 2
# top X 1 True
# freq 2 2 3
In the case of a DataFrame
without numerical columns, calculations are applied to the available data types by default.
print(df[['str', 'str_num', 'bool']])
# str str_num bool
# 0 X 1 True
# 1 Y 2 True
# 2 X 3 False
# 3 Z 1 True
print(df[['str', 'str_num', 'bool']].describe())
# str str_num bool
# count 4 4 4
# unique 3 3 2
# top X 1 True
# freq 2 2 3
Target all types of columns
Setting include='all'
includes all types of columns. Since the calculated items vary between numerical and other column types, the values for items not calculated will be missing values (NaN
).
print(df.describe(include='all'))
# int float str str_num bool
# count 4.000000 3.00 4 4 4
# unique NaN NaN 3 3 2
# top NaN NaN X 1 True
# freq NaN NaN 2 2 3
# mean 1.750000 0.20 NaN NaN NaN
# std 0.957427 0.10 NaN NaN NaN
# min 1.000000 0.10 NaN NaN NaN
# 25% 1.000000 0.15 NaN NaN NaN
# 50% 1.500000 0.20 NaN NaN NaN
# 75% 2.250000 0.25 NaN NaN NaN
# max 3.000000 0.30 NaN NaN NaN
Include and exclude specific types
Any data type can be specified for the include
and exclude
arguments. A DataFrame
is returned as the result, even when it contains only a single column.
print(df.describe(include=int))
# int
# count 4.000000
# mean 1.750000
# std 0.957427
# min 1.000000
# 25% 1.000000
# 50% 1.500000
# 75% 2.250000
# max 3.000000
print(type(df.describe(include=int)))
# <class 'pandas.core.frame.DataFrame'>
Multiple types can be specified in a list. The items to be calculated are automatically determined based on the selected types.
print(df.describe(include=[object, bool]))
# str str_num bool
# count 4 4 4
# unique 3 3 2
# top X 1 True
# freq 2 2 3
print(df.describe(exclude=['f8', 'object']))
# int bool
# count 4.000000 4
# unique NaN 2
# top NaN True
# freq NaN 3
# mean 1.750000 NaN
# std 0.957427 NaN
# min 1.000000 NaN
# 25% 1.000000 NaN
# 50% 1.500000 NaN
# 75% 2.250000 NaN
# max 3.000000 NaN
You can specify data types using type objects (like int
, bool
, or np.float64
), type names in strings (like 'int'
, 'bool'
, or 'float64'
), or type codes in strings (for example, 'f8'
for 'float64'
). For more details, refer to the following article.
Calculate top
, freq
, and more for numerical columns
For example, in categorical data represented numerically, such as when males are coded as 0
, females as 1
, or geographical names are assigned numbers, you might prefer to check the mode and its frequency instead of the mean or standard deviation.
In such cases, you can use the astype()
method to convert the type to object
, either for all columns or for specific columns.
print(df.astype(object).describe())
# int float str str_num bool
# count 4 3.0 4 4 4
# unique 3 3.0 3 3 2
# top 1 0.1 X 1 True
# freq 2 1.0 2 2 3
print(df.astype({'int': object}).describe(exclude='number'))
# int str str_num bool
# count 4 4 4 4
# unique 3 3 3 2
# top 1 X 1 True
# freq 2 2 2 3
Calculate mean
, std
, and more for strings of numbers
To calculate the mean and standard deviation for strings of numbers, first convert them to a numerical data type using the astype()
method.
print(df.astype({'str_num': int, 'bool': int}).describe())
# int float str_num bool
# count 4.000000 3.00 4.000000 4.00
# mean 1.750000 0.20 1.750000 0.75
# std 0.957427 0.10 0.957427 0.50
# min 1.000000 0.10 1.000000 0.00
# 25% 1.000000 0.15 1.000000 0.75
# 50% 1.500000 0.20 1.500000 1.00
# 75% 2.250000 0.25 2.250000 1.00
# max 3.000000 0.30 3.000000 1.00
Details of items by describe()
and corresponding methods
This section details the items calculated by describe()
and the methods for obtaining them individually. While you can access the values for each item by selecting the rows in the DataFrame
returned by describe()
, using individual methods is more efficient when you do not require other items.
Note that describe()
excludes missing values (NaN
) in its calculations, but some individual methods provide an option to include or exclude NaN
.
count
: Number of non-NaN-values
count
represents the number of non-NaN-values, which can be obtained by the count()
method.
print(df.count())
# int 4
# float 3
# str 4
# str_num 4
# bool 4
# dtype: int64
unique
: Number of unique values
unique
represents the number of unique values, which can be obtained by the nunique()
method.
print(df.nunique())
# int 3
# float 3
# str 3
# str_num 3
# bool 2
# dtype: int64
top
: Mode
top
represents the mode (the most frequent value), which can be obtained by the mode()
method.
mode()
on DataFrame
returns a DataFrame
. If there are multiple modes, any unfilled positions in the result are represented as missing values (NaN
).
print(df.mode())
# int float str str_num bool
# 0 1.0 0.1 X 1 True
# 1 NaN 0.2 NaN NaN NaN
# 2 NaN 0.3 NaN NaN NaN
Selecting the first row with iloc[0]
allows you to obtain at least one mode for each column.
print(df.mode().iloc[0])
# int 1.0
# float 0.1
# str X
# str_num 1
# bool True
# Name: 0, dtype: object
Note that top
in describe()
returns only one of the modes if there are multiple, and it does not necessarily match the first row of mode()
.
freq
: Frequency of the mode
freq
represents the frequency of the mode, which can be obtained by the value_counts()
method on Series
.
print(df['str'].value_counts())
# str
# X 2
# Y 1
# Z 1
# Name: count, dtype: int64
print(df['str'].value_counts().iat[0])
# 2
For more details on obtaining the frequency of the mode for each column, refer to the following article.
mean
: Arithmetic mean
mean
represents the arithmetic mean, which can be obtained by the mean()
method.
Setting the numeric_only
argument to True
targets only numerical columns. The same applies to methods described in the following subsections.
print(df.mean(numeric_only=True))
# int 1.75
# float 0.20
# bool 0.75
# dtype: float64
In describe()
, bool
type columns are excluded, whereas in mean()
, True
is treated as 1
and False
as 0
. The same applies to methods described in the following subsections.
std
: Sample standard deviation
std
represents the sample standard deviation, which can be obtained by the std()
method.
print(df.std(numeric_only=True))
# int 0.957427
# float 0.100000
# bool 0.500000
# dtype: float64
min
: Minimum Value
min
represents the minimum value, which can be obtained by the min()
method.
print(df.min(numeric_only=True))
# int 1
# float 0.1
# bool False
# dtype: object
max
: Maximum Value
max
represents the maximum value, which can be obtained by the max()
method.
print(df.max(numeric_only=True))
# int 3
# float 0.3
# bool True
# dtype: object
50%
: Median (50th percentile)
50%
represents the median (50th percentile), which can be obtained by the median()
method.
pandas.DataFrame.median — pandas 2.1.4 documentation
print(df.median(numeric_only=True))
# int 1.5
# float 0.2
# bool 1.0
# dtype: float64
25%
, 75%
: 25th and 75th percentiles
25%
and 75%
represent the 25th and 75th percentiles, which can be obtained by the quantile()
method.
As of pandas version 2.1.4, even when setting the numeric_only
argument to True
, an error occurs if there are bool
columns.
# print(df.quantile(q=[0.25, 0.75], numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
The following example uses a DataFrame
that excludes the bool
column. quantile()
can calculate the minimum value, 25th percentile, median, 75th percentile, and maximum value all at once.
print(df.iloc[:, :-1])
# int float str str_num
# 0 1 0.1 X 1
# 1 2 0.2 Y 2
# 2 3 0.3 X 3
# 3 1 NaN Z 1
print(df.iloc[:, :-1].quantile(q=[0, 0.25, 0.5, 0.75, 1], numeric_only=True))
# int float
# 0.00 1.00 0.10
# 0.25 1.00 0.15
# 0.50 1.50 0.20
# 0.75 2.25 0.25
# 1.00 3.00 0.30
You can specify the percentiles to calculate in describe()
with the percentiles
argument.
Specify percentiles to calculate in describe()
: percentiles
By default, describe()
calculates the minimum value (0th percentile), median (50th percentile), and maximum value (100th percentile), along with the 25th and 75th percentiles.
The minimum, median, and maximum values are always calculated, but others can be specified with the percentiles
argument. Specify a list of values ranging from 0.0
to 1.0
.
print(df.describe(percentiles=[0.2, 0.4, 0.6, 0.8]))
# int float
# count 4.000000 3.00
# mean 1.750000 0.20
# std 0.957427 0.10
# min 1.000000 0.10
# 20% 1.000000 0.14
# 40% 1.200000 0.18
# 50% 1.500000 0.20
# 60% 1.800000 0.22
# 80% 2.400000 0.26
# max 3.000000 0.30
For datetime64[ns]
type
Pandas has datetime64[ns]
for representing dates and times.
df['dt'] = pd.to_datetime(['2023-12-01', '2023-12-08', '2023-12-15', '2023-12-22'])
print(df)
# int float str str_num bool dt
# 0 1 0.1 X 1 True 2023-12-01
# 1 2 0.2 Y 2 True 2023-12-08
# 2 3 0.3 X 3 False 2023-12-15
# 3 1 NaN Z 1 True 2023-12-22
print(df.dtypes)
# int int64
# float float64
# str object
# str_num object
# bool bool
# dt datetime64[ns]
# dtype: object
In previous versions, for columns of type datetime64[ns]
, additional calculations like first
(the earliest datetime) and last
(the latest datetime) were performed along with count
, unique
, top
, and freq
.
As of pandas version 2.1.4, datetime64[ns]
columns are treated similarly to numeric columns. min
and max
correspond to first
and last
, respectively. To calculate count
or unique
for datetime64[ns]
columns, you can convert their type using astype()
.
print(df.describe())
# int float dt
# count 4.000000 3.00 4
# mean 1.750000 0.20 2023-12-11 12:00:00
# min 1.000000 0.10 2023-12-01 00:00:00
# 25% 1.000000 0.15 2023-12-06 06:00:00
# 50% 1.500000 0.20 2023-12-11 12:00:00
# 75% 2.250000 0.25 2023-12-16 18:00:00
# max 3.000000 0.30 2023-12-22 00:00:00
# std 0.957427 0.10 NaN
print(df.astype(object).describe())
# int float str str_num bool dt
# count 4 3.0 4 4 4 4
# unique 3 3.0 3 3 2 4
# top 1 0.1 X 1 True 2023-12-01 00:00:00
# freq 2 1.0 2 2 3 1
For information on calculating percentiles for datetime, refer to the following article.