pandas: Get summary statistics for each column with describe()

Posted: 2024-01-20 | Tags: Python, pandas

In pandas, the describe() method on DataFrame and Series allows you to get summary statistics such as the mean, standard deviation, maximum, minimum, and mode for each column.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.DataFrame({'int': [1, 2, 3, 1],
                   'float': [0.1, 0.2, 0.3, float('nan')],
                   'str': ['X', 'Y', 'X', 'Z'],
                   'str_num': ['1', '2', '3', '1'],
                   'bool': [True, True, False, True]})

print(df)
#    int  float str str_num   bool
# 0    1    0.1   X       1   True
# 1    2    0.2   Y       2   True
# 2    3    0.3   X       3  False
# 3    1    NaN   Z       1   True

print(df.dtypes)
# int          int64
# float      float64
# str         object
# str_num     object
# bool          bool
# dtype: object

source: pandas_describe.py

Basic usage of `describe()`

The describe() method on Series returns a Series.

print(df['float'].describe())
# count    3.00
# mean     0.20
# std      0.10
# min      0.10
# 25%      0.15
# 50%      0.20
# 75%      0.25
# max      0.30
# Name: float, dtype: float64

print(df['str'].describe())
# count     4
# unique    3
# top       X
# freq      2
# Name: str, dtype: object

print(type(df['float'].describe()))
# <class 'pandas.core.series.Series'>

source: pandas_describe.py

The items calculated by describe() vary depending on the data type (dtype). Details about these items will be provided later.

For numerical columns: count, mean, std, min, 25%, 50%,75%, max
For object columns (such as strings): count, unique, top, freq

The describe() method on DataFrame returns a DataFrame.

print(df.describe())
#             int  float
# count  4.000000   3.00
# mean   1.750000   0.20
# std    0.957427   0.10
# min    1.000000   0.10
# 25%    1.000000   0.15
# 50%    1.500000   0.20
# 75%    2.250000   0.25
# max    3.000000   0.30

print(type(df.describe()))
# <class 'pandas.core.frame.DataFrame'>

source: pandas_describe.py

Rows, columns, and values can be accessed using loc and at.

pandas: Get/Set values with loc, iloc, at, iat

print(df.describe().loc['std'])
# int      0.957427
# float    0.100000
# Name: std, dtype: float64

print(df.describe().at['std', 'int'])
# 0.9574271077563381

source: pandas_describe.py

In a DataFrame containing mixed column types, calculations are applied only to numerical columns by default. The include and exclude arguments, which will be explained next, allow you to specify the data types to include and exclude.

Specify target types: `include`, `exclude`

To target specific data types with the describe() method in a DataFrame, use the include and exclude arguments. Remember to specify the data type, not the column name.

Target non-numerical columns

Numerical types can be represented as 'number'. Setting exclude='number' will calculate results for non-numerical columns, such as those containing strings.

print(df.describe(exclude='number'))
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3

source: pandas_describe.py

In the case of a DataFrame without numerical columns, calculations are applied to the available data types by default.

print(df[['str', 'str_num', 'bool']])
#   str str_num   bool
# 0   X       1   True
# 1   Y       2   True
# 2   X       3  False
# 3   Z       1   True

print(df[['str', 'str_num', 'bool']].describe())
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3

source: pandas_describe.py

Target all types of columns

Setting include='all' includes all types of columns. Since the calculated items vary between numerical and other column types, the values for items not calculated will be missing values (NaN).

print(df.describe(include='all'))
#              int  float  str str_num  bool
# count   4.000000   3.00    4       4     4
# unique       NaN    NaN    3       3     2
# top          NaN    NaN    X       1  True
# freq         NaN    NaN    2       2     3
# mean    1.750000   0.20  NaN     NaN   NaN
# std     0.957427   0.10  NaN     NaN   NaN
# min     1.000000   0.10  NaN     NaN   NaN
# 25%     1.000000   0.15  NaN     NaN   NaN
# 50%     1.500000   0.20  NaN     NaN   NaN
# 75%     2.250000   0.25  NaN     NaN   NaN
# max     3.000000   0.30  NaN     NaN   NaN

source: pandas_describe.py

Include and exclude specific types

Any data type can be specified for the include and exclude arguments. A DataFrame is returned as the result, even when it contains only a single column.

print(df.describe(include=int))
#             int
# count  4.000000
# mean   1.750000
# std    0.957427
# min    1.000000
# 25%    1.000000
# 50%    1.500000
# 75%    2.250000
# max    3.000000

print(type(df.describe(include=int)))
# <class 'pandas.core.frame.DataFrame'>

source: pandas_describe.py

Multiple types can be specified in a list. The items to be calculated are automatically determined based on the selected types.

print(df.describe(include=[object, bool]))
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3

print(df.describe(exclude=['f8', 'object']))
#              int  bool
# count   4.000000     4
# unique       NaN     2
# top          NaN  True
# freq         NaN     3
# mean    1.750000   NaN
# std     0.957427   NaN
# min     1.000000   NaN
# 25%     1.000000   NaN
# 50%     1.500000   NaN
# 75%     2.250000   NaN
# max     3.000000   NaN

source: pandas_describe.py

You can specify data types using type objects (like int, bool, or np.float64), type names in strings (like 'int', 'bool', or 'float64'), or type codes in strings (for example, 'f8' for 'float64'). For more details, refer to the following article.

pandas: How to use astype() to cast dtype of DataFrame

Calculate `top`, `freq`, and more for numerical columns

For example, in categorical data represented numerically, such as when males are coded as 0, females as 1, or geographical names are assigned numbers, you might prefer to check the mode and its frequency instead of the mean or standard deviation.

In such cases, you can use the astype() method to convert the type to object, either for all columns or for specific columns.

pandas: How to use astype() to cast dtype of DataFrame

print(df.astype(object).describe())
#         int  float str str_num  bool
# count     4    3.0   4       4     4
# unique    3    3.0   3       3     2
# top       1    0.1   X       1  True
# freq      2    1.0   2       2     3

print(df.astype({'int': object}).describe(exclude='number'))
#         int str str_num  bool
# count     4   4       4     4
# unique    3   3       3     2
# top       1   X       1  True
# freq      2   2       2     3

source: pandas_describe.py

Calculate `mean`, `std`, and more for strings of numbers

To calculate the mean and standard deviation for strings of numbers, first convert them to a numerical data type using the astype() method.

print(df.astype({'str_num': int, 'bool': int}).describe())
#             int  float   str_num  bool
# count  4.000000   3.00  4.000000  4.00
# mean   1.750000   0.20  1.750000  0.75
# std    0.957427   0.10  0.957427  0.50
# min    1.000000   0.10  1.000000  0.00
# 25%    1.000000   0.15  1.000000  0.75
# 50%    1.500000   0.20  1.500000  1.00
# 75%    2.250000   0.25  2.250000  1.00
# max    3.000000   0.30  3.000000  1.00

source: pandas_describe.py

Details of items by `describe()` and corresponding methods

This section details the items calculated by describe() and the methods for obtaining them individually. While you can access the values for each item by selecting the rows in the DataFrame returned by describe(), using individual methods is more efficient when you do not require other items.

Note that describe() excludes missing values (NaN) in its calculations, but some individual methods provide an option to include or exclude NaN.

`count`: Number of non-NaN-values

count represents the number of non-NaN-values, which can be obtained by the count() method.

pandas: Detect and count NaN (missing values) with isnull(), isna()

print(df.count())
# int        4
# float      3
# str        4
# str_num    4
# bool       4
# dtype: int64

source: pandas_describe.py

`unique`: Number of unique values

unique represents the number of unique values, which can be obtained by the nunique() method.

pandas: Get unique values and their counts in a column

print(df.nunique())
# int        3
# float      3
# str        3
# str_num    3
# bool       2
# dtype: int64

source: pandas_describe.py

`top`: Mode

top represents the mode (the most frequent value), which can be obtained by the mode() method.

pandas: Get the mode (the most frequent value) with mode()

mode() on DataFrame returns a DataFrame. If there are multiple modes, any unfilled positions in the result are represented as missing values (NaN).

print(df.mode())
#    int  float  str str_num  bool
# 0  1.0    0.1    X       1  True
# 1  NaN    0.2  NaN     NaN   NaN
# 2  NaN    0.3  NaN     NaN   NaN

source: pandas_describe.py

Selecting the first row with iloc[0] allows you to obtain at least one mode for each column.

print(df.mode().iloc[0])
# int         1.0
# float       0.1
# str           X
# str_num       1
# bool       True
# Name: 0, dtype: object

source: pandas_describe.py

Note that top in describe() returns only one of the modes if there are multiple, and it does not necessarily match the first row of mode().

`freq`: Frequency of the mode

freq represents the frequency of the mode, which can be obtained by the value_counts() method on Series.

print(df['str'].value_counts())
# str
# X    2
# Y    1
# Z    1
# Name: count, dtype: int64

print(df['str'].value_counts().iat[0])
# 2

source: pandas_describe.py

For more details on obtaining the frequency of the mode for each column, refer to the following article.

pandas: Get unique values and their counts in a column

`mean`: Arithmetic mean

mean represents the arithmetic mean, which can be obtained by the mean() method.

pandas.DataFrame.mean — pandas 2.1.4 documentation

Setting the numeric_only argument to True targets only numerical columns. The same applies to methods described in the following subsections.

print(df.mean(numeric_only=True))
# int      1.75
# float    0.20
# bool     0.75
# dtype: float64

source: pandas_describe.py

In describe(), bool type columns are excluded, whereas in mean(), True is treated as 1 and False as 0. The same applies to methods described in the following subsections.

`std`: Sample standard deviation

std represents the sample standard deviation, which can be obtained by the std() method.

pandas.DataFrame.std — pandas 2.1.4 documentation

print(df.std(numeric_only=True))
# int      0.957427
# float    0.100000
# bool     0.500000
# dtype: float64

source: pandas_describe.py

`min`: Minimum Value

min represents the minimum value, which can be obtained by the min() method.

pandas.DataFrame.min — pandas 2.1.4 documentation

print(df.min(numeric_only=True))
# int          1
# float      0.1
# bool     False
# dtype: object

source: pandas_describe.py

`max`: Maximum Value

max represents the maximum value, which can be obtained by the max() method.

pandas.DataFrame.max — pandas 2.1.4 documentation

print(df.max(numeric_only=True))
# int         3
# float     0.3
# bool     True
# dtype: object

source: pandas_describe.py

`50%`: Median (50th percentile)

50% represents the median (50th percentile), which can be obtained by the median() method.

pandas.DataFrame.median — pandas 2.1.4 documentation

print(df.median(numeric_only=True))
# int      1.5
# float    0.2
# bool     1.0
# dtype: float64

source: pandas_describe.py

`25%`, `75%`: 25th and 75th percentiles

25% and 75% represent the 25th and 75th percentiles, which can be obtained by the quantile() method.

pandas: Find the quantile with quantile()

As of pandas version 2.1.4, even when setting the numeric_only argument to True, an error occurs if there are bool columns.

BUG: Broken bool supports in pandas' quantile by NumPy's percentile behaviour change · Issue #41792 · pandas-dev/pandas

# print(df.quantile(q=[0.25, 0.75], numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

source: pandas_describe.py

The following example uses a DataFrame that excludes the bool column. quantile() can calculate the minimum value, 25th percentile, median, 75th percentile, and maximum value all at once.

print(df.iloc[:, :-1])
#    int  float str str_num
# 0    1    0.1   X       1
# 1    2    0.2   Y       2
# 2    3    0.3   X       3
# 3    1    NaN   Z       1

print(df.iloc[:, :-1].quantile(q=[0, 0.25, 0.5, 0.75, 1], numeric_only=True))
#        int  float
# 0.00  1.00   0.10
# 0.25  1.00   0.15
# 0.50  1.50   0.20
# 0.75  2.25   0.25
# 1.00  3.00   0.30

source: pandas_describe.py

You can specify the percentiles to calculate in describe() with the percentiles argument.

Specify percentiles to calculate in `describe()`: `percentiles`

By default, describe() calculates the minimum value (0th percentile), median (50th percentile), and maximum value (100th percentile), along with the 25th and 75th percentiles.

The minimum, median, and maximum values are always calculated, but others can be specified with the percentiles argument. Specify a list of values ranging from 0.0 to 1.0.

print(df.describe(percentiles=[0.2, 0.4, 0.6, 0.8]))
#             int  float
# count  4.000000   3.00
# mean   1.750000   0.20
# std    0.957427   0.10
# min    1.000000   0.10
# 20%    1.000000   0.14
# 40%    1.200000   0.18
# 50%    1.500000   0.20
# 60%    1.800000   0.22
# 80%    2.400000   0.26
# max    3.000000   0.30

source: pandas_describe.py

For `datetime64[ns]` type

Pandas has datetime64[ns] for representing dates and times.

df['dt'] = pd.to_datetime(['2023-12-01', '2023-12-08', '2023-12-15', '2023-12-22'])
print(df)
#    int  float str str_num   bool         dt
# 0    1    0.1   X       1   True 2023-12-01
# 1    2    0.2   Y       2   True 2023-12-08
# 2    3    0.3   X       3  False 2023-12-15
# 3    1    NaN   Z       1   True 2023-12-22

print(df.dtypes)
# int                 int64
# float             float64
# str                object
# str_num            object
# bool                 bool
# dt         datetime64[ns]
# dtype: object

source: pandas_describe.py

In previous versions, for columns of type datetime64[ns], additional calculations like first (the earliest datetime) and last (the latest datetime) were performed along with count, unique, top, and freq.

As of pandas version 2.1.4, datetime64[ns] columns are treated similarly to numeric columns. min and max correspond to first and last, respectively. To calculate count or unique for datetime64[ns] columns, you can convert their type using astype().

print(df.describe())
#             int  float                   dt
# count  4.000000   3.00                    4
# mean   1.750000   0.20  2023-12-11 12:00:00
# min    1.000000   0.10  2023-12-01 00:00:00
# 25%    1.000000   0.15  2023-12-06 06:00:00
# 50%    1.500000   0.20  2023-12-11 12:00:00
# 75%    2.250000   0.25  2023-12-16 18:00:00
# max    3.000000   0.30  2023-12-22 00:00:00
# std    0.957427   0.10                  NaN

print(df.astype(object).describe())
#         int  float str str_num  bool                   dt
# count     4    3.0   4       4     4                    4
# unique    3    3.0   3       3     2                    4
# top       1    0.1   X       1  True  2023-12-01 00:00:00
# freq      2    1.0   2       2     3                    1

source: pandas_describe.py

For information on calculating percentiles for datetime, refer to the following article.

pandas: Find the quantile with quantile()

Related Categories

Related Articles