pandasで分位数・パーセンタイルを取得するquantile

Modified: 2023-12-14 | Tags: Python, pandas

pandasでDataFrame, Seriesの分位数・パーセンタイルを取得するにはquantile()メソッドを使う。

分位数・パーセンタイルの定義は以下の通り。

実数（0.0 ~ 1.0）に対し、q 分位数 (q-quantile) は、分布を q : 1 - q に分割する値である。
...
q/100分位数を、q パーセンタイルという。
分位数 - Wikipedia

例えば1/4分位数は25パーセンタイルと同じ。上記Wikipediaのページにも記載されているように、1/4分位数を第1四分位数、3/4分位数を第3四分位数と呼ぶなど、特別な名称を持つ分位数もある。

目次

quantile()の基本的な使い方
取得する分位数・パーセンタイルを指定: 引数q
補間方法を指定: 引数interpolation
- 補間方法によるデータ型dtypeの違い
行・列を指定: 引数axis
数値以外も処理するか指定: 引数numeric_only
文字列に対するquantile()
日時datetimeに対するquantile()
真偽値boolに対するquantile()

四分位数を含む要約統計量を算出するメソッドdescribe()もある。データの概観をつかむためならこちらのほうが便利。

関連記事: pandasのdescribeで各列の要約統計量（平均、標準偏差など）を取得

中央値（1/2分位数、50パーセンタイル）だけが必要であればmedian()でもよい。

関連記事: pandasで中央値を取得するmedian

本記事のサンプルコードのpandasのバージョンは以下の通り。バージョンによって仕様が異なる可能性があるので注意。以下のDataFrameを例とする。

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.DataFrame({'col_1': range(11), 'col_2': [i**2 for i in range(11)]})
print(df)
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

source: pandas_quantile.py

quantile()の基本的な使い方

デフォルトでは、DataFrameのquantile()は各列の中央値（1/2分位数、50パーセンタイル）をSeriesで返す。数値以外の列を含む場合については後述。

print(df.quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(type(df.quantile()))
# <class 'pandas.core.series.Series'>

source: pandas_quantile.py

Seriesからquantile()を呼んだ場合は、中央値がスカラー値として返される。

print(df['col_1'].quantile())
# 5.0

print(type(df['col_1'].quantile()))
# <class 'numpy.float64'>

source: pandas_quantile.py

要素の型は元のデータ型および後述の引数interpolationの設定によって異なる。

取得する分位数・パーセンタイルを指定: 引数q

第一引数qに取得したい分位・パーセントを0.0〜1.0で指定する。

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

source: pandas_quantile.py

リストで複数指定も可能。この場合の返り値はDataFrameとなる。

print(df.quantile([0, 0.25, 0.5, 0.75, 1.0]))
#       col_1  col_2
# 0.00    0.0    0.0
# 0.25    2.5    6.5
# 0.50    5.0   25.0
# 0.75    7.5   56.5
# 1.00   10.0  100.0

print(type(df.quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.frame.DataFrame'>

source: pandas_quantile.py

Seriesで複数指定すると返り値はSeriesとなる。

print(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0]))
# 0.00     0.0
# 0.25     2.5
# 0.50     5.0
# 0.75     7.5
# 1.00    10.0
# Name: col_1, dtype: float64

print(type(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.series.Series'>

source: pandas_quantile.py

補間方法を指定: 引数interpolation

値の補間方法は引数interpolationで指定する。デフォルトは'linear'で、前後の値から線形補間した値が使われる。

print(df.quantile(0.21))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

print(df.quantile(0.21, interpolation='linear'))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

source: pandas_quantile.py

'lower'は小さい方、'higher'は大きい方、'nearest'は近い方の値が使われる。

print(df.quantile(0.21, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='higher'))
# col_1    3
# col_2    9
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='nearest'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

source: pandas_quantile.py

'midpoint'は前後の値の中間の値（平均値）となる。

print(df.quantile(0.21, interpolation='midpoint'))
# col_1    2.5
# col_2    6.5
# Name: 0.21, dtype: float64

source: pandas_quantile.py

補間方法によるデータ型dtypeの違い

デフォルトは線形補間なので、元のデータ型dtypeが整数intの場合は浮動小数点数floatに変換される。元の値と等しい値でもデータ型が変わるので注意。

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

source: pandas_quantile.py

'lower', 'higher', 'nearest'の場合は元の値がそのまま使われるためデータ型もそのまま。

print(df.quantile(0.2, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.2, dtype: int64

source: pandas_quantile.py

行・列を指定: 引数axis

デフォルトは列ごとに処理されるが、引数axisを1または'columns'とすると行ごとの処理となる。

print(df.quantile(axis=1))
# 0      0.0
# 1      1.0
# 2      3.0
# 3      6.0
# 4     10.0
# 5     15.0
# 6     21.0
# 7     28.0
# 8     36.0
# 9     45.0
# 10    55.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

数値以外も処理するか指定: 引数numeric_only

引数numeric_onlyで数値以外の列を処理するか指定できる。numeric_onlyをTrueとすると数値列のみが対象、Falseとするとすべての型の列が対象となる。

pandas 2.0からnumeric_onlyのデフォルト値がFalseになった。それより前はTrue。バージョンによって異なるので注意。

文字列に対するquantile()

文字列の列を加えたDataFrameを例とする。

df_str = df.copy()
df_str['col_3'] = list('abcdefghijk')
print(df_str)
#     col_1  col_2 col_3
# 0       0      0     a
# 1       1      1     b
# 2       2      4     c
# 3       3      9     d
# 4       4     16     e
# 5       5     25     f
# 6       6     36     g
# 7       7     49     h
# 8       8     64     i
# 9       9     81     j
# 10     10    100     k

print(df_str.dtypes)
# col_1     int64
# col_2     int64
# col_3    object
# dtype: object

source: pandas_quantile.py

引数numeric_onlyをTrueとすると数値列のみが対象となり、文字列の列は除外される。

print(df_str.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

引数numeric_onlyをFalse（pandas 2.0からはデフォルト）として文字列の列を対象とする場合、引数interpolationが'linear'（デフォルト）や'midpoint'だとエラーになる。'lower', 'higher', 'nearest'だと辞書順に従って前後いずれかの値となる。

# print(df_str.quantile())
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

# print(df_str.quantile(interpolation='midpoint'))
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      3      9     d
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

source: pandas_quantile.py

日時datetimeに対するquantile()

日時datetimeの列を加えたDataFrameを例とする。

関連記事: pandasの時系列データにおける頻度（引数freq）の指定方法

df_dt = df.copy()
df_dt['col_3'] = pd.date_range('2023-01-01', '2023-01-11')
print(df_dt)
#     col_1  col_2      col_3
# 0       0      0 2023-01-01
# 1       1      1 2023-01-02
# 2       2      4 2023-01-03
# 3       3      9 2023-01-04
# 4       4     16 2023-01-05
# 5       5     25 2023-01-06
# 6       6     36 2023-01-07
# 7       7     49 2023-01-08
# 8       8     64 2023-01-09
# 9       9     81 2023-01-10
# 10     10    100 2023-01-11

print(df_dt.dtypes)
# col_1             int64
# col_2             int64
# col_3    datetime64[ns]
# dtype: object

source: pandas_quantile.py

引数numeric_onlyをTrueとすると数値列のみが対象となり、日時の列は除外される。

print(df_dt.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

日時の列は引数interpolationが'linear'（デフォルト）や'midpoint'でも正しく補間される。もちろん'lower', 'higher', 'nearest'でもよい。

print(df_dt.quantile([0.2, 0.21, 0.3]))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.1    4.5 2023-01-03 02:24:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='midpoint'))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.5    6.5 2023-01-03 12:00:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      3      9 2023-01-04
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

source: pandas_quantile.py

DataFrame, Seriesを時系列データとして処理するための方法は以下の記事を参照。

関連記事: pandas.DataFrame, Seriesを時系列データとして処理

真偽値boolに対するquantile()

真偽値boolの列を加えたDataFrameを例とする。

df_bool = df.copy()
df_bool['col_3'] = [True, False, True, False, True, False, True, False, True, False, True]
print(df_bool)
#     col_1  col_2  col_3
# 0       0      0   True
# 1       1      1  False
# 2       2      4   True
# 3       3      9  False
# 4       4     16   True
# 5       5     25  False
# 6       6     36   True
# 7       7     49  False
# 8       8     64   True
# 9       9     81  False
# 10     10    100   True

print(df_bool.dtypes)
# col_1    int64
# col_2    int64
# col_3     bool
# dtype: object

source: pandas_quantile.py

boolは整数intのサブクラスで数値扱いなので、引数numeric_onlyがTrueでもFalseでも処理の対象となるが、pandas 2.1.4時点ではboolの列があるとエラーになる。

BUG: Broken bool supports in pandas' quantile by NumPy's percentile behaviour change · Issue #41792 · pandas-dev/pandas

# print(df_bool.quantile())
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

# print(df_bool.quantile(numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

source: pandas_quantile.py

select_dtypes()でbool列を除外するか、astype()で整数intに変換すればよい。

関連記事: pandas.DataFrameから特定の型の列を抽出・除外するselect_dtypes
関連記事: pandasのデータ型dtype一覧とastypeによる変換（キャスト）

print(df_bool.select_dtypes(exclude=bool))
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

print(df_bool.select_dtypes(exclude=bool).quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(df_bool.astype({'col_3': int}))
#     col_1  col_2  col_3
# 0       0      0      1
# 1       1      1      0
# 2       2      4      1
# 3       3      9      0
# 4       4     16      1
# 5       5     25      0
# 6       6     36      1
# 7       7     49      0
# 8       8     64      1
# 9       9     81      0
# 10     10    100      1

print(df_bool.astype({'col_3': int}).quantile())
# col_1     5.0
# col_2    25.0
# col_3     1.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

関連カテゴリー

関連記事