数据聚合与分组操作

本文深入探讨了Pandas库中的GroupBy机制,包括如何遍历分组、选择列、使用字典和Series分组以及按索引层级分组。进一步讨论了数据聚合,如逐列应用函数和返回不含索引的聚合结果。文章还介绍了各种实际应用,如压缩分组键、分位数分析、填充缺失值、随机采样、加权平均和逐组线性回归。最后,讲解了数据透视表和交叉表的创建与使用,提供了一种强大的数据分析工具。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

GroupBy机制

import numpy as np
import pandas as pd
df = pd.DataFrame({'key1':['a','a','d','d','a'],
                  'key2':['one','two','one','two','one'],
                  'data1': np.random.randn(5),
                  'data2': np.random.randn(5)})
df
data1data2key1key2
00.3981710.618838aone
11.4064400.007411atwo
20.8422360.090966done
3-0.3772310.431523dtwo
4-0.525386-1.980548aone
# 根据key1标签计算data1列的均值
grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f3f19fbc438>
grouped.mean()
key1
a    0.426408
d    0.232502
Name: data1, dtype: float64
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
key1  key2
a     one    -0.063608
      two     1.406440
d     one     0.842236
      two    -0.377231
Name: data1, dtype: float64
means.unstack()
key2onetwo
key1
a-0.0636081.406440
d0.842236-0.377231
states = np.array(['Ohio','California','California','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
California  2005    1.406440
            2006    0.842236
Ohio        2005    0.010470
            2006   -0.525386
Name: data1, dtype: float64
df.groupby('key1').mean()
data1data2
key1
a0.426408-0.451433
d0.2325020.261245
df.groupby(['key1','key2']).mean()
data1data2
key1key2
aone-0.063608-0.680855
two1.4064400.007411
done0.8422360.090966
two-0.3772310.431523
df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
d     one     1
      two     1
dtype: int64
遍历各个分组
for name,group in df.groupby('key1'):
    print(name)
    print(group)
a
      data1     data2 key1 key2
0  0.398171  0.618838    a  one
1  1.406440  0.007411    a  two
4 -0.525386 -1.980548    a  one
d
      data1     data2 key1 key2
2  0.842236  0.090966    d  one
3 -0.377231  0.431523    d  two
for name,group in df.groupby(['key1','key2']):
    print(name)
    print(group)
('a', 'one')
      data1     data2 key1 key2
0  0.398171  0.618838    a  one
4 -0.525386 -1.980548    a  one
('a', 'two')
     data1     data2 key1 key2
1  1.40644  0.007411    a  two
('d', 'one')
      data1     data2 key1 key2
2  0.842236  0.090966    d  one
('d', 'two')
      data1     data2 key1 key2
3 -0.377231  0.431523    d  two
pieces = dict(list(df.groupby('key1')))
pieces['d']
data1data2key1key2
20.8422360.090966done
3-0.3772310.431523dtwo
df.dtypes
data1    float64
data2    float64
key1      object
key2      object
dtype: object
grouped  = df.groupby(df.dtypes, axis = 1)  # 指定axis = 1按列分组
for name,group in grouped:
    print(name)
    print(group)
float64
      data1     data2
0  0.398171  0.618838
1  1.406440  0.007411
2  0.842236  0.090966
3 -0.377231  0.431523
4 -0.525386 -1.980548
object
  key1 key2
0    a  one
1    a  two
2    d  one
3    d  two
4    a  one
选择一列或者所有列的子集
df.groupby(['key1','key2'])[['data2']].mean()
data2
key1key2
aone-0.680855
two0.007411
done0.090966
two0.431523
s_grouped = df.groupby(['key1','key2'])['data2']
s_grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f3f1974de10>
s_grouped.mean()
key1  key2
a     one    -0.680855
      two     0.007411
d     one     0.090966
      two     0.431523
Name: data2, dtype: float64
使用字典和Series分组
people = pd.DataFrame(np.random.randn(5,5),
                     columns = ['a','b','c','d','e'],
                     index = ['Joe','Steve','Wes','Jim','Travis'])
people
abcde
Joe-1.650753-1.182232-0.5346440.344981-0.747273
Steve1.4819611.266301-0.7588660.931459-0.512757
Wes-1.755591-0.0035350.910192-0.187150-0.603618
Jim1.520320-1.055722-1.2218940.7416071.282918
Travis-0.2712830.343674-0.210378-0.503580-0.816606
people.iloc[2:3,[1,2]] = np.nan
people
abcde
Joe-1.650753-1.182232-0.5346440.344981-0.747273
Steve1.4819611.266301-0.7588660.931459-0.512757
Wes-1.755591NaNNaN-0.187150-0.603618
Jim1.520320-1.055722-1.2218940.7416071.282918
Travis-0.2712830.343674-0.210378-0.503580-0.816606
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
by_columns = people.groupby(mapping, axis = 1)
by_columns.sum()
bluered
Joe-0.189663-3.580258
Steve0.1725932.235505
Wes-0.187150-2.359209
Jim-0.4802871.747516
Travis-0.713958-0.744215
map_series = pd.Series(mapping)
map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
people.groupby(map_series, axis = 1).count()
bluered
Joe23
Steve23
Wes12
Jim23
Travis23
使用函数分组
people.groupby(len).sum() # 根据第一列名字的长度分组
abcde
3-1.886025-2.237954-1.7565380.899438-0.067973
51.4819611.266301-0.7588660.931459-0.512757
6-0.2712830.343674-0.210378-0.503580-0.816606
key_list = ['one','one','one','two','two']
people.groupby([len, key_list]).min()
abcde
3one-1.755591-1.182232-0.534644-0.187150-0.747273
two1.520320-1.055722-1.2218940.7416071.282918
5one1.4819611.266301-0.7588660.931459-0.512757
6two-0.2712830.343674-0.210378-0.503580-0.816606
根据索引层级分组
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
                                    [1,3,5,1,3]],
                                    names = ['cty','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5), columns=columns)
hier_df
ctyUSJP
tenor13513
0-0.548593-0.488964-0.4804140.929712-0.096871
11.5508610.6525210.2311580.717516-2.594271
20.014603-0.3082890.1616340.3984460.437358
30.191229-1.140538-0.7137860.5493530.838565
hier_df.groupby(level = 'cty', axis = 1).count()
ctyJPUS
023
123
223
323

数据聚合

df
data1data2key1key2
00.3981710.618838aone
11.4064400.007411atwo
20.8422360.090966done
3-0.3772310.431523dtwo
4-0.525386-1.980548aone
grouped = df.groupby('key1')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3f19767198>
grouped['data1'].quantile(0.9) # quantile可以计算Serise 或 DataFrame 列的样本分位数
key1
a    1.204786
d    0.720289
Name: data1, dtype: float64
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)
data1data2
key1
a1.9318262.599386
d1.2194670.340557
grouped.describe()
data1data2
countmeanstdmin25%50%75%maxcountmeanstdmin25%50%75%max
key1
a3.00.4264080.966223-0.525386-0.0636080.3981710.9023051.4064403.0-0.4514331.359082-1.980548-0.9865680.0074110.3131250.618838
d2.00.2325020.862294-0.377231-0.0723640.2325020.5373690.8422362.00.2612450.2408100.0909660.1761060.2612450.3463840.431523
逐列及多函数应用
tips = pd.read_csv('examples/tips.csv')
# 添加总账单的小费比例
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]
total_billtipsmokerdaytimesizetip_pct
016.991.01NoSunDinner20.059447
110.341.66NoSunDinner30.160542
221.013.50NoSunDinner30.166587
323.683.31NoSunDinner20.139780
424.593.61NoSunDinner40.146808
525.294.71NoSunDinner40.186240
grouped = tips.groupby(['day','smoker'])
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')
day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64
grouped_pct.agg(['mean','std',peak_to_peak])
meanstdpeak_to_peak
daysmoker
FriNo0.1516500.0281230.067349
Yes0.1747830.0512930.159925
SatNo0.1580480.0397670.235193
Yes0.1479060.0613750.290095
SunNo0.1601130.0423470.193226
Yes0.1872500.1541340.644685
ThurNo0.1602980.0387740.193350
Yes0.1638630.0393890.151240
grouped_pct.agg([('foo','mean'),('bar',np.std)])
foobar
daysmoker
FriNo0.1516500.028123
Yes0.1747830.051293
SatNo0.1580480.039767
Yes0.1479060.061375
SunNo0.1601130.042347
Yes0.1872500.154134
ThurNo0.1602980.038774
Yes0.1638630.039389
functions = ['count','mean','max']
result = grouped['tip_pct','total_bill'].agg(functions)
result
tip_pcttotal_bill
countmeanmaxcountmeanmax
daysmoker
FriNo40.1516500.187735418.42000022.75
Yes150.1747830.2634801516.81333340.17
SatNo450.1580480.2919904519.66177848.33
Yes420.1479060.3257334221.27666750.81
SunNo570.1601130.2526725720.50666748.17
Yes190.1872500.7103451924.12000045.35
ThurNo450.1602980.2663124517.11311141.19
Yes170.1638630.2412551719.19058843.11
result['tip_pct']
countmeanmax
daysmoker
FriNo40.1516500.187735
Yes150.1747830.263480
SatNo450.1580480.291990
Yes420.1479060.325733
SunNo570.1601130.252672
Yes190.1872500.710345
ThurNo450.1602980.266312
Yes170.1638630.241255
ftuples = [('Durchschnitt', 'mean'),('Abweichung', np.var)]
grouped['tip_pct','total_bill'].agg(ftuples)
tip_pcttotal_bill
DurchschnittAbweichungDurchschnittAbweichung
daysmoker
FriNo0.1516500.00079118.42000025.596333
Yes0.1747830.00263116.81333382.562438
SatNo0.1580480.00158119.66177879.908965
Yes0.1479060.00376721.276667101.387535
SunNo0.1601130.00179320.50666766.099980
Yes0.1872500.02375724.120000109.046044
ThurNo0.1602980.00150317.11311159.625081
Yes0.1638630.00155119.19058869.808518
grouped.agg({'tip': np.max, 'size':'sum'})
tipsize
daysmoker
FriNo3.509
Yes4.7331
SatNo9.00115
Yes10.00104
SunNo6.00167
Yes6.5049
ThurNo6.70112
Yes5.0040
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
            'size' : 'sum'})
sizetip_pct
summinmaxmeanstd
daysmoker
FriNo90.1203850.1877350.1516500.028123
Yes310.1035550.2634800.1747830.051293
SatNo1150.0567970.2919900.1580480.039767
Yes1040.0356380.3257330.1479060.061375
SunNo1670.0594470.2526720.1601130.042347
Yes490.0656600.7103450.1872500.154134
ThurNo1120.0729610.2663120.1602980.038774
Yes400.0900140.2412550.1638630.039389
返回不含索引的聚合数据
tips.groupby(['day','smoker'], as_index = False).mean()
daysmokertotal_billtipsizetip_pct
0FriNo18.4200002.8125002.2500000.151650
1FriYes16.8133332.7140002.0666670.174783
2SatNo19.6617783.1028892.5555560.158048
3SatYes21.2766672.8754762.4761900.147906
4SunNo20.5066673.1678952.9298250.160113
5SunYes24.1200003.5168422.5789470.187250
6ThurNo17.1131112.6737782.4888890.160298
7ThurYes19.1905883.0300002.3529410.163863

应用: 通用拆分——应用——联合

def top(df, n=5, column = 'tip_pct'):
    return df.sort_values(by = column)[-n:]
top(tips, n = 6)
total_billtipsmokerdaytimesizetip_pct
10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
23211.613.39NoSatDinner20.291990
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby('smoker').apply(top)
total_billtipsmokerdaytimesizetip_pct
smoker
No8824.715.85NoThurLunch20.236746
18520.695.00NoSunDinner50.241663
5110.292.60NoSunDinner20.252672
1497.512.00NoThurLunch20.266312
23211.613.39NoSatDinner20.291990
Yes10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
tips.groupby(['smoker','day']).apply(top,n=1, column = 'total_bill')
total_billtipsmokerdaytimesizetip_pct
smokerday
NoFri9422.753.25NoFriDinner20.142857
Sat21248.339.00NoSatDinner40.186220
Sun15648.175.00NoSunDinner60.103799
Thur14241.195.00NoThurLunch50.121389
YesFri9540.174.73YesFriDinner40.117750
Sat17050.8110.00YesSatDinner30.196812
Sun18245.353.50YesSunDinner30.077178
Thur19743.115.00YesThurLunch40.115982
result = tips.groupby('smoker')['tip_pct'].describe()
result
countmeanstdmin25%50%75%max
smoker
No151.00.1593280.0399100.0567970.1369060.1556250.1850140.291990
Yes93.00.1631960.0851190.0356380.1067710.1538460.1950590.710345
result.unstack()
       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64
压缩分组键
tips.groupby('smoker', group_keys = False).apply(top)
total_billtipsmokerdaytimesizetip_pct
8824.715.85NoThurLunch20.236746
18520.695.00NoSunDinner50.241663
5110.292.60NoSunDinner20.252672
1497.512.00NoThurLunch20.266312
23211.613.39NoSatDinner20.291990
10914.314.00YesSatDinner20.279525
18323.176.50YesSunDinner40.280535
673.071.00YesSatDinner10.325733
1789.604.00YesSunDinner20.416667
1727.255.15YesSunDinner20.710345
分位数和桶分析
frame = pd.DataFrame({'data1':np.random.randn(1000),
                     'data2':np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]
0     (0.134, 1.778]
1    (-1.509, 0.134]
2    (-1.509, 0.134]
3    (-1.509, 0.134]
4    (-1.509, 0.134]
5    (-1.509, 0.134]
6    (-1.509, 0.134]
7     (0.134, 1.778]
8    (-1.509, 0.134]
9    (-1.509, 0.134]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.16, -1.509] < (-1.509, 0.134] < (0.134, 1.778] < (1.778, 3.422]]
def get_stats(group):
    return {'min':group.min(),'max':group.max(), 'count':group.count(), 'mean':group.mean()}
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()
countmaxmeanmin
data1
(-3.16, -1.509]56.02.7342280.053983-2.932092
(-1.509, 0.134]492.02.719045-0.027057-2.955542
(0.134, 1.778]409.03.1402400.027807-2.777564
(1.778, 3.422]43.01.889577-0.137910-1.716137
grouping = pd.qcut(frame.data1, 10, labels = False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
countmaxmeanmin
data1
0100.02.7342280.090541-2.932092
1100.02.4964480.066131-2.085507
2100.02.719045-0.172793-2.885545
3100.02.550963-0.040588-2.152490
4100.02.660218-0.090275-2.739500
5100.02.599636-0.062196-2.955542
6100.02.739102-0.090731-2.090401
7100.03.1402400.151551-2.273333
8100.03.0804270.212375-1.990595
9100.01.889577-0.112474-2.777564
使用指定的分组值填充缺失值
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s
0         NaN
1    1.010049
2         NaN
3   -1.347714
4         NaN
5   -0.419845
dtype: float64
s.fillna(s.mean())
0   -0.252503
1    1.010049
2   -0.252503
3   -1.347714
4   -0.252503
5   -0.419845
dtype: float64
states = ['Ohio','New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East']*4 + ['West']*4
data = pd.Series(np.random.randn(8),index = states)
data
Ohio         -2.481099
New York      0.259243
Vermont      -0.327158
Florida      -1.354184
Oregon        1.049335
Nevada       -0.173168
California   -0.202508
Idaho         0.405706
dtype: float64
data[['Vermont','Nevada','Idaho']] = np.nan
data
Ohio         -2.481099
New York      0.259243
Vermont            NaN
Florida      -1.354184
Oregon        1.049335
Nevada             NaN
California   -0.202508
Idaho              NaN
dtype: float64
data.groupby(group_key).mean()
East   -1.192014
West    0.423414
dtype: float64
fill_mean = lambda g : g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio         -2.481099
New York      0.259243
Vermont      -1.192014
Florida      -1.354184
Oregon        1.049335
Nevada        0.423414
California   -0.202508
Idaho         0.423414
dtype: float64
随机采样与排列
# 红桃、黑桃、梅花、方块
suits = ['H','S','C','D']
card_val = (list(range(1,11))+ [10]*3) * 4
base_names = ['A'] + list(range(2,11)) + ['J','K','Q']
cards = []
for suit in ['H','S','C','D']:
    cards.extend(str(num)+suit for num in base_names)
deck = pd.Series(card_val,index = cards)
deck[:13]
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
def draw(deck, n=5):
    return deck.sample(n)
draw(deck)
9S     9
QC    10
7H     7
5H     5
6S     6
dtype: int64
get_suit = lambda card : card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n = 2)
C  AC     1
   4C     4
D  AD     1
   QD    10
H  QH    10
   8H     8
S  8S     8
   3S     3
dtype: int64
deck.groupby(get_suit, group_keys = False).apply(draw, n= 2)
10C    10
3C      3
9D      9
6D      6
9H      9
4H      4
7S      7
QS     10
dtype: int64
分组加权平均和相关性
df = pd.DataFrame({'category': ['a','a','a','a','b','b','b','b'],
                  'data': np.random.randn(8),
                  'weights':np.random.rand(8)})
df
categorydataweights
0a-0.8901480.524090
1a0.2670260.055750
2a0.7908690.465993
3a1.6678370.085079
4b2.0386120.772350
5b0.5150910.681560
6b-0.3342570.869390
7b-0.3534950.517881
grouped = df.groupby('category')
get_wavg = lambda g : np.average(g['data'], weights = g['weights'])
grouped.apply(get_wavg)
category
a    0.051999
b    0.511026
dtype: float64
ls
CH10--数据聚合与分组操作.ipynb  [0m[01;34mpydata-book-2nd-edition[0m/
CH6.ipynb                       [01;32mSeaborn.ipynb[0m*
[01;34mexamples[0m/                       绘图可视化——pandas,Seaborn.ipynb
[01;35mfigpath.png[0m                     绘图与可视化.ipynb
mydata.sqlite
close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates = True, index_col = 0)
close_px.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
AAPLMSFTXOMSPX
2011-10-11400.2927.0076.271195.54
2011-10-12402.1926.9677.161207.25
2011-10-13408.4327.1876.371203.66
2011-10-14422.0027.2778.111224.58
spx_corr = lambda x : x.corrwith(x['SPX'])
rets = close_px.pct_change().dropna()
get_year = lambda x : x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
AAPLMSFTXOMSPX
20030.5411240.7451740.6612651.0
20040.3742830.5885310.5577421.0
20050.4675400.5623740.6310101.0
20060.4282670.4061260.5185141.0
20070.5081180.6587700.7862641.0
20080.6814340.8046260.8283031.0
20090.7071030.6549020.7979211.0
20100.7101050.7301180.8390571.0
20110.6919310.8009960.8599751.0
by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64
逐组线性回归
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y,X).fit() # (OLS最小二乘回归)
    return result.params
by_year.apply(regress,'AAPL', ['SPX'])
SPXintercept
20031.1954060.000710
20041.3634630.004201
20051.7664150.003246
20061.6454960.000080
20071.1987610.003438
20080.968016-0.001110
20090.8791030.002954
20101.0526080.001261
20110.8066050.001514

数据透视表与交叉表

tips.pivot_table(index = ['day','smoker'])
sizetiptip_pcttotal_bill
daysmoker
FriNo2.2500002.8125000.15165018.420000
Yes2.0666672.7140000.17478316.813333
SatNo2.5555563.1028890.15804819.661778
Yes2.4761902.8754760.14790621.276667
SunNo2.9298253.1678950.16011320.506667
Yes2.5789473.5168420.18725024.120000
ThurNo2.4888892.6737780.16029817.113111
Yes2.3529413.0300000.16386319.190588
tips.pivot_table(['tip_pct', 'size'], index = ['time','day'], columns = 'smoker')
sizetip_pct
smokerNoYesNoYes
timeday
DinnerFri2.0000002.2222220.1396220.165347
Sat2.5555562.4761900.1580480.147906
Sun2.9298252.5789470.1601130.187250
Thur2.000000NaN0.159744NaN
LunchFri3.0000001.8333330.1877350.188937
Thur2.5000002.3529410.1603110.163863
tips.pivot_table(['tip_pct', 'size'], index = ['time', 'day'], columns = 'smoker', margins = True)
sizetip_pct
smokerNoYesAllNoYesAll
timeday
DinnerFri2.0000002.2222222.1666670.1396220.1653470.158916
Sat2.5555562.4761902.5172410.1580480.1479060.153152
Sun2.9298252.5789472.8421050.1601130.1872500.166897
Thur2.000000NaN2.0000000.159744NaN0.159744
LunchFri3.0000001.8333332.0000000.1877350.1889370.188765
Thur2.5000002.3529412.4590160.1603110.1638630.161301
All2.6688742.4086022.5696720.1593280.1631960.160803
tips.pivot_table('tip_pct', index = ['time', 'smoker'], columns = 'day', aggfunc= len, margins = True)
dayFriSatSunThurAll
timesmoker
DinnerNo3.045.057.01.0106.0
Yes9.042.019.0NaN70.0
LunchNo1.0NaNNaN44.045.0
Yes6.0NaNNaN17.023.0
All19.087.076.062.0244.0
tips.pivot_table('tip_pct', index = ['time','size', 'smoker'], columns = 'day', 
                 aggfunc= 'mean', fill_value = 0)
dayFriSatSunThur
timesizesmoker
Dinner1No0.0000000.1379310.0000000.000000
Yes0.0000000.3257330.0000000.000000
2No0.1396220.1627050.1688590.159744
Yes0.1712970.1486680.2078930.000000
3No0.0000000.1546610.1526630.000000
Yes0.0000000.1449950.1526600.000000
4No0.0000000.1500960.1481430.000000
Yes0.1177500.1245150.1933700.000000
5No0.0000000.0000000.2069280.000000
Yes0.0000000.1065720.0656600.000000
6No0.0000000.0000000.1037990.000000
Lunch1No0.0000000.0000000.0000000.181728
Yes0.2237760.0000000.0000000.000000
2No0.0000000.0000000.0000000.166005
Yes0.1819690.0000000.0000000.158843
3No0.1877350.0000000.0000000.084246
Yes0.0000000.0000000.0000000.204952
4No0.0000000.0000000.0000000.138919
Yes0.0000000.0000000.0000000.155410
5No0.0000000.0000000.0000000.121389
6No0.0000000.0000000.0000000.173706
交叉表
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
smokerNoYesAll
timeday
DinnerFri3912
Sat454287
Sun571976
Thur101
LunchFri167
Thur441761
All15193244
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值