分组运算过程 split->apply->combine
拆分:进行分组的根据
应用:每个分组运行的计算规则
合并:把每个分组的计算结果合并起来
1.分组函数-groupbygroupby(by=None) ,groupby实现了split过程。
import pandas as pd import numpy as np df=pd.Dataframe({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)}) print(df) out: key1 key2 data1 data2 0 a one 0.445011 -0.346976 1 a two 0.129147 -1.260998 2 b one 0.806248 -0.125555 3 b two 0.981721 -1.633108 4 a one -1.533791 -0.176332 for name,group in df.groupby('key1'): print(name) print(group) out: a key1 key2 data1 data2 0 a one 0.445011 -0.346976 1 a two 0.129147 -1.260998 4 a one -1.533791 -0.176332 b key1 key2 data1 data2 2 b one 0.806248 -0.125555 3 b two 0.981721 -1.633108 # by指定一个列 print(df.groupby('key1').mean()) out: data1 data2 key1 a 1.533847 0.463332 b -0.224867 -1.825610 #by指定多个列 print(df.groupby(['key1','key2']).mean()) out: data1 data2 key1 key2 a one 2.362819 0.559232 two -0.124097 0.271531 b one -0.548848 -1.423937 two 0.099114 -2.2272822.agg函数处理groupby结果
agg(func)
agg实现了apply+combine
- func 取内置聚合函数(如max、min)
import pandas as pd import numpy as np df=pd.Dataframe({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':np.random.randn(5), 'data2':np.random.randn(5)}) print(df) out: key1 key2 data1 data2 0 a one -0.407514 -0.010082 1 a two -0.774303 0.251207 2 b one -1.189536 -2.061739 3 b two -0.411025 -0.289213 4 a one 1.688148 -0.434298 print(df.groupby('key1').agg(max)) out: key2 data1 data2 key1 a two 1.688148 0.251207 b two -0.411025 -0.289213 #也可以不用agg函数,直接使用聚合函数MAX() print(df.groupby('key1').max()) out: key2 data1 data2 key1 a two 1.688148 0.251207 b two -0.411025 -0.289213
- func 取自定义函数
import pandas as pd import numpy as np df=pd.Dataframe({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':[3,7,9,12,15], 'data2':[1,6,14,23,7]}) print(df) out: key1 key2 data1 data2 0 a one 3 1 1 a two 7 6 2 b one 9 14 3 b two 12 23 4 a one 15 7 print(df.groupby('key1').agg(lambda x:x.max()-x.min())) out: data1 data2 key1 a 12 6 b 3 9
- func取函数列表
import pandas as pd import numpy as np df=pd.Dataframe({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':[3,7,9,12,15], 'data2':[1,6,14,23,7]}) print(df) out: key1 key2 data1 data2 0 a one 3 1 1 a two 7 6 2 b one 9 14 3 b two 12 23 4 a one 15 7 #应用多个聚合函数 #通过元组提供新的列名 print(df.groupby('key1').agg(['mean','std','sum',('range',lambda df:df.max()-df.min())])) out: data1 data2 mean std sum range mean std sum range key1 a 8.333333 6.110101 25 12 4.666667 3.214550 14 6 b 10.500000 2.121320 21 3 18.500000 6.363961 37 9
- func 取key为列名、value为函数的dict
import pandas as pd import numpy as np df=pd.Dataframe({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':[3,7,9,12,15], 'data2':[1,6,14,23,7]}) print(df) out: key1 key2 data1 data2 0 a one 3 1 1 a two 7 6 2 b one 9 14 3 b two 12 23 4 a one 15 7 #应用不同的聚合函数到每列 dict_map={'data1':['mean',('range',lambda df:df.max()-df.min())],'data2':'sum'} print(df.groupby('key1').agg(dict_map)) out: data1 data2 mean range sum key1 a 8.333333 12 14 b 10.500000 3 37
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)