pandas数据变形之分组与聚合

pandas数据变形之分组与聚合,第1张

pandas数据变形之分组聚合

分组运算过程 split->apply->combine

拆分:进行分组的根据

应用:每个分组运行的计算规则

合并:把每个分组的计算结果合并起来

1.分组函数-groupby

        groupby(by=None) ,groupby实现了split过程。

import pandas as pd
import numpy as np
df=pd.Dataframe({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':np.random.randn(5),
                 'data2':np.random.randn(5)})
print(df)
out:
  key1 key2     data1     data2
0    a  one  0.445011 -0.346976
1    a  two  0.129147 -1.260998
2    b  one  0.806248 -0.125555
3    b  two  0.981721 -1.633108
4    a  one -1.533791 -0.176332

for name,group in df.groupby('key1'):
    print(name)
    print(group)

out:
a
  key1 key2     data1     data2
0    a  one  0.445011 -0.346976
1    a  two  0.129147 -1.260998
4    a  one -1.533791 -0.176332

b
  key1 key2     data1     data2
2    b  one  0.806248 -0.125555
3    b  two  0.981721 -1.633108

# by指定一个列
print(df.groupby('key1').mean())
out:
         data1     data2
key1                    
a     1.533847  0.463332
b    -0.224867 -1.825610

#by指定多个列
print(df.groupby(['key1','key2']).mean())
out:
            data1     data2
key1 key2                    
a    one   2.362819  0.559232
     two  -0.124097  0.271531
b    one  -0.548848 -1.423937
     two   0.099114 -2.227282
2.agg函数处理groupby结果

agg(func)

agg实现了apply+combine

  • func 取内置聚合函数(如max、min)
import pandas as pd
import numpy as np
df=pd.Dataframe({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':np.random.randn(5),
                 'data2':np.random.randn(5)})
print(df)
out:
  key1 key2     data1     data2
0    a  one -0.407514 -0.010082
1    a  two -0.774303  0.251207
2    b  one -1.189536 -2.061739
3    b  two -0.411025 -0.289213
4    a  one  1.688148 -0.434298

print(df.groupby('key1').agg(max))
out:
   key2     data1     data2
key1                         
a     two  1.688148  0.251207
b     two -0.411025 -0.289213

#也可以不用agg函数,直接使用聚合函数MAX()
print(df.groupby('key1').max())
out:
   key2     data1     data2
key1                         
a     two  1.688148  0.251207
b     two -0.411025 -0.289213
  •  func 取自定义函数
import pandas as pd
import numpy as np
df=pd.Dataframe({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':[3,7,9,12,15],
                 'data2':[1,6,14,23,7]})
print(df)
out:

  key1 key2  data1  data2
0    a  one      3      1
1    a  two      7      6
2    b  one      9     14
3    b  two     12     23
4    a  one     15      7

print(df.groupby('key1').agg(lambda x:x.max()-x.min()))
out:
      data1  data2
key1              
a        12      6
b         3      9
  • func取函数列表
import pandas as pd
import numpy as np
df=pd.Dataframe({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':[3,7,9,12,15],
                 'data2':[1,6,14,23,7]})
print(df)
out:

  key1 key2  data1  data2
0    a  one      3      1
1    a  two      7      6
2    b  one      9     14
3    b  two     12     23
4    a  one     15      7
#应用多个聚合函数
#通过元组提供新的列名
print(df.groupby('key1').agg(['mean','std','sum',('range',lambda df:df.max()-df.min())]))

out:
       data1                          data2                    
           mean       std sum range       mean       std sum range
key1                                                              
a      8.333333  6.110101  25    12   4.666667  3.214550  14     6
b     10.500000  2.121320  21     3  18.500000  6.363961  37     9
  • func 取key为列名、value为函数的dict  
import pandas as pd
import numpy as np
df=pd.Dataframe({'key1':['a','a','b','b','a'],
                 'key2':['one','two','one','two','one'],
                 'data1':[3,7,9,12,15],
                 'data2':[1,6,14,23,7]})
print(df)
out:
  key1 key2  data1  data2
0    a  one      3      1
1    a  two      7      6
2    b  one      9     14
3    b  two     12     23
4    a  one     15      7

#应用不同的聚合函数到每列
dict_map={'data1':['mean',('range',lambda df:df.max()-df.min())],'data2':'sum'}
print(df.groupby('key1').agg(dict_map))
out:
         data1       data2
           mean range   sum
key1                       
a      8.333333    12    14
b     10.500000     3    37

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/4830255.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-11-10
下一篇 2022-11-10

发表评论

登录后才能评论

评论列表(0条)

保存