方法1:
合并
crop列
>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],... x['Crop2'],... x['Crop3']]),axis=1)>>> df1.head() County Crop1 Crop2 Crop3 Total_pop combined_temp0 Harney grain melons apples 2000 [grain, melons, apples]1 Baker melons grain apples 1500 [melons, grain, apples]2 Wheeler melons grain apples 3000 [melons, grain, apples]3 Hood River apples melons grain 1500 [apples, melons, grain]4 Wasco pears carrots raddish 2000 [pears, carrots, raddish]
使它成为一个排序的元组
>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)>>> df1.head() County Crop1 Crop2 ... Total_pop combined_temp sorted0 Harney grain melons ... 2000 [grain, melons, apples] (apples, grain, melons)1 Baker melons grain ... 1500 [melons, grain, apples] (apples, grain, melons)2 Wheeler melons grain ... 3000 [melons, grain, apples] (apples, grain, melons)3 Hood River apples melons ... 1500 [apples, melons, grain] (apples, grain, melons)4 Wasco pears carrots ... 2000 [pears, carrots, raddish] (carrots, pears, raddish)
然后按 *** 作进行常规分组
>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()>>> df1_groupedsorted Total_pop0 (apples, grain, melons) 80001 (carrots, pears, raddish) 9200
方法2: 基于ws-apprentice的答案的简短
df = df1.copy() grouping_cols = ['Crop1', 'Crop2', 'Crop3'] df[grouping_cols] = pd.Dataframe(df.loc[:, grouping_cols] .apply(set, axis=1) .apply(sorted).values .tolist(), columns=grouping_cols) >>> df.head()County Crop1 Crop2 Crop3 Total_pop 0 Harney apples grain melons 2000 1 Baker apples grain melons 1500 2 Wheeler apples grain melons 3000 3 Hood River apples grain melons 1500 4 Wasco carrots pears raddish 2000
现在按组分组
>>> df.groupby(grouping_cols).Total_pop.sum() Crop1 Crop2 Crop3 apples grain melons 8000 carrots pears raddish 9200 Name: Total_pop, dtype: int64
但我个人更喜欢使用numpy这个答案
由于您的数据似乎可以保证每个国家/地区拥有3种独特的农作物(“我正在按县级列出前三名的农作物一览表。”),因此可以对值进行排序并重新分配。
import numpy as npcols = ['Crop1', 'Crop2', 'Crop3']df1[cols] = np.sort(df1[cols].values, axis=1) County Crop1 Crop2 Crop3 Total_pop0 Harney apples grain melons 20001 Baker apples grain melons 15002 Wheeler apples grain melons 30003 Hood River apples grain melons 15004 Wasco carrots pears raddish 20005 Morrow carrots pears raddish 25006 Union carrots pears raddish 27007 Lake carrots pears raddish 2000
然后总结一下:
df1.groupby(cols).sum()# Total_pop#Crop1 Crop2 Crop3 #apples grain melons 8000#carrots pears raddish 9200
好处是您避免使用Series.apply或.apply(axis=1)。对于large Dataframes,性能差异非常明显:
df1 = pd.concat([df1]*10000, ignore_index=True)cols = ['Crop1', 'Crop2', 'Crop3']%timeit df1[cols] = np.sort(df1[cols].values, axis=1)#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)to_sum = ['Crop1', 'Crop2', 'Crop3']%timeit df1[to_sum] = pd.Dataframe(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)