这两个函数在spark中是用来代替spark sql 的GROUPING SETS函数的。
主要的作用都是对多列做groupBy。既然spark中已经有了groupBy函数,这两个函数又是用来干啥的?他们之间有什么区别?
Spark The Defination Guide书中解释是这样的:
cube函数:
Rather than treating elements hierarchically, a cube
does the same thing across all dimensions
rowup函数:
A rollup is a multidimensional aggregation that performs a variety of group-by style calculations
看起来不明所以。
其实cube就是key的所有可能组合(这里包含交,并,补)
rowup是key分层下钻组合,顺序从左到右逐步下钻组合。
在回到groupBy本身其实只有简单的两两组合
所以最好还是举个栗子:如果我们要对year,month,day这三个字段进行作为groupby 分组依据,对组内进行计数 *** 作。
那么cube相当于做了如下的所有 *** 作
GROUP BY SELECt COUNT(*) FROM table GROUP BY year, month, day SELECt COUNT(*) FROM table GROUP BY year, month SELECt COUNT(*) FROM table GROUP BY year, day SELECt COUNT(*) FROM table GROUP BY year SELECt COUNT(*) FROM table GROUP BY month, day SELECt COUNT(*) FROM table GROUP BY month SELECt COUNT(*) FROM table GROUP BY day null, null, null SELECt COUNT(*) FROM table
对应的rollup相当于做了如下 *** 作
GROUP BY SELECt COUNT(*) FROM table GROUP BY year, month, day SELECt COUNT(*) FROM table GROUP BY year, month SELECt COUNT(*) FROM table GROUP BY year null SELECt COUNT(*) FROM table
而groupBy只做了SELECt COUNT(*) FROM table GROUP BY year, month, day这个 *** 作
进一步,可以举一个pyspark的例子
数据如下所示:
+---------------+---------+--------+ | category| name|how_many| +---------------+---------+--------+ | insurance| Janusz| 0| |savings account| Grażyna| 1| | credit card|Sebastian| 0| | mortgage| Janusz| 2| | term deposit| Janusz| 4| | insurance| Grażyna| 2| |savings account| Janusz| 5| | credit card|Sebastian| 2| | mortgage|Sebastian| 4| | term deposit| Janusz| 9| | insurance| Grażyna| 3| |savings account| Grażyna| 1| |savings account|Sebastian| 0| |savings account|Sebastian| 2| | credit card|Sebastian| 1| +---------------+---------+--------+Cube *** 作
df.cube('category', 'name').agg(sum('how_many')) +---------------+---------+-------------+ | category| name|sum(how_many)| +---------------+---------+-------------+ | null| Grażyna| 7| | mortgage| null| 6| | null| null| 36| | insurance| null| 5| |savings account| Grażyna| 2| | credit card|Sebastian| 3| | term deposit| null| 13| | insurance| Grażyna| 5| | null|Sebastian| 9| | term deposit| Janusz| 13| |savings account| null| 9| | insurance| Janusz| 0| | mortgage|Sebastian| 4| |savings account| Janusz| 5| | mortgage| Janusz| 2| |savings account|Sebastian| 2| | credit card| null| 3| | null| Janusz| 20| +---------------+---------+-------------+
可以看出,cube做了四种groupby:
单独category列的计数单独namel列的计数category和name组合列的计数所有category和name的计数总和 rowup *** 作
df.rollup('category', 'name').agg(sum('how_many')) +---------------+---------+-------------+ | category| name|sum(how_many)| +---------------+---------+-------------+ | mortgage| null| 6| | null| null| 36| | insurance| null| 5| |savings account| Grażyna| 2| | credit card|Sebastian| 3| | term deposit| null| 13| | insurance| Grażyna| 5| | term deposit| Janusz| 13| |savings account| null| 9| | insurance| Janusz| 0| | mortgage|Sebastian| 4| |savings account| Janusz| 5| | mortgage| Janusz| 2| |savings account|Sebastian| 2| | credit card| null| 3| +---------------+---------+-------------+
rowup做了三种 *** 作:
category和name的组合计数category的计数所有category的计数总和
Reference:
1.文中主要例子来源
2.What is the difference between cube, rollup and groupBy operators?——Stack Overflow
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)