spark rollup和cube的区别

spark rollup和cube的区别,第1张

spark rollup和cube的区别

这两个函数在spark中是用来代替spark sql 的GROUPING SETS函数的。
主要的作用都是对多列做groupBy。既然spark中已经有了groupBy函数,这两个函数又是用来干啥的?他们之间有什么区别?
Spark The Defination Guide书中解释是这样的:
cube函数:
Rather than treating elements hierarchically, a cube
does the same thing across all dimensions

rowup函数:
A rollup is a multidimensional aggregation that performs a variety of group-by style calculations
看起来不明所以。
其实cube就是key的所有可能组合(这里包含交,并,补)
rowup是key分层下钻组合,顺序从左到右逐步下钻组合。

在回到groupBy本身其实只有简单的两两组合

所以最好还是举个栗子:如果我们要对year,month,day这三个字段进行作为groupby 分组依据,对组内进行计数 *** 作
那么cube相当于做了如下的所有 *** 作

GROUP BY
SELECt COUNT(*) FROM table GROUP BY year, month, day
SELECt COUNT(*) FROM table GROUP BY year, month
SELECt COUNT(*) FROM table GROUP BY year, day
SELECt COUNT(*) FROM table GROUP BY year
SELECt COUNT(*) FROM table GROUP BY month, day
SELECt COUNT(*) FROM table GROUP BY month
SELECt COUNT(*) FROM table GROUP BY day
null, null, null SELECt COUNT(*) FROM table

对应的rollup相当于做了如下 *** 作

GROUP BY
SELECt COUNT(*) FROM table GROUP BY year, month, day
SELECt COUNT(*) FROM table GROUP BY year, month
SELECt COUNT(*) FROM table GROUP BY year
null SELECt COUNT(*) FROM table

而groupBy只做了SELECt COUNT(*) FROM table GROUP BY year, month, day这个 *** 作

进一步,可以举一个pyspark的例子
数据如下所示:

+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+
Cube *** 作
df.cube('category', 'name').agg(sum('how_many'))

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+

可以看出,cube做了四种groupby:

单独category列的计数单独namel列的计数category和name组合列的计数所有category和name的计数总和 rowup *** 作

df.rollup('category', 'name').agg(sum('how_many'))

+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+

rowup做了三种 *** 作:

category和name的组合计数category的计数所有category的计数总和

Reference:
1.文中主要例子来源
2.What is the difference between cube, rollup and groupBy operators?——Stack Overflow

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5705924.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存