hive开窗函数官方描述:地址
窗口函数又叫OLAP函数/分析函数,窗口函数兼具分组和排序功能。
引入:在 sql 中有一类函数叫做聚合函数,例如 sum()、avg()、max()等等,这类函数可以将多行数据按照规则聚合为一行,一般来讲聚合后的行数是要少于聚合前的行数的。但是有时我们想要既显示聚合前的数据,又要显示聚合后的数据,所以就需要使用窗口。
窗口最重要的就是分组和排序功能,对应的关键字是 partition by 和order by。
窗口函数总结:这个数据窗口的大小可能会随着行的改变而改变
Current row: 当前行
n preceding:往前n行数据
n following:往后n行数据
unbounded preceding:表示从前面的起点
unbounded following:表示从后面的起点over(partition by name)
over(order by orderdate)
over(rows between ** and **)
over(rows between unbounding preceding and unbounded following )
over(partition by name order by orderdate
rows between n preceding and current)
[结果有重复,排名相等会同名,但是会跳号,就是说有排名相同,并且结果行等于窗口行]
[例子: 1 2 2 4 5 ]
[结果无重复,按照窗口定行号,获取窗口所有行]
[例子: 1 2 3 4 5 ]
[结果有重复,可以同名次,导致结果行小于窗口行]
[例子: 1 2 2 3 4 ]
注意:n必须为int类型
- 如何判断什么时候需要使用窗口函数?
在计算时,函数运算的范围需要动态变化需要跨行取一些字段
窗口函数和group by 的区别:
假设 *** 作的数据集有 N行,group by 后分了M个组
使用窗口函数计算,计算后的结果依然有N行;
使用group by 分组,之后再使用聚合函数运算,计算的结果有M行
也就是说使用group by 会导致结果行比原来行数要少,而使用窗口函数不会减少行数
窗口(函数计算的范围) + 函数
语法:
函数() over( partition by 分组 order by 排序 [rows between(定义函数运算的范围)] )
若不指定partition by,默认就是全部数据看作成一个组[窗口] 计算
如果不指定order by,默认不排序,将分组内所有值计算
rows between也叫做window子句, 如果不指定默认就是指从起点到当前行
从window子句语法: (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING) (ROWS | RANGE) BETWEEN 上界 AND 下界 参数介绍 rows between ... and ... : 控制窗口函数的范围 [物理上的行号决定] range between ... and ... : 控制计算列值的大小 [逻辑上的值决定,跟order by指定列有关] preceding:向前 following:往后 current row:当前行 unbounded:起点 unbounded preceding:表示从前面的起点 unbounded following:表示到后面的终点
哪些函数可以使用窗口:
①标准的聚合函数:SUM,MIN,MAX,AVG,COUNT
②特定的函数:
LEADLAGFIRST_VALUELAST_VALUE
③排名函数RANKROW_NUMBERDENSE_RANKCUME_DISTPERCENT_RANKNTILE
注意事项:
- 排名函数有: rank, row_number, dense_rank, cume_dist, percent_rank,ntile.
Lead and Lag 函数.这些函数支持写Over(), 但是不支持在Over()中定义window子句当Over()中指定了Order by但是没有写window子句,此时窗口默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.当Over()中 既没有写Order by 也没有写window子句,此时窗口默认为ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.CUME_DIST和 PERCENT_RANK是序列分析函数,不支持window子句,但是这两个函数不常用特别注意使用first_value 或 last_value 或 row_number,若是不指定order by,可能会导致排序混乱,会出现错误的结果,其他函数还有待考擦
数据准备
-- 创建表 create table if not exists test_over ( id string, datetime string, num int ); -- 插入数据 insert into test_over values ('id1', '2021-01-10', 1), ('id1', '2021-01-11', 5), ('id1', '2021-01-12', 7), ('id1', '2021-01-13', 3), ('id1', '2021-01-14', 2), ('id1', '2021-01-15', 4), ('id1', '2021-01-16', 4), ('id2', '2021-01-10', 2), ('id2', '2021-01-11', 3), ('id2', '2021-01-12', 5), ('id2', '2021-01-13', 6), ('id2', '2021-01-14', 3), ('id2', '2021-01-15', 9), ('id2', '2021-01-16', 7); -- 查看 select * from test_over;1.over 和 聚合函数的使用
-- 1.over 和 聚合函数的使用 select id, datetime, num, sum(num) over (partition by id order by datetime) as sum_num, avg(num) over (partition by id order by datetime) as avg_num, count(*) over (partition by id order by datetime) as count_n, max(num) over (partition by id order by datetime) as max_num, min(num) over (partition by id order by datetime) as min_num from test_over; select id, datetime, num, sum(num) over (partition by id order by datetime rows between 3 preceding and current row ) as sum_num, avg(num) over (partition by id order by datetime rows between 3 preceding and current row ) as avg_num, count(*) over (partition by id order by datetime rows between 3 preceding and current row ) as count_n, max(num) over (partition by id order by datetime rows between 3 preceding and current row ) as max_num, min(num) over (partition by id order by datetime rows between 3 preceding and current row ) as min_num from test_over; select id, datetime, num, sum(num) over (partition by id order by num range between 1 preceding and 5 following) as sum_num, avg(num) over (partition by id order by num range between 1 preceding and 5 following) as avg_num, count(*) over (partition by id order by num range between 1 preceding and 5 following) as count_n, max(num) over (partition by id order by num range between 1 preceding and 5 following) as max_num, min(num) over (partition by id order by num range between 1 preceding and 5 following) as min_num from test_over order by id; -- 对比 rows between 和 range between select id, datetime, num, sum(num) over (partition by id order by num rows between 2 preceding and 2 following) as rows_sum, count(*) over (partition by id order by num rows between 2 preceding and 2 following) as rows_cnt, sum(num) over (partition by id order by num range between 2 preceding and 2 following) as range_sum, count(*) over (partition by id order by num range between 2 preceding and 2 following) as range_cnt, sum(num) over (partition by id order by datetime range between 2 preceding and 2 following) as range_date, count(*) over (partition by id order by datetime range between 2 preceding and 2 following) as range_date from test_over order by id; -- 结论: rows between 根据行号确定窗口范围,而range between根据order by指定的列值的范围确定窗口范围2.lag 和 lead
-- 2.lag 和 lead select id,datetime,num, -- 按照id分组日期排序,获取前一行的num值,如果没有则为-1 lag(num, 1, -1) over (partition by id order by datetime), -- 按照id分组日期排序,获取前两行的datetime值,如果没有则为 '1970-01-01 00:00:00' lag(datetime, 2, '1970-01-01 00:00:00') over (partition by id order by datetime), from test_over; select id,datetime,num, -- 按照id分组日期排序,获取窗口下一行的num值,如果没有则为-1 lead(num, 1, -1) over (partition by id order by datetime), -- 按照id分组日期排序,获取窗口内后两行的datetime值,如果没有则为 '9999-99-99 00:00:00' lead(datetime, 2, '9999-99-99 00:00:00') over (partition by id order by datetime) from test_over; -- 错误的 -- select -- id,datetime,num, -- lead(datetime, 2, '9999-99-99 00:00:00') over (partition by id order by datetime rows between UNBOUNDED preceding and current row ) -- from test_over;3.first_value, last_value
-- 3.first_value, last_value select id,datetime,num, -- 按照id分组日期排序,获取窗口内的第一个值 first_value(datetime) over (partition by id order by datetime), -- 按照id分组日期排序并且窗口是往前2行往后2行,获取窗口内的第一个值 first_value(datetime) over (partition by id order by datetime rows between 2 preceding and 2 following), -- 按照id分组日期排序,获取窗口内的最后一个值 last_value(datetime) over (partition by id order by datetime), -- 按照id分组日期排序并且窗口是往前2行往后2行,获取窗口内的最后一个值 last_value(datetime) over (partition by id order by datetime rows between 2 preceding and 2 following) from test_over; -- 特别注意: first_value, last_value的 over中不指定order by会导致排序混乱,出现错误的结果 select id,datetime,num, -- 按照id分组日期排序,获取窗口内的第一个值 first_value(datetime) over (partition by id), -- 按照id分组日期排序,获取窗口内的最后一个值 last_value(datetime) over (partition by id) from test_over;
first_value, last_value结果图
-- 4.row_number,rank,dense_rank select id,datetime,num, row_number() over (partition by id order by num desc ) as row_number, rank() over (partition by id order by num desc) as rank, dense_rank() over (partition by id order by num desc) as dense_rank from test_over;
row_number,rank,dense_rank结果图
180. 连续出现的数字 [针对row_number计算失效问题]
5.ntile-- 5.ntile select id,datetime,num, ntile(2) over (partition by id order by datetime) as n1, ntile(3) over (partition by id order by datetime) as n2, ntile(4) over (partition by id order by datetime) as n3, ntile(4) over (order by datetime) as nopartition from test_over order by id, datetime;
ntile运行结果图
-- 6.cume_dist, percent_rank -- cume_dist 函数: 当前行值 <= 当前值的行数 / 分组内总行数 select id, datetime, num, cume_dist() over (order by num) as c1, cume_dist() over (partition by id order by num) as c2 from test_over; -- persent_rank 函数: 当前行的值 = (当前行rank -1) / (分组内总行数 - 1) select id, datetime, num, percent_rank() over (order by num) as p_nopartition, rank() over (order by num) as rank_nopartition, percent_rank() over (partition by id order by num) as perrank, rank() over (partition by id order by num) as rank from test_over;
cume_dist结果图
percent_rank结果图
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)