《Postgresql雕虫小技,分组top性能提升44倍》要点:
本文介绍了Postgresql雕虫小技,分组top性能提升44倍,希望对您有用。如果有疑问,可以联系我们。
业务配景
按分组取出top值,是非经常见的业务需求.
好比提取每位歌手的下载量top 10的曲目、提取每个城市纳税前10的人或企业.
传统办法
传统的办法是使用窗口查询,Postgresql是支持窗口查询的.
例子
测试表和测试数据,生成10000个分组,1000万条记录.
postgres=# create table tbl(c1 int,c2 int,c3 int);
CREATE table
postgres=# create index IDx1 on tbl(c1,c2);
CREATE INDEX
postgres=# insert into tbl select mod(trunc(random()*10000)::int,10000),trunc(random()*10000000) from generate_serIEs(1,10000000);
INSERT 0 10000000
使用窗口查询的执行计划
postgres=# explain select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
query PLAN
----------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..770563.03 rows=3333326 wIDth=20)
Filter: (t.rn <= 10)
-> WindowAgg (cost=0.43..645563.31 rows=9999977 wIDth=12)
-> Index Scan using IDx1 on tbl (cost=0.43..470563.72 rows=9999977 wIDth=12)
(4 rows)
使用窗口查询的成果举例
postgres=# select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
rn | c1 | c2 | c3
----+------+--------+----
1 | 0 | 1657 |
2 | 0 | 3351 |
3 | 0 | 6347 |
4 | 0 | 12688 |
5 | 0 | 16991 |
6 | 0 | 19584 |
7 | 0 | 24694 |
8 | 0 | 36646 |
9 | 0 | 40882 |
10 | 0 | 41599 |
1 | 1 | 14465 |
2 | 1 | 29032 |
3 | 1 | 39969 |
4 | 1 | 41094 |
5 | 1 | 69481 |
6 | 1 | 70919 |
7 | 1 | 75575 |
8 | 1 | 81102 |
9 | 1 | 87496 |
10 | 1 | 90603 |
......
使用窗口查询的效率,20.1秒
postgres=# explain (analyze,verbose,costs,timing,buffers) select * from (select row_number() over(partition by c1 order by c2) as rn,* from tbl) t where t.rn<=10;
query PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.43..770563.03 rows=3333326 wIDth=20) (actual time=0.040..20813.469 rows=100000 loops=1)
Output: t.rn,t.c1,t.c2,t.c3
Filter: (t.rn <= 10)
Rows Removed by Filter: 9900000
Buffers: shared hit=10035535
-> WindowAgg (cost=0.43..645563.31 rows=9999977 wIDth=12) (actual time=0.035..18268.027 rows=10000000 loops=1)
Output: row_number() OVER (?),tbl.c1,tbl.c2,tbl.c3
Buffers: shared hit=10035535
-> Index Scan using IDx1 on public.tbl (cost=0.43..470563.72 rows=9999977 wIDth=12) (actual time=0.026..11913.677 rows=10000000 loops=1)
Output: tbl.c1,tbl.c3
Buffers: shared hit=10035535
Planning time: 0.110 ms
Execution time: 20833.747 ms
(13 rows)
虫篆之技
如何优化?
可以参考我之前写的,使用递归查询,优化count distinct的办法.
https://yq.aliyun.com/articles/39689
(点击“阅读原文”,方可点击该连接进入哟!)
本文同样必要用到递归查询,获得分组ID:
postgres=# with recursive t1 as (
postgres(# (select min(c1) c1 from tbl )
postgres(# union all
postgres(# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
postgres(# )
postgres-# select * from t1;
写成SRF函数,如下:
postgres=# create or replace function f() returns setof tbl as $$
postgres$# declare
postgres$# v int;
postgres$# begin
postgres$# for v in with recursive t1 as (
postgres$# (select min(c1) c1 from tbl )
postgres$# union all
postgres$# (select (select min(tbl.c1) c1 from tbl where tbl.c1>t.c1) c1 from t1 t where t.c1 is not null)
postgres$# )
postgres$# select * from t1
postgres$# LOOP
postgres$# return query select * from tbl where c1=v order by c2 limit 10;
postgres$# END LOOP;
postgres$# return;
postgres$#
postgres$# end;
postgres$# $$ language plpgsql strict;
CREATE FUNCTION
优化后的查询成果例子
postgres=# select * from f();
c1 | c2 | c3
------+--------+----
0 | 1657 |
0 | 3351 |
0 | 6347 |
0 | 12688 |
0 | 16991 |
0 | 19584 |
0 | 24694 |
0 | 36646 |
0 | 40882 |
0 | 41599 |
1 | 14465 |
1 | 29032 |
1 | 39969 |
1 | 41094 |
1 | 69481 |
1 | 70919 |
1 | 75575 |
1 | 81102 |
1 | 87496 |
1 | 90603 |
......
优化后,只必要464毫秒返回10000个分组的top 10.
postgres=# explain (analyze,buffers) select * from f();
query PLAN
---------------------------------------------------------------------------------------------------------------------
Function Scan on public.f (cost=0.25..10.25 rows=1000 wIDth=12) (actual time=419.218..444.810 rows=100000 loops=1)
Output: c1,c2,c3
Function Call: f()
Buffers: shared hit=170407,temp read=221 written=220
Planning time: 0.037 ms
Execution time: 464.257 ms
(6 rows)
小 结
1)传统的办法使用窗口查询,输出多个每个分组的top 10,需要扫描所有的记录.效率较低.
2)由于分组不是非常多,只有10000个,所以可以选择使用递归的办法,用上索引取top 10,速度非常快.
3)目前Postgresql的递归语法不支持递归的启动表写在subquery里面,也不支持启动表在递归查询中使用order by,所以不能直接使用递归得出结果,目前必要套一层函数.
更多深度技术内容,请存眷云栖社区微信公众号:yunqiinsight.
《Postgresql雕虫小技,分组top性能提升44倍》是否对您有启发,欢迎查看更多与《Postgresql雕虫小技,分组top性能提升44倍》相关教程,学精学透。内存溢出PHP学院为您提供精彩教程。
总结以上是内存溢出为你收集整理的PostgreSQL雕虫小技,分组TOP性能提升44倍全部内容,希望文章能够帮你解决PostgreSQL雕虫小技,分组TOP性能提升44倍所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)