postgresql – 在Postgres上多个字段上的松散索引扫描？_sql

概述我在Postgres 9.2(数百万行)中有几个大表,我需要根据两个字段’source'(varchar)和’id'(int)的组合生成一个唯一的代码.我可以通过生成row_numbers来执行此 *** 作： SELECT source,id FROM tablename GROUP BY source,id 但结果可能需要一段时间才能完成.有人建议,如果字段被索引,并且索引值的比例很小(这是我的情况) 我在Postgres 9.2(数百万行)中有几个大表,我需要根据两个字段’source'(varchar)和’ID'(int)的组合生成一个唯一的代码.我可以通过生成row_numbers来执行此 *** 作：

SELECT source,ID FROM tablename GROUP BY source,ID

但结果可能需要一段时间才能完成.有人建议,如果字段被索引,并且索引值的比例很小(这是我的情况),那么松散的索引扫描可能是更好的选择：http://wiki.postgresql.org/wiki/Loose_indexscan

WITH RECURSIVE     t AS (SELECT min(col) AS col FROM tablename           UNION ALL           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NulL)SELECT col FROM t WHERE col IS NOT NulLUNION ALLSELECT NulL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NulL);

该示例在单个字段上运行.尝试返回多个字段会产生错误：子查询必须只返回一列.一种可能性是尝试检索整个行 – 例如SELECT ROW(min(source),min(ID)…,但是我不确定WHERE语句的语法需要看起来像处理单个行元素.

问题是：可以修改基于递归的代码以使用多个列,如果是,如何？我致力于使用Postgres,但看起来MysqL已经为多个列实现了松散的索引扫描：http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html

按照建议,我附上了EXPLAIN ANALYZE结果.

对于我的情况 – 我使用GROUP BY为2列选择不同的值,它是以下内容：

HashAggregate  (cost=1645408.44..1654099.65 rows=869121 wIDth=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 wIDth=34) (actual time=4413.311..25450.840 rows=22025768 loops=1) Total runtime: 36127.789 ms(3 rows)

我不知道如何进行2列索引扫描(这是问题),但为了进行比较,在一列上使用GROUP BY,我得到：

HashAggregate  (cost=1590346.70..1590347.69 rows=99 wIDth=8) (actual time=32310.706..32310.722 rows=100 loops=1)   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 wIDth=8) (actual time=4764.609..26941.832 rows=22025768 loops=1) Total runtime: 32350.899 ms(3 rows)

但是对于一列上的松散索引扫描,我得到：

Result  (cost=181.28..198.07 rows=101 wIDth=8) (actual time=0.069..1.935 rows=100 loops=1)   CTE t     ->  Recursive Union  (cost=1.74..181.28 rows=101 wIDth=8) (actual time=0.062..1.855 rows=101 loops=1)           ->  Result  (cost=1.74..1.75 rows=1 wIDth=0) (actual time=0.061..0.061 rows=1 loops=1)                 InitPlan 1 (returns )                   ->  limit  (cost=0.00..1.74 rows=1 wIDth=8) (actual time=0.057..0.057 rows=1 loops=1)                         ->  Index Only Scan using tablename_ID on tablename  (cost=0.00..38379014.12 rows=22024696 wIDth=8) (actual time=0.055..0.055 rows=1 loops=1)                               Index Cond: (ID IS NOT NulL)                               Heap Fetches: 0           ->  Worktable Scan on t  (cost=0.00..17.75 rows=10 wIDth=8) (actual time=0.017..0.017 rows=1 loops=101)                 Filter: (ID IS NOT NulL)                 Rows Removed by Filter: 0                 SubPlan 3                   ->  Result  (cost=1.75..1.76 rows=1 wIDth=0) (actual time=0.016..0.016 rows=1 loops=100)                         InitPlan 2 (returns )                           ->  limit  (cost=0.00..1.75 rows=1 wIDth=8) (actual time=0.016..0.016 rows=1 loops=100)                                 ->  Index Only Scan using tablename_ID on tablename  (cost=0.00..12811462.41 rows=7341565 wIDth=8) (actual time=0.015..0.015 rows=1 loops=100)                                       Index Cond: ((ID IS NOT NulL) AND (ID > t.ID))                                       Heap Fetches: 0   ->  Append  (cost=0.00..16.79 rows=101 wIDth=8) (actual time=0.067..1.918 rows=100 loops=1)         ->  CTE Scan on t  (cost=0.00..2.02 rows=100 wIDth=8) (actual time=0.067..1.899 rows=100 loops=1)               Filter: (ID IS NOT NulL)               Rows Removed by Filter: 1         ->  Result  (cost=13.75..13.76 rows=1 wIDth=0) (actual time=0.002..0.002 rows=0 loops=1)               One-Time Filter:                InitPlan 5 (returns )                 ->  Index Only Scan using tablename_ID on tablename  (cost=0.00..13.75 rows=1 wIDth=0) (actual time=0.002..0.002 rows=0 loops=1)                       Index Cond: (ID IS NulL)                       Heap Fetches: 0 Total runtime: 2.040 ms

完整的表定义如下所示：

CREATE table tablename(  source character(25),ID bigint NOT NulL,time_ timestamp without time zone,height numeric,lon numeric,lat numeric,distance numeric,status character(3),geom geometry(PointZ,4326),relID bigint)WITH (  OIDS=FALSE);CREATE INDEX tablename_height  ON public.tablename  USING btree  (height);CREATE INDEX tablename_geom  ON public.tablename  USING gist  (geom);CREATE INDEX tablename_ID  ON public.tablename  USING btree  (ID);CREATE INDEX tablename_lat  ON public.tablename  USING btree  (lat);CREATE INDEX tablename_lon  ON public.tablename  USING btree  (lon);CREATE INDEX tablename_relID  ON public.tablename  USING btree  (relID);CREATE INDEX tablename_sID  ON public.tablename  USING btree  (source ColLATE pg_catalog."default",ID);CREATE INDEX tablename_source  ON public.tablename  USING btree  (source ColLATE pg_catalog."default");CREATE INDEX tablename_time  ON public.tablename  USING btree  (time_);

答案选择：

我花了一些时间来比较所提供的方法.有时这样我希望不止一个答案可以被接受,但在这种情况下,我正在给@jjanes打勾.这样做的原因是他的解决方案与最初提出的问题更紧密地匹配,并且我能够获得关于所需WHERE语句的形式的一些见解.最后,HashAggregate实际上是最快的方法(对我来说),但这是由于我的数据的性质,而不是算法的任何问题.我已经为下面的不同方法添加了EXPLAIN ANALYZE,并且将为jjanes和joop提供1.

HashAggregate：

HashAggregate  (cost=1018669.72..1029722.08 rows=1105236 wIDth=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)   ->  Seq Scan on tablename  (cost=0.00..908548.48 rows=22024248 wIDth=34) (actual time=0.054..14639.931 rows=22024982 loops=1) Total runtime: 24787.292 ms

松散索引扫描修改

CTE Scan on t  (cost=13.84..15.86 rows=100 wIDth=112) (actual time=0.916..250311.164 rows=1233080 loops=1)   Filter: (source IS NOT NulL)   Rows Removed by Filter: 1   CTE t     ->  Recursive Union  (cost=0.00..13.84 rows=101 wIDth=112) (actual time=0.911..249295.872 rows=1233081 loops=1)           ->  limit  (cost=0.00..0.04 rows=1 wIDth=34) (actual time=0.910..0.911 rows=1 loops=1)                 ->  Index Only Scan using tablename_sID on tablename  (cost=0.00..965442.32 rows=22024248 wIDth=34) (actual time=0.908..0.908 rows=1 loops=1)                       Heap Fetches: 0           ->  Worktable Scan on t  (cost=0.00..1.18 rows=10 wIDth=112) (actual time=0.201..0.201 rows=1 loops=1233081)                 Filter: (source IS NOT NulL)                 Rows Removed by Filter: 0                 SubPlan 1                   ->  limit  (cost=0.00..0.05 rows=1 wIDth=34) (actual time=0.100..0.100 rows=1 loops=1233080)                         ->  Index Only Scan using tablename_sID on tablename  (cost=0.00..340173.38 rows=7341416 wIDth=34) (actual time=0.100..0.100 rows=1 loops=1233080)                               Index Cond: (ROW(source,ID) > ROW(t.source,t.ID))                               Heap Fetches: 0                 SubPlan 2                   ->  limit  (cost=0.00..0.05 rows=1 wIDth=34) (actual time=0.099..0.099 rows=1 loops=1233080)                         ->  Index Only Scan using tablename_sID on tablename  (cost=0.00..340173.38 rows=7341416 wIDth=34) (actual time=0.098..0.098 rows=1 loops=1233080)                               Index Cond: (ROW(source,t.ID))                               Heap Fetches: 0 Total runtime: 250491.559 ms

合并反加入

Merge Anti Join  (cost=0.00..12099015.26 rows=14682832 wIDth=42) (actual time=48.710..541624.677 rows=1233080 loops=1)   Merge Cond: ((src.source = nx.source) AND (src.ID = nx.ID))   Join Filter: (nx.time_ > src.time_)   Rows Removed by Join Filter: 363464177   ->  Index Only Scan using tablename_pkey on tablename src  (cost=0.00..1060195.27 rows=22024248 wIDth=42) (actual time=48.566..5064.551 rows=22024982 loops=1)         Heap Fetches: 0   ->  Materialize  (cost=0.00..1115255.89 rows=22024248 wIDth=42) (actual time=0.011..40551.997 rows=363464177 loops=1)         ->  Index Only Scan using tablename_pkey on tablename nx  (cost=0.00..1060195.27 rows=22024248 wIDth=42) (actual time=0.008..8258.890 rows=22024982 loops=1)               Heap Fetches: 0 Total runtime: 541750.026 ms

相当可怕,但这似乎工作：

WITH RECURSIVE     t AS (  select a,b from (select a,b from foo order by a,b limit 1) asdf union all   select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),(select b from foo where (a,b limit 1)      from t where t.a is not null)  select * from t where t.a is not null;

我真的不明白为什么需要“不是空”,因为空值首先出现在哪里？

总结

以上是内存溢出为你收集整理的postgresql – 在Postgres上多个字段上的松散索引扫描？全部内容，希望文章能够帮你解决postgresql – 在Postgres上多个字段上的松散索引扫描？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/sjk/1164482.html

postgresql – 在Postgres上多个字段上的松散索引扫描？

发表评论

评论列表（0条）