使用时间戳OVERLAPS和“PARTITION BY”加速PostgreSQL查询

使用时间戳OVERLAPS和“PARTITION BY”加速PostgreSQL查询,第1张

概述我在PostgreSQL 9.0中有一个相当大的表(500K – 1M行),它包含通用的“时间片”信息,也就是说,它确定另一个表中的行(“特征”)何时有效.定义看起来像这样(略微简化): CREATE TABLE feature_timeslice( timeslice_id int NOT NULL, feature_id int NOT NULL, valid_time_begi 我在Postgresql 9.0中有一个相当大的表(500K – 1M行),它包含通用的“时间片”信息,也就是说,它确定另一个表中的行(“特征”)何时有效.定义看起来像这样(略微简化):

CREATE table feature_timeslice(  timeslice_ID int NOT NulL,feature_ID int NOT NulL,valID_time_begin timestamp NOT NulL,valID_time_end timestamp,sequence_number smallint,-- Some other columns  CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_ID)  -- Some other constraints)CREATE INDEX ix_feature_timeslice_feature_IDON feature_timeslice USING btree (feature_ID);

然后,在timeslice_ID上​​将许多其他特定功能表连接到它:

CREATE table specific_feature_timeslice(  timeslice_ID int NOT NulL,-- Other columns  CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_ID),CONSTRAINT fk_specific_feature_timeslice_feature_timeslice FOREIGN KEY (timeslice_ID) REFERENCES feature_timeslice (timeslice_ID))

可能存在具有重叠有效时段(开始/结束时间)的多个时间片,但具有最高sequence_number的时间片优先(再次,略微简化,但足够接近).我想有效地找到每个feature_ID的当前有效行,所以我定义了一个视图,如下所示:

CREATE VIEW feature_timeslice_ID_NowAS    SELECT timeslice_ID    FROM    (        SELECT timeslice_ID,rank() OVER        (            PARTITION BY feature_ID            ORDER BY sequence_number DESC,timeslice_ID DESC        )        FROM feature_timeslice        WHERE (current_timestamp AT TIME ZONE 'UTC','0'::interval) OVERLAPS (valID_time_begin,COALESCE(valID_time_end,'infinity'::timestamp))    ) subq     WHERE subq.rank = 1

它通常像这样查询

SELECT *FROM specific_feature_timeslice sfJOIN feature_timeslice_ID_Now n USING (timeslice_ID)WHERE sf.name = 'SOMETHING'

这可行,但它仍然有点太慢 – 需要1-2秒,即使可能只返回1-5行,因为specific_feature_timeslice标准通常会将其缩小很多. (加入多个功能视图的更复杂的查询会非常快速地变慢.)我无法弄清楚如何让Postgresql更有效地执行此 *** 作.查询计划如下所示:

Join Filter: ((r.timeslice_ID)::integer = (subq.timeslice_ID)::integer)  ->  Subquery Scan on subq  (cost=32034.36..37876.98 rows=835 wIDth=4) (actual time=2086.125..5243.467 rows=250918 loops=1)        Filter: (subq.rank = 1)        ->  WindowAgg  (cost=32034.36..35790.33 rows=166932 wIDth=10) (actual time=2086.110..4066.351 rows=250918 loops=1)              ->  Sort  (cost=32034.36..32451.69 rows=166932 wIDth=10) (actual time=2086.065..2654.971 rows=250918 loops=1)                    Sort Key: feature_timeslice.feature_ID,feature_timeslice.sequence_number,feature_timeslice.timeslice_ID                    Sort Method:  quicksort  Memory: 13898kB                    ->  Seq Scan on feature_timeslice  (cost=0.00..17553.93 rows=166932 wIDth=10) (actual time=287.270..1225.595 rows=250918 loops=1)                          Filter: overlaps(timezone('UTC'::text,Now()),(timezone('UTC'::text,Now()) + '00:00:00'::interval),(valID_time_begin)::timestamp without time zone,COALESCE((valID_time_end)::timestamp without time zone,'infinity'::timestamp without time zone))  ->  Materialize  (cost=0.00..1093.85 rows=2 wIDth=139) (actual time=0.002..0.007 rows=2 loops=250918)        ->  Seq Scan on specific_feature_timeslice sf  (cost=0.00..1093.84 rows=2 wIDth=139) (actual time=1.958..7.674 rows=2 loops=1)              Filter: ((name)::text = 'SOMETHING'::text)Total runtime: 10319.875 ms

实际上,我想在任何给定时间内进行此查询,而不仅仅是当前时间.我有一个为此定义的函数,它将时间作为参数,但查询“现在”是最常见的情况,所以即使我只能加快速度,这将是一个很大的改进.

==编辑==

好的,我已经尝试按照两个答案的建议对表进行规范化 – 也就是说,我将valID_time_begin和valID_time_end移动到一个单独的表time_period中.我还用WHERE NOT EXISTS([更好的候选时间片])替换了窗口函数.在这个过程中,我也升级到了Postgresql 9.1.所有这些查询现在都快了两倍.查询计划与wildplasser的答案看起来相同.这很好,但不如我希望的那么好 – 从一个功能表中选择仍需要一秒钟.

理想情况下,我想利用功能WHERE条件的选择性,正如Erwin Brandstetter所说.如果我手工制作一个查询来做到这一点,我得到的时间是15-30毫秒.现在,它更像是它!手工制作的查询看起来像这样:

WITH filtered_feature AS(    SELECT *    FROM specific_feature_timeslice sf    JOIN feature_timeslice ft USING (timeslice_ID)    WHERE sf.name = 'SOMETHING')SELECT *FROM filtered_feature ffJOIN(    SELECT timeslice_ID    FROM filtered_feature candIDate    JOIN time_period candIDate_time ON candIDate.valID_time_period_ID = candIDate_time.ID    WHERE ('2011-09-26','0'::interval) OVERLAPS (candIDate_time.valID_time_begin,COALESCE(candIDate_time.valID_time_end,'infinity'::timestamp))        AND NOT EXISTS        (            SELECT *            FROM filtered_feature better            JOIN time_period better_time ON better.valID_time_period_ID = better_time.ID            WHERE ('2011-09-26','0'::interval) OVERLAPS (better_time.valID_time_begin,COALESCE(better_time.valID_time_end,'infinity'::timestamp))                AND better.feature_ID = candIDate.feature_ID AND better.timeslice_ID != candIDate.timeslice_ID                AND better.sequence_number > candIDate.sequence_number        )) AS ft ON ff.timeslice_ID = ft.timeslice_ID

不幸的是,这在普通查询中使用太大而且太复杂,它们可能会连接许多其他表.我需要一些方法将这个逻辑封装在一个函数(任意时间)或至少一个视图(当前时间),但我仍然无法弄清楚如何执行此 *** 作,同时仍然让查询规划器首先过滤特定的功能.如果我只能将行集传递给函数 – 但据我所知,Postgresql不允许这样做.有任何想法吗?

==结论==

我最终使用Postgresql继承来解决这个问题(请参阅我的回答),但如果不是因为Erwin Brandstetter的答案,我就不会想出这个想法,所以赏金归于他. wildplasser的答案也非常有用,因为它允许我消除不必要的窗口功能,从而进一步加快它的速度.非常感谢你们俩!

解决方法 我最终使用Postgresql继承来解决这个问题,因此每个specific_feature_timeslice表都继承自feature_timeslice(而不是像之前那样引用它).这允许“特征的选择性可以首先生效” – 查询计划首先将其缩小到我想要的几行.因此,架构现在看起来像这样:

CREATE table feature_timeslice(  timeslice_ID int NOT NulL,-- Some other columns  CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_ID)  -- Some other constraints)CREATE table specific_feature_timeslice(  -- Feature-specific columns only,eg.  name character varying(100),CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_ID))inheritS (feature_timeslice);CREATE INDEX ix_specific_feature_timeslice_feature_IDON specific_feature_timeslice (feature_ID);

每个这样的派生表都有自己的函数来选择指定时间的行电流:

CREATE FUNCTION specific_feature_asof(effective_time timestamp)RETURNS SetoF specific_feature_timesliceAS $BODY$    SELECT candIDate.*    FROM specific_feature_timeslice candIDate    WHERE (,'0'::interval) OVERLAPS (candIDate.valID_time_begin,COALESCE(candIDate.valID_time_end,'infinity'::timestamp))        AND NOT EXISTS        (            SELECT *            FROM specific_feature_timeslice better            WHERE (,'0'::interval) OVERLAPS (better.valID_time_begin,COALESCE(better.valID_time_end,'infinity'::timestamp))                AND better.feature_ID = candIDate.feature_ID AND better.timeslice_ID != candIDate.timeslice_ID AND better.sequence_number > candIDate.sequence_number        )$BODY$LANGUAGE sql Stable;

我自动生成这些函数 – 当然 – 除了表名之外它们是相同的.然后典型的查询变为:

SELECT *FROM specific_feature_asof('2011-09-30')WHERE name = 'SOMETHING'

并且查询计划如下所示:

nested Loop Anti Join  (cost=0.00..412.84 rows=3 wIDth=177) (actual time=0.044..7.038 rows=10 loops=1)  Join Filter: (((better.timeslice_ID)::integer <> (candIDate.timeslice_ID)::integer) AND ((better.sequence_number)::smallint > (candIDate.sequence_number)::smallint))  ->  Seq Scan on specific_feature_timeslice candIDate  (cost=0.00..379.66 rows=3 wIDth=177) (actual time=0.018..6.688 rows=10 loops=1)        Filter: (((name)::text = 'SOMETHING'::text) AND overlaps(('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone,(('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone + '00:00:00'::interval),'infinity'::timestamp without time zone)))  ->  Index Scan using ix_specific_feature_timeslice_feature_ID on specific_feature_timeslice better  (cost=0.00..8.28 rows=1 wIDth=14) (actual time=0.008..0.011 rows=1 loops=10)        Index Cond: ((feature_ID)::integer = (candIDate.feature_ID)::integer)        Filter: overlaps(('2011-09-30 00:00:00'::timestamp without time zone)::timestamp without time zone,'infinity'::timestamp without time zone))Total runtime: 7.150 ms

性能差异非常显着:像上面的查询一样简单的选择需要30-60毫秒.加入两个这样的函数需要300-400毫秒,这比我预期的要多一点,但仍然可以接受.

通过这些更改,我认为不再需要规范化feature_timeslice,即.将有效的开始/结束时间提取到一个单独的表中,所以我没有这样做.

总结

以上是内存溢出为你收集整理的使用时间戳OVERLAPS和“PARTITION BY”加速PostgreSQL查询全部内容,希望文章能够帮你解决使用时间戳OVERLAPS和“PARTITION BY”加速PostgreSQL查询所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/sjk/1159124.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-01
下一篇 2022-06-01

发表评论

登录后才能评论

评论列表(0条)

保存