PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询_sql

概述点击有惊喜标签 PostgreSQL , 搜索引擎 , GIN , ranking , high light , 全文检索 , 模糊查询 , 正则查询 , 相似查询 , ADHOC查询背景字符串搜索是非常常见的业务需求，它包括： 1、前缀+模糊查询。（可以使用b-tree索引） select * from tbl where col like 'ab%'; 或 selec

点击有惊喜

标签

Postgresql,搜索引擎,GIN,ranking,high light,全文检索,模糊查询,正则查询,相似查询,ADHOC查询

背景

字符串搜索是非常常见的业务需求，它包括：

1、前缀+模糊查询。（可以使用b-tree索引）

select * from tbl where col like 'ab%';    或    select * from tbl where col ~ '^ab';

2、后缀+模糊查询。（可以使用reverse(col)表达式b-tree索引）

select * from tbl where col like '%ab';    或    select * from tbl where col ~ 'ab$';    写法    select * from tbl where reverse(col) like 'ba%';    或    select * from tbl where reverse(col) ~ '^ba';

3、前后模糊查询。（可以使用pg_trgm和gin索引）

https://www.postgresql.org/docs/10/static/pgtrgm.html

select * from tbl where col like '%ab%';    或    select * from tbl where col ~ 'ab';

4、全文检索。（可以使用全文检索类型以及gin或rum索引）

select * from tbl where tsvector_col @@ 'postgres & china | digoal:A' order by ts_rank(tsvector_col,'postgres & china | digoal:A') limit xx;    详细语法后面介绍

5、正则查询。（可以使用pg_trgm和gin索引）

select * from tbl where col ~ '^a[0-9]{1,5}\ +digoal$';

6、相似查询。（可以使用pg_trgm和gin索引）

select * from tbl order by similarity(col,'postgre') desc limit 10;

7、ADHOC查询，任意字段组合查询。（通过bloom index,multi-index bitmap scan,gin-index bitmap scan 等索引都可以实现）

select * from tbl where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);

通常来说，数据库并不具备3以后的加速能力，但是Postgresql的功能非常强大，它可以非常完美的支持这类查询的加速。（是指查询和写入不冲突的，并且索引BUILD是实时的。）

用户完全不需要将数据同步到搜索引擎，再来查询，而且搜索引擎也只能做到全文检索，并不你做到正则、相似、前后模糊这几个需求。

使用Postgresql可以大幅度的简化用户的架构，开发成本，同时保证数据查询的绝对实时性。

一、全文检索

全文检索中几个核心的功能：

词典、分词语法、搜索语法、排序算法、效率、命中词高亮等。

Postgresql都已经实现，并支持扩展。例如扩展词典、扩展排序算法等。

支持4种文档结构（标题、作者、摘要、内容），可以在生成tsvector时指定。在一个tsvector中允许多个文档结构。

文档结构在ranking算法中，被用于计算权值，例如在标题中命中的词权值可以设更大一些。

支持掩码，主要用于调和很长的文本，调和ranking的输出。

通过设置不同文档结构权值，调和ranking的输出。

词典

默认PG没有中文分词，但是好在我们可以基于text search框架扩展，例如开源的zhparser,jIEba等中文分词插件。

https://github.com/jaiminpan/pg_jieba

https://github.com/jaiminpan/pg_scws

甚至可以通过pljava,plpython等来实现对中文的分词，这个实际上是对应编程体系内的分词能力，通过Postgresql的过程语言引入，是不是很炫酷。

《使用阿里云PostgreSQL zhparser中文分词时不可不知的几个参数》

《如何加快PostgreSQL结巴分词加载速度》

《PostgreSQL Greenplum 结巴分词(by plpython)》

分词介绍

1、parser，功能是将字符串转换为token（可以自定义parser）。

default parser的token类别如下：

例子

SELECT alias,description,token FROM ts_deBUG('http://example.com/stuff/index.HTML');    alias   |  description  |            token               ----------+---------------+------------------------------   protocol | Protocol head | http://   url      | URL           | example.com/stuff/index.HTML   host     | Host          | example.com   url_path | URL path      | /stuff/index.HTML

创建text parser的语法

@L_301_12@

2、配合text search configuration 和dictionary，将token转换为lexemes

例如创建了一个同义词字典

postgres        pgsql  postgresql      pgsql  postgre pgsql  gogle   googl  indices index*

然后用这个字典来将token转换为lexemes，转换后得到的是lexeme. (tsvector中存储的也是lexeme，并不是原始token)

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym,synonyms='synonym_sample');  mydb=# SELECT ts_lexize('syn','indices');   ts_lexize  -----------   {index}  (1 row)    mydb=# CREATE TEXT SEARCH CONfigURATION tst (copy=simple);  mydb=# ALTER TEXT SEARCH CONfigURATION tst ALTER MAPPing FOR asciiword WITH syn;  mydb=# SELECT to_tsvector('tst','indices');   to_tsvector  -------------   'index':1  (1 row)    mydb=# SELECT to_tsquery('tst','indices');   to_tsquery  ------------   'index':*  (1 row)    mydb=# SELECT 'indexes are very useful'::tsvector;              tsvector               ---------------------------------   'are' 'indexes' 'useful' 'very'  (1 row)    mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst','indices');   ?column?  ----------   t  (1 row)

创建text dictionary的语法

https://www.postgresql.org/docs/10/static/sql-createtsdictionary.html

3、将lexemes存储为tsvector

text search configuration 决定了要存哪些东西。

convert过程中，parser得到的token依次与configuration配置的dictionary匹配，并存储从dictionary中对应的lexeme。

ALTER TEXT SEARCH CONfigURATION tsconfig名      ADD MAPPing FOR token类型1 WITH 字典1,字典2,字典3;    如果使用这个tsconfig来转换文本为tsvector，那么对于 token类型1，首先与字典1匹配，如果匹配上了，会存储字典1中对应的lexeme，如果没有对应上，则继续搜索字典2......

创建text search configuration的语法

https://www.postgresql.org/docs/10/static/sql-createtsconfig.html

创建text search template的语法

https://www.postgresql.org/docs/10/static/sql-createtstemplate.html

4、控制参数

通常parser有一些控制参数，例如是否输出单字、双字等。例如zhparser这个parser的参数如下：

5、文档结构

标题、作者、摘要、内容

使用ABCD来表示。

点击有惊喜

总结

以上是内存溢出为你收集整理的PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询全部内容，希望文章能够帮你解决PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/sjk/1171742.html

PostgreSQL 实时高效搜索 - 全文检索、模糊查询、正则查询、相似查询、ADHOC查询

发表评论

评论列表（0条）