Analyzer是es中专门用来处理分词的组件,由三部分组成:
- Character Filters:针对原始文本的处理,例如去除html等
- Tokenizer:按照规则进行分词
- Token Filter:将切分的单词进行加工,例如去除修饰性单词等
这是默认分词器,按词切分,将字母转换为小写,默认关闭终止词。
使用方法如下:
GET /_analyze { "analyzer": "standard", "text": "It`s a good day commander. Let`s do it for 2 times!" }
结果如下,可见其中的大写字母都被转换成了小写:
{ "tokens" : [ { "token" : "it", "start_offset" : 0, "end_offset" : 2, "type" : "", "position" : 0 }, { "token" : "s", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 1 }, { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 2 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "", "position" : 3 }, { "token" : "day", "start_offset" : 12, "end_offset" : 15, "type" : "", "position" : 4 }, { "token" : "commander", "start_offset" : 16, "end_offset" : 25, "type" : "", "position" : 5 }, { "token" : "let", "start_offset" : 27, "end_offset" : 30, "type" : "", "position" : 6 }, { "token" : "s", "start_offset" : 31, "end_offset" : 32, "type" : "", "position" : 7 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "", "position" : 8 }, { "token" : "it", "start_offset" : 36, "end_offset" : 38, "type" : "", "position" : 9 }, { "token" : "for", "start_offset" : 39, "end_offset" : 42, "type" : "", "position" : 10 }, { "token" : "2", "start_offset" : 43, "end_offset" : 44, "type" : "SimpleAnalyzer", "position" : 11 }, { "token" : "times", "start_offset" : 45, "end_offset" : 50, "type" : "", "position" : 12 } ] }
按照非字母切分,非字母的都会被去掉,字母则同样会进行小写处理
举例如下:
GET /_analyze { "analyzer": "simple", "text": "It`s a good day commander. Let`s do it for 2 times!" }
输出如下,可见除了单词小写,其中的非字母都被去掉了(数字2):
{ "tokens" : [ { "token" : "it", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "s", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "word", "position" : 2 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "word", "position" : 3 }, { "token" : "day", "start_offset" : 12, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "commander", "start_offset" : 16, "end_offset" : 25, "type" : "word", "position" : 5 }, { "token" : "let", "start_offset" : 27, "end_offset" : 30, "type" : "word", "position" : 6 }, { "token" : "s", "start_offset" : 31, "end_offset" : 32, "type" : "word", "position" : 7 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "word", "position" : 8 }, { "token" : "it", "start_offset" : 36, "end_offset" : 38, "type" : "word", "position" : 9 }, { "token" : "for", "start_offset" : 39, "end_offset" : 42, "type" : "word", "position" : 10 }, { "token" : "times", "start_offset" : 45, "end_offset" : 50, "type" : "word", "position" : 11 } ] }WhitespaceAnalyzer
按照空格切分单词,举例如下:
GET /_analyze { "analyzer": "whitespace", "text": "It`s a good day commander. Let`s do it for 2 times!" }
输出如下,可见It`s、Let`s等都被保留了:
{ "tokens" : [ { "token" : "It`s", "start_offset" : 0, "end_offset" : 4, "type" : "word", "position" : 0 }, { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "word", "position" : 1 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "word", "position" : 2 }, { "token" : "day", "start_offset" : 12, "end_offset" : 15, "type" : "word", "position" : 3 }, { "token" : "commander.", "start_offset" : 16, "end_offset" : 26, "type" : "word", "position" : 4 }, { "token" : "Let`s", "start_offset" : 27, "end_offset" : 32, "type" : "word", "position" : 5 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "word", "position" : 6 }, { "token" : "it", "start_offset" : 36, "end_offset" : 38, "type" : "word", "position" : 7 }, { "token" : "for", "start_offset" : 39, "end_offset" : 42, "type" : "word", "position" : 8 }, { "token" : "2", "start_offset" : 43, "end_offset" : 44, "type" : "word", "position" : 9 }, { "token" : "times!", "start_offset" : 45, "end_offset" : 51, "type" : "word", "position" : 10 } ] }StopAnalyzer
相比SimpleAnalyzer,多了个stop filter,可以把the、a、is等修饰性的词去掉。举例如下:
GET /_analyze { "analyzer": "stop", "text": "It`s a good day commander. Let`s do it for 2 times!" }
输出如下,可见少了很多修饰性的词:
{ "tokens" : [ { "token" : "s", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "word", "position" : 3 }, { "token" : "day", "start_offset" : 12, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "commander", "start_offset" : 16, "end_offset" : 25, "type" : "word", "position" : 5 }, { "token" : "let", "start_offset" : 27, "end_offset" : 30, "type" : "word", "position" : 6 }, { "token" : "s", "start_offset" : 31, "end_offset" : 32, "type" : "word", "position" : 7 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "word", "position" : 8 }, { "token" : "times", "start_offset" : 45, "end_offset" : 50, "type" : "word", "position" : 11 } ] }Keyword Analyzer
不分词,直接将输入当成一个term输出。举例如下:
GET /_analyze { "analyzer": "keyword", "text": "It`s a good day commander. Let`s do it for 2 times!" }
输出如下:
{ "tokens" : [ { "token" : "It`s a good day commander. Let`s do it for 2 times!", "start_offset" : 0, "end_offset" : 51, "type" : "word", "position" : 0 } ] }Pattern Analyzer
通过正则表达式进行分词,默认用W+,也就是按照非字母进行分隔。举例如下:
GET /_analyze { "analyzer": "pattern", "text": "It`s a good day commander. Let`s do it for 2 times!" }
这里输出其实和standard是一样的:
{ "tokens" : [ { "token" : "it", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "s", "start_offset" : 3, "end_offset" : 4, "type" : "word", "position" : 1 }, { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "word", "position" : 2 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "word", "position" : 3 }, { "token" : "day", "start_offset" : 12, "end_offset" : 15, "type" : "word", "position" : 4 }, { "token" : "commander", "start_offset" : 16, "end_offset" : 25, "type" : "word", "position" : 5 }, { "token" : "let", "start_offset" : 27, "end_offset" : 30, "type" : "word", "position" : 6 }, { "token" : "s", "start_offset" : 31, "end_offset" : 32, "type" : "word", "position" : 7 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "word", "position" : 8 }, { "token" : "it", "start_offset" : 36, "end_offset" : 38, "type" : "word", "position" : 9 }, { "token" : "for", "start_offset" : 39, "end_offset" : 42, "type" : "word", "position" : 10 }, { "token" : "2", "start_offset" : 43, "end_offset" : 44, "type" : "word", "position" : 11 }, { "token" : "times", "start_offset" : 45, "end_offset" : 50, "type" : "word", "position" : 12 } ] }LanguageAnalyzer
es也可以按照语言进行分词:
GET /_analyze { "analyzer": "english", "text": "It`s a good day commander. Let`s do it for 2 times!" }
输出如下,也对修饰性单词进行了过滤:
{ "tokens" : [ { "token" : "s", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 1 }, { "token" : "good", "start_offset" : 7, "end_offset" : 11, "type" : "", "position" : 3 }, { "token" : "dai", "start_offset" : 12, "end_offset" : 15, "type" : "", "position" : 4 }, { "token" : "command", "start_offset" : 16, "end_offset" : 25, "type" : "", "position" : 5 }, { "token" : "let", "start_offset" : 27, "end_offset" : 30, "type" : "", "position" : 6 }, { "token" : "s", "start_offset" : 31, "end_offset" : 32, "type" : "", "position" : 7 }, { "token" : "do", "start_offset" : 33, "end_offset" : 35, "type" : "", "position" : 8 }, { "token" : "2", "start_offset" : 43, "end_offset" : 44, "type" : "ICU-Analyzer", "position" : 11 }, { "token" : "time", "start_offset" : 45, "end_offset" : 50, "type" : "", "position" : 12 } ] }
这是对中文分词的分词器,要先进行安装:
[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu -> Installing analysis-icu -> Downloading analysis-icu from elastic [=================================================] 100% -> Installed analysis-icu
然后重启ES,再做一下测试:
GET /_analyze { "analyzer": "icu_analyzer", "text": "这个进球真是漂亮!" }
输出如下,看来"进球"并没有分割成一个词语:
{ "tokens" : [ { "token" : "这个", "start_offset" : 0, "end_offset" : 2, "type" : "配置自定义Analyzer", "position" : 0 }, { "token" : "进", "start_offset" : 2, "end_offset" : 3, "type" : " ", "position" : 1 }, { "token" : "球", "start_offset" : 3, "end_offset" : 4, "type" : " ", "position" : 2 }, { "token" : "真是", "start_offset" : 4, "end_offset" : 6, "type" : " ", "position" : 3 }, { "token" : "漂亮", "start_offset" : 6, "end_offset" : 8, "type" : " ", "position" : 4 } ] }
通过组合CharacterFilter、Tokenizer和TokenFilter来实现自定义Analyzer
自带的CharacterFilter有HTML strip、Mapping和Pattern replace,分别用来进行html标签去除、字符串替换和正则匹配替换
自带的Tokenizer有whitespace、standard、uax_url_email、pattern、keyword、path_hierarchy,也可以用java开发插件,实现自己的tokenizer
自带的TokenFilter有Lowercase、stop、synonym
tokenizer和character_filter的组合示例如下:
POST _analyze { "tokenizer": "keyword", "char_filter": ["html_strip"], "text": "aaa" }
同样是tokenizer和character_filter的组合,不过可以在character_filter中加入mapping,示例如下:
POST _analyze { "tokenizer": "standard", "char_filter": [ { "type": "mapping", "mappings": ["- => _"] }], "text": "1-2, d-4" }正则
正则示例如下,$1表示取第几个()里的内容,这里就是www.baidu.com:
POST _analyze { "tokenizer": "standard", "char_filter": [{ "type": "pattern_replace", "pattern": "http://(.*)", "replacement": "" }], "text": "http://www.baidu.com" }路径层次分词器
路径层次分词器如下,把输入/home/szc/a/b/c/e当成目录,然后按照目录的层级进行分词:
POST _analyze{ "tokenizer": "path_hierarchy", "text": "/home/szc/a/b/c/e"}
输出如下
{ "tokens" : [ { "token" : "/home", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 }, { "token" : "/home/szc", "start_offset" : 0, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "/home/szc/a", "start_offset" : 0, "end_offset" : 11, "type" : "word", "position" : 0 }, { "token" : "/home/szc/a/b", "start_offset" : 0, "end_offset" : 13, "type" : "word", "position" : 0 }, { "token" : "/home/szc/a/b/c", "start_offset" : 0, "end_offset" : 15, "type" : "word", "position" : 0 }, { "token" : "/home/szc/a/b/c/e", "start_offset" : 0, "end_offset" : 17, "type" : "word", "position" : 0 } ] }filter组合
filter组合如下,同时进行小写和去除修饰词处理
POST _analyze { "tokenizer": "whitespace", "filter": ["lowercase", "stop"], "text": "The boys in China are playing soccer!" }综合使用
一个组合性的名为my_analyzer的自定义Analyzer如下所示,其中的char_filter、tokenizer、filter都是自定义的,
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["emoticons"], "tokenizer": "punctuation", "filter":[ "lowercase", "english_stop" ] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "char_filter": { "emoticons": { "type": "mapping", "mappings": [ ":) => happy", ":( => sad"] } }, "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" } } } } }
使用时加上定义此分词器的文档即可:
POST my_index/_analyze { "analyzer": "my_analyzer", "text": "I`m a :) guy, and you ?" }
输出如下,可见先是完成了表情符替换,又按照指定的正则进行了分词,最后去除了修饰性单词
{ "tokens" : [ { "token" : "i`m", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0 }, { "token" : "happy", "start_offset" : 6, "end_offset" : 8, "type" : "word", "position" : 2 }, { "token" : "guy", "start_offset" : 9, "end_offset" : 12, "type" : "word", "position" : 3 }, { "token" : "you", "start_offset" : 18, "end_offset" : 21, "type" : "word", "position" : 5 } ] }结语
以上就是关于ElasticSearch7的分词器的内容
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)