ElasticSearch7学习笔记之用Analyzer分词

ElasticSearch7学习笔记之用Analyzer分词,第1张

ElasticSearch7学习笔记之用Analyzer分词 定义

Analyzer是es中专门用来处理分词的组件,由三部分组成:

  1. Character Filters:针对原始文本的处理,例如去除html等
  2. Tokenizer:按照规则进行分词
  3. Token Filter:将切分的单词进行加工,例如去除修饰性单词等
分词器种类 StandardAnalyzer


这是默认分词器,按词切分,将字母转换为小写,默认关闭终止词。
使用方法如下:

GET /_analyze
{
  "analyzer": "standard",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

结果如下,可见其中的大写字母都被转换成了小写:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "",
      "position" : 12
    }
  ]
}
SimpleAnalyzer

按照非字母切分,非字母的都会被去掉,字母则同样会进行小写处理
举例如下:

GET /_analyze
{
  "analyzer": "simple",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下,可见除了单词小写,其中的非字母都被去掉了(数字2):

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}
WhitespaceAnalyzer

按照空格切分单词,举例如下:

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下,可见It`s、Let`s等都被保留了:

{
  "tokens" : [
    {
      "token" : "It`s",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "commander.",
      "start_offset" : 16,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Let`s",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "times!",
      "start_offset" : 45,
      "end_offset" : 51,
      "type" : "word",
      "position" : 10
    }
  ]
}
StopAnalyzer

相比SimpleAnalyzer,多了个stop filter,可以把the、a、is等修饰性的词去掉。举例如下:

GET /_analyze
{
  "analyzer": "stop",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下,可见少了很多修饰性的词:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}
Keyword Analyzer

不分词,直接将输入当成一个term输出。举例如下:

GET /_analyze
{
  "analyzer": "keyword",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下:

{
  "tokens" : [
    {
      "token" : "It`s a good day commander. Let`s do it for 2 times!",
      "start_offset" : 0,
      "end_offset" : 51,
      "type" : "word",
      "position" : 0
    }
  ]
}
Pattern Analyzer

通过正则表达式进行分词,默认用W+,也就是按照非字母进行分隔。举例如下:

GET /_analyze
{
  "analyzer": "pattern",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

这里输出其实和standard是一样的:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 12
    }
  ]
}
LanguageAnalyzer

es也可以按照语言进行分词:

GET /_analyze
{
  "analyzer": "english",
    "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下,也对修饰性单词进行了过滤:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "dai",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "command",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "",
      "position" : 11
    },
    {
      "token" : "time",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "",
      "position" : 12
    }
  ]
}
ICU-Analyzer

这是对中文分词的分词器,要先进行安装:

[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu

-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%   
-> Installed analysis-icu

然后重启ES,再做一下测试:

GET /_analyze
{
  "analyzer": "icu_analyzer",
    "text": "这个进球真是漂亮!"
}

输出如下,看来"进球"并没有分割成一个词语:

{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "进",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "球",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "真是",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "漂亮",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "",
      "position" : 4
    }
  ]
}
配置自定义Analyzer

通过组合CharacterFilter、Tokenizer和TokenFilter来实现自定义Analyzer
自带的CharacterFilter有HTML strip、Mapping和Pattern replace,分别用来进行html标签去除、字符串替换和正则匹配替换
自带的Tokenizer有whitespace、standard、uax_url_email、pattern、keyword、path_hierarchy,也可以用java开发插件,实现自己的tokenizer
自带的TokenFilter有Lowercase、stop、synonym

tokenizer+character_filter

tokenizer和character_filter的组合示例如下:

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "aaa"
}

同样是tokenizer和character_filter的组合,不过可以在character_filter中加入mapping,示例如下:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }],
    "text": "1-2, d-4"
}

正则

正则示例如下,$1表示取第几个()里的内容,这里就是www.baidu.com:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [{
    "type": "pattern_replace",
    "pattern": "http://(.*)",
    "replacement": ""
  }],
  "text": "http://www.baidu.com"
}
路径层次分词器

路径层次分词器如下,把输入/home/szc/a/b/c/e当成目录,然后按照目录的层级进行分词:

POST _analyze{  "tokenizer": "path_hierarchy",  "text": "/home/szc/a/b/c/e"}

输出如下

{
  "tokens" : [
    {
      "token" : "/home",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c/e",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}
filter组合

filter组合如下,同时进行小写和去除修饰词处理

POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop"],
  "text": "The boys in China are playing soccer!"
}
综合使用

一个组合性的名为my_analyzer的自定义Analyzer如下所示,其中的char_filter、tokenizer、filter都是自定义的,

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["emoticons"],
          "tokenizer": "punctuation",
          "filter":[
            "lowercase",
            "english_stop"
            ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => happy",
            ":( => sad"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

使用时加上定义此分词器的文档即可:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I`m a :) guy, and you ?"
}

输出如下,可见先是完成了表情符替换,又按照指定的正则进行了分词,最后去除了修饰性单词

{
  "tokens" : [
    {
      "token" : "i`m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "happy",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guy",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "you",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 5
    }
  ]
}
结语

以上就是关于ElasticSearch7的分词器的内容

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5665327.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-16
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存