ElasticSearch7学习笔记之用Analyzer分词_随笔

ElasticSearch7学习笔记之用Analyzer分词定义

Analyzer是es中专门用来处理分词的组件，由三部分组成：

Character Filters：针对原始文本的处理，例如去除html等
Tokenizer：按照规则进行分词
Token Filter：将切分的单词进行加工，例如去除修饰性单词等

分词器种类 StandardAnalyzer

这是默认分词器，按词切分，将字母转换为小写，默认关闭终止词。
使用方法如下：

GET /_analyze
{
  "analyzer": "standard",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

结果如下，可见其中的大写字母都被转换成了小写:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "",
      "position" : 12
    }
  ]
}

SimpleAnalyzer

按照非字母切分，非字母的都会被去掉，字母则同样会进行小写处理
举例如下:

GET /_analyze
{
  "analyzer": "simple",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下，可见除了单词小写，其中的非字母都被去掉了(数字2):

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}

WhitespaceAnalyzer

按照空格切分单词，举例如下:

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下，可见It`s、Let`s等都被保留了：

{
  "tokens" : [
    {
      "token" : "It`s",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "commander.",
      "start_offset" : 16,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Let`s",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "times!",
      "start_offset" : 45,
      "end_offset" : 51,
      "type" : "word",
      "position" : 10
    }
  ]
}

StopAnalyzer

相比SimpleAnalyzer，多了个stop filter，可以把the、a、is等修饰性的词去掉。举例如下:

GET /_analyze
{
  "analyzer": "stop",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下，可见少了很多修饰性的词:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}

Keyword Analyzer

不分词，直接将输入当成一个term输出。举例如下:

GET /_analyze
{
  "analyzer": "keyword",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下:

{
  "tokens" : [
    {
      "token" : "It`s a good day commander. Let`s do it for 2 times!",
      "start_offset" : 0,
      "end_offset" : 51,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer

通过正则表达式进行分词，默认用W+，也就是按照非字母进行分隔。举例如下:

GET /_analyze
{
  "analyzer": "pattern",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}

这里输出其实和standard是一样的:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 12
    }
  ]
}

LanguageAnalyzer

es也可以按照语言进行分词:

GET /_analyze
{
  "analyzer": "english",
    "text": "It`s a good day commander. Let`s do it for 2 times!"
}

输出如下，也对修饰性单词进行了过滤:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "dai",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "command",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "",
      "position" : 11
    },
    {
      "token" : "time",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "",
      "position" : 12
    }
  ]
}

ICU-Analyzer

这是对中文分词的分词器，要先进行安装:

[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu

-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%   
-> Installed analysis-icu

然后重启ES，再做一下测试:

GET /_analyze
{
  "analyzer": "icu_analyzer",
    "text": "这个进球真是漂亮！"
}

输出如下，看来"进球"并没有分割成一个词语:

{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "进",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "球",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "真是",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "漂亮",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "",
      "position" : 4
    }
  ]
}

配置自定义Analyzer

通过组合CharacterFilter、Tokenizer和TokenFilter来实现自定义Analyzer
自带的CharacterFilter有HTML strip、Mapping和Pattern replace，分别用来进行html标签去除、字符串替换和正则匹配替换
自带的Tokenizer有whitespace、standard、uax_url_email、pattern、keyword、path_hierarchy，也可以用java开发插件，实现自己的tokenizer
自带的TokenFilter有Lowercase、stop、synonym

tokenizer+character_filter

tokenizer和character_filter的组合示例如下:

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "aaa"
}

同样是tokenizer和character_filter的组合，不过可以在character_filter中加入mapping，示例如下:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }],
    "text": "1-2, d-4"
}

正则

正则示例如下，$1表示取第几个()里的内容，这里就是www.baidu.com：

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [{
    "type": "pattern_replace",
    "pattern": "http://(.*)",
    "replacement": ""
  }],
  "text": "http://www.baidu.com"
}

路径层次分词器

路径层次分词器如下，把输入/home/szc/a/b/c/e当成目录，然后按照目录的层级进行分词:

POST _analyze{  "tokenizer": "path_hierarchy",  "text": "/home/szc/a/b/c/e"}

输出如下

{
  "tokens" : [
    {
      "token" : "/home",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c/e",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

filter组合

filter组合如下，同时进行小写和去除修饰词处理

POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop"],
  "text": "The boys in China are playing soccer!"
}

综合使用

一个组合性的名为my_analyzer的自定义Analyzer如下所示，其中的char_filter、tokenizer、filter都是自定义的，

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["emoticons"],
          "tokenizer": "punctuation",
          "filter":[
            "lowercase",
            "english_stop"
            ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => happy",
            ":( => sad"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

使用时加上定义此分词器的文档即可：

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I`m a :) guy, and you ?"
}

输出如下，可见先是完成了表情符替换，又按照指定的正则进行了分词，最后去除了修饰性单词

{
  "tokens" : [
    {
      "token" : "i`m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "happy",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guy",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "you",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 5
    }
  ]
}

结语

以上就是关于ElasticSearch7的分词器的内容

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5665327.html

ElasticSearch7学习笔记之用Analyzer分词

发表评论

评论列表（0条）