ES 分词器_随笔_内存溢出

ES 分词器

分词器：

ES在创建倒排索引时需要对文档分词。
在搜索时，需要对用户输入内容分词。

但默认的分词规则对中文处理并不友好。

在kibana的DevTools中测试：

POST /_analyze
{
"analyzer": "standard",
"text": "张三老铁学习java！"
}

post代表请求方式。
/_analyze代表分词分析。
analyzer分词类型，这里是默认的standard分词器。
text要分词的内容。

中文都直接被拆分成了一个个汉字，所以不能此分词器，要用lk分词器。

安装ik分词器
在线安装，不推荐较慢

# 进入容器内部
docker exec -it elasticsearch /bin/bash

# 在线下载并安装
./bin/elasticsearch-plugin  install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.12.1/elasticsearch-analysis-ik-7.12.1.zip

#退出
exit
#重启容器
docker restart elasticsearch

离线安装，推荐

# 查看elasticsearch的plugins目录位置
docker volume inspect es-plugins

说明plugins目录被挂载到了

/var/lib/docker/volumes/es-plugins/_data

上传离线包到此目录
然后重启

docker restart es
# 查看日志
docker logs -f es

回到kibana中测试
ik分词器包含两种模式：
ik_smart：最少切分，分词少，但占用内存低一些
ik_max_word：最细切分，分词多，但占用内存多

ik_smart

POST /_analyze
{
"analyzer": "ik_smart",
"text": "张三老铁学习java！"
}

{
  "tokens" : [
    {
      "token" : "张",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "三老",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "铁",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "学习",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "java",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 4
    }
  ]
}

ik_max_word

POST /_analyze
{
"analyzer": "ik_max_word",
"text": "张三老铁学习java！"
}

{
  "tokens" : [
    {
      "token" : "张三",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "三老",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "三",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "TYPE_CNUM",
      "position" : 2
    },
    {
      "token" : "老",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "铁",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "学习",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "java",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 6
    }
  ]
}

分词器原理
分词器有对应的字典，分词时匹配字典，如果有就分词。

对于字典中不存在的，想在使用，就需要个性化设置，也就是拓展。

拓展与停用
修改ik分词器目录中的config目录中的ikAnalyzer.cfg.xml文件：



	IK Analyzer 扩展配置
	
	ext.dic
	 
	stopword.dic

指定拓展与停止文件后，在当前目录下新建ext.dic与stopword.dic
ext.dic

老铁

stopword.doc
有现在的文件，不需要新建，直接追加

的
了
哦
啊
嗯

替换文件后，重启ES

docker restart es

再测试

POST /_analyze
{
"analyzer": "ik_smart",
"text": "张三的老铁学习java！"
}

返回中有【老铁】分词，【的】也去掉了

{
  "tokens" : [
    {
      "token" : "张三",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "老铁",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "学习",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "java",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "ENGLISH",
      "position" : 3
    }
  ]
}

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5679495.html

ES 分词器

发表评论

评论列表（0条）