elasticSearch之ik分词器_随笔

elasticSearch之ik分词器

安装好ik分词器后，可以测试分词效果
分词器包括两种：ik_smart、ik_max_word

ik_smart：

GET _analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_smart"
}

会将"中华人民共和国国歌"分成"中华人民共和国"、“国歌”

ik_max_word：分词的细粒度更高

GET _analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_max_word"
}

结果：

{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

创建mapping时指定分词器
默认查询的分词器和存储的分词器一致(query_string可在查询时指定查询分词器)

PUT /users
{
  "mappings": {
    "properties": {
      "tittle":{
        "type": "text",
        "analyzer": "ik_max_word"
        #默认查询的分词器和存储的分词器一致(query_string可在查询时指定查询分词器)
        #"search_analyzer": "ik_max_word"
      },
      "age":{
        "type": "short"
      },
      "time":{
        "type": "date"
      },
      "content":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "address":{
        "type": "keyword"
      }
    }
  }
}

插入两条数据

POST /users/_bulk
{"index":{}}
	{"tittle":"功夫电影","age":18,"time":"2008-04-18","content":"电影中有一个穿着拖鞋出场的人物，他叫火云邪神","address":"香港"}
{"index":{}}
	{"tittle":"惊天大秘密","age":20,"time":"2021-11-268","content":"其实就是发生了一起车祸，很多吃瓜群众 永远滴神","address":"成都"}

在content中搜索电影(term不会对搜索词进行分词)

GET /users/_search
{
  "query":{
    "term": {
      "content": {
        "value": "电影"
      }
    }
  }
}

词典的意思：ik中问分词默认会根据已经存在的词典进行分词：

比如ik中config目录下.dic都是一些词典文件

ik提供了自定义词典和停用的词典(还有远程词典和远程停用词典)
在config目录下的IKAnalyzer.cfg.xml文件中




	IK Analyzer 扩展配置

比如上面添加的两天数据，用term在content中搜索“火云邪神”就搜索不到

自定义本地词典：
1、自定义词典文件
2、在IKAnalyzer.cfg.xml文件中配置
3、删除原来存储的数据，数据重载

自定义一个词典文件my.dic

火云邪神

在IKAnalyzer.cfg.xml文件中配置




	IK Analyzer 扩展配置
	
	my.dic

现在搜索火云邪神就能搜到

GET /users/_search
{
  "query":{
    "term": {
      "content": {
        "value": "火云邪神"
      }
    }
  }
}

停止词典：分词时不会使用这些词

在使用term(不会对搜索词分词)搜索时就搜不到

远程词典

新建一个springboot应用，新建一个dic.txt文件

吃瓜群众

在IKAnalyzer.cfg.xml中配置




	IK Analyzer 扩展配置
	
	my.dic
	 
	
	
	http://192.168.56.1:8080/config/dic.txt

这样就可以拿到远程词典，当dic,txt内容发生变化，es能及时监听并获取
不过每次貌似都需要做数据重载
解决办法：

通过post方式调用  ip:port/索引/_update_by_query?conflicts=proceed
localhost:8080/users/_update_by_query?conflicts=proceed
多个索引：localhost:8080/users,xxx,xxx/_update_by_query?conflicts=proceed
通配符*：localhost:8080/user*/_update_by_query?conflicts=proceed

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5605350.html

elasticSearch之ik分词器

发表评论

评论列表（0条）