文章目录背景: IK分词提供的两个分词器,并不支持一些新的词汇,有时候也不能满足实际业务需要,这时候,我们可以定义自定义词库来完成目标。
目标: 定制化中文分词器,使得我们的中文分词器支持扩展的词汇
- 一、搜索现状
- 1. 搜索关键词
- 2. 数据结果
- 3. 数据分析
- 4. ES IK分词
- 5. IK分词结果+分析
- 二、定制化分词器
- 2.1. 新增分词词典库
- 2.2. 词典配置
- 2.3. 重启es7
- 2.4. 重新查看分词结果
- 2.5. 重新搜索
- 2.6. 重建分词索引
- 2.7. 再次查询
- 2.8. 数据分析
# 搜索凯悦相关的酒店 GET /shop/_search { "query":{ "match": {"name":"凯悦"} } }2. 数据结果
{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 5, "relation" : "eq" }, "max_score" : 3.3362136, "hits" : [ { "_index" : "shop", "_type" : "_doc", "_id" : "9", "_score" : 3.3362136, "_source" : { "price_per_man" : 176, "remark_score" : 2.2, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.916Z", "tags" : "落地大窗", "location" : "31.306172,121.525843", "seller_remark_score" : 3.0, "id" : 9, "name" : "凯悦酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "10", "_score" : 2.836244, "_source" : { "price_per_man" : 182, "remark_score" : 0.5, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.918Z", "tags" : "自助餐", "location" : "31.196742,121.322846", "seller_remark_score" : 3.0, "id" : 10, "name" : "凯悦嘉轩酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "11", "_score" : 2.836244, "_source" : { "price_per_man" : 74, "remark_score" : 1.0, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.920Z", "tags" : "自助餐", "location" : "31.156899,121.238362", "seller_remark_score" : 3.0, "id" : 11, "name" : "新虹桥凯悦酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "12", "_score" : 2.638537, "_source" : { "price_per_man" : 71, "remark_score" : 2.0, "category_name" : "美食2", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.923Z", "tags" : "有包厢", "location" : "30.679819,121.651921", "seller_remark_score" : 3.0, "id" : 12, "name" : "凯悦咖啡(新建西路店)", "seller_id" : 17, "category_id" : 1 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "4", "_score" : 1.3119392, "_source" : { "price_per_man" : 152, "remark_score" : 2.0, "category_name" : "美食2", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.907Z", "tags" : "落地大窗 有WIFI", "location" : "31.306419,121.524878", "seller_remark_score" : 2.0, "id" : 4, "name" : "花悦庭果木烤鸭", "seller_id" : 2, "category_id" : 1 } } ] } }3. 数据分析
上面数据中有一条不符的结果数据,此数据中无**“凯悦”**关键词,但是,搜索后还是显示在页面上,不符合预期搜索结果。
{ "_index" : "shop", "_type" : "_doc", "_id" : "4", "_score" : 1.3119392, "_source" : { "price_per_man" : 152, "remark_score" : 2.0, "category_name" : "美食2", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.907Z", "tags" : "落地大窗 有WIFI", "location" : "31.306419,121.524878", "seller_remark_score" : 2.0, "id" : 4, "name" : "花悦庭果木烤鸭", "seller_id" : 2, "category_id" : 1 } }4. ES IK分词
# 查阅凯悦分词 GET /shop/_analyze { "analyzer": "ik_smart", "text": "凯悦" }5. IK分词结果+分析
{ "tokens" : [ { "token" : "凯", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "悦", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 } ] }
从上面数据可以看出,使用ik_smart分词api,分词“凯”,“悦”,并没有将“凯悦”关键词当做一个分词元素,主要原因就是,es安装的ik中文分词库中没有将“凯悦”放入分词库。
二、定制化分词器 2.1. 新增分词词典库cd /app/elasticsearch-7.15.2/config/analysis-ik/ vim new_word.dic
添加自定义分词
凯悦2.2. 词典配置
使用ik加载我们自定义的分词词典库
vim IKAnalyzer.cfg.xml
内容:
2.3. 重启es7IK Analyzer 扩展配置 new_word.dic
ps -ef|grep elasticsearch
kill -9 es进程号
cd /app/elasticsearch-7.15.2/ bin/elasticsearch -d2.4. 重新查看分词结果
# 查阅凯悦分词 GET /shop/_analyze { "analyzer": "ik_smart", "text": "凯悦" } GET /shop/_analyze { "analyzer": "ik_max_word", "text": "凯悦" }2.5. 重新搜索
GET /shop/_search { "query":{ "match": {"name":"凯悦"} } }
GET /shop/_search
发现一条数据都没查询出来,但是,数据都还在。
2.6. 重建分词索引索引创建的时候,是在ik分词器上当时没有“凯悦”这个词的时候。目前,我们凯悦酒店这条记录对应的记录”凯和悦”已经在索引成型,单字的”凯”和单字“悦”。因为在擦黄建索引的时候,并没有做分词的扩展分词库加载。
目前的问题,现在索引中存储的是”凯和悦”分开的,但是我搜索的时候,执行的凯悦,却是按照搜索当前search的分词器,也就是分出来的是”凯和悦”连字存在。
我搜索是2个字,但是倒排索引的时候是按照单字做搜引得,因此导致搜素数据为空。
解决方案:
第一种(第一次):把索引全部删除,然后全量同步分词索引
第二种(推荐):针对搜索的索引中包含“凯“或者“悦“的索引执行重建索引,其他的索引不重建索引。
# 重建凯悦分析索引 POST /shop/_update_by_query { "query": { "bool": { "must": [ {"term":{"name":"凯"}}, {"term":{"name":"悦"}} ] } } }2.7. 再次查询
GET /shop/_search { "query":{ "match": {"name":"凯悦"} } }2.8. 数据分析
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : 2.0709352, "hits" : [ { "_index" : "shop", "_type" : "_doc", "_id" : "9", "_score" : 2.0709352, "_source" : { "price_per_man" : 176, "remark_score" : 2.2, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.916Z", "tags" : "落地大窗", "location" : "31.306172,121.525843", "seller_remark_score" : 3.0, "id" : 9, "name" : "凯悦酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "10", "_score" : 1.7177677, "_source" : { "price_per_man" : 182, "remark_score" : 0.5, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.918Z", "tags" : "自助餐", "location" : "31.196742,121.322846", "seller_remark_score" : 3.0, "id" : 10, "name" : "凯悦嘉轩酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "11", "_score" : 1.7177677, "_source" : { "price_per_man" : 74, "remark_score" : 1.0, "category_name" : "酒店", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.920Z", "tags" : "自助餐", "location" : "31.156899,121.238362", "seller_remark_score" : 3.0, "id" : 11, "name" : "新虹桥凯悦酒店", "seller_id" : 17, "category_id" : 2 } }, { "_index" : "shop", "_type" : "_doc", "_id" : "12", "_score" : 1.5828056, "_source" : { "price_per_man" : 71, "remark_score" : 2.0, "category_name" : "美食2", "@version" : "1", "seller_disabled_flag" : 0, "@timestamp" : "2021-11-21T04:10:03.923Z", "tags" : "有包厢", "location" : "30.679819,121.651921", "seller_remark_score" : 3.0, "id" : 12, "name" : "凯悦咖啡(新建西路店)", "seller_id" : 17, "category_id" : 1 } } ] } }
从上面数据可以看出,符合预期结果!
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)