1、下载安装ES对应Plugin Release版本
a. GitHub - NLPchina/elasticsearch-analysis-ansj
b. 解压 elasticsearch-analysis-ansj-7.7.1-release.zip 到 plugins 目录下
c.将 ansj.cfg.xml 拷贝到 es 对应的 config 目录下
d.在es config 同级目录创建 library目录用于放置分词数据,将词库信息放入该目录
自定义词库(default.dic),停词词库(stop.dic),歧义词词库(ambiguity.dic),同义词词库(synonyms.dic)
2、重启Elasticsearch 三、分词方式 1、分词方式解析base_ansj
基本分词POST _analyze { "text": ["美国阿拉斯加州发生8.0级地震"], "analyzer": "index_ansj" }
结果
{ "tokens" : [ { "token" : "美国", "start_offset" : 0, "end_offset" : 2, "type" : "ns", "position" : 0 }, { "token" : "美", "start_offset" : 0, "end_offset" : 1, "type" : "b", "position" : 1 }, { "token" : "国", "start_offset" : 1, "end_offset" : 2, "type" : "n", "position" : 2 }, { "token" : "阿拉斯加州", "start_offset" : 2, "end_offset" : 7, "type" : "nsf", "position" : 3 }, { "token" : "阿拉斯加", "start_offset" : 2, "end_offset" : 6, "type" : "nsf", "position" : 4 }, { "token" : "阿拉斯", "start_offset" : 2, "end_offset" : 5, "type" : "nsf", "position" : 5 }, { "token" : "阿拉", "start_offset" : 2, "end_offset" : 4, "type" : "r", "position" : 6 }, { "token" : "阿", "start_offset" : 2, "end_offset" : 3, "type" : "b", "position" : 7 }, { "token" : "拉斯", "start_offset" : 3, "end_offset" : 5, "type" : "nrf", "position" : 8 }, { "token" : "拉", "start_offset" : 3, "end_offset" : 4, "type" : "v", "position" : 9 }, { "token" : "斯", "start_offset" : 4, "end_offset" : 5, "type" : "b", "position" : 10 }, { "token" : "加州", "start_offset" : 5, "end_offset" : 7, "type" : "ns", "position" : 11 }, { "token" : "加", "start_offset" : 5, "end_offset" : 6, "type" : "v", "position" : 12 }, { "token" : "州", "start_offset" : 6, "end_offset" : 7, "type" : "n", "position" : 13 }, { "token" : "发生", "start_offset" : 7, "end_offset" : 9, "type" : "v", "position" : 14 }, { "token" : "发", "start_offset" : 7, "end_offset" : 8, "type" : "v", "position" : 15 }, { "token" : "生", "start_offset" : 8, "end_offset" : 9, "type" : "v", "position" : 16 }, { "token" : "8.0级", "start_offset" : 9, "end_offset" : 13, "type" : "mq", "position" : 17 }, { "token" : "0", "start_offset" : 11, "end_offset" : 12, "type" : "w", "position" : 18 }, { "token" : "级", "start_offset" : 12, "end_offset" : 13, "type" : "q", "position" : 19 }, { "token" : "地震", "start_offset" : 13, "end_offset" : 15, "type" : "n", "position" : 20 }, { "token" : "地", "start_offset" : 13, "end_offset" : 14, "type" : "ude2", "position" : 21 }, { "token" : "震", "start_offset" : 14, "end_offset" : 15, "type" : "vi", "position" : 22 } ] }四、ansj暴露的api整理
http://127.0.0.1:9200/_ansj/flush/dic/single?key=dic
/_cat/ansj 执行分词
例子:/_cat/ansj?text=中国&type=index_ansj&dic=dic&stop=stop&ambiguity=ambiguity&synonyms=synonyms
其中text和type是必须传的:text为需要进行分词的语句,type是分词类型,支持如下
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)