Elasticsearch 入门到精通-Hanlp分词器的安装和使用_随笔

Elasticsearch 入门到精通-Hanlp分词器的安装和使用一、版本和对应关系 Plugin versionBranch version7.x7.x6.x6.x 二、安装步骤 1、下载安装ES对应Plugin Release版本

a. 下载对应的release安装包，最新release包可从baidu盘下载（链接:百度网盘请输入提取码密码:i0o7）

b. 执行如下命令安装，其中PATH为插件包绝对路径：

./bin/elasticsearch-plugin install file://${PATH}

2、安装数据包

release包中存放的为HanLP源码中默认的分词数据，若要下载完整版数据包，请查看HanLP Release。

数据包目录：ES_HOME/plugins/analysis-hanlp

注：因原版数据包自定义词典部分文件名为中文，这里的hanlp.properties中已修改为英文，请对应修改文件名

3、重启Elasticsearch

注：上述说明中的ES_HOME为自己的ES安装路径，需要绝对路径

4、热更新

在本版本中，增加了词典热更新，修改步骤如下：

a. 在ES_HOME/plugins/analysis-hanlp/data/dictionary/custom目录中新增自定义词典

b. 修改hanlp.properties，修改CustomDictionaryPath，增加自定义词典配置

c. 等待1分钟后，词典自动加载

注：每个节点都需要做上述更改

提供的分词方式说明

hanlp: hanlp

默认分词hanlp_standard标准分词hanlp_index索引分词hanlp_nlpNLP分词hanlp_crfCRF分词hanlp_n_shortN-最短路分词hanlp_dijkstra最短路分词hanlp_speed极速词典分词 5、样例

POST _analyze
{
  "text": ["美国阿拉斯加州发生8.0级地震"],
  "analyzer": "hanlp_index"
}

结果

{
  "tokens" : [
    {
      "token" : "美国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : "阿拉斯加州",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "nsf",
      "position" : 1
    },
    {
      "token" : "阿拉斯加",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "阿拉斯",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 3
    },
    {
      "token" : "阿拉",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "r",
      "position" : 4
    },
    {
      "token" : "拉斯",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "nrf",
      "position" : 5
    },
    {
      "token" : "加州",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "ns",
      "position" : 6
    },
    {
      "token" : "发生",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "v",
      "position" : 7
    },
    {
      "token" : "8.0",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "m",
      "position" : 8
    },
    {
      "token" : "级",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "q",
      "position" : 9
    },
    {
      "token" : "地震",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "n",
      "position" : 10
    }
  ]
}

三、远程词典配置

配置文件为ES_HOME/config/analysis-hanlp/hanlp-remote.xml


    HanLP Analyzer 扩展配置

    
    words_location

    
    stop_words_location

1. 远程扩展字典

其中words_location为URL或者URL+" "+词性，如：

1. http://localhost:8080/mydic

2. http://localhost:8080/mydic nt

第一个样例，是直接配置URL，词典内部每一行代表一个单词，格式遵从[单词] [词性A] [A的频次] [词性B] [B的频次] ... 如果不填词性则表示采用词典的默认词性n。

第二个样例，配置词典URL，同时配置该词典的默认词性nt，当然词典内部同样遵循[单词] [词性A] [A的频次] [词性B] [B的频次] ... 如果不配置词性，则采用默认词性nt。

2. 远程扩展停止词字典

其中stop_words_location为URL，如：

1. http://localhost:8080/mystopdic

样例直接配置URL，词典内部每一行代表一个单词，不需要配置词性和频次，换行符用 n 即可。

注意，所有的词典URL是需要满足条件即可完成分词热更新：

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。

可以配置多个字典路径，中间用英文分号;间隔

URL每隔1分钟访问一次

保证词典编码UTF-8

3、自定义分词配置

HanLP在提供了各类分词方式的基础上，也提供了一系列的分词配置，分词插件也提供了相关的分词配置，我们可以在通过如下配置来自定义自己的分词器：

ConfigElastic versionenable_custom_config是否开启自定义配置enable_index_mode是否是索引分词enable_number_quantifier_recognize是否识别数字和量词enable_custom_dictionary是否加载用户词典enable_translated_name_recognize是否识别音译人名enable_japanese_name_recognize是否识别日本人名enable_organization_recognize是否识别机构enable_place_recognize是否识别地名enable_name_recognize是否识别中国人名enable_traditional_chinese_mode是否开启繁体中文enable_stop_dictionary是否启用停用词enable_part_of_speech_tagging是否开启词性标注enable_remote_dict是否开启远程词典enable_normalization是否执行字符正规化enable_offset是否计算偏移量

注意：如果要采用如上配置配置自定义分词，需要设置enable_custom_config为true

例如：

PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_hanlp_analyzer": {
          "tokenizer": "my_hanlp"
        }
      },
      "tokenizer": {
        "my_hanlp": {
          "type": "hanlp",
          "enable_stop_dictionary": true,
          "enable_custom_config": true
        }
      }
    }
  }
}

POST test/_analyze
{
  "text": "美国,|=阿拉斯加州发生8.0级地震",
  "analyzer": "my_hanlp_analyzer"
}

结果：

{
  "tokens" : [
    {
      "token" : "美国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "nsf",
      "position" : 0
    },
    {
      "token" : ",|=",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "w",
      "position" : 1
    },
    {
      "token" : "阿拉斯加州",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "nsf",
      "position" : 2
    },
    {
      "token" : "发生",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "v",
      "position" : 3
    },
    {
      "token" : "8.0",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "m",
      "position" : 4
    },
    {
      "token" : "级",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "q",
      "position" : 5
    },
    {
      "token" : "地震",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "n",
      "position" : 6
    }
  ]
}

三、遇到问题 1、java.io.FilePermission" “data/dictionary/CoreNatureDictionary.tr.txt” "read"错误

编辑插件目录下 plugin-security.policy 文件

添加

// HanLP data directories permission java.io.FilePermission "<>", "read,write,delete";

重启 Elasticsearch集群

注意：每个节点都需要安装哦

欢迎分享，转载请注明来源：内存溢出

原文地址: https://outofmemory.cn/zaji/5718641.html

Elasticsearch 入门到精通-Hanlp分词器的安装和使用

发表评论

评论列表（0条）