使用Elastic search从文本中提取关键字（多词）_随笔

使用Elastic search从文本中提取关键字（多词）

只有一种真正的方法可以做到这一点。您必须将您的数据作为关键字建立索引，并使用带状疱疹对其进行分析：

看到这个复制品：

首先，我们将创建两个自定义分析器：keyword和shingles：

PUT test{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer_keyword": {          "type": "custom",          "tokenizer": "keyword",          "filter": [ "asciifolding", "lowercase"          ]        },        "my_analyzer_shingle": {          "type": "custom",          "tokenizer": "standard",          "filter": [ "asciifolding", "lowercase", "shingle"          ]        }      }    }  },  "mappings": {    "your_type": {      "properties": {        "keyword": {          "type": "string",          "index_analyzer": "my_analyzer_keyword",          "search_analyzer": "my_analyzer_shingle"        }      }    }  }}

现在，让我们使用您提供的数据创建一些示例数据：

POST /test/your_type/1{  "id": 1,  "keyword": "thousand eyes"}POST /test/your_type/2{  "id": 2,  "keyword": "facebook"}POST /test/your_type/3{  "id": 3,  "keyword": "superdoc"}POST /test/your_type/4{  "id": 4,  "keyword": "quora"}POST /test/your_type/5{  "id": 5,  "keyword": "your story"}POST /test/your_type/6{  "id": 6,  "keyword": "Surgery"}POST /test/your_type/7{  "id": 7,  "keyword": "lending club"}POST /test/your_type/8{  "id": 8,  "keyword": "ad roll"}POST /test/your_type/9{  "id": 9,  "keyword": "the honest company"}POST /test/your_type/10{  "id": 10,  "keyword": "Draft kings"}

最后查询以运行搜索：

POST /test/your_type/_search{  "query": {    "match": {      "keyword": "I saw the news of lending club on facebook, your story and quora"    }  }}

这是结果：

{  "took": 6,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "failed": 0  },  "hits": {    "total": 4,    "max_score": 0.009332742,    "hits": [      {        "_index": "test",        "_type": "your_type",        "_id": "2",        "_score": 0.009332742,        "_source": {          "id": 2,          "keyword": "facebook"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "7",        "_score": 0.009332742,        "_source": {          "id": 7,          "keyword": "lending club"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "4",        "_score": 0.009207102,        "_source": {          "id": 4,          "keyword": "quora"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "5",        "_score": 0.0014755741,        "_source": {          "id": 5,          "keyword": "your story"        }      }    ]  }}

那么它在幕后做什么？

它将您的文档索引为整个关键字（它将整个字符串作为单个标记发出）。我还添加了asiifolding过滤器，因此它可以对字母进行规范化（即
```
é
```
成为
```
e
```
）和小写过滤器（不区分大小写的搜索）。因此例如
```
Draft kings
```
被索引为
```
draft kings
```
现在，搜索分析器使用的是相同的逻辑，除了它的令牌生成器发出单词令牌，并在其之上创建带状疱疹（令牌的组合）之外，它将匹配第一步中索引的关键字。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/4908745.html

使用Elastic search从文本中提取关键字（多词）

发表评论

评论列表（0条）