使用Elastic search从文本中提取关键字(多词)

使用Elastic search从文本中提取关键字(多词),第1张

使用Elastic search从文本中提取关键字(多词)

只有一种真正的方法可以做到这一点。您必须将您的数据作为关键字建立索引,并使用带状疱疹对其进行分析:

看到这个复制品:

首先,我们将创建两个自定义分析器:keyword和shingles:

PUT test{  "settings": {    "analysis": {      "analyzer": {        "my_analyzer_keyword": {          "type": "custom",          "tokenizer": "keyword",          "filter": [ "asciifolding", "lowercase"          ]        },        "my_analyzer_shingle": {          "type": "custom",          "tokenizer": "standard",          "filter": [ "asciifolding", "lowercase", "shingle"          ]        }      }    }  },  "mappings": {    "your_type": {      "properties": {        "keyword": {          "type": "string",          "index_analyzer": "my_analyzer_keyword",          "search_analyzer": "my_analyzer_shingle"        }      }    }  }}

现在,让我们使用您提供的数据创建一些示例数据:

POST /test/your_type/1{  "id": 1,  "keyword": "thousand eyes"}POST /test/your_type/2{  "id": 2,  "keyword": "facebook"}POST /test/your_type/3{  "id": 3,  "keyword": "superdoc"}POST /test/your_type/4{  "id": 4,  "keyword": "quora"}POST /test/your_type/5{  "id": 5,  "keyword": "your story"}POST /test/your_type/6{  "id": 6,  "keyword": "Surgery"}POST /test/your_type/7{  "id": 7,  "keyword": "lending club"}POST /test/your_type/8{  "id": 8,  "keyword": "ad roll"}POST /test/your_type/9{  "id": 9,  "keyword": "the honest company"}POST /test/your_type/10{  "id": 10,  "keyword": "Draft kings"}

最后查询以运行搜索:

POST /test/your_type/_search{  "query": {    "match": {      "keyword": "I saw the news of lending club on facebook, your story and quora"    }  }}

这是结果:

{  "took": 6,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "failed": 0  },  "hits": {    "total": 4,    "max_score": 0.009332742,    "hits": [      {        "_index": "test",        "_type": "your_type",        "_id": "2",        "_score": 0.009332742,        "_source": {          "id": 2,          "keyword": "facebook"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "7",        "_score": 0.009332742,        "_source": {          "id": 7,          "keyword": "lending club"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "4",        "_score": 0.009207102,        "_source": {          "id": 4,          "keyword": "quora"        }      },      {        "_index": "test",        "_type": "your_type",        "_id": "5",        "_score": 0.0014755741,        "_source": {          "id": 5,          "keyword": "your story"        }      }    ]  }}

那么它在幕后做什么?

  1. 它将您的文档索引为整个关键字(它将整个字符串作为单个标记发出)。我还添加了asiifolding过滤器,因此它可以对字母进行规范化(即
    é
    成为
    e
    )和小写过滤器(不区分大小写的搜索)。因此例如
    Draft kings
    被索引为
    draft kings
  2. 现在,搜索分析器使用的是相同的逻辑,除了它的令牌生成器发出单词令牌,并在其之上创建带状疱疹(令牌的组合)之外,它将匹配第一步中索引的关键字。


欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/4908745.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-11-12
下一篇 2022-11-12

发表评论

登录后才能评论

评论列表(0条)

保存