ElasticSearch7学习笔记之基于单词和基于全文的搜索_随笔

ElasticSearch7学习笔记之基于单词和基于全文的搜索

文章目录

基于单词的搜索

检索分词基于全文的查询相关性分数_scoreFunctionScoreQuery

测试数据FieldValueFactor

基于单词的搜索

基于单词的搜索对应term关键字，es在检索数据时会自动把关键词小写分词处理，如果不希望这样，可以加入keyword

检索

以下例子什么也搜不到：

PUT term_test/_doc/1
{
  "name": "Szc",
  "hometown": "China-Henan-Anyang"
}

GET term_test/_search
{
  "query": {
    "term": {
      "name": {
        "value": "Szc"
      }
    }
  }
}

输出如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

但name后加了.keyword就可以了:

GET term_test/_search
{
  "query": {
    "term": {
      "name.keyword": {
        "value": "Szc"
      }
    }
  }
}

输出如下

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "term_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "Szc",
          "hometown" : "China-Henan-Anyang"
        }
      }
    ]
  }
}

分词

分词也是一样的：

GET term_test/_search
{
  "query": {
    "term": {
      "hometown.keyword": {
        "value": "China-Henan-Anyang"
      }
    }
  }
}

会得到想要的结果

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "term_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "Szc",
          "hometown" : "China-Henan-Anyang"
        }
      }
    ]
  }
}

基于全文的查询

对关键词进行分词，对拆到的每个单词进行term查询，然后进行合并，例如match、match_phrase，具体参见ElasticSearch7学习之搜索API中逻辑 *** 作符和match_phrase章节

相关性分数_score

现在相关性算法采用的是BM25，和经典的TFIDF相比，当TF无限增加时，BM25算分会趋于一个数值。

可以通过增加explain参数来查看分数是怎么算的：

GET term_test/_search
{
  "explain": true,
  "query": {
    "term": {
      "hometown.keyword": {
        "value": "China-Henan-Anyang"
      }
    }
  }
}

输出如下，_explanation字段里各个参数和参数值都有清楚的解释

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_shard" : "[term_test][0]",
        "_node" : "JhcR-XkxT2uY3UubTpfZAQ",
        "_index" : "term_test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "Szc",
          "hometown" : "China-Henan-Anyang"
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "weight(hometown.keyword:China-Henan-Anyang in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "score(freq=1.0), computed as boost * idf * tf from:",
              "details" : [
                {
                  "value" : 2.2,
                  "description" : "boost",
                  "details" : [ ]
                },
                {
                  "value" : 0.2876821,
                  "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1,
                      "description" : "n, number of documents containing term",
                      "details" : [ ]
                    },
                    {
                      "value" : 1,
                      "description" : "N, total number of documents with field",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 0.45454544,
                  "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "freq, occurrences of term within document",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "k1, term saturation parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "b, length normalization parameter",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "dl, length of field",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "avgdl, average length of field",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

我们可以指定boost参数来优化查分的计算，包括正向增强、负向增强和负向增强因子

GET term_test/_search
{
  "explain": true,
  "query": {
    "term": {
      "hometown.keyword": {
        "value": "China-Henan-Anyang"
      }
    },
    "boosting": {
      "positive": {
        "term": {
          "FIELD": {
            "value": "VALUE"
          }
        }
      },
      "negative": {
        "term": {
          "FIELD": {
            "value": "VALUE"
          }
        }
      },
      "negative_boost": 0.2
    }
  }
}

FunctionScoreQuery

作用：可以在查询结束以后，对每一个匹配的文档进行重新算分，根据新生成的分数进行排序

有几种默认的计算分值的函数：

Weight：为每一个文档设置一个简单而不被规范化的权重)；FieldValueFactor：将某些字段作为算分的参考因素)；RandomScore：为每一个用户使用一个不同的随机算分结果)；衰减函数：以某个字段的值为标准，距离某个值越近，得分越高；scriptScore：自定义脚本。测试数据

PUT /blogs/_doc/1
{
  "title": "blog 1",
  "content": "content1",
  "votes": 10000000
}


PUT /blogs/_doc/2
{
  "title": "blog 2",
  "content": "content2",
  "votes": 0
}


PUT /blogs/_doc/3
{
  "title": "blog 3",
  "content": "content3",
  "votes": 1000
}

FieldValueFactor

对于FieldValueFactor，用法如下，其实就是原始的评分乘上指定的字段值

POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "blog",
          "fields": ["title",    "content"]
        }
      },
      "field_value_factor": {
        "field": "votes"
      }
    }
  }
}

可以使用modifier和factor来使曲线更加平滑，从这儿可以看出原始分数不是直接和字段值进行结合的，而是先将字段值进行强化(boost)，再和原始分数进行结合

POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "blog",
          "fields": ["title",    "content"]
        }
      },
      "field_value_factor": {
        "field": "votes",
        "modifier": "log1p",
        "factor": 0.1
      }
    }
  }
}

还可以通过指定boost_mode来改变分数和boost后字段值的结合方式，默认是相乘；而max_boost字段可以指定boost后字段值的最大值

POST /blogs/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "blog",
          "fields": ["title",    "content"]
        }
      },
      "field_value_factor": {
        "field": "votes",
        "modifier": "log1p",
        "factor": 0.1
      },
      "boost_mode": "max",
      "max_boost": 5
    }
  }
}

对于一致性随机函数，用法如下，给定种子值即可

POST blogs/_search
{
  "query": {
    "function_score": {
      "random_score": {
        "seed": 314159265361
      }
    }
  }
}

种子值不同，排序结果就可能不同。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5705328.html

ElasticSearch7学习笔记之基于单词和基于全文的搜索

发表评论

评论列表（0条）