基于单词的搜索
检索分词 基于全文的查询相关性分数_scoreFunctionScoreQuery
测试数据FieldValueFactor
基于单词的搜索基于单词的搜索对应term关键字,es在检索数据时会自动把关键词小写分词处理,如果不希望这样,可以加入keyword
检索以下例子什么也搜不到:
PUT term_test/_doc/1 { "name": "Szc", "hometown": "China-Henan-Anyang" } GET term_test/_search { "query": { "term": { "name": { "value": "Szc" } } } }
输出如下:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
但name后加了.keyword就可以了:
GET term_test/_search { "query": { "term": { "name.keyword": { "value": "Szc" } } } }
输出如下
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.2876821, "hits" : [ { "_index" : "term_test", "_type" : "_doc", "_id" : "1", "_score" : 0.2876821, "_source" : { "name" : "Szc", "hometown" : "China-Henan-Anyang" } } ] } }分词
分词也是一样的:
GET term_test/_search { "query": { "term": { "hometown.keyword": { "value": "China-Henan-Anyang" } } } }
会得到想要的结果
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.2876821, "hits" : [ { "_index" : "term_test", "_type" : "_doc", "_id" : "1", "_score" : 0.2876821, "_source" : { "name" : "Szc", "hometown" : "China-Henan-Anyang" } } ] } }基于全文的查询
对关键词进行分词,对拆到的每个单词进行term查询,然后进行合并,例如match、match_phrase,具体参见ElasticSearch7学习之搜索API中逻辑 *** 作符和match_phrase章节
相关性分数_score现在相关性算法采用的是BM25,和经典的TFIDF相比,当TF无限增加时,BM25算分会趋于一个数值。
可以通过增加explain参数来查看分数是怎么算的:
GET term_test/_search { "explain": true, "query": { "term": { "hometown.keyword": { "value": "China-Henan-Anyang" } } } }
输出如下,_explanation字段里各个参数和参数值都有清楚的解释
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.2876821, "hits" : [ { "_shard" : "[term_test][0]", "_node" : "JhcR-XkxT2uY3UubTpfZAQ", "_index" : "term_test", "_type" : "_doc", "_id" : "1", "_score" : 0.2876821, "_source" : { "name" : "Szc", "hometown" : "China-Henan-Anyang" }, "_explanation" : { "value" : 0.2876821, "description" : "weight(hometown.keyword:China-Henan-Anyang in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.2876821, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 0.2876821, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 1, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 1, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.45454544, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 1.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 1.0, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } } ] } }
我们可以指定boost参数来优化查分的计算,包括正向增强、负向增强和负向增强因子
GET term_test/_search { "explain": true, "query": { "term": { "hometown.keyword": { "value": "China-Henan-Anyang" } }, "boosting": { "positive": { "term": { "FIELD": { "value": "VALUE" } } }, "negative": { "term": { "FIELD": { "value": "VALUE" } } }, "negative_boost": 0.2 } } }FunctionScoreQuery
作用:可以在查询结束以后,对每一个匹配的文档进行重新算分,根据新生成的分数进行排序
有几种默认的计算分值的函数:
- Weight:为每一个文档设置一个简单而不被规范化的权重);FieldValueFactor:将某些字段作为算分的参考因素);RandomScore:为每一个用户使用一个不同的随机算分结果);衰减函数:以某个字段的值为标准,距离某个值越近,得分越高;scriptScore:自定义脚本。
PUT /blogs/_doc/1 { "title": "blog 1", "content": "content1", "votes": 10000000 } PUT /blogs/_doc/2 { "title": "blog 2", "content": "content2", "votes": 0 } PUT /blogs/_doc/3 { "title": "blog 3", "content": "content3", "votes": 1000 }FieldValueFactor
对于FieldValueFactor,用法如下,其实就是原始的评分乘上指定的字段值
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "blog", "fields": ["title", "content"] } }, "field_value_factor": { "field": "votes" } } } }
可以使用modifier和factor来使曲线更加平滑,从这儿可以看出原始分数不是直接和字段值进行结合的,而是先将字段值进行强化(boost),再和原始分数进行结合
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "blog", "fields": ["title", "content"] } }, "field_value_factor": { "field": "votes", "modifier": "log1p", "factor": 0.1 } } } }
还可以通过指定boost_mode来改变分数和boost后字段值的结合方式,默认是相乘;而max_boost字段可以指定boost后字段值的最大值
POST /blogs/_search { "query": { "function_score": { "query": { "multi_match": { "query": "blog", "fields": ["title", "content"] } }, "field_value_factor": { "field": "votes", "modifier": "log1p", "factor": 0.1 }, "boost_mode": "max", "max_boost": 5 } } }
对于一致性随机函数,用法如下,给定种子值即可
POST blogs/_search { "query": { "function_score": { "random_score": { "seed": 314159265361 } } } }
种子值不同,排序结果就可能不同。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)