很多时候,我们可能想要通过多个条件去做检索,比如说豆瓣搜索一部电影,我们可能会限定根据电影的类型、年份、豆瓣评分等多个条件去查询,那么这种场景其实就是多个检索条件和多个字段匹配的一种场景。在ES中有一种查询,叫做bool查询,他可以组合多个查询字句,然后将结果输出,并且他是支持嵌套子句的。他支持的查询字句可以分为四种类型:
must:必须匹配,会影响算分结果should:选择性匹配,也会影响算分结果must_not:必须不能匹配,不会影响算分filter:必须匹配,不会影响算分
下面是官网提供的一个关于bool查询的例子,其中must和should字句会进行相关性算分,并且累计到最终的分数中。bool查询可以通过minimum_should_match来指定should查询中的term子查询必须匹配几个才可以算是真正的匹配到这条数据。假设现在就是查询一部电影,我们搜索限定评分要大于9分,类型是文艺片,上映时间是2021年,演员有张国荣。那么如果不指定minimum_should_match,可能这四个条件中有一个满足就能查到,但是如果指定了minimum_should_match=3,那么这四个条件中必须满足三个才会返回。
POST _search { "query": { "bool" : { "must" : { "term" : { "user.id" : "kimchy" } }, "filter": { "term" : { "tags" : "production" } }, "must_not" : { "range" : { "age" : { "gte" : 10, "lte" : 20 } } }, "should" : [ # 一个数组,包括了两个term查询,如果没有指定must条件,那么should查询中的term必须至少满足一条查询 { "term" : { "tags" : "env1" } }, { "term" : { "tags" : "deployed" } } ], "minimum_should_match" : 1, "boost" : 1.0 } } }boosting(Boosting query)
假设现在我们有下面这样一个索引,包括三个文档,其中前两条是Apple公司的电子产品介绍,后面一条是水果Apple的百科介绍,那么如果我们通过下面的查询条件去匹配,会既查询到苹果手机,也会查询到水果里的苹果。
POST /baike/_search { "query": { "bool": { "must": { "match":{"title":"Apple"} } } } }
"hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : 0.1546153, "hits" : [ { "_index" : "baike", "_type" : "_doc", "_id" : "1", "_score" : 0.1546153, "_source" : { "title" : "Apple Pad" } }, { "_index" : "baike", "_type" : "_doc", "_id" : "2", "_score" : 0.1546153, "_source" : { "title" : "Apple Mac" } }, { "_index" : "baike", "_type" : "_doc", "_id" : "3", "_score" : 0.1546153, "_source" : { "title" : "Apple Pie and Apple Fruit" } } ] }
但是也许我们的用户关注的并不是水果的苹果,而是电子产品,那么我们应该如何进行更精确的匹配呢?当然我们可以对前面的查询做一些修改,通过mast_not来排除title中包括pie或者fruit的文档,只返回Apple Pad和Apple Mac。但是这么做又似乎有一点绝对,虽然很多人确实是想找苹果手机,但是也总有人是要看看什么苹果好吃,那么有没有什么折中的办法呢?
POST /baike/_search { "query": { "bool": { "must": { "match":{"title":"Apple"} }, "must_not": { "match":{"title":"Pie"} } } } }
在ES中,为我们提供了Boosting query这种查询方式(boosting:boost的现在分词形式,有提高,助推的意思,这里我理解是提高_score这个分值),他可以为我们匹配到用户最关心的苹果手机,也可以匹配到吃的苹果。并且可以指定让最受关注的苹果手机展示在搜索结果的最前面。写法大概如下:
这里对几个属性做一个简单的分析:
positive:翻译过来有积极地意思,用来指定我们最关心的,希望靠前展示,算分高的文档negative:翻译过来有消极地意思,用来指定我们不是很关心,但是还是希望他能被匹配到的文档negative_boost:这个是为negative里面的条件指定一个boost值,用来降低他们的算分,在0.0-1.0之间的一个float数字
POST /baike/_search { "query": { "boosting": { "positive": { "match": { "title": "Apple" } }, "negative": { "match": { "title": "fruit" } }, "negative_boost": 0.5 # 通过这个字段结合上面的negative里的条件,在查询的时候就会将包含fruit的那条数据的算分打的很低,让他排在最后展示 } } }costant_score (Constant score query)
我们知道filter查询是不会进行算分的,而且es会自动缓存一些filter查询,以此来提高一个效率。有时候可能确实需要返回一个期望的算分,那么constant_score可以用来做这件事,他可以对filter查询进行一次包装,然后通过boost这个参数来指定返回一个常量的算分。constant(常量)
POST /baike/_search { "query": { "constant_score": { "filter": {"term": {"title.keyword": "Quick brown rabbits"}}, "boost": 1.2 } } } "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 1.2, "hits" : [ { "_index" : "baike", "_type" : "_doc", "_id" : "4", "_score" : 1.2, # 通过上面查询,这里返回的算分和我们指定的boost分值相等 "_source" : { "title" : "Quick brown rabbits", "body" : "Brown rabbits are commonly seen." } } ] }dis_max(Disjunction max query)
上面说到了bool查询,我们这里在回顾一下,首先这里我从es中文网站找了两条测试数据,
POST baike/_doc/4 { "title": "Quick brown rabbits", "body": "Brown rabbits are commonly seen." } POST baike/_doc/5 { "title": "Keeping pets healthy", "body": "My quick brown fox eats rabbits on a regular basis." }
假设我们现在要在title或者body里查询brown fox相关的内容,那么我们通过观察发现ID为5的这条数据应该是相关性更高的,因为他的body里出现了完整的brown fox这个搜索条件,那么我们当然希望他能获得更高的算分,稍微靠前一点展示,接下来我们用bool查询试试看会不会和我们想的一样,下面是结果:
POST /baike/_search { "query": { "bool": { "should": [ {"match": {"title": "Brown fox"}}, {"match": {"body": "Brown fox"}} ] } } } # 查询结果 "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.5974034, "hits" : [ { "_index" : "baike", "_type" : "_doc", "_id" : "4", "_score" : 1.5974034, "_source" : { "title" : "Quick brown rabbits", "body" : "Brown rabbits are commonly seen." } }, { "_index" : "baike", "_type" : "_doc", "_id" : "5", "_score" : 0.77041256, "_source" : { "title" : "Keeping pets healthy", "body" : "My quick brown fox eats rabbits on a regular basis." } } ] }
实际 *** 作过程中我们发现ID为5的这条数据并没有得到更高的算分,这是为什么呢?为了回答这个问题,我们要知道在es中也可以类似mysql查询sql的执行计划一样,通过explain这个关键字来展示dsl的执行计划,包括算分方式。接下来让我们一起拭目以待吧:
POST /baike/_search { "query": { "bool": { "should": [ {"match": {"title": "Brown fox"}}, {"match": {"body": "Brown fox"}} ] } }, "explain": true } # 查询结果 "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 1.5974034, "hits" : [ { "_shard" : "[baike][0]", "_node" : "aPt8G7vHTzOJU_L2FdLBpA", "_index" : "baike", "_type" : "_doc", "_id" : "4", "_score" : 1.5974034, "_source" : { "title" : "Quick brown rabbits", "body" : "Brown rabbits are commonly seen." }, "_explanation" : { "value" : 1.5974034, # 这个值约等于38行的value + 49行的value "description" : "sum of:", # !!! 求和 "details" : [ { "value" : 1.3862942, "description" : "weight(title:brown in 0) [PerFieldSimilarity], result of:", # title中有关键字brown,算一次 "details" : [ { "value" : 1.3862942, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [] # 算分细节,因为太长省略 } ] }, { "value" : 0.21110919, "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:", # body中有关键字brown,算一次分 "details" : [ { "value" : 0.21110919, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [] } ] } ] } }, { "_shard" : "[baike][0]", "_node" : "aPt8G7vHTzOJU_L2FdLBpA", "_index" : "baike", "_type" : "_doc", "_id" : "5", "_score" : 0.77041256, "_source" : { "title" : "Keeping pets healthy", "body" : "My quick brown fox eats rabbits on a regular basis." }, "_explanation" : { "value" : 0.77041256, "description" : "sum of:", "details" : [ { "value" : 0.160443, "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:", # body中有关键字brown,算一次分 "details" : [ { "value" : 0.160443, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [] } ] }, { "value" : 0.60996956, "description" : "weight(body:fox in 0) [PerFieldSimilarity], result of:", # body中有关键字fox,算一次分 "details" : [ { "value" : 0.60996956, "description" : "score(freq=1.0), computed as boost * idf * tf from:", "details" : [] } ] } ] } } ] }
通过对执行计划的分析,我们发现在bool查询会将should里面两个子查询分别进行算分,然后做加法,得到一个总的分数,在ID为4的文档中,title和body中分别包含了brown这个关键字,而ID为5的文档呢,因为title中没有包含查询条件中任何一个字符,因此它的算分下来就偏低,最终排在了后面。
显而易见,这种结局并不是我们想要看到的,那么有没有什么办法呢?es中就提供了一种解决方案,叫做dis_max。接下来我们用他再做一次查询,看看有什么结果,很明显ID为5的这条数据这一次获得了一个较高的算分。
"hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.77041256, "hits" : [ { "_index" : "baike", "_type" : "_doc", "_id" : "5", "_score" : 0.77041256, "_source" : { "title" : "Keeping pets healthy", "body" : "My quick brown fox eats rabbits on a regular basis." } }, { "_index" : "baike", "_type" : "_doc", "_id" : "4", "_score" : 0.6931471, "_source" : { "title" : "Quick brown rabbits", "body" : "Brown rabbits are commonly seen." } } ] } }
我们在用explain看看他的执行计划,发现他这次不是单纯的将两个子查询的算分加起来,而是选了两个子查询中算分的最大值做为他的最终得分。
"hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.77041256, "hits" : [ { "_shard" : "[baike][0]", "_node" : "tc1MvVwdRcO-2A5L6j_l0Q", "_index" : "baike", "_type" : "_doc", "_id" : "5", "_score" : 0.77041256, "_source" : { "title" : "Keeping pets healthy", "body" : "My quick brown fox eats rabbits on a regular basis." }, "_explanation" : { "value" : 0.77041256, "description" : "max of:", # !!! 求最大值 "details" : [ { "value" : 0.77041256, "description" : "sum of:", "details" : [ { "value" : 0.160443, "description" : "weight(body:brown in 1) [PerFieldSimilarity], result of:", "details" : [] }, { "value" : 0.60996956, "description" : "weight(body:fox in 1) [PerFieldSimilarity], result of:", "details" : [] } ] } ] } }, { "_shard" : "[baike][0]", "_node" : "tc1MvVwdRcO-2A5L6j_l0Q", "_index" : "baike", "_type" : "_doc", "_id" : "4", "_score" : 0.6931471, "_source" : { "title" : "Quick brown rabbits", "body" : "Brown rabbits are commonly seen." }, "_explanation" : { "value" : 0.6931471, "description" : "max of:", "details" : [ { "value" : 0.6931471, "description" : "sum of:", "details" : [ { "value" : 0.6931471, "description" : "weight(title:brown in 0) [PerFieldSimilarity], result of:", "details" : [] } ] }, { "value" : 0.21110919, "description" : "sum of:", "details" : [ { "value" : 0.21110919, "description" : "weight(body:brown in 0) [PerFieldSimilarity], result of:", "details" : [] } ] } ] } } ] }
但是有时候完全取最高的,直接忽略掉其他查询字句的分值,也不是很合理。毕竟优秀的人总是凤毛麟角,普通人的力量也不容小觑,因此我们也要考虑到。ES也为我们提供了一个参数:tie_breaker。他的有效值在0.0-1.0之间的一个浮点数,默认是0.0,如果我们设置了这个字段,那么在算分的时候,首先他会取最高分,然后和所有子查询的得分乘以tie_breaker的值相加,求取一个最终的算分。那么在这个过程中,他给了最高算分和其他子查询算分一个权重,既考虑了极个别优先卓越人物的贡献,也考虑到了人民群众的力量。那么分析了这么多,我们在理解下为什么叫dis_max,dis也就是Disjunction的缩写,有分离,提取的意思,max是最大的意思,因此他就是将组合查询分离成多个子查询,去算分最高的作为最终得分。他是一个帮助我们选取最佳匹配的一种有效手段。
POST /baike/_search { "query": { "dis_max": { "queries": [ {"match": {"title": "Brown fox"}}, {"match": {"body": "Brown fox"}} ], "tie_breaker": 0.7 } }
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)