ElasticStack learning (X): In-depth QueryFiltering ElasticSearch search, the multi / single string of multi-field queries

A composite query

  1, in ElasticSearch, there are two different Filter Query and Context. Query Context conducted correlation count points, Filter Context points need not be calculated, while using the Cache, better performance.

  2, bool Query: A Boolean query is a combination of one or more query clauses, a total of four types of clause, where two kinds of influence count points, 2 points does not affect the count.

   Boolean queries can also involve correlation count points, because the more matches clause, the higher the correlation count points. Boolean query for each query clauses calculated branch count is incorporated into the total points in the correlation calculation.

   Subquery may appear in any order, and can nest multiple subqueries.

   

   Boolean query count points of the process:

    1) Should the query statement in the query;

    2) Rating query results are summed;

    3) multiplied by the total number of matches statement;

    4) divided by the total number of statements;

  3, complex query applications

    1)must_not与filter

  

    2)should

  

    3) bool nested

  

    4) For the same level of competition should fields having the same weight, bool if nested query can be changed to affect the calculation points.

   

    5) control field boost weight to influence the results returned by the query.

  

  

    6) meet the requirements of the relevant documents high front or exclude irrelevant documents conditions, Boosting Query improved Precision, but also enhance the Recall.

  

  

 Second, multi-field single query string

  1) background example shows:

  

  As can be seen from the example, the title and body of the two fields match "brown fox", since there is only the second brown fox in the body, and are matched in the first title, in the body. So, the first comprehensive count points points higher than the second count bars.

  2)Disjunction Max Query

  For example, the title and body compete with each other in the query brown fox, you should search out the second piece of information. Therefore, for the search strategy should not be a simple sum of the scores, but should find the best match of the word score.

  Disjunction Max Query可以按最匹配字段评分进行返回。

   

   从上图可以看出,通过Disjunction Max Query进行查询,获取了最合适的匹配结果。

  3)Tie Breaker

  

  通过对quice pets进行查询,会发现两条文档的评分是一样的,这是因为quick pets做为查询Term存在,在title或body中存在,两者的评分是一样的。为了获取最佳匹配,可以使用Tie Breaker。如下图所示:

  

  可以看出,文档2排在了文档1的前面,原因是文档2的title和body,分别存有quick或pets,而文档1中只有title存在有一个quick。因此文档2的评分比文档1的高。

  Tie Breaker的作用:Tie Breaker是一个介于0-1之间的浮点数,1代表使用最佳匹配;0代表所有语句同等重要。

    获得最佳匹配语句的评分;

    将其他匹配语句的评分与Tie Breaker相乘;

    对以上评分求和并规范化;

  4)Multi Match

  当输入单个字符串进行查询时,通常会遇到三种情形:

    a)最佳字段

     当字段之间相互竞争,又相互关联时,评分来自于最匹配字段,比如上述的title和body字段。如当搜索“brown fox”时,此时该词组比两个独立的单词更有意义,因此文档在相同字段中包含的词最多越好,评分也来自于最匹配字段。

    

    b)多数字段

    为了对相关度进行微调,常用的技术就是将相同的数据索引到不同的字段,以匹配更多的文档。具体操作就是:

      在主字段抽取词干,加入同义词、变音词、口语词,以匹配更多的字段;

      相同的文本,加入子字段,以提供更加精确的匹配;

      其他字段作为匹配文档提高相关度的信号,匹配字段最多越好;

    

    通过上图可以发现,文档2更符合quicking brown的搜索条件,但是它排在了第2位,原因就是两个字段采用了英文分词器,而查询的实际上是quick、brown两个Term,在两个文档中的title或body字段中,存在于这两个词。又因为第一个文档的相应字段总词数比第二个文档的相应字段总词数少,所以文档1评分会较高。

    

    对上图分析发现,用广度匹配字段body,以包括尽可能多的文档,提升了召回率,同时对body字段增加了子字段,将std作为信号将相关度更高的文档置于结果顶部。

    每个字段对于最终评分的贡献可以通过自定义值boost来控制。如下图所示:

    未设置boost值的情况

    

    将body设置boost的情况

    

    c)混合字段

    对于某些实体,例如人名、地址、图书信息。需要在多个字段中确定信息,单个字段只能作为整体的一部分。希望在列出的字段中找到尽可能多的词。

       比如对于街道+门牌号存在于指定搜索字段的文档记录,如果想用most_fileds进行搜索是无法直接进行搜索的,对于operator=and也不能使用,因为它不适用于跨字段场景中。

    而之前用于字段中的copy_to,虽然可以解决此类问题,但是需要额外的存储空间,因此也不是最优的解决方式。

    

    从上图中可以看出,cross_fields可以配合operator=and进行跨字段的查询匹配,同时与copy_to相比,它还可以在搜索时为单个字段提升权重,如下图所示:

     

    

 

  大家可关注我的公众号 

    

  知识学习来源:阮一鸣:《Elasticsearch核心技术与实战》 

Guess you like

Origin www.cnblogs.com/supersnowyao/p/11206672.html