Hands-on learning about Multi Match Query in Elasticsearch

In Elasticsearch full-text retrieval, we use Multi Match Query more, which supports matching multiple fields. Elasticsearch supports 5 types of Multi Match, let's learn the difference between them in depth.

5 types of Multi Match Query

Excerpted directly from the document on the official website:

best_fields: (default) Finds documents which match any field, but uses the _score from the best field.
most_fields: Finds documents which match any field and combines the _score from each field.
cross_fields: Treats fields with the same analyzer as though they were one big field. Looks for each word in any field.
phrase: Runs a match_phrase query on each field and combines the _score from each field.
phrase_prefix: Runs a match_phrase_prefix query on each field and combines the _score from each field.

Here we only consider the first three, and the latter two can be studied separately and will be ignored.

Create test indexes and preset test data

Create gino_product index

PUT /gino_product
{
  "mappings": {
    "product": {
      "properties": {
        "productName": {
          "type": "string",
          "analyzer": "fulltext_analyzer",
          "copy_to": [
            "bigSearchField"
          ]
        },
        "brandName": {
          "type": "string",
          "analyzer": "fulltext_analyzer",
          "copy_to": [
            "bigSearchField"
          ],
          "fields": {
            "brandName_pinyin": {
              "type": "string",
              "analyzer": "pinyin_analyzer",
              "search_analyzer": "standard"
            },
            "brandName_keyword": {
              "type": "string",
              "analyzer": "keyword",
              "search_analyzer": "standard"
            }
          }
        },
        "sortName": {
          "type": "string",
          "analyzer": "fulltext_analyzer",
          "copy_to": [
            "bigSearchField"
          ],
          "fields": {
            "sortName_pinyin": {
              "type": "string",
              "analyzer": "pinyin_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "productKeyword": {
          "type": "string",
          "analyzer": "fulltext_analyzer",
          "copy_to": [
            "bigSearchField"
          ]
        },
        "bigSearchField": {
          "type": "string",
          "analyzer": "fulltext_analyzer"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    },
    "analysis": {
      "tokenizer": {
        "simple_pinyin": {
          "type": "pinyin",
          "first_letter": "none"
        }
      },
      "analyzer": {
        "fulltext_analyzer": {
          "type": "ik",
          "use_smart": true
        },
        "pinyin_analyzer": {
          "type": "custom",
          "tokenizer": "simple_pinyin",
          "filter": [
            "word_delimiter",
            "lowercase"
          ]
        }
      }
    }
  }
}

Insert some test data

POST /gino_product/product/1
{
  "productName": "耐克女生运动轻跑鞋",
  "brandName": "耐克",
  "sortName": "鞋子",
  "productKeyword": "耐克，潮流，运动，轻跑鞋"
}

POST /gino_product/product/2
{
  "productName": "耐克女生休闲运动服",
  "brandName": "耐克",
  "sortName": "上衣",
  "productKeyword": "耐克，休闲，运动"
}

POST /gino_product/product/3
{
  "productName": "阿迪达斯女生冬季运动板鞋",
  "brandName": "阿迪达斯",
  "sortName": "鞋子",
  "productKeyword": "阿迪达斯，冬季，运动，板鞋"
}

POST /gino_product/product/4
{
  "productName": "阿迪达斯女生冬季运动夹克外套",
  "brandName": "阿迪达斯",
  "sortName": "上衣",
  "productKeyword": "阿迪达斯，冬季，运动，夹克，外套"
}

Test data overview

Search [Sports] separately

POST /gino_product/_search
{
  "query": {
    "multi_match": {
      "query": "运动",
      "fields": [
        "brandName^100",
        "brandName.brandName_pinyin^100",
        "brandName.brandName_keyword^100",
        "sortName^80",
        "sortName.sortName_pinyin^80",
        "productName^60",
        "productKeyword^20"
      ],
      "type": <multi-match-type>,
      "operator": "AND"
    }
  }
}

It is found that 4 items of product data can be searched by using 3 types, and the sorting is also consistent.

Search for [Sports Tops]

POST /gino_product/_search
{
  "query": {
    "multi_match": {
      "query": "运动 上衣",
      "fields": [
        "brandName^100",
        "brandName.brandName_pinyin^100",
        "brandName.brandName_keyword^100",
        "sortName^80",
        "sortName.sortName_pinyin^80",
        "productName^60",
        "productKeyword^20"
      ],
      "type": <multi-match-type>,
      "operator": "AND"
    }
  }
}

This time, only cross_field can search the data, but using best_fields and most_fields can't, why?

Use the validate API to compare the difference

POST /gino_product/_validate/query?rewrite=true
{
  "query": {
    "multi_match": {
      "query": "运动 上衣",
      "fields": [
        "brandName^100",
        "brandName.brandName_pinyin^100",
        "brandName.brandName_keyword^100",
        "sortName^80",
        "sortName.sortName_pinyin^80",
        "productName^60",
        "productKeyword^20"
      ],
      "type": <multi-match-type>,
      "operator": "AND"
    }
  }
}

best_fields: All input tokens must all match on a field.

The analyzer and search_analyzer defined on mapping are used when each field is matched .

(+brandName:运动 +brandName:上衣)^100.0 
| (+brandName.brandName_pinyin:运 +brandName.brandName_pinyin:动 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 
| (+brandName.brandName_keyword:运 +brandName.brandName_keyword:动 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0
| (+sortName:运动 +sortName:上衣)^80.0 
| (+sortName.sortName_pinyin:运 +sortName.sortName_pinyin:动 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 
| (+productName:运动 +productName:上衣)^60.0 
| (+productKeyword:运动 +productKeyword:上衣)^20.0

most_fields: All input tokens must all match on a field.

The difference from best_fields is the relevance score, best_fields takes the maximum matching score (max calculation), and most_fields takes the sum of all matches (sum calculation).

(
    (+brandName:运动 +brandName:上衣)^100.0 
    (+brandName.brandName_pinyin:运 +brandName.brandName_pinyin:动 +brandName.brandName_pinyin:上 +brandName.brandName_pinyin:衣)^100.0 
    (+brandName.brandName_keyword:运 +brandName.brandName_keyword:动 +brandName.brandName_keyword:上 +brandName.brandName_keyword:衣)^100.0
    (+sortName:运动 +sortName:上衣)^80.0 
    (+sortName.sortName_pinyin:运 +sortName.sortName_pinyin:动 +sortName.sortName_pinyin:上 +sortName.sortName_pinyin:衣)^80.0 
    (+productName:运动 +productName:上衣)^60.0 
    (+productKeyword:运动 +productKeyword:上衣)^20.0
)

cross_fields: All input tokens must all match on the same set of fields.

First, ES will perform query rewrite grouping on cross_fields, and the grouping is based on search_analyzer . Specifically in our example [brandName.brandName_pinyin, brandName.brandName_keyword, sortName.sortName_pinyin] the search_analyzer of these three fields is standard, and the rest of the fields are fulltext_analyzer, so they are finally divided into two groups.

(
  (
    +(brandName.brandName_pinyin:运^100.0 | sortName.sortName_pinyin:运^80.0 | brandName.brandName_keyword:运^100.0) 
    +(brandName.brandName_pinyin:动^100.0 | sortName.sortName_pinyin:动^80.0 | brandName.brandName_keyword:动^100.0) 
    +(brandName.brandName_pinyin:上^100.0 | sortName.sortName_pinyin:上^80.0 | brandName.brandName_keyword:上^100.0) 
    +(brandName.brandName_pinyin:衣^100.0 | sortName.sortName_pinyin:衣^80.0 | brandName.brandName_keyword:衣^100.0)
  ) 
  (
    +(productKeyword:运动^20.0 | brandName:运动^100.0 | sortName:运动^80.0 | productName:运动^60.0) 
    +(productKeyword:上衣^20.0 | brandName:上衣^100.0 | sortName:上衣^80.0 | productName:上衣^60.0)
  )
)

keep exploring and thinking

How to make best_fields and most_fields also match products?

The most common way is to use the _all field or the copyTo field, such as the bigSearchField field in our mapping.

How to improve search results for cross_fields?

Since cross_fields need to be grouped according to search_analyzer, it is impossible to match products when searching for input like [sports shangyi]. Therefore, the grouping should be reduced as much as possible, and a unified search_analyzer should be used as much as possible, or the search_analyzer should be forced to cover the mapping when searching. Defined search_analyzer.

How about changing the operator to OR?

In the above example, the operators we set are AND, which means that all searched tokens must be matched. So what happens when it is set to OR and in what scenarios should OR be used?

Pay special attention when using OR, because as long as there is a Token match, the product will be searched. For example, when searching for [sports jacket] above, the product of shoes will also be matched, so the accuracy of the search will be far reduce.

In some special searches, such as when we search for [Nike Adidas tops], if the operator is used as AND, no matter which multi-search-type is used, the product cannot be matched (think why?), then we can set the operator Set to OR and set minimum_should_match to 60%, so that you can search for shirts belonging to Nike and Adidas, which is equivalent to a smart search downgrade.

/gino_product/_search
{
  "query": {
    "multi_match": {
      "query": "耐克 阿迪达斯 上衣",
      "fields": [
        "brandName^100",
        "brandName.brandName_pinyin^100",
        "brandName.brandName_keyword^100",
        "sortName^80",
        "sortName.sortName_pinyin^80",
        "productName^60",
        "productKeyword^20"
      ],
      "type": "cross_fields",
      "operator": "OR",
      "minimum_should_match": "60%"
    }
  }
}

Relevance scoring again

In the Elasticsearch Relevance Scoring Mechanism Learning article, we have discussed the best_fields and cross_fields relevancy scoring mechanisms, and the examples use the same search_analyzer. In the case of grouping, how is the cross_fields score calculated?

Let's still use the above example and add the explain parameter to see.

POST /gino_product/_search
{
  "explain": true,
  "query": {
    "multi_match": {
      "query": "运动 上衣",
      "fields": [
        "brandName^100",
        "brandName.brandName_pinyin^100",
        "brandName.brandName_keyword^100",
        "sortName^80",
        "sortName.sortName_pinyin^80",
        "productName^60",
        "productKeyword^20"
      ],
      "type": "cross_fields",
      "operator": "AND"
    }
  }
}

Detailed ES response message: cross_fields_scoring.json

Through the grouping information obtained by the above validate API and the scoring details obtained by explain, a cross_fields scoring formula can be summarized:

score(q, d) = coord(q, d) * ∑(∑(max(score(t, f))))

coord(q, d): grouping matching factor, for example, we have only one grouping match above, and coord is 0.5 (one grouping is matched in two groups);
score(t, f): The correlation score between a Token searched and a specific field (using TFIDF) is calculated;
max: a Token searched takes the maximum value among all field scores;
Intra-group summation: sum the maximum value of all Tokens searched in a group;
Summation between groups: The scores of all groups are finally summed;

summary

best_fields is better when the search is a single Token. For example, when searching for [Nike], the brand is Nike and when the product keyword contains Nike, the former has a higher correlation score; but for multiple Tokens, cross-field matching is required. When , only large fields can be introduced to match, so the setting of weights is meaningless;
Most_fields is similar to best_fields, its advantage is that it can match as many as possible, and the correlation scoring mechanism is more reasonable;
The biggest advantage of cross_fields is that it can match across fields, and makes full use of the weight settings of each field. However, it should be noted that the matching is grouped according to search_analyzer, and the direct matching of different groups cannot cross fields.

References

ElaticSearch Reference > Multi Match Query