36. Elasticsearch actual combat | match_phrase can not be found, what should I do (Elasticsearch advanced retrieval)

1. The problem is thrown

If a certain phrase exists in a certain document in Elasitcsearch, it must be searched out through some matching method. 
Example:

title=The Bureau of Highways is dealing with the water problem of Jiefang Road.

Enter the keyword: road, can you search for this document? 
In practical applications, it may be necessary to: 
1) Search for the keywords "understanding", "liberation", "road", and "understanding enlargement", and this document can be found. 
2) The single word splitting "governance" and "water" is too distracting and should not be retrieved. 
3) The word to be searched is not in the dictionary and must be found. 
4) As long as the words to be searched appear in the original title or content, they must be retrieved. 
5) The retrieval should be fast, and the performance problem of wildcard fuzzy matching should be abandoned.

2. Problem analysis

The commonly used stand standard word segmentation can meet requirements 1), 3), 4), and 5). 
What the hell is a standard tokenizer? 
The standard analyzer is the default analyzer, if not specified, the tokenizer is used by default. It provides syntax-based markup and works with most languages. 
For Chinese strings, word segmentation will be performed one by one. 
The result of the standard tokenizer is as follows:

GET /ik_index/_analyze?analyzer=standard
{
"text":"公路局正在治理解放大道路面积水问题"
}
公,路,局,正,在,治,理,解,放,大,道,路,面,积,水,问,题

However, there will be a lot of redundant data. 
For requirement 2), exclude match retrieval and stand word segmentation. 
For requirement 5), exclude wildcard fuzzy retrieval. 
For requirements 3) and 4), new words should also be searched, such as: "Sound Immersive", "Sun Dazu", etc. should also be searched. 
For requirement 1), it seems more reliable to use match_phrase.

3, a small test knife

First use the IK-max-word fine-grained tokenizer and try it with match_phrase?

Step 1: Define Indexes and Mapping

PUT ik_index
{
  "mappings":{
  "ik_type":{
  "properties":{
  "title":{
  "type":"text",
  "fields":{
  "ik_my_max":{
  "type":"text",
  "analyzer":"ik_max_word"
  },
  "ik_my_smart":{
  "type":"text",
  "analyzer":"ik_smart"
  },
  "keyword":{
  "type":"keyword",
  "ignore_above":256
  }
  }
  }
  }
  }
  }
  }

Here, in order to verify the word segmentation, both ik_smart and ik_max are used for word segmentation. 
It is not needed in actual development, because: the coexistence of two kinds of word segmentation will cause the index to be very large when importing data to create an index, which will affect the disk and retrieval performance.

Step 2: Insert the document

POST ik_index/ik_type/3
{
  "title":"公路局正在治理解放大道路面积水问题"
}

Step 3: Implement the search

POST ik_index/ik_type/_search
{
  "profile":"true",
  "query":
  {
  "match_phrase":
  {
  "title.ik_my_max":"道路"
  }
  }
}

The search results are as follows: 
No results returned.

{
  "took": 1,
  "timed_out": false,
  "_shards": {
  "total": 5,
  "successful": 5,
  "failed": 0
  },
  "hits": {
  "total": 0,
  "max_score": null,
  "hits": []
  }
}

Why use max_word fine-grained word segmentation, use match_pharse retrieval, and why there is no result. 
Analysis: 
The fine-grained ik_max_word segmentation result is:

GET /ik_index/_analyze?analyzer=ik_max_word
{
"text":"公路局正在治理解放大道路面积水问题"
}
公路局 ,公路 ,路局 ,路 ,局正 ,正在 ,正 ,治理 ,治 ,理解 ,
理 ,解放 ,解 ,放大 ,大道 ,大 ,道路 ,道 ,路面 ,路 ,
面积 ,面 ,积水 ,积 ,水 ,问题

In the above method, in addition to returning the result of word segmentation, it can also return the position of the word.

When building the index, the road is split into: road:16, road:17, road:19. (Note that 18 is added in the middle: road)

{
  "token": "路面",
  "start_offset": 11,
  "end_offset": 13,
  "type": "CN_WORD",
  "position": 18
  }

When retrieving, the road is split into: road 0 road 1 road 2

During match_phrase retrieval, the document must meet the following two conditions at the same time in order to be retrieved: 
1) All terms appear in this field after word segmentation; 
2) The order of terms in the field must be consistent. 
The location information can be stored in the inverted index, so the match_phrase query, which is sensitive to the location of words, can use the location information to match the documents that contain all the query terms, and the order of the terms is also consistent with our search specification. There are no other terms in between.

In order to verify the above explanation, add a title related to "road" and check it:

POST ik_index/ik_type/4
{
  "title":"党员干部坚持走马克思主义道路的重要性"
}

Note: At this time, the search road can be matched.

  "hits": {
  "total": 1,
  "max_score": 1.9684901,
  "hits": [
  {
  "_index": "ik_index",
  "_type": "ik_type",
  "_id": "4",
  "_score": 1.9684901,
  "_source": {
  "title": "党员干部坚持走马克思主义道路的重要性"
  }
  }
  ]
  },

The fine-grained ik_max_word segmentation result is:

党员干部, 党员, 干部, 坚持走, 坚持, 坚, 持, 走马, 马克思主义, 马克思,
马克, 马, 克, 思, 主义, 道路, 道, 路, 重要性, 重要,
要性, 性

When building the index, the road is split into: 15, 16, 17 positions. 
It is consistent with the order of terms retrieved. 
More detailed analysis here: http://t.cn/R8pzw9e

4. The match_pharse can't be found. Is there any other solution?

Yes, similar to match_pharse, but match_phrase_prefix supports last term prefix matching. 
The match_phrase_prefix is ​​basically the same as the match_phrase query, except that the last participle of the query text is only used for prefix matching. The parameter max_expansions controls how many prefixes the last word will be rewritten, that is, controls the number of prefix expansion components. The default value is 50 (50 is recommended in the official website document). 
The more prefixes that are extended, the more documents are found; 
if the number of prefixes is too small, the corresponding documents may not be found and data may be missed.

POST ik_index/ik_type/_search
{
  "profile":"true",
  "query":
  {
  "match_phrase_prefix" : {
  "title.ik_my_max" : {
  "query": "道路",
  "max_expansions": 50
  }
  }
}
}

It has been verified that the keywords "understanding", "liberation", "road", and "understanding enlargement" can all search for this document.

5. Application scenarios

When we develop our own search engines, we often search based on the title or content fields. 
If match retrieval is used, there will be a lot of noise; 
if match_phrase is used, some fields will not be retrieved, such as the "road" analyzed above; 
if wildcard is used, it can be retrieved, but there are performance problems. 
At this time, you can consider: match_phrase_prefix.

6. Summary

In actual development, different tokenizers are used according to different application scenarios. 
If ik is selected, it is recommended to use ik_max_word segmentation, because the segmentation result of ik_max_word contains ik_smart. 
When matching, if you want to retrieve as many results as possible, consider using match; 
if you want to match word segmentation results as accurately as possible, consider using match_phrase; 
if you are afraid of omission when matching phrases, consider using match_phrase_prefix.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325492961&siteId=291194637