Elasticsearch search type (query type) detailed

es When querying, you can specify the search type as:
QUERY_THEN_FETCH : The query results are consistent with the required results, and the query speed is slow;
QUERY_AND_FEATCH : The query results are more than the actual results;
DFS_QUERY_THEN_FEATCH : The precision is higher than
QUERY_THEN_FETCH , and the query is the slowest; QUERY_AND_FEATCH is more precise
So what's the difference between these 4 search types?

Introduction to distributed search background:
ES is born for distributed, but distributed has the disadvantages of distributed. For example, if you want to search for a word, but the data is on 5 shards, these 5 shards may be on 5 hosts. Because full-text search is inherently sorted (ranked according to the degree of matching), but the data is on 5 shards, how to get the final correct sorting? ES does this in roughly two steps.
Step1. The ES client will initiate a search request to 5 shards for the search term at the same time. This is called Scatter.
Step2. These 5 shards independently complete the search based on the shard, and then return all the results that meet the conditions. This step is called Gather.
The client reorders and ranks the returned results, and finally returns them to the user. That is to say, a search of ES is a scatter/gather process (this is also very similar to mapreduce).

However, there are two problems.
First, the question of quantity. For example, the user needs to search for "Shuanghuanglian", and is required to return the top 10 most matching conditions. However, in the 5 shards, data related to Shuanghuanglian may be stored. So ES will send a query request to these 5 shards, and ask each shard to return 10 records that meet the conditions. After the ES gets the returned results, it performs overall sorting, and then returns the top 10 most eligible items to the user. In this case, ES5 shards will receive at most 10*5=50 records, so the number of results returned to the user will be more than the number of user requests.
Second, the ranking problem. In the above search, the calculation score of each shard is calculated based on its own shard data. The word frequency and other information used to calculate the score are based on its own shards, and the overall ranking of ES is based on the calculated scores of each shard, which may lead to inaccurate rankings. . If we want to control the sorting more precisely, we should first collect the sorting and ranking-related information (word frequency, etc.) from 5 shards, perform unified calculation, and then use the overall word frequency to query each shard.

These two problems, it is estimated that ES does not have any good solutions, and finally the right to choose is given to the user, the method is to specify the query type when searching.
  • 1、query and fetch
  •     A query request is sent to all shards of the index, and when each shard is returned, the element document (document) and the calculated ranking information are returned together. This search method is the fastest. Because compared to the following search methods, this query method only needs to query the shard once. But the sum of the number of results returned by each shard may be n times the size requested by the user.
  • 2. query then fetch (the default search method)
  •    If you do not specify a search method when you search, this is the search method used. This search method is roughly divided into two steps. The first step is to send a request to all shards. Each shard only returns information related to sorting and ranking (note that the document document is not included), and then returns according to each shard. Scores are reordered and ranked, taking the first size documents. Then go to the second step to get the document from the relevant shard. The document returned in this way is equal to the size requested by the user.
  • 3、DFS query and fetch
  •    This method has one more initial scatter step than the first method. With this step, it is said that the search scoring and ranking can be controlled more precisely.
  • 4、DFS query then fetch
  •    There is one more initial scatter step than the second method.


What is the abbreviation for DSF? What is the process of initializing distribution?
From the official website of es, we can specify that the initial distribution is actually to collect the word frequency and document frequency of each shard before performing the real query, and then when performing word search, each shard is based on the global word frequency and Document frequency to search and rank. Obviously, if the query method of DFS_QUERY_THEN_FETCH is used, the efficiency is the lowest, because one search may require 3 shards. However, with the DFS method, the search accuracy should be the highest.
As for the abbreviation of DFS, no relevant information has been found. This D may be Distributed, F may be the abbreviation of frequency, and S may be the abbreviation of Scatter, and the whole word may be Distributed Word Frequency and Document Frequency Distributed abbreviations.
To sum up, QUERY_AND_FETCH is the fastest and DFS_QUERY_THEN_FETCH is the slowest in terms of performance. In terms of search accuracy, DFS is more accurate than non-DFS.

There is a good article on aggregation query: https://elasticsearch.cn/article/102

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326444695&siteId=291194637