ElasticSearch three paging methods and their advantages and disadvantages (one article knows how to choose)

1. Regarding Elasticsearch paging query, these questions are often asked

  • Question 1: I would like to ask, besides increasing the max_result_window, is there any other way to obtain all the values ​​of a certain field on the index (about 1 million) at one time?

  • Question 2: Regarding the pagination of es, each time 20 items are displayed on the foreground, and then click the next page to query the next 20 items of data, how should I write it?

  • Question 3: What are the essential differences and application scenarios of From+size, Scroll, and search_after?

2. Three paging query methods supported by Elasticsearch

  • From + Size query
  • Search After query
  • Scroll query

Next, I will explain the connection and difference, advantages and disadvantages, and applicable scenarios of the three methods.

2.1 From + size paging query

2.1.1 From + size paging query definition and practical cases

The basic query is as follows:

GET kibana_sample_data_flights/_search

Returns the first 10 matching matches by default. in:

  • from: not specified, the default value is 0, note that it is not 1, which represents the starting value of the data returned by the current page.
  • size: not specified, the default value is 10, which represents the number of returned data on the current page.

Specify conditional query and sorting as follows:

GET kibana_sample_data_flights/_search
{
  "from": 0,
  "size":20,
  "query": {
    "match": {
      "DestWeather": "Sunny"
    }
  },
  "sort": [
    {
      "FlightTimeHour": {
        "order": "desc"
      }
    }
  ]
}

A total of 20 pieces of data are returned.

Among them: from + size The two parameters define the content of the data displayed on the result page.

2.1.2 From + size query advantages and disadvantages and applicable scenarios

From + size query advantages

  • Support random page turning.

From + size query disadvantages

  • Subject to the max_result_window setting, you cannot turn pages without limit.

  • There is a problem of deep page turning, the slower the page turning back.

From + size query applicable scenarios

First: It is very suitable for business scenarios where small data sets or large data sets return Top N (N <= 10000) result sets.

Second: Business scenarios similar to mainstream PC search engines (Google, bing, Baidu, 360, sogou, etc.) that support random jumping and paging.

2.1.3 Deep page turning is not recommended to use From + size

Elasticsearch will limit the maximum number of pages to avoid low performance caused by the recall of large amounts of data.

Elasticsearch's max_result_window default value is: 10000. That means: if there are 10 pieces of data per page, the page will be turned up to 1000 pages at most.

In fact, mainstream search engines can’t turn so many pages. For example: Baidu searches for “Shanghai”, and when you turn to page 76, you can’t turn down any more pages. The prompt message is shown in the screenshot below:

The following pagination query

GET kibana_sample_data_flights/_search
{
  "from": 0,
  "size":10001
}

GET kibana_sample_data_flights/_search
{
  "from": 10001,
  "size":10
}

The error is as follows:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],

what reason? The limit of the maximum window is exceeded, the default value of index.max_result_window is 10000.

The error message also gives two solutions:

  • Option 1: Large data set recall data use: scroll api.

It will be explained in detail later.

  • Solution 2: Increase the default value of index.max_result_window.
PUT kibana_sample_data_flights/_settings
{
    "index.max_result_window":50000
}

Official advice : Avoid excessive use of from and size to paginate or request too many results at once.

The core reason why it is not recommended to use from + size for deep paging query:

  • Search requests typically span multiple shards, and each shard must load into memory the hits for its request, as well as the hits for any previous pages.

  • For pages with deep page turns or large numbers of results, these operations can significantly increase memory and CPU usage, leading to performance degradation or node failure.

What does that mean?

GET kibana_sample_data_flights/_search
{
  "from": 10001,
  "size": 10
}

A total of 10 pieces of data loaded into the memory? no!

Total: 10011 pieces of data are loaded into memory, and then the last 10 pieces of data we want are returned after background processing.

That also means that the more you turn the page backwards (that is, the deeper the page is), the greater the amount of data that needs to be loaded, the more CPU + memory resources will be consumed, and the response will be slower!

2.2 search_after query

2.2.1 search_after query definition and practical cases

search_after The essence of the query : use a set of sorted values ​​from the previous page to retrieve the matching next page.

Precondition: Use search_after to require that subsequent requests return the same sequence of sorted results as the first query. That is to say, even though there may be operations such as new data writing during the subsequent page turning process, these operations will not affect the original result set.

How?

A point in time (PIT) can be created to ensure that the index state of a specific event point is preserved during the search process.

Point In Time (PIT) is a new feature of Elasticsearch version 7.10.

The essence of PIT : a lightweight view that stores the state of indexed data.

The following example can well explain the connotation of the PIT view.

# 创建 PIT
POST kibana_sample_data_logs/_pit?keep_alive=1m

# 获取数据量 14074
POST kibana_sample_data_logs/_count

# 新增一条数据
POST kibana_sample_data_logs/_doc/14075
{
  "test":"just testing"
}

# 数据总量为 14075
POST kibana_sample_data_logs/_count


# 查询PIT,数据依然是14074,说明走的是之前时间点的视图的统计。
POST /_search
{
  "track_total_hits": true, 
  "query": {
    "match_all": {}
  }, 
   "pit": {
    "id": "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEN3RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA"
  }
}

With PIT, the subsequent queries of search_after are all based on the PIT view, which can effectively guarantee data consistency.

The search_after paging query can be briefly summarized as the following steps.

Step 1: Create a PIT view, which cannot be omitted as a precondition.

# Step 1: 创建 PIT
POST kibana_sample_data_logs/_pit?keep_alive=5m

The returned results are as follows:

{
  "id" : "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA"
}

keep_alive=5m, a parameter similar to scroll, means that the view retention time is 5 minutes, and if it is executed after 5 minutes, an error will be reported as follows:

  "type" : "search_context_missing_exception",
  "reason" : "No search context found for id [91600]"

Step 2: Create a basic query statement, here you need to set the conditions for turning pages.

# Step 2: 创建基础查询
GET /_search
{
  "size":10,
  "query": {
    "match" : {
      "host" : "elastic"
    }
  },
  "pit": {
     "id":  "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA", 
     "keep_alive": "1m"
  },
  "sort": [ 
    {"response.keyword": "asc"}
  ]
}
  • With PIT set, there is no need to specify an index when searching.

  • id is based on the id value returned in step 1.

  • Sorting sort refers to: according to which keyword to sort.

At the end of each returned document, there will be two result values, as follows:

 "sort" : [
          "200",
          4
        ]
  • Among them, "200" is the sorting method we specified: ascending order based on {"response.keyword": "asc"}.

And what does 4 represent?

  • 4 stands for - the implicit sort value, which is based on the ascending order of _shard_doc.

The official document calls this implicit field: tiebreaker (decision field), tiebreaker is equivalent to _shard_doc.

The essential meaning of tiebreaker  : a unique value for each document, to ensure that the paging will not be lost or the paging result data will be repeated (the same page is repeated or across pages).

Step 3: Realize subsequent page turning.

# step 3 : 开始翻页
GET /_search
{
  "size": 10,
  "query": {
    "match" : {
      "host" : "elastic"
    }
  },
  "pit": {
     "id":  "48myAwEXa2liYW5hX3NhbXBsZV9kYXRhX2xvZ3MWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAWdG1TOWFMTF9UdTZHdVZDYmhoWUljZwAAAAAAAAEg5RZGOFJCMGVrZVNndTk3U1I0SG81V3R3AAEWM2hGWXpxLXFSSGlfSmZIaXJWN0dxUQAA", 
     "keep_alive": "1m"
  },
  "sort": [
    {"response.keyword": "asc"}
  ],
  "search_after": [                                
    "200",
    4
  ]
}

Subsequent page turning needs to use search_after to specify the sort field value of the last document on the previous page.

As shown in the following code:

  "search_after": [                                
    "200",
    4
  ]

Apparently, the search_after query only supports page backwards.

2.2.2 The advantages and disadvantages of search_after query and applicable scenarios

advantages of search_after

  • It is not strictly limited by max_result_window, and you can turn pages back without limit.

ps: not strict meaning: the value of a single request cannot exceed max_result_window; but the total page turning result set can exceed.

search_after Disadvantages

  • It only supports backward page turning, not random page turning.

search_after applicable scenarios

  • Similar: Today's Toutiao page search https://m.toutiao.com/search

It does not support random page turning and is more suitable for mobile phone application scenarios.

2.3 Scroll traversal query

2.3.1 Scroll traversal query definition and practical cases

Instead of From + size and search_after returning a page of data, the Scroll API can be used to retrieve a large number of results (or even all results) from a single search request in a manner similar to a cursor in a traditional database.

If the From + size and search_after requests are regarded as near real-time request processing methods, then the scroll scrolling traversal query is obviously non-real-time. When the amount of data is large, the response time may be longer.

The scroll core execution steps are as follows:

Step 1: Specify the retrieval statement and set the scroll context retention time at the same time.

In fact, scroll already includes the view or snapshot functionality of search_after's PIT by default.

The results returned from a Scroll request reflect the state of the index at the time the initial search request was made, as if a snapshot was taken at that moment. Subsequent changes (writes, updates, or deletions) to the document will only affect future search requests.

POST kibana_sample_data_logs/_search?scroll=3m
{
  "size": 100,
  "query": {
    "match": {
      "host": "elastic"
    }
  }
}

Step 2: Turn the page backwards and continue to fetch data until there is no result to be returned.

POST _search/scroll                                   
{
  "scroll" : "3m",
  "scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFkY4UkIwZWtlU2d1OTdTUjRIbzVXdHcAAAAAAAGmkBZ0bVM5YUxMX1R1Nkd1VkNiaGhZSWNn" 
}

The scroll_id value is the result value returned by step 1.

2.3.2 Advantages and disadvantages of Scroll traversal query and applicable scenarios

    Advantages of scroll query

  • Full traversal is supported.

ps: The size value of a single traversal cannot exceed the max_result_window size.

    scroll query disadvantages

  • Response time is not real time.

  • Sufficient heap memory space is required to retain the context.

Scroll query applicable scenarios

  • Traverse the result data instead of pagination query when the full amount or the amount of data is large.

  • The official document emphasizes that it is no longer recommended to use the scroll API for deep pagination. If you want to retrieve more than Top 10,000+ results by page, it is recommended to use: PIT + search_after.

3. Summary

  • From+ size: It is necessary to randomly jump to different pages (similar to mainstream search engines), and to display the scene in pages within the Top 10,000 pieces of data.

  • search_after: The scene that only needs to turn pages backwards and the scene that requires pagination if more than Top 10000 data.

  • Scroll: Need to traverse all data scenarios.

  • max_result_window: It is not recommended to adjust too large to treat the symptoms but not the root cause.

  • PIT: the essence is the view

Guess you like

Origin blog.csdn.net/gongzi_9/article/details/124681107