How ElasticSearch performs deep paging

10. Perform deep paging

Original address

https://www.cnblogs.com/wangzhen3798/p/10070977.html

Business background

In traditional business systems, a common way of displaying information is " paging list ". As the amount of data increases, the problem of "deep paging" will be encountered. For example, the user turns page by page until it reaches the 50,000th page. For example, when exporting all list data to excel, the data will be appended to excel page by page until all the data is exported. A common problem with "deep paging" is that as the number of pages becomes larger and larger, the response of ES or relational databases becomes slower and slower, and even memory overflow OOM! What is the principle? How to do deep paging in ES?

Technical Principle

  • The essence of paging The essence of paging is to take a part from the "big data set". For example, 10000 records, 10 data per page. Take the second page that is the 11th to 20th data. How does ES or the database know which data is the second part (page 2) and which is the third part (page 3)? The answer is that ES or the database don't know, so the correct paging must specify the order of the paging, that is, there must be an order by or sort statement .

  • Single-machine database system paging A single-machine database system has a paging implementation called "first sorting in positive order and then inverted sorting". That is, first arrange the "offset+limit" data set according to the order field in the positive order, and then find the limit data in the reverse order.

  • Distributed database system paging

Compared with the stand-alone database system, the distributed database system also summarizes the "limit" data of each node to the master node after the limit pieces of data are retrieved from each node. The master node sorts limit*N (number of nodes) again, and finds the final limit data and returns it to the application. Therefore, in deep paging, the offset+limit is too large, and the data to be sorted is too much. For the memory paging database, it is easy to exceed the memory limit of the process, resulting in OOM!

Pagination

There are three ways to realize paging in ES: from+size, scroll, search_after

Method 1: from+size

The standard paging method of ES is from+size. From is equivalent to postgresql's offset, and size is equivalent to limit. 10 pieces of data per page, get the data on page 11, the syntax is as follows:

POST rzfx-sqlinfo/sqlinfo/_search
{
    
    
  "query": {
    
    
    "bool": {
    
    
      "must": [
        {
    
    
          "term": {
    
    
            "architect.keyword": {
    
    
              "value": "郭锋"
            }
          }
        },
        {
    
    
          "range": {
    
    
            "NRunTime": {
    
    
              "lte": 100
            }
          }
        }
      ]
    }
  },
  "size": 10,
  "from": 100
}

ES tab in order to ensure not to take a lot of heap memory, to avoid the OOM, the parameter index.max_result_windowsets the maximum value from + size is 10,000. That is, if there are 10 entries per page, you can turn to 1000 pages at most. All the parameters of index can be viewed by the following statement:
GET /rzfx-sqlinfo/_settings?flat_settings=true&include_defaults=true
For documents with a relatively simple structure and relatively small size, the index.max_result_window parameter can be appropriately expanded to partially realize deep paging. Adjustment method

PUT rzfx-sqlinfo/_settings
{
    
    
  
    "index.max_result_window":100000
}

Method 2: scroll

The scroll api provides a global deep page turning operation. The first request will return a scroll_id, and the scroll_id can be used to sequentially obtain the next batch of data

Case study

The initial search request specifies the scroll parameter in the query string. This parameter tells Elasticsearch how long to store the "search context". For example:?scroll=5m

GET /db10/_search?&scroll=5m
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  }
  , "sort": [
    {
    
    
      "_doc": {
    
    
        "order": "desc"
      }
    }
  ], "size": 2
}
Back to the result

The result returned by the above request will contain a _scroll_id, we need to pass this value to the scroll API to retrieve the next batch of results.

"_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAzVUWazVqSWpwSm1UTUc5U1Y4OGN5SWN6QQ==",
  "took" : 0,
Execute next page query
GET _search/scroll
{
    
    
  "scroll":"5m",
  "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAy_kWazVqSWpwSm1UTUc5U1Y4OGN5SWN6QQ=="
}
Delete scroll

When the scroll timeout is exceeded, the search context will be automatically deleted. There is a cost to keep scrolls open. When scroll is no longer used, clear-scroll API should be used to clear it explicitly

DELETE _search/scroll
{
    
    
   "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAABZOkWazVqSWpwSm1UTUc5U1Y4OGN5SWN6QQ=="
}

performance

Case DSL

scroll=5m scroll page identification

GET /filebeat-7.4.0-2019.10.17-000001/_search?&scroll=5m
{
    
    
  "query": {
    
    
    "match_all": {
    
    }
  }
  , "sort": [
    {
    
    
      "_doc": {
    
    
        "order": "desc"
      }
    }
  ], "size": 10
}
result

Obtain _scroll_id to
observe the took, which consumes 15349ms

"_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAFnssWeUdmZzFOcHBUeFdzVTVwMTVPVTZNZw==",
  "took" : 15349,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,
Take the _scroll_id from the previous step
GET /_search/scroll
{
    
    
  "scroll":"5m",
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAFm2IWeUdmZzFOcHBUeFdzVTVwMTVPVTZNZw=="
}
result

Through _scroll_id, the scrolling page lock consumes about the same time.
Observe the take, here it consumes 2742ms

"_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAFm2IWeUdmZzFOcHBUeFdzVTVwMTVPVTZNZw==",
  "took" : 2742,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,

Way three: search_after

The function search_after pagination mode provided by version 5.0 and later, the first search needs to specify sort, and the value is guaranteed to be unique, and the sort result value of the last record in the previous query result is used as the next query condition

Case study

GET /db10/_search?pretty=true
{
    
    
  "size": 1, 
  "query": {
    
    
    "match_all": {
    
    
    }
  }
  , "sort": [
    {
    
    
      "age": {
    
    
        "order": "desc"
      }
    }
  ]
}

Result interception

},
 *"sort" : [
   22
  ]*

Retrieve the next page

"Search_after": Retrieve the next page based on the sort result value of the previous page to achieve dynamic paging

GET /db10/_search?pretty=true
{
    
    
  "size": 1, 
  "query": {
    
    
    "match_all": {
    
    
    }
  },
  *"search_after": [ 22 ]*
  , "sort": [
    {
    
    
      "age": {
    
    
        "order": "desc"
      }
    }
  ]
}

performance

Case DSL
GET /filebeat-7.4.0-2019.10.17-000001/_search?pretty
{
    
    
  "size": 20, 
  "query": {
    
    
    "match_all": {
    
    
      
    } 
  }
  , "sort": [
    {
    
    
      "@timestamp": {
    
    
        "order": "desc"
      }
    }
  ]
}
result

Observe "took": 4825

"took" : 4825,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
        "sort" : [
          1571522951167
        ]
    }
Take the next page

"Search_after": [1571522951167] Take the last sort value of the previous page

GET /filebeat-7.4.0-2019.10.17-000001/_search?pretty
{
    
    
  "size": 20, 
  "query": {
    
    
    "match_all": {
    
    
      
    } 
  },
  "search_after": [ 1571522951167 ]
  , "sort": [
    {
    
    
      "@timestamp": {
    
    
        "order": "desc"
      }
    }
  ]
}
result

Observe "took": 4318,

{
    
    
  "took" : 4318,
  "timed_out" : false,
  "_shards" : {
    
    
    "total" : 1,
    "successful" : 1,

to sum up

  • from+size

    • Use the from+size method for paging, which is limited by the default parameter of max_result_window of 10000 documents. It is not recommended to modify this parameter

    • The default paging method is suitable for small data volume scenarios, but should be avoided in large data volume scenarios

    • Through the performance test, as the paging becomes deeper and deeper, the execution time and heap memory usage are gradually increasing. In the case of concurrent from+size, it is easy to cause the OutOfMemory problem of the cluster service

  • Scroll

    • Scroll cursor mode paging query is suitable for large data scenarios. It can only search backward incrementally, and cannot query forward or skip pages. It is suitable for scenarios such as incremental scroll extraction, data migration, and index reconstruction.

    • Through performance case analysis, the performance consumption of scrolling paging search is not much different, and there is no problem that the performance of the from+size method gradually increases with the depth of the paging, and there is no OOM problem.

    • This paging method is a historical snapshot of the query. Changes to the document (update or deletion of the index) will only affect future search requests, not applicable to real-time query scenarios

  • search_after

    • The paging method makes up for the memory resource problem of opening scroll in scroll mode

    • search_after can pull large amounts of data in parallel

    • The search_after paging method locates by a unique sort value, controls the data that needs to be processed each time to a certain range, avoids the overhead caused by deep paging, and is suitable for deep paging scenarios

Guess you like

Origin blog.csdn.net/qq_41489540/article/details/113794793