How to use deep paging in elasticsearch

The previous article mentioned that the result data set returned by the default from+size paging method of es cannot exceed 10,000 points, and the more data returned after that, the lower the performance.

This is because es needs to sort the entire result set in order to calculate the similarity ranking. Suppose we have an index with 5 shards. Now we want to read these 10 pieces of data between 1000 and 1010. Inside es, each shard will be stored in es. Read 1010 pieces of data on the Internet, and then return it to the computing node. Some friends here may ask why not 10 pieces of data but 1010 pieces? This is because the 10 pieces of data on a shard may not be as similar as the data after the top 10 on another shard, so they must all be returned, and then on the computing node, re-sort the 5050 pieces of data globally, and finally select When top 10 comes out, sorting here is very time-consuming, so this number actually increases exponentially. The more the number of paging, the worse the performance. Moreover, a large amount of data sorting will occupy the memory of the jvm, which is very likely It's OOM, which is why es does not allow reading more than 10,000 pieces of data by default.

So the question is, what should I do if I want deep paging data? There are two ways to read deeply paged data in es:

(1) Scroll method to read deep paging data offline

(2) The searchAfter method that can be used in real-time and high-concurrency scenarios (after 5.x)

The scroll method was mentioned in the previous article. It maintains a search context of an index snapshot after a query request, and then reads data in batches each time, which is more efficient. After 5.x, parallel export can also be achieved through slice sharding.

Its disadvantage is that maintaining a search context requires a lot of resources, and data changes such as delete and update operations cannot be perceived after the snapshot is created, so it cannot be used in real-time and high-concurrency scenarios.

The searchAfter method avoids the shortcomings of scroll by maintaining a real-time cursor, which can be used for real-time requests and high concurrency scenarios.

Its disadvantage is that it cannot randomly jump and paginate, it can only be turned back page by page, and at least one unique non-repeating field needs to be specified to sort.

In addition, there is another difference from scorll in that the order of reading data in searchAfter will be affected by index updates and deletions, but scroll will not, because scroll reads immutable snapshots.

Let's see how to use searchAfter:

We first query a page of data:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}

Note that the above uses two fields to sort, the first is that the business field may not be unique, but the second id field must be unique and not repeated. The only way to ensure that the page-turning order of searchAfter is read.

In addition, the from field of searchAfter must be set to 0, otherwise there will be problems.

After the first request is sent, we need to get the date and id of the last piece of data in the first request, and then transfer this information to the next batch, and so on until all the data is processed.

The query body of the second request is as follows:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}

Summarize:

This article introduces how to use the deep paging function in es, and compares the advantages, disadvantages and differences between scroll and searchAfter. After understanding this knowledge, we can correctly choose the optimal processing method in the appropriate scenario. .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325449850&siteId=291194637