The feasibility of Elasticsearch full paging

Preface

在分布式系统中深度分页

理解为什么深度分页是有问题的,我们可以假设在一个有 5 个主分片的索引中搜索。 当我们请求结果的第一页(结果从 1 到 10 ),每一个分片产生前 10 的结果,并且返回给协调节点,协调节点对 50 个结果排序得到全部结果的前 10 个。

现在假设我们请求第 1000 页--结果从 10001 到 10010 。所有都以相同的方式工作除了每个分片不得不产生前10010个结果以外。 然后协调节点对全部 50050 个结果排序最后丢弃掉这些结果中的 50040 个结果。

可以看到,在分布式系统中,对结果排序的成本随分页的深度成指数上升。这就是 web 搜索引擎对任何查询都不要返回超过 1000 个结果的原因。

Pagination

  • from+size is a
    common deep paging method. from means starting from the first line, and size means how many documents to query. From defaults to 0, size defaults to 10, and size cannot exceed the setting of index.max_result_window, which defaults to 10,000.
  • scroll
    generating a cursor for a particular query scroll_id, a subsequent query only needs to fetch data from the cursor, until the result set returned hits field is empty, it means the end of traversal. The generation of scroll_id can be understood as the establishment of a temporary historical snapshot, and subsequent operations such as addition, deletion, modification, and checking will not affect the result of this snapshot.
  • search_after
    provides a live cursor to avoid performance problems that consume storage and time. The last piece of data queried on the previous page is used as the search condition for the next piece of data. There must be a globally unique field to sort, such as the business unique identification field or the elasticsearch's own _id field.
    It is suitable for deep paging + sorting, because the data of each page depends on the last data of the previous page, so page jump requests cannot be made. And the data returned is always the latest, and the location of the data may change during the paging process.

Pros and cons analysis

Pagination Is there a page limit Whether to jump page performance Waste resources real-time Common scenarios
from+size Yes, limited by parameters, adjustable Yes The deeper the page, the lower the performance index The deeper the page, the exponential increase in resource consumption real time Common search scenarios can be met, and the number of pages needs to be limited
scroll no no high Caching the query results for the first query consumes a lot of resources, and there is basically no cost in subsequent queries not real-time Background batch export tasks, data migration tasks
search_after no no in Cached query results for the first query consumes more resources, and subsequent data is retrieved according to the last query results real time Batch tasks with real-time requirements

The final proposal

Thinking: What kind of business needs do you need to search and view more than 10,000 data with elasticsearch? Consider as many conditions as possible to filter out the data that is really needed

  • Business self-inquiry limits the number of pages. Use as many query conditions as possible, limit the query results to less than 10,000, and use from+size to query flexibly
  • Performance and business selection, the business side restricts the paging method. Only the next page can be queried, and the page cannot be skipped. In this way, search_after can be used to query

Guess you like

Origin blog.csdn.net/yml_try/article/details/108648211