Tens of millions of pieces of data, how does Stack Overflow achieve fast paging

Reprinted from  tens of millions of pieces of data, how does Stack Overflow achieve fast paging

Stack Overflow uses page numbers instead of offsets in the pagination mechanism, and page numbers point to queries based on LIMIT and OFFSET. Suppose you want to paginate 10 million records. Jumping to the last page will be very slow, but Stack Overflow still finds a way to achieve fast paging.

So how does Stack Overflow achieve fast paging? Cache popular queries and implement pagination in application code? Or some database black magic used?

In fact, the whole paging process is very complicated. But instead of writing a post with many pages, I will try to tell you how it works in a simple way.

Assumption  

Speaking of paging, it basically revolves around pageNumber * pageSize. That is, to get the current set in the sorted n records, multiply pageNumber by pageSize and add pageSize to return the current result. In our case, it's actually (pageNumber - 1)*pageSize, because the index of page 1 is 0.

In the sorting problem, we do not need to completely sort the entire collection, but sort the pageNumber * pageSize pieces of data, so that we can get the sorted data of the current page, and the rest may only be partially sorted. Rather than sorting the entire collection and returning the top n results, just sort the top n results of the collection and return those results. It makes sense to do so.

Also note that the most resource-intensive queries are always those intermediate pages. Getting the last n pages is as easy as getting the top n pages: just sort in reverse. For example, getting pageNumber 1 when sorting by date in descending order is as easy as getting pageNumber n-1 when sorting by date in ascending order. Many sorting engines (databases, search engines, etc.) use this optimization, and so do we.

For the sake of discussion, we will assume that the question is the post and vice versa, since I will use the two terms interchangeably throughout the text.

Step 1: Tag Engine

We have a self-developed .NET application called Tag Engine that contains post IDs and metadata. We think of it as an inverted index to look up post IDs by data such as creation date, tags, scores, etc.

Tag Engine is mainly responsible for doing some set operations based on certain constraints. For example, it performs intersection and union operations on a series of post ID sets in order to get the final result, and it can also sort in memory based on metadata.

We use pageNumber and pageSize and some constraints (such as Site ID, because Tag Engine handles all site queries) to query Tag Engine. It performs in-memory set operations (such as union and intersection), then sorts the results, returning a subset of relevant post IDs.

Tag Engine also caches query results (which are collections, not just requested pages) and can quickly select one from a specific cached result set based on a cache key generated by the query (page number, page size, ordering, etc.) hash page. This greatly improves query performance.

Step 2: Database

Tag Engine does not contain actual data, only ID and metadata. Therefore, we query the database with the result set of post IDs. The query looks like this:

Select p.*, pm.ViewCount, u.Id, u.ProfileImageUrl, ...

From Posts p

Join PostMetadata pm On p.Id = pm.PostId

Left Join Users u On p.LastActivityUserId = u.Id

Where p.Id In @Ids";

Here @Ids refers to the list of IDs contained in Tag Engine. This query will return the actual data, but it's not over yet.

Step 3: Semi-redundant memory ordering

As mentioned above, Tag Engine may return cached data. However, by their nature, cached data is not guaranteed to be accurate (as they may be snapshots of past states). In contrast, the database always has the latest data.

To fix this, we sort the resulting pages again in memory.

But there is one more headache: the last memory sorting is basically calling List.Sort and passing in a sorting function. The sorting function varies depending on the page the user views: for "Newest" pages it compares creation dates, for "Votes" it compares scores, etc.

If we didn't do the last step, the posts might appear out of order on the page because their ordering in Tag Engine reflects the past state, not the current state of the database.

最后,我们把问题列表显示出来!

原文链接:https://meta.stackoverflow.com/questions/322164/how-does-stack-overflow-do-pagination


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325536168&siteId=291194637