How do Internet companies implement paging, and use MySQL to limit?

When browsing the website, we often encounter scenarios that require paging query.

We can easily think that it can be achieved with mysql.

Suppose our table sql is like this

mysql table sql

You don't need to deduct details when building a table sql, you only need to know that  id is the primary key, and it is  enough to build a non-primary key index on user_name, and the rest is not important.

To implement pagination.

It is easy to think of the following sql statement.

select * from page order by id limit offset, size;

For example, there are 10 pieces of data on a page.

User table database original state

The first page is the following sql statement.

select * from page order by id limit 0, 10;

The hundredth page is

select * from page order by id limit 990, 10;

So here comes the question.

In this way, we also get 10 pieces of data. Is the query speed of the first page and the hundredth page the same? Why?

The execution process of the two limits

The above two query methods. Correspondence  limit offset, size and  limit size two ways.

In fact  limit size , it is equivalent to    limit 0, size . That is to start from 0 to take size pieces of data.

That is to say, the difference between the two methods  is whether the offset is 0.

Let's first look at the internal execution logic of limit sql.

Mysql schema

MySQL is internally divided into  a server layer  and  a storage engine layer  . In general, the storage engine uses innodb.

There are many modules in the server layer, among which the  executor  is the component used to deal with the storage engine.

The executor can fetch the data row by row by calling the interface provided by the storage engine. When the data fully meets the requirements (such as meeting other where conditions), it will be placed  in the result  set  , and finally returned to the client calling mysql (go, application written in java)  .

We can execute the following sql first  explain .

explain select * from page order by id limit 0, 10;

As you can see, where the key is prompted in explain,  PRIMARY is executed  , which is the  primary key index  .

Paging query offset=0

The primary key index is essentially a B+ tree, which is a data structure placed in innodb.

We can recall that the B+ tree looks like this.

B+ tree structure

In this tree structure, what we need to pay attention to is the node at the bottom layer, that is, the  leaf node  . The information in this leaf node will vary according to whether the current index is a  primary key or a non-primary key  .

  • If it is a  primary key index  , its leaf nodes will store complete row data information.

  • If it is a non- primary key index  , its leaf nodes will store the primary key. If you want to obtain row data information, you need to go to the primary key index to get the data again, which is called a  return table  .

such as executing

select * from page where user_name = "小白10";

The data whose user_name  is "  Xiaobai 10 " will be queried through the non-primary key index  , and then  the primary key corresponding to the data of  " Xiaobai 10 " will  be found in the leaf node  as 10  .

At this time, return the table to the  primary key index  for query, and finally locate  the row data whose primary key is 10  .

return form

But whether it is a primary key or a non-primary key index, their leaf node data is  ordered  . For example, in the primary key index, the data is sorted according to the size of the primary key id, from small to large.

Limit execution process based on primary key index

So back to the question at the beginning of the article.

When we remove explain, execute this sql.

select * from page order by id limit 0, 10;

The above select is followed by an  asterisk *, that is, all the field information  of the row data is required to be obtained  .

The server layer will call the interface of innodb, obtain the  complete row data from 0 to 10 in the primary key index in innodb,  return it to the server layer in turn, put it in the result set of the server layer, and return it to the client.

And when we make the offset outrageous, for example, the execution is

select * from page order by id limit 6000000, 10;

The server layer will call the innodb interface. Since the offset=6000000 this time, the 0th to (6000000 + 10)  complete rows of data will be obtained from the primary key index in innodb,  and then returned to the server layer and discarded one by one according to the offset value. , and finally only the last size bar, that is, 10 pieces of data, is placed in the result set of the server layer and returned to the client.

It can be seen that when the offset is not 0, the server layer will obtain a  lot of useless data from the engine layer  , and the obtained useless data will take time.

Therefore, we know the answer to the question at the beginning of the article, limit 1000,10 will be slower than limit 10 in mysql query. The reason is that limit 1000,10 will take out 1000+10 pieces of data and discard the first 1000 pieces, which takes more time.

Is there any way to optimize this case?

It can be seen that when the offset is not 0, the server layer will obtain a lot of useless data from the engine layer, and when the select is followed by an *, it needs to copy the complete row information,  copy the complete data  and  only copy the row data. One or two column fields  take different times, which makes the already time-consuming operation even more outrageous.

Because the previous offset data is not needed in the end, even if the complete fields are copied, what is the use, so we can modify the sql statement as follows.

select * from page  where id >=(select id from page  order by id limit 6000000, 1) order by id limit 10;

In the above sql statement, the subquery is executed first  select id from page order by id limit 6000000, 1 . In fact, this operation will also obtain pieces of data in the primary key index in innodb  6000000+1 , and then the server layer will discard the first 6,000,000 pieces, and only retain the id of the last piece of data.

But the difference is that in the process of returning to the server layer, only the id column in the data row is copied, not all the columns of the data row. When the amount of data is large, the time-consuming of this part is relatively obvious. .

After getting the above id, assuming that this id is exactly equal to 6000000, then the sql becomes

select * from page  where id >=(6000000) order by id limit 10;

In this way, innodb  goes through the primary key index again  , quickly locates the row data with id=6000000 through the B+ tree, the time complexity is lg(n), and then fetches 10 pieces of data backwards.

In this way, the performance is indeed improved, and the personal test can be about twice as fast, which belongs to the kind of operation that takes time from 3s to 1.5s.

This······

It's true, it's a drop in the bucket, a bit of rubbing, and it's a way out of nowhere.

Limit execution process based on non-primary key index

The above mentioned is the execution process of the primary key index, let's look   at the limit execution process based on the non-primary key index .

For example the following sql statement

select * from page order by user_name  limit 0, 10;

The server layer will call the innodb interface, and after obtaining the primary key id corresponding to the 0th data in the non-primary key index in innodb,  return the table  to the primary key index to find the corresponding complete row data, and then return it to the server layer, the server layer will It is placed in the result set and returned to the client.

When offset>0, and the value of offset is small, the logic is similar, the difference is that when offset>0, the previous offset data will be discarded.

That is to say, the limit process of a non-primary key index consumes more return tables than the limit process of a primary key index.

But when the offset becomes very large, such as 6 million, explain is executed at this time.

Perform a full table scan when the non-primary key index offset value is too large

You can see that the type column shows ALL, which is  a full table scan  .

This is because the optimizer at the server layer   will determine which execution plan is less expensive before the executor executes the SQL statement.

Obviously, the optimizer shook his head after seeing the 600w times of returning to the table for the non-primary key index.

Therefore, when the limit offset is too large, non-primary key index queries can easily become full table scans. A true performance killer.

This situation can also be optimized in some ways. for example

select * from page t1, (select id from page order by user_name limit 6000000, 100) t2  WHERE t1.id = t2.id;

pass  select id from page order by user_name limit 6000000, 100 . First go to the user_name non-primary key index of the innodb layer to get the id, because only the primary key id is used, and  there is no need to return to the table  , so the performance of this piece will be slightly faster. After returning to the server layer, the first 600w pieces of data are also discarded, and the last 100 pieces are kept. id. Then use these 100 ids to match the id of the t1 table. At this time, the primary key index is used, and the matched 100 rows of data are returned. This bypasses the return form of the previous 600w data.

Of course, like the above case, there is still no solution to the problem of taking 600w pieces of data for nothing and then discarding it, which is also a very frustrating optimization.

Like this, when the offset becomes very large, such as on the order of millions of millions, the problem suddenly becomes serious.

Here comes a special term called  deep paging  .

Deep paging problem

The problem of deep paging is a very disgusting problem. The disgusting is disgusting. This problem is actually  unsolvable  .

Whether you use mysql or es, you can only "mitigate" the severity of the problem by some means.

When we encounter this problem, we should look back and think about it.

Why does our code have deep pagination problems?

What is the original requirement behind it, we can do some evasion based on this.

If you want to fetch the data of the whole table

Some requirements are like this, we have a database table, but we want to take out all the data in this database table and heterogeneously transfer it to es or hive. At this time, if we execute it directly

select * from page;

As soon as this sql was executed, the dog shook his head when he saw it.

Because of the large amount of data, mysql can't get all the data at one time, and it will  report an error when it is properly overtime  .

So many mysql novices will  limit offset size obtain them in batches in the form of paging, which is good at the beginning, and slowly, one day, the data table becomes extremely large, and the  deep paging  problem mentioned above may occur.

This scenario is the best solution.

We can sort all the data  according to the id primary key  , then fetch them in batches, and query the current batch with the largest id as the condition for the next filter.

You can see the pseudo code

batch get data

For this operation, you can use the primary key index to locate the id each time, and then traverse 100 data in the future, so that no matter how many thousands of data, the query performance is very stable.

batch get user table in batches

If it is to display pagination to users

If the original demand behind deep pagination is just a function that the product manager wants to make a display page, such as a product display page, then we should have a good battle with the product manager.

What kind of page turning needs to be turned after more than 100,000, which is obviously an unreasonable demand.

Is it possible to change the requirements to make it closer to the user's usage behavior?

For example, the page-turning feature we see when we search with Google.

In general, Google searches are basically within 20 pages, and as a user, I rarely turn to page 10.

Reference.

If we want to search or filter pages, don't use mysql, use es, and also need to control the number of results displayed, such as less than 10,000, so that the paging will not be too deep.

If for various reasons, mysql must be used. In the same way, it is also necessary to control the number of returned results, such as the number within 1k.

In this way, it can barely support various page turns and page jumps (such as suddenly jumping to page 6 and then jumping to page 106).

However, it would be better if it can be made in the form of a product that does not support page jumps, such as  only supporting the previous page or the next page  .

The form of the upper and lower pages

In this way, we can use the start_id method mentioned above to obtain in batches, and each batch of data starts with start_id as the starting position. The biggest advantage of this solution is that no matter how many pages are turned, the query speed is always stable.

Sound frustrating?

How can it be, wrap this function.

It becomes like Douyin, which can only be swiped up or down, professional point, called  waterfall flow  .

Will it not be frustrating?

Summarize

  • limit offset, size It is slower than  limit size that, and the larger the value of offset, the slower the execution speed of sql.

  • When the offset is too large, it will cause the  problem of deep paging  . At present, neither MySQL nor ES has a good way to solve this problem. It can only be circumvented by limiting the number of queries or fetching them in batches.

  • When encountering the problem of deep paging, think more about its original requirements. Most of the time, the scene of deep paging should not occur, and if necessary, it will affect the product manager more.

  • If the amount of data is very small, such as the order of 1k, and it is unlikely that there will be huge growth in the long term, it is better to use  limit offset, size the plan. It is good and can be used.

References

"You must know the underlying principle of MySQL's Limit clause" https://blog.csdn.net/qq_34115899/article/details/120727513

at last

Regarding in-depth paging, if you have better ideas, you are welcome to say them in the comment area.

Stop talking, let's choke in the ocean of knowledge together

Click on the business card below and follow the official account: [Xiaobai debug]

Not satisfied with talking shit in the message area? Come to a basket of hard-core dry goods, you are also welcome to follow me and get the Go development reference book.

- END -

Guess you like

Origin blog.csdn.net/java_beautiful/article/details/125760762