Large table paging optimization tens of millions-cpu has not yet reacted, I have searched it out

Preface

Book connected to Wen shocked that you would also be able to optimize mysql 5.5k ten million salary
this article correct, confirm down a few points of the foregoing, and explain optimize paging mechanism of operation

■ Task

Slow SQL optimization and transformation, performance transformation of the data paging mechanism of large tables, and improvement of the query speed of big data

Before dissolving doubts

The following are the btree indexes of InnoDB

■ 1. When does the query optimizer choose not to use the range index?

The previous statement :

When not indexing is faster, the query optimizer chooses not to index. The time range is an example. It is said that it will not go if the selectivity is less than 17%. The specifics have not been verified. I have tried not to index when the time range is too large, but when I force it to go indexed, the speed will still It's a lot faster, so my statement is not very accurate.

Correct and answer:
The mysql query optimizer will estimate the optimal execution plan it thinks based on the query cost model. This part is more complicated, there will be a lot of magic operations, but what is certain is-the query optimizer chooses the best execution plan in the cost estimation plan (it may not compare all the plans, so sometimes it will not be the most Excellent solution), instead of choosing the fastest query plan, it is not simply not going to a large area.
Look at the following example:
Insert picture description here
the t_d_c index in the figure above is different, I call it index slur, which will be discussed later. It can be distinguished by key_len, the one at 204 is the real t_d_c index, and its last bit is the time range index. In the case that SQL does not interfere with the use of the index, by adjusting the time range,You can find that the scope index is not effective -> effective -> not effective -> effective.
Therefore, many widely circulated claims are wrong.

Selectivity is less than 17% do not go ,wrong. This is a conclusion drawn by a certain person from a specific sql. Posts are constantly copied by crawlers, and then seen by more people, and then continue to spread, which has a certain impact, wrong and wrong.

If the range is large to a certain range, it will not go to the range index .wrong. This statement is much more cautious. He doesn't say a specific value, but it is still wrong.
I saw a live interview on Douyin, which was about advanced Java development. Q: Will the scope index take effect? The interviewer responded very quickly, answering: This may not take effect. If the range index is as large as a certain range, the index will not be taken. This value may be 40%. If it exceeds it, it will not go. Then the interviewer was also very satisfied. Uh~. . . Now I can tell you that this statement is wrong. Of course, the size of the range is definitely related to the choice of the optimizer, but it is by no means the only factor.

Therefore, the mysql query optimizer estimates the optimal execution plan it considers based on the query cost model. It is a complicated operation to determine whether the range index can go. If you do it properly, it will be indexed in the 100-year time frame.

■ 2. Explain that the measured key may not be a real index

I mentioned this situation in the previous article: The key measured by explain may not be the real index, then what is it? I used to call it multiple indexes that confuse each other. Multiple indexes confuse each other, eh? Is this similar to index merging?

Index merging:
1. Index merging is to merge the range scans of several indexes into one index.
2. When the index is merged, the indexes will be union, intersection or first intersection and then union operation, in order to merge into one index.
3. These indexes that need to be merged can only belong to one table. Cannot merge indexes on multiple tables.
If index merging is used, index_merge will be displayed in the type column of the output content, and all indexes used will be displayed in the key column

Uh. . I randomly searched for an introduction, which is very official. To put it bluntly, the index is built unreasonably. MySQL combines multiple single-column indexes, which is called index merge, and the execution plan can be seen from index_merge.

Index merging is for single-column indexes. If you break your head, you can't think that MySQL will start with composite indexes, and the execution plan will not tell you what it is doing. It's crazy. In the picture of the previous point, you can see the odor of the index. What you can see is that its key columns are all the same index.

The method of identification can be distinguished by looking at the key_len column, and the length is different. Then by adding after the sql table nameignore index(Index a, Index b...) You can find out which index has affected your entrance to Tsinghua University. Find it soon and passforce index(Index) Forcing the index to go, you may find that the index is not the same index when the index is odor, so I call it the index odor, rather than the index becoming a certain one.

Solution:
A passage from "High-performance MySQL Third Edition", IDoes not approve
Insert picture description here
If the optimizer goes the wrong way and causes our query to get stuck, it will definitely not work, and it must be solved. You cannot be afraid of optimization because of unknown troubles.

① Delete useless indexes.

②Or sql statement uses ignore index, force index, use index, and then specify which indexes can be selected through the mybatis tag. This method is quite flexible and does not affect the old index. It is only for specific sql and specific conditions.
( The difference between force and use : force is that as long as there is an index field that corresponds to the query field, it will definitely go. Use is that the query optimizer thinks that the full table scan is faster, he will not go)

■ 3. Emphasize the principle of the leftmost prefix of the index once again

This is a very interesting event. The blogger is certified as a blog expert and wrote a mysql index introduction. The blogger explained the index through the interviewer’s question and explained the index. It is well received and has nearly 1,000 collections. The warrior in the picture below is a new blogger.

Insert picture description here
Insert picture description here
If you don’t understand the leftmost principle, you can almost assume that you have no real tuning experience (composite index), but bloggers have a bonus for blog expert certification. Read some theoretical books and summarize some clichés. People who don’t understand the index will not. To doubt his statement, the influence it brings is very terrifying. The collection alone is close to 1,000. This kind of wrong view is unthinkable. Because my knowledge of the leftmost principle of the index is the same as that of bloggers, I am more attentive to this wrong view, but I don’t know how to spread it, and I have no ability to spread it (this is the point), O(∩_∩ )O haha~.

The rebuttal of a blog expert is likely to make a newcomer doubt life and give up his own point of view. Fortunately, I affirmed what the Warrior said. The Warrior also refuted the blogger again with examples. The blogger quickly corrected the mistake and avoided it. Misleading more people.

Computers are developing rapidly, and many things have only been around for a few years. Everyone can be someone’s teacher, and sometimes it is more necessary to affirm yourself.

So, what is the leftmost principle?
Preamble to quote

The leftmost principle of the index refers to the combination of index definitions (sex, age, time) in order from left to right, not the order of the where conditions of the SQL statement. It is often said on the Internet that the index does not take effect, and the full table scan is used. For example, the above example: It does not mean that the index is not taken at all if the time is not effective. It still goes (sex, age). This is a very ambiguous sentence. The expression should be expressed as a full table scan or a full index scan.

So what is the leftmost principle, see the next point-

■ 4. My knowledge of the index

On the Internet, innodb indexes are all given a B+ tree diagram, which is not very easy to understand for many people. I removed the structure of the index tree to visualize how the index can be narrowed down. See the following picture I gave In the picture, you can basically understand the problem of index not going.

The example picture is a visualized structure diagram of the composite index (company, position, gender, time) of the access control record table with 582w records created by me.The index order in the figure from top to bottom is equivalent to the left-most principle from left to right. The number of each cuboid that is cut indicates the size of the index, and it is continuously sorted and sorted layer by layer.
Insert picture description here
Next, I will give an example-
two sql query
sql2 query access control records of all female managers of a medium-sized company in 2019
sql1 query 2019-11-10~2019-11-17 male employees of a large company limit 10000,10
Insert picture description here
to miscellaneous Color version——
Insert picture description here

■ 5. Try to avoid the use of in

I said in the previous article that in is indexed. Although in explains as rang, it is not a range query. Therefore, in is used in the front field of the composite index, and the following fields are also indexed. An example is also given,
reviewing the previous article-

③ Use in to create indexes less cleverly. For example, if you have a search condition gender, male is 1 and female is 2, all are not checked. Some people may create two indexes (…, sex ,…) and (…,…) to deal with gender search conditions , And no gender search criteria. Then, if you have to index gender, you can only build (…, sex ,…) index, and then modify the sql statement. When you do not filter men and women, you can list all men and women, that is,
select… from a where… and sex in ( 1, 2) and... In
this way, indexes can be created less cleverly.

This is also the solution given by "High-Performance MySQL Third Edition", which is more than 7 or 8 years away.
Now I want to tell you that this solution has limitations in its usage scenarios. If the fields behind the in field of the composite index are sorted in SQL, the query connection is estimated to be directly unresponsive.

Example: sql3 queries male managers and male employees of a medium and large company.
As
Insert picture description here
shown in the picture - look at this picture to understand how in is the index. This index is good when sorting is not required, and you will find limit The first 900 records are the records of male managers in mid-sized companies.
That is to say, if you query all, the final result is 4 independent continuous time interval blocks and then merged together, each block itself is ordered, once you sort, it will filesort the 4 blocks Operation reorganization order. (Each block is ordered by itself, and multi-block combinations naturally need to be rearranged.) The more in is used, the more Cartesian product combinations are sorted, and the performance naturally drops sharply.

and so,The subsequent fields are sorted, and the in operation should be avoided as much as possible in the previous fields, Only queries that do not care about order can use this method.

Optimization in progress

The following is a 5.5k code animal's paging optimization operation for mysql (innodb engine) nearly tens of millions of large tables

■ Effect display

I was shocked by Baidu Tieba. There are 1500w topics on Li Yi’s bar, and the number of pages can be checked in seconds. It is worthy of being a big factory. Later I found that it only locates 201 pages, and the following are fake pages. . . . Well, I didn't expect to find a good off-the-shelf solution, so I did my own research.

The following is a display of the paging effect with a count of more than one hundred w-
Insert picture description here
under the premise of ensuring the optimal effect of the index, a large table paging transformation is carried out. When the count cache is good, the query results for the first tens of thousands and the last tens of thousands of pages of large data volumes are good. Within 1w, there is even only the time consumed for network transmission. The query performance is worse as you go to the middle. The viewing value of intermediate data is not high, and the first and last data are still the most valuable, which can restrict users from jumping too far to ensure query performance.

■ Query process outline

I am confused about this picture, and I will explain it in words later. However, drawing a picture can also sort out some ideas, and found a bug that was not discovered before-the asynchronous query cache is Loading. After the data was found, the cache was counted again and I deleted the cache. Add a restriction-this cache is only deleted when your current interface request is updated.
Insert picture description here

■ Detailed query process

#preparation stage

① Recall the method of dynamically appending the next page mentioned above

Example: The mobile terminal requires page 3 and 5 data. Traditionally, we remove the limit and check the count first, which may be 100w, and then query the specific data limit 10, 5. It takes several seconds to query the count, and the specific data of the limit is checked in seconds. How to optimize this?
I call it-dynamically append the next page method:
limit 10,6 is resultList; return count=(3-1)*5+resultList.size; if siz is 6 the returned resultList, romove the last piece of data
if resultList. The size returns 1~5, indicating that this query is the last page. The count calculated in this way is consistent with the traditional count. If resultList.size returns 6, it means that the query can have a next page, and the mobile terminal can also determine whether there is a next page.

Simply put, it is to check one more piece of data to determine whether there is a next page, and then after post-processing the data is sorted into the same effect as a normal check, so as to determine whether there is a next page.

② It is inevitable to sacrifice count real-time

The rationale behind the CAP principle is universal. In many cases, you have to sacrifice something if you want something. The daily check here generally requires immediateness, and the amount of data is small, and the performance of count is not cached. The larger the count, the longer the cache time, and the needSacrifice the real-time performance of count to improve query performance. It is impossible to output the real-time count of mysql big table query in seconds, even MyIsam does not support the second query count with conditional query.

③ Reverse search method

For example, limit 100000, 10; As mentioned before, MySQL must scan the first 10w and then discard it. This has nothing to do with whether to go or not to go to the index, it will not be scanned because you have built the index, this is an inevitable process. Then we can do this: If you know that the count is 10w, and now you want to check the last 10, is it actually the first 10 in reverse order?
as follows--

public void reverse() {
    
    
		this.firstResultReverse = (int)count - this.firstResult - pageSize;
		if(this.firstResultReverse < 0) {
    
    
			this.pageSizeReverse = this.firstResultReverse + this.pageSize;
			this.firstResultReverse = 0;
		} else {
    
    
			this.pageSizeReverse = this.pageSize;
		}
	}

Example: The
total is 9995; order by DESC limit 9980, 10; The result is the second page from the
bottom , 10 pieces of data; available parameters: count=9995; firstResult=9980; pageSize=10;
Substitute the above method to get:

	firstResultReverse=5;
 =>  判断大于等于0
	pageSizeReverse=10;

重组sql——order by ASC limit 5, 10 order by DESC;

The total is 9995; order by desc limit 9990, 10; The result is the last page, 5 pieces of data;
Available parameters: count=9995; firstResult=9990; pageSize=10;
Substitute the above method to get:

	firstResultReverse=-5;
 =>  判断小于0
	pageSizeReverse=5;
	firstResultReverse=0

重组sql——order by desc limit 0, 5 order by DESC;

In this way, the number of pages at the end of the search can be as fast as the number of pages at the front, of course, the search in the middle will become slower and slower. This approach conflicts a bit with the append one page method. Previously, I used the append next page method to continuously query the last page, and I would constantly update the count cache, but using the reverse order method, the last page is the last page, and it must be the last The above data, but count will not be updated, so an additional mechanism needs to be set up to update the count cache to ensure that count will not remain unchanged.

1. Query by day

Daily query is classified as low data query by me. The data volume of a single user per day is at most tens of thousands of small, so here multiple threads query count and specific data at the same time, and then integrate them back.

2. Unconditional query

Unconditional query is classified as mobile query by me. Mobile query is the easiest. It does not need to query count and can be easily solved by dynamic append next page method. Or use the last timestamp as the cursor to bring the cursor to the next query. This method is better than dynamically appending the next page. As long as the index takes effect, there will be no performance problems regardless of the end of the world. The only thing to be solved is the same time. Positioning of stamped data.

3. Query other conditions

Other conditional queries-queries that are almost stuck to bursts, which is the focus of what I want to talk about, most of the operations are used here.
①What is the
biggest problem of caching the paging query of the large count table? There is no doubt that the total count, and the need for immediacy and speed, can be said to be impossible (in the case of a single condition, the count can be maintained separately, here is the case of complex query conditions), of course you can use only For the paging method of opening a few nearby items, what I am explaining here is the paging query of traditional display count.
Sort the query parameters (remove some parameters that are not involved in sorting, sort the query parameters, and generate unique values ​​for the conditions)-

// 获取升序参数map
	public static Map getSortParmMap(ServletRequest request) {
    
    
		Enumeration<?> pNames = request.getParameterNames();
		Map<String, String> params = new HashMap<>();
		while (pNames.hasMoreElements()) {
    
    
			String pName = (String) pNames.nextElement();
			if ("pageSize".equals(pName) || "pageNo".equals(pName) || "token".equals(pName) || "sign".equals(pName) || "appSecret".equals(pName))) {
    
    
				continue;
			}
			String pValue = request.getParameter(pName).replace(":","-");
			if (StringUtils.isNotBlank(pValue)) {
    
    
				params.put(pName, pValue);
			}
		}
		return params;
	}

Generate redis key value (tips:Use colon: In the redis visual management tool, it will be archived for easy management). Don't worry about the performance degradation caused by the key being too long, such a bit length is a piece of cake for redis.

Map<String, Object> params = LargePage.getSortParmMap(request);
String paramStr = "face:" + rcd.getCurrentUser().getTenantId()+ ":" + params;

② In the case that the main thread data has been queried, count has not been cached, the data is always more important than count, do not wait, ensure data priority, directly use the dynamic append next page method.
For example, it takes 200ms for me to query the data, and the data result has come out. The main thread queries the cache again and knows that the count is still Loading, so don't wait, just return the result and inform the front end that there is a next page.
Insert picture description here
As shown in the figure, and inform the front end of the current dynamic append next page method, do not display the count returned by the dynamic append next page method, but display a loading status, indicating that the count is still being queried, and the false count returned is only used Calculate the number of displayed pages in the paging plugin.

③ If the data is queried and the count cache is obtained by querying again, the count can be used at this time. If the count cache is done by your current query, you can clear the cache because the data is out , Count is also found out, indicating that the count query is fast and caching is not necessary. You can even set the condition to be cached for a certain period of time, which means that you don’t need to cache the condition later, and query the database every time. Real-time performance of small counts.

#Summary effect

Normally check daily and
other types of asynchronous check count, count is first found out and assigned directly to ensure that real-time
count 1000 (settable) is not cached and
count is still being queried. The front-end count shows a circle, but you definitely know if there is a next page

The count is less than 5w (settable). The conventional limit for the number of pages behind will be slightly slower, but it will determine whether there is a next page. The
count is greater than 5w (settable). The number of pages that follow is actually the order of reverse to check the first few pages , The beginning and the end are fast, the middle is slow

My count caching mechanism is that the larger the count is, the longer it will be cached, and then a refresh mechanism of the count will complete the paging mechanism of a large table.

At last

There is a long way to go, mysql optimization and paging optimization still have a long way to go, I'm still on the way.

Guess you like

Origin blog.csdn.net/qq_24054301/article/details/106444854