The bottom layer of MySQL combat analysis --- how to correctly display random messages

Table of contents

foreword

in-memory temporary table

disk temporary table

random ordering method


  • foreword

  • Now let’s talk about another sorting requirement in MySQL, hoping to deepen the understanding of MySQL sorting logic
  • Randomly select three words from a word list
  • The table creation statement and initial data commands for this table are as follows:

  • In order to facilitate quantitative description, 10,000 rows of records are inserted into this table
  • The next step is to randomly select 3 words, what is the method to achieve it, what problems exist and how to improve
  • in-memory temporary table

  • First of all, you will think of using order by rand() to realize this logic

  • The meaning of this statement is very straightforward, random sorting takes the first 3
  • Although the writing of this SQL statement is very simple, the execution process is a bit complicated
  • First use the explain command to see the execution of this statement

  • The Extra field displays Using temporary, indicating that a temporary table needs to be used
  • Using filesort, which means that the sorting operation needs to be performed
  • So the meaning of this Extra is that a temporary table is required, and it needs to be sorted on the temporary table
  • Here you can first review the content of full field sorting and rowid sorting in the previous article
  • Full field sorting:

  • rowid sorting:

  • Then ask another question, which algorithm will it choose for the sorting of the temporary memory table?
  • Recall a conclusion from the previous article:
  • For InnoDB tables, performing full-field sorting reduces disk access and is therefore preferred
  • The "InnoDB table" is emphasized here. You must have thought of it. For memory tables, the process of returning to the table is simply to directly access the memory to obtain data according to the position of the data row, and it will not cause multiple accesses to the disk at all.
  • The optimizer does not have this concern, so it will give priority to the fewer rows used for sorting, the better, so MySQL will choose rowid sorting at this time
  • After understanding the logic of this algorithm selection, let's look at the execution flow of the statement
  • At the same time, through this example, try to analyze the number of rows scanned by the statement
  • The execution flow of this statement is as follows:
    • 1-Create a temporary table; this temporary table uses the memory engine, and there are two fields in the table, the first field is double type, for the convenience of later description, it is recorded as field R, and the second field is varchar(64) Type, recorded as field W; and, this table is not indexed
    • 2- From the words table, take out all the word values ​​in the order of the primary key
    • For each word value, call the rand() function to generate a random decimal greater than 0 and less than 1, and store the random decimal and word in the R and W fields of the temporary table, so far, the number of scanned rows is 10000
    • 3- Now there are 10,000 rows of data in the temporary table, and the next step is to sort by field R on this non-indexed memory temporary table
    • 4-Initialize sort_buffer; there are two fields in sort_buffer, one is double type and the other is integer type
    • 5- Take out the R value and location information line by line from the temporary memory table (we will explain why it is "position information" later), and store them in the two fields in sort_buffer respectively
    • This process requires a full table scan on the memory temporary table. At this time, the number of scanned rows increases by 10,000 and becomes 20,000.
    • 6- Sort according to the value of R in sort_buffer; note that this process does not involve table operations, so it will not increase the number of scanned rows
    • 7- After the sorting is completed, take out the position information of the first three results, take out the word value from the memory temporary table in turn, and return it to the client
    • In this process, three rows of data in the table are accessed, and the total number of scanned rows becomes 20003
  • Next, use the slow query log (slowlog) to verify whether the number of scanned rows obtained by the analysis is correct

  • Among them, Rows_examined: 20003 means that 20003 rows were scanned during the execution of this statement, which also verifies the conclusion drawn by the analysis
  • In the process of learning concepts, you can often do this, first calculate the number of scanned rows through principle analysis, and then verify your conclusion by viewing the slow query log
  • Now let's get out the complete sorting execution flow chart

  • The pos in the picture is the position information. What is the concept of "position information" here?
  • In the previous article, when sorting the InnoDB table, the ID field was clearly used
  • At this time, it is necessary to return to a basic concept: what method does MySQL table use to locate "a row of data"
  • In the previous article on indexing, if the primary key of an InnoDB table is deleted, will there be no primary key, and there will be no way to return the table?
  • Actually not
  • If you create a table without a primary key, or delete the primary key of a table, InnoDB will generate a rowid with a length of 6 bytes as the primary key
  • This is the origin of the rowid name in the sorting mode
  • In fact, what it represents is: the information used by each engine to uniquely identify the data row
  • For an InnoDB table with a primary key, this rowid is the primary key ID
  • For an InnoDB table without a primary key, the rowid is generated by the system; the MEMORY engine is not an index-organized table
  • In this example, it can be considered as an array
  • Therefore, this rowid is actually the subscript of the array
  • Here, a little summary:
  • order by rand() uses a memory temporary table, and the rowid sorting method is used when the memory temporary table is sorted
  • disk temporary table

  • So, are all temporary tables memory tables?
  • Actually not
  • tmp_table_size This configuration limits the size of the memory temporary table, the default value is 16M
  • If the size of the temporary table exceeds tmp_table_size, the memory temporary table will be converted into a disk temporary table
  • The engine used by the disk temporary table is InnoDB by default, which is controlled by the parameter internal_tmp_disk_storage_engine
  • When using disk temporary tables, it corresponds to the sorting process of an InnoDB table without an explicit index
  • To reproduce this process, set tmp_table_size to 1024, set sort_buffer_size to 32768, and set max_length_for_sort_data to 16

  • OPTIMIZER_TRACE partial results

  • Then take a look at the results of this OPTIMIZER_TRACE
  • Because max_length_for_sort_data is set to 16, which is less than the length definition of the word field, so you can see that the rowid sorting is displayed in sort_mode, which is in line with expectations
  • Participating in sorting is the row composed of random value R field and rowid field
  • Think about it now, it's not right
  • The random value stored in the R field is 8 bytes, the rowid is 6 bytes, and the total number of data rows is 10,000, so the calculation is 140,000 bytes, which exceeds the 32,768 bytes defined by sort_buffer_size
  • However, the value of number_of_tmp_files is actually 0, don't you need to use temporary files?
  • The sorting of this SQL statement does not use temporary files. It adopts a new sorting algorithm introduced by MySQL 5.6, namely: priority queue sorting algorithm
  • Next, let's see why the algorithm of temporary files is not used, that is, the merge sort algorithm, but the priority queue sorting algorithm is used.
  • In fact, the current SQL statement only needs to take the three rowids with the smallest R value
  • However, if the merge sort algorithm is used, although the first 3 values ​​can be obtained in the end, after the algorithm ends, all 10,000 rows of data have been sorted
  • In other words, the following 9997 lines are also in order
  • But the query does not require the data to be in order
  • So, if you think about it, you will understand that this wastes a lot of calculations
  • The priority queue algorithm can accurately only get three minimum values
  • The execution process is as follows:

    • 1- For the 10,000 (R, rowid) to be sorted, first take the first three rows and construct a heap
    • 2- Take the next row (R', rowid') and compare it with the largest R in the current heap. If R' is smaller than R, remove this (R, rowid) from the heap and replace it with (R', rowid')
    • 3- Repeat step 2 until the 10000th (R', rowid') completes the comparison
  • The above figure is the process of simulating 6 (R, rowid) rows and finding the rows with the smallest three R values ​​through priority queue sorting
  • Throughout the sorting process, in order to get the maximum value of the current heap as quickly as possible, the maximum value is always kept at the top of the heap, so this is a maximum heap
  • In the OPTIMIZER_TRACE result above, the chosen=true of the filesort_priority_queue_optimization part means that the priority queue sorting algorithm is used. This process does not require temporary files, so the corresponding number_of_tmp_files is 0
  • After the process is over, in the constructed heap, there are three lines with the smallest R value among the 10,000 lines
  • Then, take out their rowids one by one, go to the temporary table to get the word field, this process is the same as the rowid sorting process in the previous article
  • Look at the SQL query statement in the previous article:

  • You may ask, limit is also used here, why not use the priority queue sorting algorithm?
  • The reason is that this SQL statement is limit 1000
  • If the priority queue algorithm is used, the size of the heap to be maintained is 1000 rows (name,rowid), which exceeds the set size of sort_buffer_size, so only the merge sort algorithm can be used
  • In short, no matter what type of temporary table is used, the wording of order by rand() will make the calculation process very complicated and require a large number of scanned rows, so the resource consumption of the sorting process will also be large
  • Going back to the question at the beginning of the article, how to correctly sort randomly?
  • random ordering method

  • Let's simplify the problem first. If only one word value is randomly selected, what can we do?
  • The idea is this:
  • 1- Get the maximum value M and minimum value N of the primary key id of this table
  • 2- Use a random function to generate a number between the maximum value and the minimum value X = (MN)*rand() +N
  • 3- Take the row with the first ID not less than X
  • We temporarily call this algorithm Random Algorithm 1
  • Sequence of executed statements:

  • This method is very efficient, because there is no need to scan the index to get max(id) and min(id), and the select in the third step can also use the index to quickly locate, it can be considered that only 3 rows are scanned
  • But in fact, this algorithm itself does not strictly meet the random requirements of the title, because there may be holes in the ID, so the probability of selecting different rows is different, not truly random
  • For example, there are 4 ids, namely 1, 2, 4, and 5. If you follow the above method, the probability of getting the row with id=4 is twice that of other rows
  • What if the ids of these four lines are 1, 2, 40000, 40001 respectively?
  • This algorithm can basically be regarded as a bug.
  • Therefore, in order to obtain strictly random results, the following process can be used
    • 1- Get the number of rows in the entire table and record it as C
    • 2- Obtain Y = floor(C*rand()); the function of the floor function here is to take the integer part
    • 3- Then use limit Y,1 to get a row
  • We call this algorithm the random algorithm 2
  • The following piece of code is the sequence of execution statements of the above process:

  • Since the parameters behind the limit cannot directly follow the variable, the prepare+execute method is used in the above code
  • You can also write the method of splicing SQL statements in the application program, which will be simpler
  • This random algorithm 2 solves the obvious uneven probability problem in algorithm 1
  • The way MySQL handles limit Y, 1 is to read them one by one in order, discard the first Y, and then return the next record as the result, so this step needs to scan Y+1 rows
  • In addition, the C line scanned in the first step needs to scan C+Y+1 lines in total, and the execution cost is higher than the cost of random algorithm 1
  • Of course, compared with the direct order by rand() of random algorithm 2, the execution cost is still much smaller
  • Now let's look again, what if we want to randomly take 3 word values ​​according to the idea of ​​random algorithm 2?
  • You can do this:
    • 1- Obtain the number of rows in the entire table, denoted as C
    • 2- Obtain Y1, Y2, Y3 according to the same random method
    • 3- Execute three more limit Y, 1 statements to get three rows of data
  • We call this algorithm the random algorithm 3
  • The following piece of code is the sequence of execution statements of the above process:

Guess you like

Origin blog.csdn.net/weixin_59624686/article/details/131351446