Correct posture for MySQL random sorting

There is a table structure:
CREATE TABLE `words` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;

There are 10,000 rows inserted into the table, and 3 words are randomly selected from them.

The easiest way

mysql> select word from words order by rand() limit 3;

Although this SQL statement is very simple, the execution process is a bit complicated.

The Extra field displays Using temporary and Using filesort, indicating that a temporary table is needed and sorting is required on the temporary table. For InnoDB tables, performing full-field sorting will reduce disk access, so it will be preferred.

However, for the memory table, the process of returning to the table simply accesses the memory to obtain the data based on the location of the data row, which will not lead to multiple access to the disk. MySQL will choose rowid sorting at this time.

The execution flow of this statement is like this:

Create a temporary table. The memory engine is used. There are two fields in the table. The first field is of type double and is marked as field R, and the second field is of type varchar(64) and is marked as field W. Also, this table is not indexed.
From the words table, take out all the word values in the order of the primary key. For each word value, call the rand() function to generate a random decimal greater than 0 and less than 1, and store this random decimal and word in the R and W fields of the temporary table respectively. So far, the number of scanned rows is 10000 .
Now the temporary table has 10,000 rows of data. Next, you need to sort by field R on this non-indexed memory temporary table.
Initialize sort_buffer. There are two fields in sort_buffer, one is double type and the other is integer type.
Fetch the R value and location information line by line from the memory temporary table, and store them in two fields in sort_buffer. This process requires a full table scan. At this time, the number of scanned rows increases by 10,000 and becomes 20,000.
Sort according to the value of R in sort_buffer. Note that this process does not involve table operations, so it will not increase the number of scanned rows.
After the sorting is completed, the location information of the first three results is retrieved, the word value is retrieved from the temporary memory table in turn, and returned to the client. In this process, the three rows of data in the table are accessed, and the total number of scanned rows becomes 20003 .

Note: What is the concept of "location information" in step 5: MEMORY engine is not an index-organized table. In this example, you can think of it as an array. Therefore, this rowid is actually the subscript of the array.

Use the slow log to verify:

# Query_time: 0.900376  Lock_time: 0.000347 Rows_sent: 3 Rows_examined: 20003
SET timestamp=1541402277;
select word from words order by rand() limit 3;

order by rand() uses a memory temporary table, and the rowid sort method is used when sorting the memory temporary table.

tmp_table_size This configuration limits the size of the memory temporary table, the default value is 16M. If the size of the temporary table exceeds tmp_table_size, the in- memory temporary table will be converted to a disk temporary table . The default engine used by disk temporary tables is InnoDB, which is controlled by the parameter internal_tmp_disk_storage_engine .

When using disk temporary tables, the above example corresponds to the sorting process of an InnoDB table without an explicit index.

set tmp_table_size=1024;
set sort_buffer_size=32768;
set max_length_for_sort_data=16;
/* 打开 optimizer_trace，只对本线程有效 */
SET optimizer_trace='enabled=on'; 
/* 执行语句 */
select word from words order by rand() limit 3;
/* 查看 OPTIMIZER_TRACE 输出 */
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G

The sort_mode shows rowid sorting, and the rows that are involved in the sorting are the random value R field and the rowid field.

The random value stored in the R field is only 8 bytes, rowid is 6 bytes, and the total number of data rows is 10000. This is 140000 bytes, which exceeds the 32768 bytes defined by sort_buffer_size. However, the value of number_of_tmp_files is actually 0. Because the sorting of this SQL statement is a new sorting algorithm introduced by MySQL 5.6 version, namely: priority queue sorting algorithm. From the OPTIMIZER_TRACE result, the chosen=true part of the filesort_priority_queue_optimization can also be seen.

In fact, our current SQL statement only needs to take the 3 rowids with the smallest R value. If the merge sort algorithm is used, although the first 3 values can be obtained in the end, this algorithm will sort all 10,000 rows of data. Is unnecessary.

The priority queue algorithm can accurately get only three minimum values. The execution process is as follows:

For the 10,000 (R, rowid) to be sorted, first take the first three rows and construct a heap;
Take the next row (R',rowid') and compare it with the largest R in the current heap. If R'is less than R, remove this (R,rowid) from the heap and replace it with (R',rowid');
Repeat step 2 until the 10000th (R',rowid') is compared.

The SQL query statement in the previous article is also limit 1000. If the priority queue algorithm is used, the size of the heap that needs to be maintained is 1000 rows (name, rowid), which exceeds the size of sort_buffer_size I set, so I can only use merge sort algorithm.

In short, no matter which type of temporary table is used, order by rand() will make the calculation process very complicated and require a large number of scan rows, so the resource consumption of the sorting process will be high.

Sort correctly randomly

Simplify the problem first, if only one word value is randomly selected:

Get the maximum value M and minimum value N of the primary key id of this table;
Use a random function to generate a number between the maximum and minimum X = (MN)*rand() + N;
Take the row with the first ID not less than X.

For the time being called random algorithm 1, look at the sequence of execution statements:

mysql> select max(id),min(id) into @M,@N from t ;
set @X= floor((@M-@N+1)*rand() + @N);
select * from t where id >= @X limit 1;

This method is very efficient, because both max(id) and min(id) do not need to scan the index, and the third step of select can also use the index to locate quickly, which can be considered to only scan 3 rows. But in fact, this algorithm itself does not strictly meet the random requirements of the title, because there may be holes in the ID, so the probability of choosing different rows is different , not truly random.

In order to get strictly random results, you can use the following process:

Get the number of rows in the entire table and record it as C.
Get Y = floor(C * rand()). The role of the floor function here is to take the integer part.
Use limit Y,1 to get a row.

This is random algorithm 2, which solves the obvious uneven probability problem in algorithm 1. MySQL's approach to processing limit Y,1 is to read them out one by one in order, discard the first Y, and then use the next record as the return result , so this step needs to scan Y+1 rows. In addition, the C line scanned in the first step requires scanning C+Y+1 lines in total, and the execution cost is higher than the cost of random algorithm 1.

If calculated according to this table with 10000 rows, C=10000, if it is random to a larger Y value, the number of scanned rows is almost 20000, which is close to the number of scanned rows of order by rand(), but still more than order by rand() is much less expensive to execute. Because random algorithm 2 performs limit to obtain data according to the primary key sorting, and the primary key natural index sorting, this process is omitted here.

If we follow the idea of random algorithm 2, we need to randomly select 3 word values:

Get the number of rows in the entire table, denoted as C;
Get Y1, Y2, Y3 according to the same random method;
Execute three limit Y, 1 statements to get three rows of data.

The total number of scan lines of this random algorithm is C+(Y1+1)+(Y2+1)+(Y3+1). In fact, it can continue to be optimized to further reduce the number of scan lines:

After random out Y1, Y2, Y3, calculate Ymax and Ymin;
再用 select id from t limit Ymin，(Ymax - Ymin + 1)；
After obtaining the id set, calculate the three ids corresponding to Y1, Y2, and Y3;
最后 select * from t where id in (id1, id2, id3)。

The number of lines scanned in this way should be C+Ymax+3.

Content source: Lin Xiaobin "45 Lectures on MySQL Actual Combat"