45 MySQL combat stress study notes: "How to correctly display the random message (Lecture 17)?

 First, the primer

I'm in the last article, after you complete the implementation of several models to explain the order by statement, the thought of doing a performance before a problem learning English App friend come across. Today this article, I'll start with this performance problem, and
you talk about MySQL needs of another sort, hoping to deepen your understanding of MySQL sorting logic.

This time learning English App Home has a random word display function, which is a word list based on each user's level, then the user each time you visit the home page will scroll through three random words. They found that as the single
vocabulary becomes larger, choose words that logic becomes slower and slower, and even affect the home's open speed.

Now, if you let this SQL statement to design, how would you write it?

For ease of understanding, this my be simplified example: the word list to remove the logic level of each user has a corresponding directly from a word list is randomly selected three words. This command table of construction and initial data table statement as follows:

mysql> CREATE TABLE `words` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=0;
  while i<10000 do
    insert into words(word) values(concat(char(97+(i div 1000)), char(97+(i % 1000 div 100)), char(97+(i % 100 div 10)), char(97+(i % 10))));
    set i=i+1;
  end while;
end;;
delimiter ;

call idata();

For ease of explanation quantify, I inserted 10,000 rows in this table inside. Next, we take a look to be randomly selected three words, what method to achieve, what problems exist and how to improve.

Second, the temporary memory table

First, you will think of using order by rand () to achieve this logic.

mysql> select word from words order by rand() limit 3;

1, ordering the temporary memory table, it will be what kind of algorithm to choose?

This statement means very straightforward, take the first three random order. Although this SQL statement wording is very simple, but the implementation process is a bit complicated.

Let's use explain command to see the implementation of this statement.

Figure 1 Using explain command to view the implementation of the statement

1, Using temporary What does it mean?

Extra field shows Using temporary, indicates the need to use a temporary table; Using filesort, indicates the need to perform a sort operation

Extra so this means that, you need a temporary table, and need to sort on a temporary table.

Here, you can first look at the contents of the article in full-field sorting and sorting rowid. I put two flowcharts on an article posted over, allowing you to review.

 

 

 FIG full field sorts 2

 

 

 FIG 3 rowid ordering

Then I ask you a question, do you think for ordering the temporary memory table, it will be what kind of algorithm to choose? Recall a conclusion on the article: For InnoDB tables, perform a full field ordering will reduce disk access
ask, therefore will be preferred.

I emphasize the "InnoDB table," you must think, for the memory table, back to the table process simply based on location data line, to get direct access to memory data, it will not lead to a multi-access disk. Optimizer does not this layer concerns,

Then it will be a priority, the line is used for sorting as small as possible, and therefore, MySQL will then select rowid sort.

2, randomly ordered complete process

Understand the logic of this algorithm selection, we look at the execution flow statement. At the same time, through this example today, we try to analyze the number of scanning lines of the statement.

The flow of execution of this statement is this:

1. Create a temporary table. The memory temporary table using the engine, table has two fields, a first field is of type double, described later, for convenience, referred to as the field R, the second field is varchar (64) type, referred to as a field W. Also, this table is not indexed.

2. Remove all the values of the primary key word order words from the table. Value for each word, calling rand () function generates a random decimal greater than 0 to less than 1, and this word are stored in a random decimal and temporary tables
    R and W fields, this scanning line number 10,000.

3. Now there are 10,000 temporary table rows of data, the next you do not want to index in this temporary table memory, according to field R sort.

4. Initialize sort_buffer. sort_buffer has two fields, a type double, the other is an integer.
5. Remove the R-value line by line and location information from memory temporary table (behind me and explain why you are here "position information"), respectively, into sort_buffer two fields. The process to do the whole table memory temporary table
    scans, this time to increase the number of scanning lines 10000, became 20,000.
6. sorted according to the value of R in sort_buffer. Note that this process is not related to the operation table, so it will not increase the number of scanning lines.
7. After sorting is complete, remove the top three results of the position information, the temporary memory table successively extracted word value back to the client.

1, recommended a learning method

This process, visit the three lines of data tables, the total number of scanning lines becomes 20003.

Next, we pass the slow query log (slow log) to verify the number of scanning lines to get our analysis is correct.

# Query_time: 0.900376  Lock_time: 0.000347 Rows_sent: 3 Rows_examined: 20003
SET timestamp=1541402277;
select word from words order by rand() limit 3;

Which, Rows_examined: 20003 says that this statement during execution of the scan line 20003, it validates our analysis concluded.

Here insert an aside, in the normal process of learning concept, you can always do so, first calculate the number of scanning lines by the principle of analysis, and then by looking at the slow query log to verify their own conclusions . I myself am often do, this is
a very interesting process, analysis of the happy, figure out the analysis was wrong but also very happy.

3, a flowchart of a complete random ordering

Now, I come to a complete picture of the implementation process of sorting out.

4 a flowchart of a complete random ordering FIG.

pos is the position information of the figure, you might be wondering, where "location information" is what the concept? In the previous article, we, obviously with the ID field or InnoDB tables when ordering.

4, MySQL table is the method used to locate "line data".

At this time, we're going back to a basic concept: MySQL table is the method used to locate "line data".

In front of 4 and 5 introduce an index of articles, several students asked, if the primary key of an InnoDB table deleted, is not no primary key, there is no way back to the table?

Well, not exactly. As if you create tables without primary keys, or to delete the primary key of a table, then generates a InnoDB will own a length of 6 bytes rowid to as the primary key.

This is the sort mode inside, rowid origin of the name. In fact it represents: Each engine is used to uniquely identify rows of information.

  1. For InnoDB primary key table, this is the primary key ID rowid;
  2. For InnoDB table does not have primary keys, this rowid is generated by the system;
  3. MEMORY engine is not index-organized tables. In this case there, you can think of it as an array. So, this is actually rowid array indices.

Here, I'll look a little summary: order by rand () uses memory temporary tables, temporary memory when using a sorting table rowid sorting method.

Third, the disk temporary tables

1, then, is not all temporary tables are memory table it?

Well, not exactly. tmp_table_size This configuration limits the size of temporary memory table, the default value is 16M. If the temporary table size exceeds tmp_table_size, then they will turn into a temporary table memory disk temporary tables.
Disk temporary table using the default engine is InnoDB, it is controlled by the parameters of internal_tmp_disk_storage_engine.

When using disk temporary tables, the corresponding sorting process is a no InnoDB table explicit index. To reproduce this process, I tmp_table_size set to 1024, the sort_buffer_size set to
32768, the max_length_for_sort_data set to 16.

SET the tmp_table_size = 1024; 
SET sort_buffer_size = 32768; 
SET max_length_for_sort_data = 16; 
/ * open optimizer_trace, only for the thread * / 
the SET optimizer_trace = 'Enabled = ON'; 

/ * execute statement * / 
SELECT Word from words Order by RAND () . 3 limit; 

/ * Check OPTIMIZER_TRACE output * / 
the SELECT * `information_schema`.`OPTIMIZER_TRACE` the FROM \ G

 

 

 Results FIG portion 5 OPTIMIZER_TRACE

Then, we look at the results of this OPTIMIZER_TRACE.

Since the max_length_for_sort_data set to 16, is less than the word length defined in the field, we see sort_mode which shows the rowid sort, this is in line with expectations and is involved in sorting the random value R word
line segment and the rowid fields.

At this time you may forget about the heart, I found wrong. Random value R field is stored on 8 bytes, rowid is 6 bytes (6 bytes as to why, they leave you after class to think of it), the number of rows of data is 10000, so calculated to
have 140,000 bytes , more than 32768 bytes of sort_buffer_size defined. However, the value is 0 number_of_tmp_files actually, do not need to use temporary files?

2, priority queue sorting algorithm

This sort of SQL statement does not use temporary files, using a new sort algorithm introduced in MySQL 5.6 version, namely: priority queue sorting algorithm. Next, we take a look at why there is no algorithm uses temporary files,
that is, merge sort algorithm, instead of using a priority queue sorting algorithm.

In fact, we are now SQL statements, you only need to take a minimum value of R 3 rowid. However, if you use merge sort algorithm, even though eventually able to get the first three values, but the end of this algorithm, has 10,000 lines of data
are sorted up.

In other words, behind the 9997 line it is orderly. But, we do not need to query the data are ordered. So, I think about to understand, this is a waste of a lot of computation.

The priority queuing algorithm, can be accurately obtained only three minimum, the implementation process is as follows:

1. For this preparation ordered 10,000 (R, rowid), to take the first three lines, configured as a stack; (data structure students vague impression, which is conceived to be the first array of three elements) 1 remove a row (R ', rowid'),

    Compare the maximum current inside the reactor R, if R 'is less than R, this (R, ROWID) removed from the stack, into (R', rowid ');

2. Repeat step 2 until 10 000 (R ', rowid') to complete the comparison.

Here, I simply drew a diagram of a priority queue sorting process.

3, a schematic diagram of the process of sorting a priority queue

Example 6 FIG priority queue sorting algorithm

6 is a simulation of 6 (R, ROWID) row, row process of finding the minimum value of the three R sorted by priority queue. The entire ordering process, in order to get the fastest current maximum heap, always keeping the maximum value at the top of the heap, so this is one of the largest heap.

OPTIMIZER_TRACE results of FIG. 5, filesort_priority_queue_optimization this section chosen = true, it means the use of a priority queue sorting algorithm, this process does not require the temporary file, thus the corresponding
number_of_tmp_files is 0.

After this process, we constructed inside the reactor, it is the smallest of the three lines of 000 rows inside the R-value. Then, turn to their rowid taken out, to get the word temporary table inside the field, this process just like in the previous article
, like a rowid ordering process.

Let's look at an article of the above SQL query:

select city,name,age from t where city='杭州' order by name limit 1000  ;

MySQL table is the method used to locate "line data".

You may ask, used here as well limit, how useless it a priority queue sorting algorithms? The reason is, this SQL statement is the limit 1000, if you use the priority queue algorithm, it needs the size of the stack maintenance is 1000 lines
(name, rowid), more than sort_buffer_size size of my settings, so only use merge sort algorithm.

In short, no matter which type of temporary table to use, order by rand () such an approach would make the calculation process is very complex, requiring a large number of scanning lines, the resource consumption of the sorting process will be great.
We go back to the beginning of the article the problem, how do random sorting correctly?

Fourth, the method of random ordering

1 and 1 randomized algorithms

We first simplify what problems, if only randomly select a word value, how to do it? The idea is this:

  • 1. The table's primary key to obtain the maximum value M and the minimum value id N;
  • 2. Generate a random function with a maximum value between the minimum number of X = (MN) * rand () + N;
  • 3. Take the line X is not less than the first ID.

We call this algorithm, tentatively called stochastic algorithms. Here, I direct to you posted about the sequence of execution of the statement:

mysql> select max(id),min(id) into @M,@N from t ;
set @X= floor((@M-@N+1)*rand() + @N);
select * from t where id >= @X limit 1;

This method is very efficient, because the take max (id) and min (id) is the need to scan the index, and the third step can also quickly select the positioning index, it can be considered only three scanning lines. But in fact, the algorithm itself does not
strictly meet the requirements of a random topic, because the middle ID may be empty, so the probability of selecting a different line is not the same, is not truly random.

For example, you have 4 id, respectively 1,2,4,5, according to the above method, then take to the probability of this line of id = 4 is made twice the probability of other lines.

If these four lines of id are 1,2,40000,40001 it? The basic algorithm can be viewed as a bug.

2, randomized algorithms 2

Therefore, in order to get strict random results, you can use the following procedure:

  • 1. Get number of rows in the entire table, and denoted by C.
  • 2. Obtain Y = floor (C * rand ()). Here the role of the floor function, that is, taking the integer part.
  • 3. then limit Y, 1 Fetch a row.

We call this algorithm, known as randomized algorithms 2. The following code is the sequence of execution of the statement above process.

Since the parameters can not be directly behind the limit with a variable, so I used prepare + execute a method in the above code. You can also put the splicing method to write SQL statements in the application, it will be simpler.

This random Algorithm 2, Algorithm 1 solves the problem of uneven inside distinct probability.

mysql> select count(*) into @C from t;
set @Y = floor(@C * rand());
set @sql = concat("select * from t limit ", @Y, ",1");
prepare stmt from @sql;
execute stmt;
DEALLOCATE prepare stmt;

MySQL process limit Y, one approach is to sequentially read out one by one, Y before a throw, and then returns the next record as a result, so this step requires a scan line Y + 1. Plus, C-line scanning of the first step, a total need
to scan C + Y + 1 line, the execution cost is higher than the cost of randomized algorithms 1.

Of course, random algorithm 2 with a direct order by rand () than up, the execution cost is still a lot smaller.

You might ask, if in accordance with this table has 10,000 lines to calculate it, C = 10000, and if random to a relatively large value of Y, then the number of scanning lines but also with 20,000 almost close order by number of scanning lines rand (), and Why with the
machine costs 2 algorithm is much smaller it? I'll leave that to you to go to after-school thinking about it.

3, referred to as a random algorithm 3

Now, we look at, we thought 2. If a random algorithm to randomly selected three word value it? You can do this:

  • 1. The entire table to obtain the number of rows, referred to are C;
  • 2 obtained according to the same random method Y1, Y2, Y3;
  • 3. then perform three limit Y, 1 as the statement is three lines of data.

We call this algorithm, known as randomized algorithms 3. The following code is the sequence of execution of the statement above process.

MySQL> SELECT COUNT (*) INTO @C from T; 
SET @ Yl = Floor (@C * RAND ()); 
SET @ Y2 = Floor (@C * RAND ()); 
SET @ Y3 = Floor (* @C RAND ()); 
SELECT * from limit T @ Y1,1; // get Y1, Y2, Y3 value of application code which, executed after spell the SQL 
SELECT * from limit T @ Y2,1; 
SELECT * from limit T @ Y3,1;

V. Summary

Today this article, I was ordered by the random demand, with you describes the process of implementation of MySQL temporary table sorting. Query execution costs tend to be relatively large. So, when designing the amount you want to avoid such an approach.

If you directly order by rand (), this statement requires Using temporary and Using filesort,

Inside today's example, we are not only to solve the problem within the database, but also let the application code with the stitching SQL statements. In actual application, usage is fairly standard: try to write business logic in business code, so that the number of
databases only "read-write data" thing. Therefore, the application of these methods is quite extensive.

Finally, I leave you with a thought to the bar.

3 Total number of scanning lines above random algorithm is C + (Y1 + 1) + (Y2 + 1) + (Y3 + 1), in fact it can continue to optimize, to further reduce the number of scanning lines.

My question is, if you are a developer needs this, how would you do to reduce the number of scanning lines of it? Tell me about your program and describe number of scanning lines of your program needs.

You can put your design and conclusions written in the comments section, I will end next article and you discuss this issue. Thank you for listening, you are welcome to send this share to more friends to read together.

Sixth, on the issue of time

In the last article I leave you with the last question, select * from t where city in ( "Hangzhou", "Suzhou") order by name limit 100; this SQL statement is needed to sort? What program to avoid the sort?

Although there are (city, name) joint index for a single internal city, name is incremented. However, because this SQL statement is not intended to investigate separately the value of a city, but at the same time check the two cities, "Hangzhou" and "Suzhou", so all full
foot condition name is not increasing the. In other words, this SQL statement requires ordering. That sort how to avoid it?

Here, we have to use characteristics (city, name) joint of the index, this statement is split into two statements, the implementation process is as follows:

  • 1. Run select * from t where city = "Hangzhou" order by name limit 100; this statement is not required to sort the client with a length A of a memory array 100 to save the results.
  • 2. Run select * from t where city = "Suzhou" order by name limit 100; by the same method, assuming the results are stored into a memory array B.
  • 3. Now A and B are two ordered arrays, then you can use merge sort of thought, to get name the minimum value of the top 100, it is the result we need.

If you put the SQL statement in the "limit 100" changed to "limit 10000,100" if, in fact, almost approach, namely: the above two statements put into writing:

select * from t where city=" 杭州 " order by name limit 10100; 

with

 select * from t where city=" 苏州 " order by name limit 10100。

At this time a large amount of data can be simultaneously read from two rows of connecting a result, with these two merge sort algorithm to get the result set, in order to take the value of the name 10001 ~ 10100, is the desired result.

Of course, this program has a significant loss of data is returned to the client from the database of a large amount.

So, if a single row of data is large, you can consider these two SQL statements into the following written in this way:

select id,name from t where city=" 杭州 " order by name limit 10100; 

with

select id,name from t where city=" 苏州 " order by name limit 10100。

Then, and then merge sort process name acquired by the name of the order of 10001 - 10100, the value of id, then took the id 100 to the database to find all records.

These methods above, you need to make trade-offs based on the complexity of the performance needs and development.

Guess you like

Origin www.cnblogs.com/luoahong/p/11640924.html