Why not use join? "Deadly Kick MySQL Series Sixteen"

Hello everyone, I'm Kaka不期速成,日拱一卒

In the usual development work, the frequency of use of join is very high. Many SQL optimization blog posts also allow subqueries to be changed to joins to improve performance, but DBAs of some companies do not allow them to be used. So what is the problem with using joins?

Dead MySQL series

1. What is Nested-Loop Join

In MySQL, the Nested-Loop Join algorithm is used to optimize the use of joins. This algorithm is translated as 嵌套循环连接, and three algorithms are used to achieve it.

  • Index Nested-Loop Join :简称NLJ
  • Block Nested-Loop Join :简称BNLJ
  • Simple Nested-Loop Join :简称 BNL

These algorithms roughly mean index nested loop joins, cache block nested loop joins, and rude nested loop joins. The order you are looking at now is the priority of MySQL choosing the join algorithm.

From the name, it seems that the Simple Nested-Loop Joinalgorithm is very simple and also the fastest, but the actual situation is that MySQL does not use this algorithm but optimizes it to use Block Nested-Loop Join, and explore the mysteries with various questions.

I've seen it all here, do 嵌套循环连接n't you understand what it means? In fact, it is very simple. You can understand what is a nested loop join with a simple case.

Suppose there is an article table articleand an article comment table now article_detail, and the requirement is to query the id of the article to query the current homepage of all comments, then the SQL will look like the following

select * from article a left join article_detail b on a.id = b.article_id

If you use code to describe the implementation principle of this SQL, it is roughly as follows. This code uses slices and double-layer loops to implement bubble sort. This code can very well represent the implementation principle of join in SQL. The first layer of for is the driving table. , the second layer for is the driven table.

func bubble_sort(arr []int) {
    
    
    a := 0 
    for j := 0; j < len(arr)-1; j++ {
    
    
        for i := 0; i < len(arr)-1-j; i++ {
    
    
            if arr[i] > arr[i+1] {
    
    
                a = arr[i]        
                arr[i] = arr[i+1] 
                arr[i+1] = a
            }
        }
    }
}

Well, now you know what Nested-Loop Join is, and you also know the three algorithms for implementing Nested-Loop Join. Next, let's discuss these three algorithms, why not use join.

二、Index Nested-Loop Join

In order to prevent the optimizer from rudely optimizing SQL, it will be used STRAIGHT_JOINto perform query operations next.

Why do you need STRAIGHT_JOINit? During the development process, have you encountered a table that is clearly a driving table but inexplicably becomes a driven table? In MySQL, the concept of driving a table is that when a connection condition is specified, it satisfies the condition and records a table with a small number of rows for the drive table. When no query condition is specified, the scan row is the driving table, and the optimizer always decides the execution order in the way that the small table drives the large table.

The index nested loop join is an algorithm based on the index. The index is based on the driven table. The query condition of the driving table is directly matched with the index of the driven table to prevent comparison with each record of the driven table. The query reduces the number of matches against the driven table, thereby improving the performance of joins.

Prerequisites for use

The premise of using index nested query is that there is an index set on the field associated with the driving table and the driven table.

Next, use a case to analyze the specific execution process of index nested query in detail. The following SQL is all tables and data, which can be used directly by copying

CREATE TABLE `article` (`id` INT (11) NOT NULL AUTO_INCREMENT COMMENT 'ID',`author_id` INT (11) NOT NULL,PRIMARY KEY (`id`)) ENGINE=INNODB CHARSET=utf8mb4 COLLATE utf8mb4_general_ci COMMENT='文章表';

CREATE PROCEDURE idata () BEGIN DECLARE i INT; SET i=1; WHILE (i<=1000) DO INSERT INTO article VALUES (i,i); SET i=i+1; END WHILE; END;

call idata();

CREATE TABLE `article_comment` (`id` INT (11) NOT NULL AUTO_INCREMENT COMMENT 'ID',`article_id` INT (11) NOT NULL COMMENT '文章ID',`user_id` INT (11) NOT NULL COMMENT '用户ID',PRIMARY KEY (`id`),INDEX `idx_article_id` (`article_id`)) ENGINE=INNODB CHARSET=utf8mb4 COLLATE utf8mb4_german2_ci COMMENT='用户评论表';

DROP PROCEDURE idata;

CREATE PROCEDURE idata () BEGIN DECLARE i INT;
SET i=1; WHILE (i<=1000)
DO
INSERT INTO article_comment VALUES (i,i,i);
SET i=i+1; END WHILE; END;

CALL idata ();

It can be seen that at this time, the article table and article_comment have 1000 rows of data

The requirement is to view all the comment information of the article, and execute the SQL as follows

SELECT*FROM article STRAIGHT_JOIN article_comment ON article.id=article_comment.article_id;

Now, let's look at the explain result of this statement.

Dead MySQL series

It can be seen that in this statement, the field article_id of the driven table article_comment uses an index, so the execution flow of this statement is as follows

  • Read a row of data from article table R
  • Remove the id field from R to the table article_comment to find
  • Take out the lines that meet the conditions in article_comment and form a line with R
  • Repeat the first three steps until the data scan of the table article meets the conditions ends

In this process, we simply sort out the number of scan lines

  • A full table scan is required for the article table, and the number of scanned rows is 1000
  • There is no row of R data, according to the id of the article table, go to the article_comment table to search, and the tree search is performed, so the results of each search are one-to-one correspondence, that is to say, only one row of data is scanned each time, and a total of scans are required. 1000
  • Therefore, for this execution flow, the total number of scanned lines is 2000 lines

How to implement in code

  • Full table scan article data, here is 1000 rows
  • Loop through the 1000 rows of data
  • Use the id of the article as a condition to query in a loop

The number of lines scanned in the execution process is also 2000 lines. Not to mention the performance of such writing, the interaction with MySQL has been carried out 1001 times.

in conclusion

Obviously, it is better to use join directly

三、Simple Nested-Loop Join

A simple nested loop join query is that the table join does not use an index, and then the nested loop is used rudely. The article and article_comment tables have 1000 rows of data, so the number of rows of scanned data is 1000*1000=10 million. The query efficiency can be imagined.

Execute the SQL as follows

SELECT * FROM article STRAIGHT_JOIN article_comment ON article.author_id=author_id.user_id;

In this process:

  • A full table scan is performed on the driver table article, which requires scanning 1000 rows
  • Reading a row of data from the driver table requires a full table scan in the article_comment table, and a full table scan is required without using an index
  • Therefore, every time a full table scan is required for the data of the driven table

These are still two very small tables. The tables in the production environment are often tens of millions. If this algorithm is used, it is estimated that MySQL will not have the current grand occasion.

Of course, MySQL does not use this algorithm, but uses the algorithm of block nested query, which is used in many places in MySQL

expand

For example, the index is stored in the disk, and each time the index is used to retrieve data, the data is read from the disk into the memory, and the reading method is also read in blocks, not all at one time.

Assuming that the operating system needs to read 1kb of data from the disk, the operating system will actually read 4kb of data. In the operating system, one page of data is 4kb, and in the innodb storage engine, the default one-page data is 16kb.

Why MySQL uses blocks to read data is because of the locality principle of data, data and programs have a tendency to aggregate into groups. After accessing a row of data, there is a great possibility to access this row again later. data and the data adjacent to this data.

四、Block Nested-Loop Join

It is definitely not advisable to use a simple nested query method after the above analysis, but the idea of ​​​​blocking is selected for processing.

At this time, the execution flow is like this

  • The data read from the driver table article is stored in the join_buffer. Since it is an unconditional select, the full table data of the article will be put into the memory.
  • Take the data in join_buffer and compare the data in article_comment line by line

Correspondingly, the explain result of this SQL is as follows

Dead MySQL series

In order to reproduce Block Nested Loop, Kaka installed three versions of MySQL, namely MySQL8, MySQL5.5, and MySQL5.7, which were used in the latter two versions Block Nested Loop, but changed in MySQL8.

Dead MySQL series

The hash join will be discussed in the next issue. During this query process, a full table scan is performed on the tables article and article_comment, so the number of scanned rows is 2000.

The data in the article is read into the join_buffer and stored in an unordered array. For each row in the article_comment table, 1000 judgments are required, so the number of judgments required is 1000*1000=10 million times.

At this time, you find that the number of rows scanned by the block nested loop is the same as that of the simple nested query, but the Block Nested Loopalgorithm uses such a memory space of join_buffer, so the speed will definitely Simplebe much faster than that.

V. Summary

In this issue, we summarize the full text with three questions to help you better understand.

The first question: can we use join?

Through the three demonstration cases, you should now know that when the column of the association condition is the index of the driven table, there is no problem at all, that is to say, when using the index to nest the query, you can use the join.

However, when a block nested query is used, the number of rows scanned in this way is the multiplication of the number of rows in the two tables, the number of scanned rows will be very large, and it will take up a lot of system resources, so the join of this algorithm is very inefficient. recommended.

Therefore, when using join, the maximum possible column of the associated query is the index column of the driven table. If this condition cannot be met, you can consider whether the table structure design is reasonable.

The second question: If using join, choose a large table or a small table as the driving table?

Good habits are formed slowly, so you have to remember to use a small table to drive a large table no matter what, remember this conclusion first.

If it is an Nested-Loop Joinalgorithm, the small table should be selected as the driving table.

If yes Block Nested-Loop Join, when join_buffer is large enough, it is the same to use a large table or a small table as the driving table, but when join_buffer is not manually set to a larger value, the small table should still be selected as the driving table.

It is also necessary to know that the default value of join_buffer is 256kb in MySQL 8.0.

The third question: what kind of table is a small table?

The small table here is not a table with a very small amount of data. This must not be mistaken. In all SQL queries, the vast majority of cases are conditionally filtered.

Seeing whether it is a small table is based on the fact that the amount of data retrieved from the two tables under the same conditions is small, and that table is the small table.

Recommended reading

Deadline MySQL Series General Catalog

Open the door of order by and take a look at "Deadly Kick MySQL Series Twelve"

Heavy blockade, so that you can't get a single piece of data "Deadly Kick MySQL Series Thirteen"

Trouble, the generation environment performed the DDL operation "Deadly Kick MySQL Series Fourteen"

Talk about the locking rules of MySQL "Deadly Kick MySQL Series Fifteen"

Persistence in learning, perseverance in writing, perseverance in sharing are the beliefs that Kaka has upheld since her career. I hope the article can bring you a little help on the huge Internet, I am Kaka, see you in the next issue.

Guess you like

Origin blog.csdn.net/fangkang7/article/details/123142694