MySQL - Efficiency comparison and optimization of Left Join and Inner Join

When I was writing code recently, I encountered a problem that required multi-table connections. The initial SQL was similar to:

select * 
from a 
left join b on a.id = b.aid 
left join c on c.bid = b.id 
left join d on d.cid = c.id

Such a combination of multiple left joins always feels that this way of writing is problematic, and the subsequent use of inner join finds that the speed is faster than left join

1. Why is left join slower than inner join

(1) About the amount of logical operations

Everyone knows the concept of left join (all records on the left are returned, and null is returned for rows corresponding to records in the right table that do not meet the matching conditions). So simply comparing the amount of logical operations, inner join only needs to return two tables. In the intersection part, the left join returns some data that the left table does not return.

(2) MySQL Nested-Loop Join Algorithm

Nested-Loop Join Algorithms

This algorithm is mysql's default connection algorithm, which is similar to the three nested loops in the program code:

Table   Join Type
t1      range
t2      ref
t3      ALL

for each row in t1 matching range {
  for each row in t2 matching reference key {
    for each row in t3 {
      if row satisfies join conditions, send to client
    }
  }
}

From an algorithmic point of view, according to the mysql document, when inner join is connected, mysql will automatically select a smaller table as the driving table, thereby achieving the purpose of reducing the number of loops. When we use the left join table, the default is to use the left table as the driving table, so the size of the left table is controlled by us at this time. If the control is not appropriate and the left table is larger, the number of natural loops will also increase, and the efficiency will be reduced. will fall.

This code is very simple. It is assumed that there are three tables, t1, t2, and t3. This code will display the range, ref and ALL in the explain plan respectively. When displayed in the SQL execution plan layer, t3 will perform a full Table scan, where the driver table is the t1 table in the pseudo code, MySQL will automatically select the table with the smallest result set as the driver table, and as an algorithm analysis, choosing the driver table is indeed the least expensive way. Then it is also mentioned here that connection optimization is performed by reducing the result set of the driver table. According to this algorithm, a driver table with a smaller result set can indeed reduce the number of loops.

select c.* 
from hotel_info_original c
left join hotel_info_collection h on c.hotel_type = h.hotel_type 
and c.hotel_id =h.hotel_id
where h.hotel_id is null

This sql is used to query the records in table c that are not in table h, so I thought of using the left join feature (returning all records on the left, and returning null for rows corresponding to records that do not meet the matching conditions in the right table) to meet the demand, but unexpectedly This query is very slow. First look at the query plan:

rows represents the number of rows that need to be scanned for each row of the result of this step relative to the previous step. You can see that the number of rows that need to be scanned by this SQL is 35773*8134, which is a very large number. Originally, the number of records in tables c and h were 40,000+ and 10,000+ respectively, which was almost the cost of Cartesian product of the two tables (select * from c,h)

Nested Loop Join actually uses the result set of the driving table as the basic data of the loop, and then uses the data in the result set as filter conditions to query the data in the next table one by one, and then merges the results. If there is a third person participating in the Join, the Join result set of the first two tables will be used as the basic data of the loop, and the data will be queried in the third table through the loop query conditions again. In this way, MySQL basically uses the most Easy-to-understand algorithms to implement joins. Therefore, the selection of the driving table is very important. Small data in the driving table can significantly reduce the number of rows scanned.

So why is the efficiency of join generally much higher than that of left join?

Under normal circumstances, the two tables participating in the joint query will be one large and one small. If it is a join, MySQL will select the small table as the driving table if there are no other filtering conditions. However, left join is generally used to join the large table to the small table. The characteristics of left join itself determine that MySQL will use a large table as the driving table, so the efficiency will be much worse. If I change the above SQL to

select c.* 
from hotel_info_original c
join hotel_info_collection h on c.hotel_type = h.hotel_type 
and c.hotel_id = h.hotel_id

The query plan is as follows: 

Obviously, MySQL chose a small table as the driving table, and combined with the index on (hotel_id, hotel_type), the performance was instantly reduced by several orders of magnitude.

If the where condition contains a non-null condition of the right table (except is null), the left join statement is equivalent to the join statement and can be directly rewritten into a join statement.

Block Nested-Loop Join Algorithm

Of course, MySQL itself has evolved the Block Nested-Loop join algorithm based on this algorithm. In fact, it is basically the same as the above algorithm. The pseudo code is as follows:

for each row in t1 matching range {
  for each row in t2 matching reference key {
    store used columns from t1, t2 in join buffer
    if buffer is full {
      for each row in t3 {
        for each t1, t2 combination in join buffer {
          if row satisfies join conditions,
          send to client
        }
      }
      empty buffer
    }
  }
}

if buffer is not empty {
  for each row in t3 {
    for each t1, t2 combination in join buffer {
      if row satisfies join conditions,
      send to client
    }
  }
}

This algorithm caches data from the outer loop

In the join buffer, the data in the table round buffer in the inner loop is compared to reduce the number of loops, which can improve efficiency. There is an example on the official website, but I don’t understand it: if 10 lines are cached in the buffer, these 10 lines are passed to the inner loop, and all the lines in the inner loop will be compared with these 10 lines in the buffer. The original text is like this:

For example, if 10 rows are read into a buffer and the buffer is passed to the next inner loop, each row read in the inner loop can be compared against all 10 rows in the buffer

If S refers to the size of the combination of t1 and t2 in the cache, and C is the number of these combinations in the buffer, then the number of times the t3 table is scanned should be:

(S * C)/join_buffer_size + 1

According to this formula, the larger the join_buffer_size is, the smaller the number of scans is. If the join_buffer_size reaches a size that can cache all previous row combinations, then this is the time when the performance is the best. Increasing it later will have no effect.

Based on the comparison between these two aspects, left join is obviously ruined in seconds. However, our actual business often needs to use left join. Everything should be based on actual business, so everyone should make a different choice. The blogger here does not really need left join because of the business, so he decisively chooses to use inner join to connect the tables

2. Optimization of left join

Based on our comparison above, we can basically summarize some simple optimization solutions. 

1. The left join selects the small table as the driving table (this part is basically everyone's consensus)

2. If the left table is relatively large, and the business requirement drives the table must be the left table, then we can use the where conditional statement to make the left table filtered smaller. The main principle is similar to the first one

3. Index related fields. Because in the nested loop algorithm of MySQL, related fields are related and queried, so it is necessary to index related fields.

4. If there is sorting in SQL, please add an index to the sorting field, otherwise it will cause the sorting to use a full table scan.

5. If the where condition contains the non-null condition of the right table (except is null), the left join statement is equivalent to the join statement and can be directly rewritten into a join statement.

6. According to the documentation, MySQL can more efficiently use indexes on columns declared to have the same type and size. Therefore, the encoding and collation (rules for determining character comparison) of the associated fields between tables are all changed to a unified type

7. The conditional column of the right table must be indexed (primary key, unique index, prefix index, etc.), and it is best to make the type reach range and above (ref, eq_ref, const, system)

Can refer to

How to optimize multiple left joins in mysql? - OSCHINA - Chinese Open Source Technology Exchange Community

Deepen the understanding of query plans through an example of MySQL left join optimization - The wave of mobile Internet is coming, can I make some gains - ITeye Blog

Guess you like

Origin blog.csdn.net/MinggeQingchun/article/details/129846517