What is the significance of MySQL multi-table union query?

Today, let's talk about the multi-table joint query in WeChat. Should a small table drive a large table or a large table drive a small table?

1. in VS exists

Before the formal analysis, let's look at the two keywords in and exists.

Suppose I now have two tables: employee table and department table, each employee has a department, the employee table holds the id of the department, and this field is an index; the department table has attributes such as department id, name, etc., among which id is the primary key and name is the unique index.

Here I will directly use the table in vhr to do the experiment, instead of giving you the database script alone, friends can check the vhr project ( github.com/lenve/vhr) to get…

Suppose I now want to query all employees in the technical department. I have the following two query methods:

The first query method is to use the in keyword to query:

select * from employee e where e.departmentId in(select d.id from department d where d.name='技术部') limit 10;
复制代码

This SQL is easy to understand, and I believe everyone can understand it. When querying, the sub-query inside is also queried (that is, the department table is queried first), and then the external query is executed. We can look at its execution plan:

It can be seen that, first query the department table, use the index if there is an index, scan the whole table if there is no index, and then query the employee table, also use the index to query, the overall efficiency is relatively high.

The second is to use the exists keyword to query:

select * from employee e where exists(select 1 from department d where d.id=e.departmentId and d.name='技术部') limit 10;
复制代码

The query result of this SQL is the same as the in keyword above, but the query process is different. Let's take a look at the execution plan of this SQL:

It can be seen that the full table scan is performed on the employee table first, and then the departmentId in the employee table is used to compare the data in the department table. In the above SQL, if the subquery has a return value, it means true, and if there is no return value, it means false. If it is true, the employee record will be retained. If it is false, the employee record will be discarded. Therefore, it is not necessary to use it in the subquery, and it SELECT *can be changed to SELECT 1or others. MySQL's official statement is that the SELECT list will be ignored during actual execution, so there is little difference in writing.

Comparing the number of scanned rows in the two query plans, we can roughly see the difference. Using in is slightly more efficient.

如果用 in 关键字查询的话,先部门表再员工表,一般来说部门表的数据是要小于员工表的数据的,所以这就是小表驱动大表,效率比较高。

如果用 exists 关键字查询的话,先员工表再部门表,一般来说部门表的数据是要小于员工表的数据的,所以这就是大表驱动小表,效率比较低。

总之,就是要小表驱动大表效率才高,大表驱动小表效率就会比较低。所以,假设部门表的数据量大于员工表的数据量,那么上面这两种 SQL,使用 exists 查询关键字的效率会比较高。

2. 为什么要小表驱动大表

在 MySQL 中,这种多表联合查询的原理是:以驱动表的数据为基础,通过类似于我们 Java 代码中写的嵌套循环 的方式去跟被驱动表记录进行匹配。

以第一小节的表为例,假设我们的员工表 E 表是大表,有 10000 条记录;部门表 D 表是小表,有 100 条记录。

假设 D 驱动 E,那么执行流程大概是这样:

for 100 个部门{
    匹配 10000 个员工(进行B+树查找)
}
复制代码

那么查找的总次数是 100+log10000。

假设 E 驱动 D,那么执行流程大概是这样:

for 10000 个员工{
    匹配 100 个部门(进行B+树查找)
}
复制代码

那么总的查找次数是 10000+log100。

从这两个数据对比中我们就能看出来,小表驱动大表效率要高。核心的原因在于,搜索被驱动的表的时候,一般都是有索引的,而索引的搜索就要快很多,搜索次数也少。

3. 没有索引咋办?

前面第二小节我们得出的结论有一个前提,就是驱动表和被驱动表之间关联的字段是有索引的,以我们前面的表为例,就是 E 表中保存了 departmentId 字段,该字段对应了 D 表中的 id 字段,而 id 字段在 D 表中是主键索引,如果 id 不是主键索引,就是一个普通字段,那么 D 表岂不是也要做全表扫描了?那个时候 E 驱动 D 还是 D 驱动 E 差别就不大了。

对于这种被驱动表上没有可用索引的情况,MySQL 使用了一种名为 Block Nested-Loop Join (简称 BNL)的算法,这种算法的步骤是这样:

  1. Read the data of the E table into the thread memory join_buffer.
  2. Scan the D table, take out each row in the D table, and compare it with the data in the join_buffer. If the join condition is satisfied, it will be returned as part of the result set.

Let's take a look, if I delete the index on the departmentId field in the E table, and then delete the primary key index on the id field in the D table, let's take a look at the following SQL execution plan:

It can be seen that at this time, both the E table and the D table are full table scans. In addition, it should be noted that these comparison operations are all in memory, so the execution efficiency is OK.

However, since the data is read into the memory, can it be placed in the memory? What should I do if I can't put it in memory? Let's look at the query plan above. In the query of the E table, Extra also appears Using join buffer (Block Nested Loop). Block means block! So the meaning is very clear. If you can't put it in the memory at one time, then read it in blocks, read one part into the memory first, and then read the other part into the memory after the comparison.

We can check the size of join_buffer with the following command:

262144/1024=256KB

The default size is 256 KB.

I now change this value to a larger value, and then view the new execution plan, as follows:

As you can see, there are no Using join buffer (Block Nested Loop)prompts .

in conclusion:

  • If the join_buffer is large enough to read all the data into memory at once, then it doesn't matter whether the big table drives the small table or the small table drives the big table.
  • If the join_buffer size is limited, it is recommended that a small table drives a large table, so that even if you want to read in blocks, the number of reads is less.

But to be honest, this kind of multi-table joint query without index is relatively inefficient and should be avoided as much as possible.

To sum up, in the joint query of multiple tables, it is recommended that small tables drive large tables.

Guess you like

Origin juejin.im/post/7083863347304611877