Why is it not recommended that you use SELECT *?

"Don't use SELECT *" has almost become a golden rule used by MySQL. Even the "Ali Java Development Manual" clearly states that it is not allowed to use *the field list as a query, which makes this rule an authoritative blessing.

Ali Java Development Manual

However, I still use it directly in the development process for SELECT *two reasons:

  1. Because of its simplicity, the development efficiency is very high, and if fields are frequently added or modified later, the SQL statement does not need to be changed;
  2. I think it's a bad habit to optimize prematurely, unless you can determine at the outset what fields you actually need in the end, and build appropriate indexes for them; otherwise, I choose to optimize SQL when I run into trouble, Of course, the premise is that the trouble is not fatal.

But we always have to know why it is not recommended to use it directly SELECT *. This article gives reasons from 4 aspects.

1. Unnecessary disk I/O

We know that MySQL essentially stores user records on disk, so a query operation is a behavior of disk IO (provided that the records to be queried are not cached in memory).

The more fields queried, the more content to be read, which increases the disk IO overhead. Especially when some fields are of type TEXT, MEDIUMTEXTor BLOB, etc., the effect is particularly obvious.

Will the use SELECT *of MySQL take up more memory?

Theoretically not, because for the Server layer, instead of storing the complete result set in memory and passing it to the client at once, each time a row is obtained from the storage engine, it is written to a net_buffermemory space called, The size of this memory is controlled by system variables net_buffer_length, the default is 16KB; when it is net_bufferfull, it writes data to the memory space of the local network stack and socket send buffersends it to the client. After the transmission is successful (the client read is completed), it is emptied net_buffer, and then continues to read next line and write.

That is to say, by default, the maximum memory space occupied by the result set is only the net_buffer_lengthsize, and it will not occupy additional memory space because of more fields.

2. Increase network latency

Continuing the previous point, although socket send bufferthe data is sent to the client every time, it seems that the amount of data is not large at a time, but it is unbearable that someone has used * or the fields TEXTof the type are also found out, and the total amount of data is large. , which directly leads to an increase in the number of network transmissions.MEDIUMTEXTBLOB

This overhead is very noticeable if MySQL and the application are not on the same machine. Even if the MySQL server and client are on the same machine and the protocol used is still TCP, the communication takes extra time.

3. Unable to use covering index

To illustrate this, we need to create a table

CREATE TABLE `user_innodb` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `gender` tinyint(1) DEFAULT NULL,
  `phone` varchar(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_NAME_PHONE` (`name`,`phone`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
复制代码

We created a table whose storage engine is InnoDB user_innodb, set it as the idprimary key, nameand phonecreated a joint index for and, and finally initialized 500W+ pieces of data randomly into the table.

InnoDB will automatically idcreate a B+ tree called a primary key index (also called a clustered index) for the primary key. The most important feature of this B+ tree is that the leaf nodes contain complete user records, which looks like this.

primary key index

If we execute this statement

SELECT * FROM user_innodb WHERE name = '蝉沐风';
复制代码

Use to EXPLAINview the execution plan of the statement:

It is found that this SQL statement will use the IDX_NAME_PHONEindex, which is a secondary index. The leaf nodes of the secondary index look like this:

nameThe InnoDB storage engine will find the records in the leaf nodes of the secondary index according to the search conditions 蝉沐风, but only the records nameand phoneprimary key idfields are recorded in the secondary index (who told us to use them SELECT *), so InnoDB needs to take the primary key idto Finding this complete record in the primary key index is called a return table .

想一下,如果二级索引的叶子节点上有我们想要的所有数据,是不是就不需要回表了呢?是的,这就是覆盖索引

举个例子,我们恰好只想搜索namephone以及主键字段。

SELECT id, name,  phone FROM user_innodb WHERE name = "蝉沐风";
复制代码

使用EXPLAIN查看一下语句的执行计划:

可以看到Extra一列显示Using index,表示我们的查询列表以及搜索条件中只包含属于某个索引的列,也就是使用了覆盖索引,能够直接摒弃回表操作,大幅度提高查询效率。

4. 可能拖慢JOIN连接查询

我们创建两张表t1t2进行连接操作来说明接下来的问题,并向t1表中插入了100条数据,向t2中插入了1000条数据。

CREATE TABLE `t1` (
  `id` int NOT NULL,
  `m` int DEFAULT NULL,
  `n` int DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT;

CREATE TABLE `t2` (
  `id` int NOT NULL,
  `m` int DEFAULT NULL,
  `n` int DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT;
复制代码

如果我们执行下面这条语句

SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.m = t2.m;
复制代码

这里我使用了STRAIGHT_JOIN强制令t1表作为驱动表,t2表作为被驱动表

对于连接查询而言,驱动表只会被访问一遍,而被驱动表却要被访问好多遍,具体的访问次数取决于驱动表中符合查询记录的记录条数。由于已经强制确定了驱动表和被驱动表,下面我们说一下两表连接的本质:

  1. t1作为驱动表,针对驱动表的过滤条件,执行对t1表的查询。因为没有过滤条件,也就是获取t1表的所有数据;
  2. 对上一步中获取到的结果集中的每一条记录,都分别到被驱动表中,根据连接过滤条件查找匹配记录

用伪代码表示的话整个过程是这样的:

// t1Res是针对驱动表t1过滤之后的结果集
for (t1Row : t1Res){
  // t2是完整的被驱动表
  for(t2Row : t2){
  	if (满足join条件 && 满足t2的过滤条件){
      发送给客户端
    }  
  }
}
复制代码

这种方法最简单,但同时性能也是最差,这种方式叫做嵌套循环连接(Nested-LoopJoin,NLJ)。怎么加快连接速度呢?

其中一个办法就是创建索引,最好是在被驱动表(t2)连接条件涉及到的字段上创建索引,毕竟被驱动表需要被查询好多次,而且对被驱动表的访问本质上就是个单表查询而已(因为t1结果集定了,每次连接t2的查询条件也就定死了)。

既然使用了索引,为了避免重蹈无法使用覆盖索引的覆辙,我们也应该尽量不要直接SELECT *,而是将真正用到的字段作为查询列,并为其建立适当的索引。

但是如果我们不使用索引,MySQL就真的按照嵌套循环查询的方式进行连接查询吗?当然不是,毕竟这种嵌套循环查询实在是太慢了!

在MySQL8.0之前,MySQL提供了基于块的嵌套循环连接(Block Nested-Loop Join,BLJ)方法,MySQL8.0又推出了hash join方法,这两种方法都是为了解决一个问题而提出的,那就是尽量减少被驱动表的访问次数。

这两种方法都用到了一个叫做join buffer的固定大小的内存区域,其中存储着若干条驱动表结果集中的记录(这两种方法的区别就是存储的形式不同而已),如此一来,把被驱动表的记录加载到内存的时候,一次性和join buffer中多条驱动表中的记录做匹配,因为匹配的过程都是在内存中完成的,所以这样可以显著减少被驱动表的I/O代价,大大减少了重复从磁盘上加载被驱动表的代价。使用join buffer的过程如下图所示:

Schematic diagram of join buffer

我们看一下上面的连接查询的执行计划,发现确实使用到了hash join(前提是没有为t2表的连接查询字段创建索引,否则就会使用索引,不会使用join buffer)。

In the best case, it is join bufferlarge enough to hold all the records in the result set of the driving table, so that only one visit to the driven table is required to complete the join operation. We can use join_buffer_sizethis system variable to configure, the default size is 256KB. If it still can't be loaded, put the result set of the driver table in batches join buffer. After the comparison in the memory is completed, empty join bufferand load the next batch of result sets until the connection is completed.

Here comes the point! Not all columns of the drive table record will be put join bufferin, only the columns in the query list and the columns in the filter conditions will be put join bufferin, so remind us again, it is best not to use it *as a query list, just put us Just put the concerned column in the query list, so that join buffermore records can be placed in it, reducing the number of batches, and naturally reducing the number of visits to the driven table.

Recommended reading

Guess you like

Origin juejin.im/post/7079417143019700255