Are you using the count in MySQL correctly? Performance comparison at a glance

This article is the second article of the " MySQL Inductive Learning " column, and it is also the second article about MySQL query knowledge points.

Past review:

MySQL Fun Guide: Exploring Server Layer Components and Permission Verification Practice

In MySQL, count() is a powerful statistical function, but did you know that? It is implemented differently in different engines! Not only that, this article will also take you to understand the performance differences of different count usages, and tell you which usage is the most efficient. In addition, we will explore the differences between the schemes of using caching systems and databases to keep counts, revealing the contrast between them.

First, let's take a look at this mind map, and have a simple understanding of the content of this article.

How count(*) is implemented

In different MySQL engines, count( ) has different implementation methods, and the count( ) without filter conditions is discussed here.

  • The MyISAM engine stores the total number of rows in a table on the disk, so *when count( ) is executed, it will directly return this number, which is very efficient; if the where condition is added, the MyISAM table cannot return so quickly.
  • But the InnoDB engine is troublesome. *When it executes count( ), it needs to read the data from the engine line by line, and then accumulate the count.

Why doesn't InnoDB store numbers like MyISAM?

This is because "how many rows should be returned" for an InnoDB table is uncertain due to Multiversion Concurrency Control (MVCC) even for multiple queries at the same time.

As shown in the following case, the results of the total number of rows in the query table t of the three sessions are different at the same time.

img

This has something to do with InnoDB's transaction design. Repeatable reading is its default isolation level, which is implemented in code through multi-version concurrency control, that is, MVCC. Each row of records must be judged whether it is visible to the session, so for the count(*) request, InnoDB has to read the data row by row and judge in turn, and only the visible rows can be used to calculate the total number of rows in the table "based on this query".

Although the execution of count( ) in the InnoDB engine *requires line-by-line reading, query optimization is still done internally. InnoDB is an index-organized table, the leaf nodes of the primary key index tree are data, and the leaf nodes of the secondary index tree are primary key values. Therefore, the ordinary index tree is much smaller than the primary key index tree. For operations such as count(*), the results obtained by traversing which index tree are logically the same. Therefore, the MySQL optimizer will find the smallest tree to traverse. Under the premise of ensuring the correct logic, minimizing the amount of scanned data is one of the general principles of database system design.

In addition to executing the count( *) command to get the number of data rows, we have also used show table statusthe command, which is used to display how many rows are currently in the table, but it should be noted that the results obtained by this command are estimated by sampling, and the official document says that the error may reach 40% to 50%. Therefore, the number of rows displayed by the show table status command cannot be used directly.

Summarize

  • Although MyISAM table count( *) is very fast, it does not support transactions;
  • Although the show table status command returns quickly, it is not accurate;
  • InnoDB table direct count(*) will traverse the entire table, although the result is accurate, it will cause performance problems.

Different uses of count

Analyze the performance of different usages such as count(*), count(primary key id), count(field) and count(1), what are the differences?

count() 是一个聚合函数,对于返回的结果集,一行行地判断,如果 count 函数的参数不是 NULL,累计值就加 1,否则不加。最后返回累计值。

所以,count(*)、count(主键 id) 和 count(1) 都表示返回满足条件的结果集的总行数;而 count(字段),则表示返回满足条件的数据行里面,参数“字段”不为 NULL 的总个数。

对于 count(主键 id) 来说,InnoDB 引擎会遍历整张表,把每一行的 id 值都取出来,返回给 server 层。server 层拿到 id 后,判断是不可能为空的,就按行累加。

count(主键 id) 不会走主键索引,因为普通索引树比主键索引树小很多。假设表中有多个普通索引树,则由优化器来决定走哪个索引。

对于 count(1) 来说,InnoDB 引擎遍历整张表,但不取值。server 层对于返回的每一行,放一个数字“1”进去,判断是不可能为空的,按行累加。

单看这两个用法的差别的话,你能对比出来,count(1) 执行得要比 count(主键 id) 快。因为从引擎返回 id 会涉及到解析数据行,以及拷贝字段值的操作。

对于 count(字段) 来说:

  1. 如果这个“字段”是定义为 not null 的话,一行行地从记录里面读出这个字段,判断不能为 null,按行累加;
  2. 如果这个“字段”定义允许为 null,那么执行的时候,判断到有可能是 null,还要把值取出来再判断一下,不是 null 才累加。

count(字段) 需要查询出该字段值,只能通过聚簇索引树,所以效率最差。

但是 count(\*) 是例外,并不会把全部字段取出来,而是专门做了优化,不取值。count(*) 肯定不是 null,直接按行累加。

主键 ID肯定非空,为什么优化器不能像优化 count()那样优化count(主键ID) 呢?答案是没必要,不做重复优化,推荐使用 count()。

根据上述分析,按照效率排序的话,count(字段)<count(主键 id)<count(1)≈count(*),所以我建议你,尽量使用 count(*)

有些文章说 count() 性能差,用词不恰当,难道其他几种计数方式就不差了,注意是计数性能差,而不是count()差。关于计数性能差,可以增加缓存,比如说 redis缓存或者本地缓存,但是不能保证完全实时一致。

用缓存系统保存计数

对于更新很频繁的库来说,你可能会第一时间想到,用缓存系统来支持。

你可以用一个 Redis 服务来保存这个表的总行数。这个表每被插入一行 Redis 计数就加 1,每被删除一行 Redis 计数就减 1。这种方式下,读和更新操作都很快,但你再想一下这种方式存在什么问题吗?

没错,缓存系统可能会丢失更新。

Redis 的数据不能永久地留在内存里,所以你会找一个地方把这个值定期地持久化存储起来。但即使这样,仍然可能丢失更新。试想如果刚刚在数据表中插入了一行,Redis 中保存的值也加了 1,然后 Redis 异常重启了,重启后你要从存储 redis 数据的地方把这个值读回来,而刚刚加 1 的这个计数操作却丢失了。

当然了,这还是有解的。比如,Redis 异常重启以后,到数据库里面单独执行一次 count(*) 获取真实的行数,再把这个值写回到 Redis 里就可以了。异常重启毕竟不是经常出现的情况,这一次全表扫描的成本,还是可以接受的。

但实际上,将计数保存在缓存系统中的方式,还不只是丢失更新的问题。即使 Redis 正常工作,这个值还是逻辑上不精确的

Redis 和 MySQL 是两个独立的数据源,我们需要解决并发环境下数据不一致的问题,一般我们都会先更新数据库,再删缓存。

我们查询如下两个时序图:

img

会话A在 T2时刻执行了插入操作,在 T3时刻会话B读取缓存中的计数,那么此时读取到的计数和会话A事务结束后读取到的计数就会发生不一致。

如果在会话A中调整更新计数操作和插入操作的顺序,那么是否会有所好转呢?

img

答案还是不行。虽然在 T3 时刻会话B 可以查询到最新的计数,但是无法获取到待插入的数据R。

因为 Redis 和 MySQL 是不同的存储构成的系统,不支持分布式事务,所以没法保证计数的精确性。

在数据库保存计数

根据上面的分析,用缓存系统保存计数有丢失数据和计数不精确的问题。那么,如果我们把这个计数直接放到数据库里单独的一张计数表 C 中,又会怎么样呢?

首先,这解决了崩溃丢失的问题,InnoDB 是支持崩溃恢复不丢数据的。

利用事务来解决时序2 图中的问题,如下所示:

img

因为MySQL 事务机制和 MVCC,在 T3时刻会话B进行的操作不受会话A 的影响,因为会话A在 T4才提交事务,T2做的修改对会话B不可见。

总结

在不同的存储引擎中,count(*)函数的实现方式不同。我们之前讨论过使用缓存系统来存储计数值存在的问题。现在,我来简洁地解释一下为什么将计数值存储在Redis中不能保证与MySQL表中的数据精确一致。

Redis和MySQL是不同的存储系统,它们不支持分布式事务,因此无法提供精确一致的视图。这就是为什么将计数值存储在Redis中无法确保与MySQL表中数据的一致性。相比之下,将计数值存储在MySQL中可以解决一致性视图的问题。

Guess you like

Origin juejin.im/post/7257922419319603256