Use count for the total number of MySQL statistics, don't use the fancy "Deadly Kick MySQL Series Ten"

One problem is that it is efficient to use count(*), count(primary key ID), count(field), and count(1) for the total number of statistics.

Let’s talk about the conclusion first, don’t be so fancy and use count(*).

But many friends will ask why? This article will answer your why.

series of articles

Five minutes, let you understand how MySQL chooses the index "Deadly Kick MySQL Series VI"

Strings can be indexed like this, you know? "Deadly Kick MySQL Series Seven"

Unreproducible "slow" SQL "Deadly Kick MySQL Series 8"

What? Still using delete to delete data "Deadly Knock MySQL Series Nine"

First, the practice of different storage engines

What you need to know is that under different storage engines, MySQL's process for returning results using count(*) is different.

In Myisam, the total number of rows of each table will be stored on the disk, so when executing count(*), the value is returned directly from the disk, which is very efficient. But you also need to know that the return of the total statistics with conditions will not be so fast.

In the Innodb engine, to execute count(*), you need to read the data line by line, and then count the total number and return it.

Question: Why doesn't Innodb store table totals like Myisam does?

This problem needs to be traced back to our previous MVCC article, because Innodb cannot directly store the total number of tables because of multi-version concurrency control.

Because the consistent view obtained by each transaction is different, the total number of returned data is also inconsistent.

If you can't understand it, go back to the MVCC article and take a good look, the meaning is the same as the inconsistent data seen by different transactions.

Practical case

Assuming the three users are in parallel, you will see that the three users see the final data totals are not consistent.

Each user will judge according to the data stored in the read view which data is visible to him and which is invisible.

read view

When executing a SQL statement query, a consistent view is generated, that is, a read-view, which is composed of an array of all uncommitted transaction IDs at the time of the query, and the largest transaction ID that has been created.

The smallest transaction ID in this array is called min_id, and the largest transaction ID is called max_id. The query data results are compared according to the read-view to get the snapshot result.

Therefore, the following comparison rules are produced. This rule is to use the trx_id of the current record to compare with the read-view. The comparison rules are as follows.

If it falls in trx_id<min_id, it means that this version is generated by a committed transaction, and the data is visible because the transaction has been committed

If it falls on trx_id>max_id, it means that this version is generated by a transaction started in the future, which is definitely invisible

If min_id<=trx_id<=max_id

  • If the trx_id of the row is in the array, it means that this version is generated by a transaction that has not yet been committed and is invisible, but the current own transaction is visible
  • If the trx_id of the row is not in the array, it indicates that the committed transaction generated the version, which can be seen

2. What optimization has MySQL done to count(*)?

Let's first look at two index structures, one is the primary key index and the other is the ordinary index.

primary key index

normal index

Now you should know that the leaf nodes of the primary key index store the entire row of data, while the normal index leaf nodes store the primary key value.

The conclusion is that the ordinary index will be much smaller than the primary key index.

Therefore, for operations such as count(*), the result obtained by traversing the index tree is logically the same.

Therefore, the optimizer will find the smallest tree to traverse. Under the premise of ensuring correct logic, it is one of the general rules of database system design to minimize the amount of scanned data.

Question: Why is there data stored and why is it not used?

I think you should know how the data of this graph is obtained, yes, it is show table status \G;obtained by execution.

So why doesn't the innodb storage engine directly use the value of Rows?

Do you still remember in the sixth article, five minutes, let you understand how MySQL chooses the index "Deadly Kick MySQL Series VI"

Don't go back to read this article, let's see what is the total number of data finally found in the above picture.

You will find that the data of these two statistics are inconsistent, so this value is definitely not usable.

Specific cause

Because the value of Rows is the same as the index base Cardinality, it is counted by sampling.

sampling rule

First, N data pages are selected, then the different values ​​on each data page are counted, and an average value is finally obtained. Multiply this average by the total number of data pages in the index to get the index cardinality.

And this index cardinality is not static. It will continue to be added, deleted, and modified as data. It will be triggered when the changed data exceeds 1/M. The value of M is obtained according to the MySQL parameter innodb_stats_persistent. It is set to 10 for on and 16 for off.

In MySQL 8.0, the default value is on, which means that when the data of this table changes more than 1/10 of the total data, the sampling statistics will be re-triggered.

3. The usage of different counts

All the following conclusions are based on MySQL's Innodb storage engine.

count(primary key ID)

The innodb engine will traverse the entire table, get the ID value of each row, and then return it to the server layer. After the server layer gets the ID, it judges that it cannot be empty and accumulates it.

count(1)

It also traverses the entire table, but does not take the value. The server layer puts a number 1 into each row returned. It is judged that it is impossible to be empty, and it is accumulated by row.

count(field)

Divided into two cases, the field is defined as not null and null

  • When it is not null: read this field from the record line by line, judge that it cannot be null, and accumulate it
  • When it is null: When it is executed, it is judged that it may be null, and the value should be taken out and judged again, and it will be accumulated if it is not null.

count(*)

This buddy is amazing. Instead of taking out all the values ​​with *, MySQL has made special optimizations. The count ( * ) is definitely not null, and it is accumulated by row.

in conclusion

According to the efficiency, field < primary key ID < 1 ~ , it is best to use count ( ), don't be fancy.

V. Summary

This article is just one sentence, 统计总数就用count(*),别花里胡哨的.

Persistence in learning, perseverance in writing, perseverance in sharing are the beliefs that Kaka has upheld since her career. I hope the article can bring you a little help on the huge Internet, I am Kaka, see you in the next issue.

Guess you like

Origin blog.csdn.net/fangkang7/article/details/121186545