In-depth understanding of MySQL from simple count(*)

Start from the problem

When developing a system, you may often need to calculate the number of rows in a table, such as the total number of all change records in a trading system. At this time, you may be thinking, isn't a select count(*) from t statement solved?

However, you will find that as the number of records in the system increases, the execution of this statement will become slower and slower . Then you are wondering, why does MySQL not record the total number, and read it out every time it needs to be checked  ?

Let's talk about how the count(*) statement is implemented , and why MySQL implements it like this . Then, I will tell you again, if there are such frequent changes in the application and the need to count the number of table rows, what can be done in business design .

How count(*) is implemented

In different MySQL engines, count(*) has different implementations.

  • The MyISAM engine stores the total number of rows in a table on disk, so when count(*) is executed, this number will be returned directly, which is very efficient. Note that the premise here is that there is no filter condition . But if the where condition is added, the MyISAM table cannot be returned so fast.
  • The InnoDB engine is in trouble. When it executes count(*), it needs to read the data line by line from the engine, and then accumulate the count.

In the previous article, we analyzed together why InnoDB should be used, because InnoDB is superior to MyISAM in terms of transaction support, concurrency and data security. I guess your table must also use the InnoDB engine. This is why calculating the total number of rows in a table will become slower and slower when you have more and more records.

Why doesn't InnoDB store numbers like MyISAM?

This is because even with multiple queries at the same time, due to multi-version concurrency control (MVCC), the InnoDB table "how many rows should be returned" is uncertain.

This is related to InnoDB's transaction design. Repeatable read is its default isolation level. The code is implemented through multi-version concurrency control, which is MVCC. Each row of records must be judged whether it is visible to this session , so for count(*) requests, InnoDB has to read the data row by row and judge in turn, the visible rows can be used to calculate the table "based on this query" The total number of rows.

Although the limitation lies in this, MySQL is still optimized when performing count(*) operations. The MySQL optimizer will find the smallest tree to traverse. On the premise that the logic is correct, reducing the amount of scanned data as much as possible is one of the general rules of database system design.

If you have used the show table status command, you will find that there is a TABLE_ROWS in the output result of this command to show how many rows there are currently in this table. This command is executed very fast. Can this TABLE_ROWS replace count(*)?

The article mentioned earlier, the value of the index is through statistical sampling to estimate the count. In fact, TABLE_ROWS is estimated from this sample, so it is also very inaccurate. How inaccurate? The official document says the error may reach 40% to 50%. Therefore, the number of rows displayed by the show table status command cannot be used directly.

Here can be a summary:

  • Although MyISAM table count(*) is fast, it does not support transactions;
  • Although the show table status command returns quickly, it is not accurate;
  • InnoDB table directly count(*) will traverse the whole table. Although the result is accurate, it will cause performance problems.

Other statistical methods

  • Use the cache system to save the count

You can use a Redis service to store the total number of rows in this table. Each time a row is inserted into this table, the Redis count is increased by 1, and each time a row is deleted, the Redis count is decreased by 1. In this way, both read and update operations are fast.

Although Redis can be persistent, data will still be lost if it encounters an abnormal restart. This problem can still be solved. After Redis restarts abnormally, execute count(*) in the database to get the real number of rows, and then write this value back to Redis. After all, abnormal restart is not a common situation, this time the cost of a full table scan is still acceptable.

But in fact, the way to store the count in the cache system is not just a problem of missing updates. Even if Redis works normally, this value is still logically inaccurate. This is an inherent deficiency of the caching system.

for example:

There is such a page to display the total number of operation records, as well as the 100 most recent operations. Then, the logic of this page needs to get the count in Redis first, and then get the data record in the data table .

Our definition is inaccurate: one is that the 100 rows found in the results contain the latest inserted record, and the Redis count has not yet added 1; the other is that there is no newly inserted record in the 100 rows found. Record, and 1 has been added to the Redis count. These two situations are logically inconsistent.

  

As shown in the figure, no matter how you adjust the write data table and change the execution order of Redis counting, the data inconsistency cannot be changed.

Therefore, in a concurrent system, we cannot precisely control the execution time of different threads. Because of the sequence of operations in the figure, we say that even if Redis works normally, this count value is still logically inaccurate.

  • Keep count in database

What if we put this count directly in a separate count table in the database?

First of all, this solves the problem of crash loss. InnoDB supports crash recovery without losing data.

Then, let's see if we can solve the problem of inaccurate counting. Using the feature of "transactions", the operations of adding 1 to the table and inserting data are placed in the same transaction. Since the transaction can ensure that the intermediate results are not read by other transactions, the order of modifying the count value and inserting new records is Does not affect the logical result. However, updating the count table involves competition for row locks. Inserting and then updating can minimize lock waiting between transactions and improve concurrency.

Different count usage

count(*), count(primary key id), count(field), count(1)

The following discussion is based on the InnoDB engine:

Here, first you have to figure out the semantics of count(). count() is an aggregate function . For the returned result set, judge line by line. If the parameter of the count function is not NULL, the accumulated value will be incremented by 1, otherwise it will not be incremented. Finally, the cumulative value is returned.

Count(*), count(primary key id) and count(1) all indicate the total number of rows in the result set that meets the condition; and count(field) indicates that the data rows that meet the condition are returned, and the parameter "field" is not NULL The total number of.

As for analyzing performance differences, you can remember these principles:

  • Give whatever the server layer wants;
  • InnoDB only gives necessary values;
  • The current optimizer only optimizes the semantics of count(*) to "fetch the number of rows", and does not do other "obvious" optimizations.

Let's analyze how several count() are executed:

  1. For count (primary key id), the InnoDB engine traverses the entire table, takes out the id value of each row, and returns it to the server layer. After the server layer gets the id, it is judged that it is impossible to be empty, so it is added up by line.
  2. For count(1), the InnoDB engine traverses the entire table, but does not take a value. The server layer puts a number "1" in each row returned, judging that it is impossible to be empty, and accumulating it by row.
  3. For count (field): if this "field" is defined as not null, read this field line by line from the record, judge that it cannot be null, and add up by line; if this "field" definition allows null, Then when it is executed, it is judged that it is possible to be null, and the value must be taken out and judged again, and it is accumulated if it is not null. That is, the first principle above, what fields are required by the server layer, what fields InnoDB returns.
  4. But count(*) is an exception. It will not take out all the fields, but is specially optimized and does not take values. count(*) is definitely not null, accumulate by line.

Therefore, count(1), like count(*), executes faster than count(primary key id). Because returning the id from the engine involves parsing the data row and copying the field value.

The conclusion is: in order of efficiency, count(field)<count(primary key id)<count(1)≈count(*) , so I suggest you use count(*) as much as possible.

 

Content source reference: Lin Xiaobin "45 Lectures on MySQL Actual Combat"

Guess you like

Origin blog.csdn.net/qq_24436765/article/details/112601467