45 MySQL combat stress study notes: count (*) so slow, how can I do? (Lecture 14)

First, the primer

In the development of the system, you may often need to calculate the number of rows in a table, such as a record of all changes the total number of trading systems. At this time you might think, a select count (*) from t statement is not solved yet?

However, you will find the number as recorded in the system more and more, this statement will have to perform more and more slowly. Then you might want a, MySQL how stupid ah, a total remember, every time you want to check when read out directly, does not like it.

So today, we have to talk about the count (*) statement in the end is how to achieve, and why so MySQL to achieve. Then I'll talk to you again, if there is such an application requires frequent changes and demand statistics the number of rows, the business
on how design can do.

Two, count (*) implementation

You must first clear that in different MySQL engines, count (*) have different implementations

1, MyISAM table although the count (*) soon, but does not support transactions;

  1. MyISAM engine, the number of rows in a table exists on the disk, so when the execution count (*) will return the number of direct, high efficiency;
  2. The InnoDB engine trouble, when it is executed count (*), you need to read data line by line from the engine inside, then the cumulative count.

It should be noted that we discussed in this article is no filter condition count (*), if added where conditions then, MyISAM tables also can not return so fast.

In the previous article, we analyze together why you want to use InnoDB, because whether in transaction support, concurrency or data security, InnoDB is better than MyISAM. I guess you must be the table with the InnoDB
engine. This is why more and more when the number of records when you calculate the total number of rows in a table will become increasingly slow.

2, show table status command to return soon though, but not accurate;

If you used the show table status command, you will find that the output of the command and there was a TABLE_ROWS used to display the current number of table rows, the command executes very quickly, then this
TABLE_ROWS can replace count (*) do?

You may remember the 10th article "MySQL Why would sometimes choose the wrong index? "I mentioned before, the index value is determined by statistical sampling to estimate. In fact, TABLE_ROWS from this sampling is estimated to come, because of
this it is not allowed. How not allowed to do, said the official documentation errors may reach 40% to 50%. Therefore, the number of lines show tablestatus command displays can not be used directly.

3, InnoDB table direct count (*) will traverse the whole table, though accurate, but can cause performance problems

So why not the same as InnoDB MyISAM, but also to keep up the numbers it?

This is because even at the same time more queries, because the multi-version concurrency control (MVCC) reasons, InnoDB table "should return the number of rows" is also uncertain. Examples here, I use a numerical count (*) for you to
explain.

Table t is assumed that there are now 10,000 records, we have designed three concurrent user sessions.

We assume that is executed from top to bottom in chronological order, the same line at the same time the statement is executed.

MyISAM engine, the number of rows in a table exists on the disk, so when the execution count (*) will return the number of direct, high efficiency;

The InnoDB engine trouble, when it is executed count (*), you need to read data line by line from the engine inside, then the cumulative count.

  1. A first session starts a transaction and query the number of rows in a table;
  2. B starts a transaction session, after a row is inserted, the query number of rows in the table;
  3. Session C to start a separate statement, insert a row, number of rows in the lookup table.

1 session A, B, C execution flow

You will see that in the last moment, three sessions A, B, C will also query the number of rows in the table t, but to get the result was different.

This design InnoDB and transaction of a relationship, it is the default repeatable read isolation level, on the code is through multi-version concurrency control, which is MVCC to achieve. Each row records to determine whether they should be seen on this conversation, because
this for the count (*) request for, InnoDB data had to be read sequentially line by line judge, only visible lines can be used to calculate "Based on this inquiry," the number of rows in the table.

NOTE: If you blur the memory of MVCC, you can then review Dir 3 article "Transaction Isolation: Why I can not see you changed? "And 8 article" affairs in the end is isolated or not isolated? "In the relevant content.

Of course, this seemingly simple-minded MySQL, when the execution count (*) operation or do optimization.

You know, InnoDB is an index-organized table, the primary key index tree leaf node is the data, while the general index tree leaf node is the primary key. So, the general index tree is much smaller than the primary key index tree. For such an operation count (*), traverse the
logical result of which index is the same tree. Therefore, MySQL optimizer will find the smallest tree to traverse. Under the premise to ensure correct logic, minimize the amount of scanned data, the database system is a universal law design.

Here we summarize briefly:

  1. While MyISAM table count (*) soon, but it does not support transactions;
  2. Although the show table status command to return soon, but not accurate;
  3. InnoDB tables direct count (*) will traverse the whole table, though accurate, but can cause performance problems.

So, back to the beginning of the article the problem, if you now have a page often displays the total number of operating record trading system, in the end should be how to do it? The answer is, we can only own count.

Next, we discuss it and see what their own counting method, as well as the advantages and disadvantages of each method has what.

Here, let me talk to you about the basic idea of ​​these methods: You need to find yourself a place to record the number of rows in the operation table of deposit.

Third, the count stored by the cache system

For updates are frequent library, you might think the first time, with caching system to support.

You can use a Redis service to save the number of rows in the table. Each table row inserted is incremented by one count Redis, each row is deleted Redis count is decremented. In this way, read and update operations very quickly, but there is a problem you think about it this way?

Yes, the cache system update may be lost.

Redis data can not be left permanently in memory, so you will find a place to put this value is periodically persistent stored. But even so, you may still be lost updates. Imagine if just insert a row in the data table, Redis Mediator
value stored also added a 1, then Redis abnormal restart, after restart where you want to store this value redis data read back, and this just add 1 count operation was lost.

Of course, this is solvability. For example, after the abnormal restart Redis, to perform a database inside a single count (*) Gets the number of real lines, then this value is written back to the Redis in on it. Reboots after all, not often
happen, this time the cost of a full table scan, or acceptable.

But in fact, the way the count stored in the cache system, not only lost update problem. Even Redis normal operation, this inaccurate value or logic.

You can imagine that there is such a page, the total number of operating records to be displayed, while also displaying 100 records the most recent operation. Then, the logical page which will need first to remove Redis count, and then fetch the data table which records the data.
We are so imprecise definitions:

A, the results found in row 100 which has newly inserted record, not counted in the Redis plus 1;
2. The other is found in the results of row 100 there are no newly inserted record, and Redis the count had already been added 1.

In both cases, the logic is inconsistent. We take a look at this timing diagram.

2 sessions A, B in FIG execution timing

In FIG. 2, A session is a logical transaction record is inserted, insert a row into table data R, then the count is incremented Redis; B session query data is required when a page is displayed.

In this sequence in FIG. 2, when at time T3 B to query the session, will show the newly inserted record R, but not Redis count is incremented. At this time, we will be talking about inconsistent data.

You will say, this is because when we execute new record logic, is to write the data table, and then change Redis count. And when reading is to read the Redis, read the data table, this order is reversed. So, if you keep the same order
, then, is not no problem? We are now the order of the update A session of a change, take a look at the results.

After adjustment sequence in FIG. 3, the session A, B of FIG execution timing

You will find that this time the turn, when the session B at time T3 query, Redis count is incremented by 1, but also finding out the newly inserted R this line, also data inconsistencies.

In which concurrent system, we can not precisely control the timing different threads of execution, since the presence of such a sequence of operations in FIG. Therefore, even if we say that the imprecise Redis normal operation, this count value is logic.

Fourth, the count stored in the database

According to the above analysis, counting the cache storage system and data loss problem of inaccurate count. So, if we put this count directly into the database in a separate count in Table C, then what will happen?

First, it solves the problem of lost crashes, InnoDB is to support crash recovery does not lose data.

Note: For InnoDB crash recovery, you may revisit the second article, "log system: how to update a SQL statement is executed? "In the relevant content.

Then, we'll see if we can solve the problem of inaccurate count.

Would you say that this is not the same as you? Nothing more than the operation of FIG. 3 in the Redis, into a counting operation of Table C. 3 as long as there's this execution sequence, this problem is no solution, right?

The question really is not no solution.

We want to solve the problem in this article, are due to InnoDB supports transactions, resulting in InnoDB tables can not count (*) directly save up, and then when the query returns directly formed.

The so-called shield spear to attack the son of the son, and now we use "transaction" this feature, the problem is resolved.

FIG 4 sessions A, B of FIG execution timing

We now look at the results. Although the read operation session B continues to be performed in the T3, but because this time the update transaction has not been submitted, so the count is incremented this operation for the session B is not visible.

Thus, the results seen in the session B, and check the count value "last 100 records" see results, is logically consistent.

Five different usage count

In the previous article, the comments area, and some students leave a message and asked to: (?) In the select count from t such a query inside, count (*), count (primary key id), count (field) and count (1) and other different usage performance, what difference
do. Today spoke count (*) performance issues, I would like to take this opportunity to explain to you in detail the performance difference between these types of usage.

Note that the following discussion is based on the InnoDB engine.

Here, first you have to figure out count () semantics. count () is a function of the polymerization, the results set returned, line by line to determine if the count function parameter is not NULL, the cumulative value is increased by one, or without. Finally, the cumulative return value.

Therefore, count (*), count (primary key id) and count (. 1) are represented by the number of rows returned to meet criteria set; the count (field), then returns the condition data row inside the parameter "field" does the total number of NULL.

As time analysis of performance difference, so you can remember a few principles:

  • What 1. server layer to give anything;
  • 2. InnoDB only to the necessary value;
  • 3. The optimizer only optimize count (*) as the semantics of "take the line number," other "obvious" optimization does not do.

What does it mean? Next, we take a look one by one.

For count (primary key id) for, InnoDB engine will traverse the entire table, the id values ​​for each row are taken out, returned to the server layer. Get the server layer id, judgment is unlikely to be empty, the accumulated row.

For count (1) is, InnoDB engine traverse the entire table, but not the value. For each row returned server layer, put a number "1" into, determination is impossible to empty the accumulated row.

Just look at the difference between these two usage, you can compare it, count (1) than was executed count (primary key id) fast. Because the return operation id involve parsed data line, and the field value from the copy engine.

For count (field) is:

  • 1. If the "field" is defined as not null, then this line by line read out from the recording field inside, is determined not null, the cumulative rows;
  • 2. If the "field" is defined to allow null, then the implementation of the time, the judge might be null, but also to determine what value is taken out again, not only accumulate null.

Which is in front of the first principle, server layer to what field, InnoDB returns what field.

count (*) is an exception

But count (*) are the exception, not the entire field will be taken out, but specifically optimized, not value. count (*) is certainly not null, rows accumulate.

See here, you will say, the optimizer can not decide for themselves about it, the primary key id certainly is not empty ah, why not in accordance with the count (*) to deal with, how simple optimization ah.

Of course, MySQL is optimized for this statement, nor is not. But too many cases this requires specialized optimization, and it has been optimized MySQL count (*), and you just use this usage on it.

So the conclusion is: sorted efficiency, then, count (field) <count (primary key id) <count (1) ≈count (*), so I suggest you try to use the count (*).

VI Summary

Today, you and I chatted in MySQL are two ways to obtain the number of table rows. We mentioned in different engines count (*) implementation is not the same, but also analyzes the problems with the count value of the cache memory system problems.

In fact, count on the inside Redis, and MySQL can not guarantee the counting table data accurate and consistent reason, this system is composed of two different storage, does not support distributed transactions, can not get the accurate and consistent view. While the count
value is also placed in MySQL, to solve the problem of consistent view.

InnoDB engine support services, we make good use transactional atomicity and isolation, you can simplify the logic in the business development. This is also one of the reasons InnoDB engine favored.

Finally, we turn to today's Questions time.

In the scheme just discussed, we used the transaction to ensure accurate counting. Since the transaction to ensure that intermediate results can not be read another transaction, thus modifying the order of the count value and insert a new record does not affect the result of the logic. However, concurrent system
point of view of system performance considerations, do you think in this transaction sequence, the operation should be inserted first recording, or should you update count table it?

You can put your thoughts and ideas written in the comments section, I will give my answer in the next reference at the end of an article. Thank you for listening, you are welcome to send this share to more friends to read together.

Seven, the issue of time

On Problems I left you is when to use alter table t engine = InnoDB table will occupy a larger space instead.

Inside the comments section of this article, we have mentioned a point, that is, the table itself has no empty, and for example, has just conducted an operation to rebuild the table.

During the DDL, DML if there are just outside of the execution, may introduce some new empty during this period.

@ Flying referred to a deeper mechanism, we did not say in the article. When rebuilding the table, InnoDB will not fill the entire table, each page is left with subsequent updates to 1/16. That, in fact, not after the reconstruction table
is the "most" compact.

If it is such a process:

1. Table t rebuild once;
2. Insert some data, but the data inserted, spent part of the reserved space;
3. Under this situation, and then rebuild a table t, may be the emergence of the phenomenon in question .

Eight, classic message

1, Ken

From the perspective of system performance concurrent consideration, the recording operation should be inserted first, then update count table.

Knowledge in the "line lock merits and demerits: how to reduce the impact on the performance of the line lock? "
Because the update count table row lock it comes to competition, first insert and then update can minimize the lock wait between transactions, enhance the degree of concurrency.

Author Replies:

Several students said yes, you first indicate the source

2, is really the case

I. How to ask this count MySQL + redis program:
1. Open transaction (transaction program)
2.MySQL insert data
3. atoms redis count update
4. If the update is successful redis commits the transaction, if the update fails redis roll back the transaction.

Two, .net and java program code of the transaction and the transaction MySQL What is the relationship, what relevance?

Author Replies:

1. Good question, does still not solve the problem we are talking about consistency. If you insert a logic Session B is between 3 and 4 it
2. I guess that start a transaction (execution begin), submit (execution commit) at the end of it, did not find out about all the frames, uncertain Kazakhstan

Guess you like

Origin www.cnblogs.com/luoahong/p/11628535.html