[MySQL] count() query performance review

[MySQL] count() query performance review

1. Background

The database used is MySQL8, and the storage engine used is Innodb.

Normally, the paging interface will query the database twice. The first time is to obtain specific data, the second time is to obtain the total number of record rows, and then the results are integrated before returning.

Query sql for specific data, such as this:

select id, name from user limit 1, 20;

It has no performance issues.

But another sql uses count(*) to query the total number of record rows, for example:

select count(*) from user;

But there is a problem of poor performance.

Why does this happen?

2. Why does count(*) have poor performance?

In MySQL, count(*)the function is to count the total number of rows recorded in the table.

The count(*)performance is directly related to the storage engine. Not all storage engines count(*)have poor performance.

The most commonly used storage engines in MySQL are: innodband myisam.

In myisam, the total number of rows will be saved to the disk. When using count(*), you only need to return that data without additional calculations, so the execution efficiency is very high.

InnoDB is different. Since it supports transactions and has MVCCthe existence of multi-version concurrency control, in different transactions at the same point in time, the number of record rows returned by the same query SQL may be uncertain.

When using count(*) in innodb, the data needs to be read row by row from the storage engine and then accumulated, so the execution efficiency is very low.

It's okay if the amount of data in the table is small. Once the amount of data in the table is large, the performance will be very poor when the innodb storage engine uses count(*) statistics.

3. How to optimize count(*) performance?

From the above, we know that since count(*)there is a performance problem, how to optimize it?

You can start from the following aspects.

3.1. Add Redis cache

For simple count(*), such as counting the total number of views or the total number of visitors, you can directly cache the interface using Redis, and there is no need to count in real time.

When the user opens the specified page, set count = count+1 in the cache each time.

When the user visits the page for the first time, the count value in Redis is set to 1. Every time the user visits the page in the future, the count is incremented by 1, and finally reset to Redis (Redis memory usage).

In this way, where the quantity needs to be displayed, the count value can be found from Redis and returned.

In this scenario, there is no need to use count(*) real-time statistics from the data point table, and the performance will be greatly improved.

However, in high concurrency situations, there may be data inconsistencies between the cache and the database.

However, for business scenarios such as counting the total number of views or the total number of visitors, the accuracy of the data is not high, and data inconsistencies are tolerated.

3.2. Add level 2 cache

For some business scenarios, there is very little new data, most of them are statistical operations, and there are many query conditions. At this time, the performance of using traditional count(*) real-time statistics will definitely not be good.

If the number of brands can be counted on the page through one or more conditions such as id, name, status, time, source, etc.

In this case, the user has many combined conditions, and it is useless to add a joint index. The user can select one or more of the query conditions. Sometimes the joint index will fail, and the index can only be added to meet the conditions that are most frequently used by users.

That is to say, some combination conditions can be indexed, and some combination conditions cannot be indexed. How to optimize these scenarios that cannot be indexed?

Answer: Use 二级缓存.

The second level cache is actually a memory cache.

You can use caffineor guavaimplement the function of second-level cache.

Caffine has Spring Bootbeen integrated and is very convenient to use.

Just use annotations in the query methods that need to increase the second-level cache @Cacheable.

 @Cacheable(value = "brand", , keyGenerator = "cacheKeyGenerator")
   public BrandModel getBrand(Condition condition) {
       return getBrandByCondition(condition);
   }

Then customize the cacheKeyGenerator to specify the cache key.

public class CacheKeyGenerator implements KeyGenerator {
    @Override
    public Object generate(Object target, Method method, Object... params) {
        return target.getClass().getSimpleName() + UNDERLINE
                + method.getName() + ","
                + StringUtils.arrayToDelimitedString(params, ",");
    }
}

This key is composed of various conditions.

In this way, after querying the brand data through a certain combination of conditions, the results will be cached in memory and the expiration time is set to 5 minutes.

Later, when the user uses the same conditions to query the data again within 5 minutes, the data can be directly retrieved from the second-level cache and returned directly.

This can greatly improve the query efficiency of count(*).

But if you use second-level cache, there may be different data on different servers. It needs to be selected according to the actual business scenario, and it cannot be applied to all business scenarios.

3.3. Multi-threaded execution

I don’t know if you have ever done such a requirement: count how many valid orders there are and how many invalid orders there are.

In this case, two SQLs are generally required to be written. The SQLs used to count valid orders are as follows:

select count(*) from order where status = 1;

The SQL for counting invalid orders is as follows:

select count(*) from order where status = 0;

But if it is in an interface, the efficiency of executing these two SQLs synchronously will be very low.

At this time, it can be changed into a sql:

select count(*), status from order
group by status;

Using group bykeyword grouping to count the number of the same status will only generate two records, one record is the number of valid orders, and the other record is the number of invalid orders.

But there is a problem: the status field only has two values ​​​​1 and 0. The degree of repetition is very high and the degree of differentiation is very low. It cannot be indexed and will scan the entire table, which is not very efficient.

Are there any other solutions?

Answer: Use multi-threading.

You can use CompleteFuturetwo 线程asynchronous calls to sql to count valid orders and sql to count invalid orders, and finally summarize the data, which can improve the performance of the query interface.

3.4. Reduce join tables

In most cases, count(*) is used to count the total quantity in real time.

However, if the amount of data in the table itself is small, but there are too many join tables, the efficiency of count(*) may also be affected.

For example, when querying product information, you need to query data based on product name, unit, brand, classification and other information.

At this time, write a sql to find the desired data, such as the following:

select count(*)
from product p
inner join unit u on p.unit_id = u.id
inner join brand b on p.brand_id = b.id
inner join category c on p.category_id = c.id
where p.name = '后端码匠' and u.id=123 and b.id = 124 and c.id=125;

Use the product table to remove jointhe three tables of unit, brand and category.

In fact, these query conditions can query data in the product table, and there is no need to join additional tables.

You can change the sql to this:

select count(*)
from product
where name = '后端码匠' and unit_id = 123 and brand_id = 124 and category_id = 125;

When counting (*), only query the product table, and remove the redundant table joins, so that the query efficiency can be greatly improved.

3.5. Change to ClickHouse

Sometimes, there are too many join tables and the redundant joins cannot be removed. What should I do?

For example, in the above example, when querying product information, you need to query data based on product name, unit name, brand name, category name and other information.

At this time, it is impossible to query the data based on the product table. We must go to jointhe three tables of unit, brand and category. How to optimize it at this time?

Answer: You can save data to ClickHouse.

ClickHouse is based on 列存储a database that does not support transactions and has very high query performance. It claims to query more than one billion data and can return it in seconds.

In order to avoid embedding of business code, you can use Canallistening MySQLlogs binlog. When new data is added to the product table, it is necessary to query the unit, brand and category data at the same time, generate a new result set, and save it to ClickHouse.

When querying data, query from ClickHouse, so the query efficiency using count(*) can be improved by N times.

A special reminder: When using ClickHouse, do not add new data too frequently and try to insert data in batches.

In fact, if there are many query conditions, it is not particularly suitable to use ClickHouse. You can change it at this time ElasticSearch, but it has the same problems as MySQL 深分页.

4. Performance comparison of various usages of count

Now that we are talking about count(*), we have to talk about other members of the count family, such as: count(1), count(id), count (normal index column), count (unindexed column).

So what's the difference?

  • count(*): It will get the data of all rows without doing any processing, and add 1 to the number of rows.
  • count(1): It will get the data of all rows, with a fixed value of 1 for each row, which is also the number of rows plus 1.
  • count(id): id represents the primary key. It needs to parse the id field from all rows of data. The id must not be NULL, and the number of rows is increased by 1.
  • count (ordinary index column): It needs to parse the ordinary index column from the data of all rows, and then determine whether it is NULL. If it is not NULL, the number of rows + 1.
  • count (unindexed column): It scans the entire table to obtain all data. No indexed column is added during analysis, and then determines whether it is NULL. If not, the number of rows is +1.

From this, the final count performance from high to low is:

count(*) ≈ count(1) > count(id) > count (ordinary index column) > count (unindexed column)

So, count(*)it's actually the fastest, don't select *confuse it with .

Guess you like

Origin blog.csdn.net/weixin_43874301/article/details/131512332