Talk about 15 tips for sql optimization

foreword

SQL optimization is a hot topic that everyone is concerned about. Whether you are in an interview or at work, you are likely to encounter it.

If there is a performance problem on an online interface you are responsible for one day, you need to optimize it. Then your first thought is probably to optimize the sql statement, because its transformation cost is much smaller than the code.

So, how to optimize the sql statement?

This article shares some tips for sql optimization from 15 aspects, hoping to help you.

1 Avoid using select *

Many times, when we write SQL statements, for convenience, we like to use them directly to select *find out the data of all columns in the table at one time.

Counterexample:

select * from user where id=1;

In actual business scenarios, maybe we only really need to use one or two of these columns. I checked a lot of data, but no, it wasted database resources, such as memory or cpu.

In addition, during the process of data transmission through network IO, the data transmission time will also be increased.

There is also a most important problem: select *it will not go 覆盖索引, there will be a large number of 回表operations, resulting in low performance of query sql.

So, how to optimize it?

Normal example:

select name,age from user where id=1;

When querying the SQL statement, only the columns that need to be used are checked, and the redundant columns do not need to be checked out at all.

2 Use union all instead of union

We all know that unionafter using the keyword in the SQL statement, we can obtain the data after reordering.

And if you use union allkeywords, you can get all data, including duplicate data.

Counterexample:

(select * from user where id=1) 
union 
(select * from user where id=2);

The process of reordering requires traversal, sorting and comparison, which is more time-consuming and consumes more CPU resources.

So if you can use union all, try not to use union.

Normal example:

(select * from user where id=1) 
union all
(select * from user where id=2);

Unless there are some special scenarios, such as after union all, duplicate data appears in the result set, and duplicate data is not allowed in business scenarios, then union can be used.

3 Small tables drive large tables

A small table drives a large table, that is, a data set of a small table drives a data set of a large table.

If there are two tables, order and user, the order table has 10,000 pieces of data, and the user table has 100 pieces of data.

At this time, if you want to check the list of orders placed by all valid users.

This can be achieved using inkeywords:

select * from order
where user_id in (select id from user where status=1)

This can also be achieved using existskeywords:

select * from order
where exists (select 1 from user where order.user_id = user.id and status=1)

In the business scenario mentioned above, it is more appropriate to use the in keyword to achieve business requirements.

why?

Because if the in keyword is included in the sql statement, it will execute in first 子查询语句, and then execute the statement outside in. If the amount of data in in is small, the query speed is faster as a condition.

And if the sql statement contains the exists keyword, it executes the statement to the left of exists first (ie the main query statement). Then use it as a condition to match the statement on the right. If it matches, you can query the data. If there is no match, the data is filtered out.

In this requirement, the order table has 10,000 pieces of data, and the user table has 100 pieces of data. The order table is a large table, and the user table is a small table. If the order table is on the left, it is better to use the in keyword.

in conclusion:

in Applicable to the large table on the left and the small table on the right.
exists Applicable to the small table on the left and the large table on the right.

Whether you use the in or exists keyword, the core idea is to use a small table to drive a large table.

4 Batch operations

What if you have a batch of data that needs to be inserted after business processing?

Counterexample:

for(Order order: list){
   orderMapper.insert(order):
}

Insert data one by one in a loop.

insert into order(id,code,user_id) 
values(123,'001',100);

This operation requires multiple requests to the database to complete the insertion of this batch of data.

But as we all know, in our code, every time we request the database remotely, it will consume a certain amount of performance. And if our code needs to request the database multiple times to complete this business function, it will inevitably consume more performance.

So how to optimize it?

Normal example:

orderMapper.insertBatch(list):

Provides a method to insert data in batches.

insert into order(id,code,user_id) 
values(123,'001',100),(124,'002',100),(125,'003',101);

In this way, you only need to request the database remotely once, and the SQL performance will be improved. The more data, the greater the improvement.

However, it should be noted that it is not recommended to operate too much data in batches at one time. If there is too much data, the database response will be very slow. Batch operations need to grasp a degree, and it is recommended that each batch of data be controlled within 500 as much as possible. If the data is more than 500, it will be processed in multiple batches.

5 multi-use limit

Sometimes, we need to query the first item in some data, for example: query the first order placed by a user, and want to see the time of his first order.

Counterexample:

select id, create_date 
 from order 
where user_id=123 
order by create_date asc;

Query orders according to the user id, sort by order time, first find out all the order data of the user, and get an order set. Then in the code, get the data of the first element, that is, the data of the first order, to get the time of the first order.

List<Order> list = orderMapper.getOrderList();
Order order = list.get(0);

Although this approach has no problem in function, it is very inefficient. It needs to query all the data first, which is a bit of a waste of resources.

So, how to optimize it?

Normal example:

select id, create_date 
 from order 
where user_id=123 
order by create_date asc 
limit 1;

Use limit 1, only return the data with the smallest order time of the user.

In addition, when deleting or modifying data, in order to prevent misoperation, resulting in deletion or modification of irrelevant data, limit can also be added at the end of the SQL statement.

E.g:

update order set status=0,edit_time=now(3) 
where id>=100 and id<200 limit 100;

In this way, even if the wrong operation, such as the id is wrong, it will not affect too much data.

6 in is too much median

For batch query interfaces, we usually use inkeywords to filter out data. For example: I want to query user information in batches through some specified ids.

The sql statement is as follows:

select id,name from category
where id in (1,2,3...100000000);

If we do not impose any restrictions, the query statement may query a lot of data at one time, which may easily cause the interface to time out.

What should we do then?

select id,name from category
where id in (1,2,3...100)
limit 500;

You can limit the data with limit in sql.

However, we are more about adding restrictions to the business code. The pseudo code is as follows:

public List<Category> getCategory(List<Long> ids) {
   if(CollectionUtils.isEmpty(ids)) {
      return null;
   }
   if(ids.size() > 500) {
      throw new BusinessException("一次最多允许查询500条记录")
   }
   return mapper.getCategoryList(ids);
}

Another solution is: if there are more than 500 records in ids, you can use multiple threads to query the data in batches. Only 500 records are checked in each batch, and finally the queried data are aggregated and returned.

However, this is only a temporary solution and is not suitable for scenes with too many ids. Because there are too many ids, even if the data can be quickly detected, if the amount of data returned is too large, the network transmission is very performance-intensive, and the interface performance is not much better.

7 Incremental query

Sometimes, we need to query data through a remote interface and then synchronize to another database.

Counterexample:

select * from user;

If you get all the data directly, then sync it. Although this is very convenient, it brings a very big problem, that is, if there is a lot of data, the query performance will be very poor.

What should we do then?

Normal example:

select * from user 
where id>#{lastId} and create_time >= #{lastCreateTime} 
limit 100;

In ascending order of id and time, only one batch of data is synchronized each time, and this batch of data has only 100 records. After each synchronization is completed, save the largest id and time of the 100 pieces of data for use when synchronizing the next batch of data.

This incremental query method can improve the efficiency of a single query.

8 Efficient paging

Sometimes, when querying data on the list page, in order to avoid returning too much data at one time and affecting the performance of the interface, we generally paginate the query interface.

limitKeywords commonly used for paging in mysql :

select id,name,age 
from user limit 10,20;

If the amount of data in the table is small, using the limit keyword for paging is no problem. But if there is a lot of data in the table, there will be performance problems with it.

For example, the pagination parameter now becomes:

select id,name,age 
from user limit 1000000,20;

MySQL will find 1,000,020 pieces of data, then discard the first 1,000,000 pieces of data, and only check the last 20 pieces of data, which is a waste of resources.

So, how to paginate this massive data?

Optimize sql:

select id,name,age 
from user where id > 1000000 limit 20;

First find the largest id of the last paging, and then use the index on the id to query. However, in this scheme, the id is required to be continuous and ordered.

You can also use betweenoptimized pagination.

select id,name,age 
from user where id between 1000000 and 1000020;

It should be noted that between should be paginated on the unique index, otherwise there will be an inconsistent size of each page.

9 Replacing subqueries with join queries

In mysql, if you need to query data from more than two tables, there are generally two implementation methods: 子查询 and . 连接查询

An example of a subquery is as follows:

select * from order
where user_id in (select id from user where status=1)

Subquery statements can be inimplemented by keywords, and the conditions of one query statement fall in the query results of another select statement. The program runs on the nested innermost statement first, and then runs the outer statement.

The advantage of a subquery statement is that it is simple and structured, if the number of tables involved is small.

But the disadvantage is that when mysql executes sub-queries, temporary tables need to be created. After the query is completed, these temporary tables need to be deleted, which has some additional performance consumption.

At this time, it can be changed to a connection query. Specific examples are as follows:

select o.* from order o
inner join user u on o.user_id = u.id
where u.status=1

10 Join tables should not be too many

According to the provisions of the Alibaba developer manual, the number of join tables should not exceed 3one.

Counterexample:

select a.name,b.name.c.name,d.name
from a 
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id
inner join d on d.c_id = c.id
inner join e on e.d_id = d.id
inner join f on f.e_id = e.id
inner join g on g.f_id = f.id

If there are too many joins, MySQL will be very complicated when selecting indexes, and it is easy to choose the wrong index.

And if there is no hit, the nested loop join is to read a row of data from the two tables for pairwise comparison, and the complexity is n^2.

So we should try to control the number of join tables.

Normal example:

select a.name,b.name.c.name,a.d_name 
from a 
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id

If you need to query the data in other tables in the implementation of the business scenario, you can 冗余专门的字段save the data to be queried in the a, b, and c tables, for example: redundant d_name field in table a.

However, I have seen some ERP systems before. The concurrency is not large, but the business is relatively complex. It needs to join more than a dozen tables to query the data.

Therefore, the number of join tables should be determined according to the actual situation of the system. It cannot be generalized. The less the better.

11 Pay attention when joining

joinWe generally use keywords when it comes to joint query of multiple tables .

The most commonly used joins are left joins and inner joins.

left join: Find the intersection of the two tables plus the remaining data in the left table.
inner join: Find the data of the intersection of two tables.

An example of using inner join is as follows:

select o.id,o.code,u.name 
from order o 
inner join user u on o.user_id = u.id
where u.status=1;

If two tables are related using inner join, mysql will automatically select the small table in the two tables to drive the large table, so there will not be too much problem in performance.

An example of using left join is as follows:

select o.id,o.code,u.name 
from order o 
left join user u on o.user_id = u.id
where u.status=1;

If two tables are associated using left join, MySQL will use the left join keyword to drive the table on the right by default. If there is a lot of data in the left table, there will be performance problems.

It should be noted that when using left join to query, use a small table on the left and a large table on the right. If you can use inner join, use left join as little as possible.

12 Controls the number of indexes

As we all know, indexes can significantly improve the performance of query SQL, but the number of indexes is not the better.

Because when new data is added to the table, an index needs to be created for it at the same time, and the index requires additional storage space and a certain performance consumption.

Alibaba's developer manual stipulates that the number of indexes in a single table should be controlled 5within 1, and the number of fields in a single index should not exceed 51.

The structure of the B+ tree used by mysql to save the index, the B+ tree index needs to be updated during insert, update and delete operations. If there are too many indexes, it will consume a lot of extra performance.

So, the question is, what if there are too many indexes in the table, more than 5?

This question needs to be viewed dialectically. If your system has a low concurrency and the amount of data in the table is not much, in fact, more than 5 can be used, as long as it does not exceed too much.

But for some high-concurrency systems, be sure to abide by the limit of no more than 5 indexes on a single table.

So, how can a high concurrency system optimize the number of indexes?

If you can build a joint index, don't build a single index, you can delete a useless single index.

Migrating some query functions to other types of databases, such as Elastic Seach, HBase, etc., only needs to build a few key indexes in the business table.

13 Choose a reasonable field type

charRepresents a fixed string type. The storage space of the field of this type is fixed, which will waste storage space.

alter table order 
add column code char(20) NOT NULL;

varcharIndicates a variable-length string type. The field storage space of this type will be adjusted according to the length of the actual data, and storage space will not be wasted.

alter table order 
add column code varchar(20) NOT NULL;

If it is a field with a fixed length, such as the user's mobile phone number, it is generally 11 bits, and can be defined as a char type with a length of 11 bytes.

But if it is an enterprise name field, if it is defined as a char type, there is a problem.

If the length is defined too long, for example, it is defined as 200 bytes, and the actual enterprise length is only 50 bytes, 150 bytes of storage space will be wasted.

If the length is defined too short, for example, it is defined as 50 bytes, but the actual enterprise name has 100 bytes, it will not be stored, and an exception will be thrown.

Therefore, it is recommended to change the enterprise name to varchar type. The storage space of variable-length fields is small, which can save storage space, and for queries, the search efficiency in a relatively small field is obviously higher.

When we choose field types, we should follow these principles:

If you can use numeric types, you don't need strings, because characters tend to be slower to process than numbers.
Use small types as much as possible, such as: use bit to store boolean values, tinyint to store enumeration values, etc.
A fixed-length string field, of type char.
A variable-length string field, of type varchar.
Decimal is used for the amount field to avoid the problem of loss of precision.

There are many more principles, which are not listed here.

14 Improve the efficiency of group by

We have many business scenarios that need to use group bykeywords, and its main functions are deduplication and grouping.

Usually it is havingused together with it to indicate that the data is grouped and then filtered according to certain conditions.

Counterexample:

select user_id,user_name from order
group by user_id
having user_id <= 200;

This writing method has poor performance. It first groups all orders according to user id, and then filters users whose user id is greater than or equal to 200.

Grouping is a relatively time-consuming operation, why don't we narrow down the data before grouping?

Normal example:

select user_id,user_name from order
where user_id <= 200
group by user_id

Use the where condition to filter out the redundant data before grouping, so that the efficiency will be higher when grouping.

In fact, this is an idea, not limited to the optimization of group by. Before our SQL statements do some time-consuming operations, we should reduce the data range as much as possible, which can improve the overall performance of SQL.