[MYSQL articles] summary of mysql performance optimization

foreword

When it comes to MYSQL performance tuning, most of the time what we want to achieve is to make our queries faster. A query action is composed of many links, and each link will consume time. If we want to reduce the time consumed by the query, we must start with each link.

image-20211006202806875

Regarding the sql statement execution process of MYSQL, if you are not clear, you can read this article: [MySQL] Detailed explanation of the principle of the Select statement

configuration optimization

The first link is to connect the client to the server. What kind of performance problems may occur in this connection? It is possible that the number of connections on the server side is insufficient and the application cannot obtain connections. For example, Mysql: error 1040: Too many connectionsan . The insufficient number of connections can be solved from two aspects.

from the server

We can increase the number of available connections on the server side.

If there are multiple applications or many requests to access the database at the same time, when the number of connections is not enough, we can:

  1. Modify the configuration parameters to increase the number of available connections and modify the size of max_connections:
show variables like 'max_connections'; -- 修改最大连接数,当有多个应用连接的时候
  1. Alternatively, or release inactive connections in a timely manner. The default timeout period of interactive and non-interactive clients is 28800 seconds, 8 hours, we can reduce this value.
show global variables like 'wait_timeout'; --及时释放不活动的连接,注意不要释放连接池还在使用的连接

from the client

The number of connections obtained from the server can be reduced. At this time, we can introduce a connection pool to realize the reuse of connections.

ORM level (MyBatis comes with a connection pool); or use a dedicated connection pool tool (Ali's Druid, Spring Boot 2.x version default connection pool Hikari, old-fashioned DBCP and C3P0).

In addition to reasonably setting the number of connections on the server side and the size of the connection pool on the client side, what other solutions do we have to reduce the number of connections between the client side and the database server side? Let's talk about the details of optimization from the perspective of architecture.

Architecture optimization

cache

When the concurrency of the application system is very large, if there is no cache, it will cause two problems: on the one hand, it will bring a lot of pressure to the database. On the other hand, from the application level, the speed of operating data will also be affected. We can solve this problem with a third-party caching service, such as Redis.

Running an independent cache service is an optimization at the architectural level.

In order to reduce the reading and writing pressure of a single database server, what other optimization measures can we do at the architectural level?

master-slave replication

If a single database service cannot meet the access requirements, then we can do a database cluster solution.

A cluster will inevitably face a problem, that is, the problem of data consistency between different nodes. If multiple database nodes are read and written at the same time, how to keep the data of all nodes consistent?

At this time, we need to use the replication technology (replication). The replicated node is called the master, and the replicated node is called the slave.

How is master-slave replication implemented? As we said before, the update statement will record binlog, which is a logical log. With this binlog, the slave server will obtain the binlog file of the master server, then parse the SQL statement inside, and execute it on the slave server to keep the master-slave data consistent.

If you don’t understand, binlogyou can read this article: [MYSQL] One article to understand redo log and binlog in mysql

Three threads are mainly involved: binlog thread, I/O thread and SQL thread.

  • binlog thread : Responsible for writing data changes on the master server to the binary log (Binary log).
  • I/O thread : responsible for reading the binary log from the master server and writing it to the relay log (Relay log) of the slave server.
  • SQL thread : responsible for reading the relay log, parsing out the data changes that the master server has performed and replaying (Replay) on the slave server.

The figure below shows the three threads involved in master-slave replication.

image-20220223092908103

read-write separation

After implementing the master-slave replication scheme, we only write data to the master node, and the read requests can be distributed to the slave nodes. We call this scheme read-write separation.

image-20220223093043275

The reasons why read-write separation can improve performance are:

  • The master and slave servers are responsible for their respective reads and writes, which greatly alleviates lock contention;
  • The slave server can use MyISAM to improve query performance and save system overhead;
  • Increase redundancy and improve availability.

Read-write separation can reduce the access pressure of the database server to a certain extent, but special attention needs to be paid to the consistency of master-slave data.

After we have done master-slave replication, if the data stored in a single master node or a single table is too large, for example, a table has hundreds of millions of data, the query performance of a single table will still decline. We need to further improve the single table The data of the database node is split, which is sub-database and sub-table.

Sub-library and sub-table

Let's take a mall system as an example to explain how the database evolves step by step.

Single Application Single Database

As shown in the figure above, the mall system includes the home page Portal template, user module, order module, inventory module, etc. All modules share a database, and there are usually many tables in the database. Because the number of users is not large, such an architecture is fully applicable in the early days.

Multiple Application Single Database

As this set of systems is continuously iteratively updated, the amount of code is getting larger and larger, the architecture is becoming more and more bloated, and the pressure on system access is gradually increasing, so it is imperative to split the system. In order to ensure smooth business, the system architecture reconstruction is also carried out in several stages.

In the first stage, the single architecture of the mall system is split into sub-services according to functional modules, such as: Portal service, user service, order service, inventory service, etc.

As shown in the figure above, multiple services share a database. The purpose of this is to keep the underlying database access logic untouched and minimize the impact.

Multiple applications and multiple databases

With the intensification of business promotion, the database has finally become the bottleneck. At this time, it is basically unfeasible for multiple services to share a database. We need to separate the tables related to each service and build a separate database, which is actually a sub-database .

The amount of concurrency supported by a single database is limited. Splitting it into multiple databases can eliminate competition between services and improve service performance.

As shown in the figure above, multiple small databases are separated from a large data, and each service corresponds to a database. This is the necessary database division operation when the system develops to a certain stage.

The same is true for the microservice architecture. If you only split the application without splitting the database, you cannot solve the fundamental problem, and the entire system will easily reach a bottleneck.

sub-table

If the system is in a high-speed development stage, take the mall system as an example, the order volume in a day may be hundreds of thousands, then the order table in the database will grow very fast, and the efficiency of database query will drop significantly when it grows to a certain stage.

Therefore, when the data increment of a single table is too fast, it is rumored in the industry that the amount of data exceeds 5 million, and the table should be considered. Of course, 5 million is just an experience value, and you can make decisions based on the actual situation.

Taking horizontal splitting as an example, each table is split into multiple sub-tables, and multiple sub-tables exist in the same database. For example, the following user table is split into user 1 table and user 2 table.

Splitting a table into several sub-tables in a database can solve the problem of single-table query performance to a certain extent, but it also encounters a problem: single-database storage bottleneck.

Therefore, it is more commonly used in the industry to split sub-tables into multiple databases. For example, in the figure below, the user table is split into two sub-tables, and the two sub-tables exist in different databases.

Table division is mainly to reduce the size of a single table and solve performance problems caused by the amount of data in a single table.

Complexity

Sub-database and sub-table indeed solve many problems, but it also brings a lot of complexity to the system.

Cross-database association query

Before the table is split in a single database, we can easily use the join operation to associate multiple tables to query data, but after the database is divided and the tables are divided, the two tables may not be in the same database. How to use join?

There are several options to solve it:

  1. Field redundancy: Put the fields that need to be associated into the main table to avoid join operations;
  2. Data abstraction: aggregate data through ETL, etc., to generate new tables;
  3. Global tables: For example, some basic tables can be placed in each database;
  4. Application layer assembly: find out the basic data, and assemble it through application calculation;
distributed transaction

A single database can be solved with local transactions, but using multiple databases can only be solved through distributed transactions.

Commonly used solutions include: solutions based on reliable messages (MQ), two-phase transaction commit, flexible transactions, etc.

Distributed ID

If you use the Mysql database to use id self-increment as the primary key in a single database and single table, it will not work after the database is divided into tables, and the id will be duplicated.

Commonly used distributed ID solutions are:

  • Use a globally unique ID (GUID);
  • Specify an ID range for each shard;
  • Distributed ID generators (such as Twitter's Snowflake algorithm).
multiple data sources

After sub-database sub-tables may be faced with obtaining data from multiple databases or multiple sub-tables, the general solutions are: client adaptation and proxy layer adaptation.
Commonly used middleware in the industry includes:

  1. shardingsphere (predecessor sharding-jdbc)
  2. Mycat

summary

If there is a database problem, don't rush to divide the database and divide the table, first see if it can be solved by using conventional means.

Sub-database and sub-table will bring huge complexity to the system, and it is not recommended not to use it in advance unless it is absolutely necessary. As a system architect, you can make the system flexible and scalable, but don't over-design and over-design.

Query performance optimization

Analyze with Explain

Explain is used to analyze the SELECT query statement, and developers can optimize the query statement by analyzing the Explain result.

The more important fields are:

  • select_type : Query type, including simple query, joint query, subquery, etc.
  • key : The index to use.
  • rows : The number of rows scanned.

Optimize data access

1. Reduce the amount of requested data

  • Only return necessary columns: it is best not to use SELECT * statements.
  • Return only necessary rows: Use the LIMIT statement to limit the data returned.
  • Cache repeatedly queried data: Using cache can avoid querying in the database, especially when the data to be queried is frequently queried repeatedly, the query performance improvement brought by caching will be very obvious.

2. Reduce the number of rows scanned by the server

The most efficient way is to use indexes to cover queries.

Refactor query method

1. Segment large queries

If a large query is executed at one time, it may lock a lot of data at one time, occupy the entire transaction log, exhaust system resources, and block many small but important queries.

2. Decompose large join query

The advantages of decomposing a large join query into a single-table query for each table and then performing an association in the application are:

  • Make caching more efficient. For join queries, if one of the tables changes, the entire query cache becomes unusable. After decomposing multiple queries, even if one of the tables changes, the query cache for other tables can still be used.
  • Decomposed into multiple single-table queries, the cached results of these single-table queries are more likely to be used by other queries, thereby reducing the query of redundant records.
  • Reduce lock contention;
  • Connecting at the application layer makes it easier to split the database, making it easier to achieve high performance and scalability.
  • The efficiency of the query itself may also be improved. For example, in the following example, using IN() instead of join query can make MySQL query according to the order of ID, which may be more efficient than random join.
SELECT * FROM tag
JOIN tag_post ON tag_post.tag_id=tag.id
JOIN post ON tag_post.post_id=post.id
WHERE tag.tag='mysql';
SELECT * FROM tag WHERE tag='mysql';
SELECT * FROM tag_post WHERE tag_id=1234;
SELECT * FROM post WHERE post.id IN (123,456,567,9098,8904);

index optimization

1. Separate columns

When making a query, the indexed column cannot be part of an expression, nor can it be a parameter of a function, otherwise the index cannot be used.

For example, the following query cannot use an index on the actor_id column:

SELECT actor_id FROM sakila.actor WHERE actor_id + 1 = 5;

2. Multi-column index

When you need to use multiple columns as conditions for query, using multi-column indexes is better than using multiple single-column indexes. For example, in the following statement, it is better to set actor_id and film_id as multi-column indexes.

SELECT film_id, actor_ id FROM sakila.film_actor
WHERE actor_id = 1 AND film_id = 1;

3. The order of the index columns

Put the most selective index columns first.

Index selectivity refers to the ratio of unique index values ​​to the total number of records. The maximum value is 1, at this time each record has a unique index corresponding to it. The higher the selectivity, the higher the discrimination of each record and the higher the query efficiency.

For example, in the results shown below, customer_id is more selective than staff_id, so it is better to put the customer_id column in front of the multi-column index.

SELECT COUNT(DISTINCT staff_id)/COUNT(*) AS staff_id_selectivity,
COUNT(DISTINCT customer_id)/COUNT(*) AS customer_id_selectivity,
COUNT(*)
FROM payment;
   staff_id_selectivity: 0.0001
customer_id_selectivity: 0.0373
               COUNT(*): 16049

4. Prefix index

For columns of BLOB, TEXT and VARCHAR types, you must use a prefix index, which only indexes the first part of characters.

The selection of the prefix length needs to be determined according to the index selectivity.

5. Covering index

The index contains the values ​​of all the fields that need to be queried.

Has the following advantages:

  • The index is usually much smaller than the size of the data row, and only reading the index can greatly reduce the amount of data access.
  • Some storage engines (such as MyISAM) only cache indexes in memory, and the data relies on the operating system to cache. Therefore, accessing only the index can avoid using system calls (which are usually time-consuming).
  • For the InnoDB engine, there is no need to access the primary index if the secondary index can cover the query.

For the relevant knowledge about mysql index, if you don’t know much about it, you can read these two articles:

[MYSQL articles] Understand the principle of mysql index in one article

[MYSQL articles] How are indexes implemented in different storage engines of mysql?

storage engine

Choice of storage engine

Choose different storage engines for different business tables, for example: use MyISAM to query and insert business tables with many operations. Use Memory for temporary data. InnoDB is used for regular concurrent large-update tables.

field definition

Principle: Use the smallest data type that can store data correctly. Select the appropriate field type for each column.

integer type

TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT use 8, 16, 24, 32, and 64-bit storage space respectively. Generally, the smaller the column, the better. The number in INT(11) only stipulates the number of characters displayed by the interactive tool, which is meaningless for storage and calculation.

character type

In the case of variable length, varchar saves more space, but for a varchar field, one byte is needed to record the length. Use char for fixed length, not varchar.

Do not use foreign keys, triggers, views

Reduces readability; affects database performance, the calculation should be handed over to the program, and the database should concentrate on storage; the integrity of the data should be checked in the program.

large file storage

Do not use the database to store pictures (such as base64 encoding) or large files;

Put the file on the NAS, the database only needs to store the URI (relative path), and configure the NAS server address in the application.

Table split or field redundancy

Split out uncommonly used fields to avoid too many columns and too much data.

For example, in a business system, it is necessary to record all received and sent messages. The messages are in XML format and stored in blob or text to track and judge duplication. A table can be created to store messages.

Summarize

If during an interview, you encountered this question " from which dimensions will you optimize the database ", how would you answer it?

  • SQL and indexes
  • Storage Engine and Table Structure
  • database schema
  • MySQL configuration
  • Hardware and Operating System

In addition to optimizing code, SQL statements, table definitions, schemas, and configurations, optimization at the business level cannot be ignored. To give a few examples:

  1. On Double Eleven in a certain year, why do you do a recharge to Yu’e Bao and the balance has a bonus activity, such as recharge 300 to get 50?

Because using the balance or Yu'ebao to pay is recorded in the local or internal database, and using the bank card to pay, you need to call the interface, and it is definitely faster to operate the internal database.

  1. On Double Eleven last year, why was it forbidden to inquire about bills other than today in the early hours of the morning?

This is a downgrade measure to ensure the current core business.

  1. In recent years, on Double Eleven, why is there already a price on Double Eleven a week in advance?

Pre-sale diversion.

At the application level, there are also many other solutions to optimize to reduce the pressure on the database as much as possible, such as current limiting, or the introduction of MQ peak shaving, and so on.

Why is MySQL also used? Some companies can withstand tens of millions of concurrency, while others cannot handle hundreds of concurrency. The key lies in how to use it. Therefore, the slowness of using the database does not mean that the database itself is slow, and sometimes it needs to be optimized to the upper layer.

Guess you like

Origin blog.csdn.net/jiang_wang01/article/details/131343977
Recommended