Experience Sharing of Database Performance Optimization in 7 Engineering Applications

Abstract: This article talks about our understanding of database optimization from the perspectives of SQL execution process, execution plan, index data structure, index query speed-up principle, focused index, left prefix optimization principle, and auto-increment primary key index.

This article is shared from Huawei Cloud Community " Summary of Database Performance Optimization Experience in Engineering Applications ", author: Ye Gong.

1 Introduction

Most of the algorithm products delivered at this stage involve the use of databases. The content it carries includes: user rights management, dataset information, asynchronous inference results, personalized configuration, etc.

In the OCR scenario, the collective amount of data is usually large (hundreds of thousands of pictures in a data set), and the database is often deployed in the customer's shared database (running a large number of other businesses at the same time), and even can only share the same server with the algorithm mirror. Therefore, special attention should be paid to database performance bottlenecks in background research and development.

This article talks about our understanding of database optimization from the perspectives of SQL execution process, execution plan, index data structure, index query speed-up principle, focused index, left prefix optimization principle, and auto-increment primary key index.

2. How to obtain the complete SQL statement in the ORM scenario

1. The online environment can intercept slow SQL through the connection pool and issue an alarm notification

2. During the test phase, if the complete SQL cannot be obtained due to the use of precompiled statements or ORM frameworks, you can use the database log method to obtain

set global general_log=on;
show variables where Variable_name="general_log_file";

2.1 SQL execution process

Analyzer: analyze SQL, which tables need to be used, which conditions to use (know what to do)

Optimizer:  Perform performance evaluation on various execution processes, and select the execution process with the least cost. The cost is only what the optimizer thinks, and it may not be correct (how to do it the fastest)

Executor: call the engine interface, return data, the engine is plug-in, similar to polymorphism in programming, you can choose the corresponding storage engine when creating a table

2.2 Execution plan

Add the explain keyword before the SQL to get the SQL execution plan. According to the execution plan, you can judge whether the execution process meets expectations.

explain
SELECT
 db_dataset.uuid AS db_dataset_uuid,
  db_dataset.NAME AS db_dataset_name,
 db_dataset.updated_at AS db_dataset_updated_at,
 db_dataset.created_at AS db_dataset_created_at,
 db_dataset.volume_dir AS db_dataset_volume_dir,
 db_dataset.max_data_count AS db_dataset_max_data_count,
 db_dataset.description AS db_dataset_description
FROM
 db_dataset
  LEFT OUTER JOIN db_manifest ON db_manifest.dataset_id = db_dataset.id AND
 db_manifest.dataset_version = 'annotation_v0'
  LEFT OUTER JOIN db_ai_data ON db_manifest.id = db_ai_data.manifest_id AND
 db_ai_data.deleted = '0'
WHERE
 db_dataset.deleted = 0
GROUP BY
  db_dataset.id

Explanation of the execution plan feedback columns:

Select_type detailed explanation:

Detailed explanation of type:

查询使用了何种类型,它在 SQL优化中是一个非常重要的指标,以下性能从好到坏依次是:
system > const > eq_ref > ref > ref_or_null > index_merge > unique_subquery >
index_subquery > range > index > ALL

system : When the table has only one row of records (system table), the amount of data is very small, and disk IO is often not required, and the speed is very fast.

const : Indicates that the query hits the primary key primary key or unique unique index, or the connected part is a constant (const) value. This type of scanning is extremely efficient, returns a small amount of data, and is very fast.

eq_ref : When querying, hit the primary key or unique key index, and the type is eq_ref.

ref : Different from eq_ref, ref means using a non-unique index, and many rows that meet the conditions will be found.

ref_or_null: This join type is similar to ref, except that MySQL will additionally search for rows containing NULL values.

index_merge : The index merge optimization method is used, and a query uses more than two indexes.

EXPLAIN SELECT * FROM user_robot_relate WHERE id > 1 AND user_id = 2;

unique_subquery : Replace the following IN subquery, the subquery returns unique collections.

value IN (SELECT primary_key FROM single_table WHERE some_expr)

index_subquery: Different from unique_subquery, it is used for non-unique indexes and can return duplicate values.

value IN (SELECT key_column FROM single_table WHERE some_expr)

range : Select rows using an index, only retrieving rows within the given range. To put it simply, it is to retrieve data within a given range for an indexed field. In the where statement, use between...and, <, >, <=, in and other conditional query types are all range. From the results, we can see that only for the fields with indexes set, the range search type is range.

EXPLAIN SELECT * FROM user_robot_relate WHERE id BETWEEN 2 AND 3;

index: Index and ALL actually read the entire table, the difference is that index traverses the index tree to read, while ALL reads from the hard disk.

ALL : The entire table will be traversed to find matching rows, with the worst performance.

Extra: Information that is not suitable for display in other columns, and many extra information in Explain will be displayed in the Extra field.

Using index: We use the covering index in the corresponding select operation. Generally speaking, the query column is covered by the index, and the query speed will be very fast when using the covering index, which is an ideal state in SQL optimization.

Using where: No available index is found during the query, and then the required data is obtained by filtering the where condition, but it should be noted that not all queries with the where statement will display Using where.

Using temporary: Indicates that the results of the query need to be stored in a temporary table, which is generally used in sorting or grouping queries.

Using filesort: Indicates that the sorting operation cannot be completed using the index, that is, the ORDER BY field has no index, and usually such SQL needs to be optimized.

Using join buffer: When we join the table query, if the join condition of the table does not use an index, a join buffer is required to store the intermediate results.

2.3 Index

A sorted data structure that helps MySQL efficiently obtain data when indexing

Index data structure:

binary tree

red black tree

HashTable

B-Tree

The general reason for not using a binary tree: ordered data will degenerate into a linked list, and the depth is uncontrollable, as shown in the figure below

Usually, the reason why red-black trees cannot be used: Although the depth is compressed, the depth is still uncontrollable, and the complexity of searching for massive data is extremely high

Hash table: only supports IN query, not RANGE query. Use the hash algorithm to hash the content hash(aaaa) = 2 hash(bbbb) = 2 hash(cccc) = 4

B+ tree: mainstream index structure

Find process:

1. Read all the elements of the root node, because they are ordered, you can use binary search to find the specified interval efficiently

2. Find the secondary node according to the file address of the specified interval, and read all elements.

3. Find the specified element position in the leaf node.

2.4 The principle of index query speed-up

Take the B+ tree index as an example,

If you want to find data item 29

1. First enter block 1, the data of block 1 is loaded into memory, and an I/O occurs

2. Perform a binary search in the memory and find that 29 is before 17 and 35, so lock the P2 pointer, load the data of block 3 into the memory, and another I/O occurs

3. In the same way, move the P2 pointer in block 3 to lock data block 8, load data block 8 into the memory, and the last I/O occurs

4. Traversing the data of block 8 can find the data of block 29

If there is no index, the worst case is that the data blocks of the entire table need to be loaded into memory, and then the results are traversed, which will generate a large amount of I/O overhead and traverse the entire table data.

2.5 Focus Index

Focused indexes are especially suitable for columns that require RANGE lookups, because their leaf nodes store ordered data rows. During the query process, the leaf nodes at both ends can be located according to the WHERE condition, and then the entire linked list structure between them can be taken out.

2.6 Left prefix optimization principle

In engineering applications, there are often some core tables that need to be queried in multiple forms. If an index is built for each query method, the performance of table insertion and update will be affected.

Considering that each sub-column of the joint index is sorted when it is created, for example, there is a joint miniature (a, b, c) on data table A, then query where a = xxx; where a = xxx and b = xxx It will hit the epitome, so you can use this feature to set a small number of joint indexes according to business needs to cover various query requirements.

Assuming there is a table A, there are the following three high-frequency queries

select xx from A where a = xxx;
select xx from A where b = xxx;
select xx from A where a = xxx and b = xxx;

The easiest way is to index ab (a, b) separately, but this is too verbose. According to the left prefix principle, the most reasonable index construction method should be b and (a, b).

2.7 Auto-increment primary key index

1. All data in InnoDB is stored based on B+Tree. If there is no primary key, mysql will select a column that may be unique among all columns as the index id. If it cannot be found, the rowid column will be added by default.

2. There will be a scene where a large amount of data is compared during the index search process. If uuid is used, it will be compared bit by bit, the efficiency will be very low, and the space will be very large. Too much ssd space will be occupied, and the storage cost will increase.

3. B+tree is an ordered tree. Auto-increment index data can be inserted backwards all the time with high performance. If non-auto-increment index is used, it may cause tree splitting and balance problems during the insertion process, resulting in additional performance loss.

3. Conventional database optimization sequence

1. Check the SQL, check the execution plan, does it hit the index? Are there a lot of big table associations? Every field of the query is required? ...

2. Add index

3. Partition

4. Sub-table

5. Change the table structure, reduce the association of query types, and increase redundant fields

6. Add server, flexible host plus U plus memory to replace SSD...

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/9105988