DTCC 2020 | Alibaba Cloud Liang High School: DAS Global Automatic Optimization Practice Based on Workload

Introduction: The 11th China Database Technology Conference (DTCC2020) was grandly held in Beijing. At the performance optimization and SQL audit session on 12.23, Liang Gaozhong, a senior technical expert from the Alibaba database technical team, was invited to introduce DAS's Workload-based global automatic optimization practice. SQL automatic optimization is one of the important autonomous scenarios of the Alibaba Cloud database autonomous service. This service supports the automatic optimization of slow SQL across the entire network of Alibaba Group. At present, more than 49 million slow SQLs have been automatically optimized. Ali has experience and lessons in the process of building this capability, and hopes to share it with you from two aspects: the process of building a global optimization capability based on Workload and the closed-loop practice of intelligent automatic optimization.

Abstract: The 11th China Database Technology Conference (DTCC2020) was grandly held in Beijing. With the theme of "Architecture Innovation, Efficient and Controllable", the conference focused on sharing and discussing data architecture, AI and big data, traditional corporate database practices, and domestic open source databases. At the performance optimization and SQL audit session on 12.23, Liang Gaozhong, a senior technical expert from the Alibaba database technical team, was invited to introduce DAS's Workload-based global automatic optimization practice.
SQL automatic optimization is one of the important autonomous scenarios of the Alibaba Cloud database autonomous service. This service supports the automatic optimization of slow SQL across the Alibaba Group network. At present, more than 49 million slow SQLs have been automatically optimized. Ali has experience and lessons in the process of building this capability, and hopes to share it with you from two aspects: the process of building a global optimization capability based on Workload and the closed-loop practice of intelligent automatic optimization.
HU8B9497.JPG

Introduction of the speakers:

Liang Gaozhong, senior technical expert of the Alibaba database technology team, joined Alibaba Group in 2017 and is currently responsible for the research and development of Alibaba's Alibaba Cloud database autonomous service. Before joining Alibaba, he worked for IBM, Huawei, etc., with 12+ years of experience in database products and database optimization, and served as the head of the development team of database optimization expert system, cross-source cross-data center federal database, etc.

The following content is organized according to the speech video and PPT.

This sharing mainly focuses on the following three aspects:
1. SQL optimization scenario
2. Core diagnosis capability construction
3. Automatic optimization closed loop

1. SQL optimization scenarios

1. SQL optimization challenge

Database diagnosis optimization is one of the key technologies to improve database performance and stability, and SQL optimization is a vital part of it. At present, about 80% of database performance problems can be solved by SQL optimization methods. SQL optimization still faces many challenges. First of all, SQL optimization requires expert knowledge and experience based on many aspects of the database field. Moreover, SQL optimization is time-consuming and heavy. When faced with a large-scale business scenario such as Ali, continuous SQL optimization is full of challenges. In the figure below, there is a database slow SQL trend graph drawn based on real business data over time.
T1 represents the time when the database instance is found to have abnormal performance due to slow SQL, and T2 represents the end of the optimization process and the normal time point. Then the shorter T1 means the less time it takes to find abnormal performance. Secondly, the T2-T1 time is the abnormal processing time. If the processing time is too long, on the one hand, it will seriously affect the business, and on the other hand, it will greatly increase the risk of failure.
1.png

2. Three scenarios for SQL optimization

If the SQL optimization function is provided to users, there are mainly three scenarios involved. The first is a single SQL tool to assist diagnosis. Users can choose to use single SQL as input, and the auxiliary diagnostic tool will give optimization suggestions (rewrite, optimal index suggestions, etc.) based on the given SQL and related environmental information to maximize query speed. There is also a load-based global auxiliary diagnosis tool, which mainly takes Workload load as the optimization unit, and comprehensively considers the characteristics of Workload that affect the overall performance, so as to maximize the overall performance of the load and minimize space consumption. These two scenarios provide users with SQL diagnosis and optimization in an auxiliary decision-making manner. Another scenario is automatic SQL optimization. Through the construction of a complete automated process, a fully automated process of problematic SQL identification, optimization suggestion generation, automatic evaluation, follow-up tracking, and profit calculation is realized.

2. Construction of core diagnostic capabilities

Supporting SQL optimization requires building core diagnostic capabilities. So what is the core diagnostic capability? That is to say, give very accurate suggestions for the problem SQL. Users usually encounter the following SQL optimization problems.

1. Single SQL optimized diagnosis

The essence of SQL optimization is to create conditions and find points that can be improved, such as SQL rewriting, creating SQL indexes, etc., so that the database optimizer can choose the optimal or suboptimal SQL execution plan. The core position in the middle of the figure below is the SQL optimization engine. On both sides are external scenarios derived from core capabilities. On the left is the closed loop of automatic optimization of SQL provided to the outside world, and on the right is the SQL optimization suggestions for users. Then the construction of single-SQL optimized diagnosis capabilities faces several major problems. First, which optimization recommendation algorithm should be used? Is it based on the rule approach or the cost model approach? How to choose a database that lacks WHAT-IF kernel capabilities? The second point, a test set with sufficient coverage, how to build a huge test case library for its core competence verification? Have sufficient coverage, because an accurate test case library is often a crucial part of the process of building core diagnostic capabilities. The third point is how to provide diagnostic service capabilities in large-scale business scenarios. Alibaba needs to serve hundreds of thousands of database instances on the cloud to optimize SQL diagnostics. Then how to achieve complex computing service splitting and horizontal computing services Scaling, maximum parallelism, concurrency control in a distributed environment of resource access, effective scheduling of different priorities to eliminate isolation, peak buffering, etc.? The fourth point is how to make SQL diagnostic capabilities continue to improve.
2.png

Single SQL Optimized Diagnosis-Optimizing the Choice of Recommendation Algorithm·Facing Challenge

The first type of recommendation algorithm is based on rules, and its obvious feature is optimization based on pre-edited rules. The second category is based on cost evaluation methods. The left side of the figure below is the current traditional commercial optimal index recommendation engine architecture. After SQL is imported, it will be analyzed to generate candidate indexes. Then through cost evaluation, then the cost of these candidate indexes will be obtained through the database server WHAT-IF capability. Perform cost evaluation based on the results returned by the WHAT-IF interface, and finally perform the final index merger to select the best. This is the optimal index recommendation process based on cost evaluation in traditional databases. However, for database engines such as MySQL, there are still several challenges in this process:
Challenge 1: WHAT-IF function is missing in
MySQL ; Challenge 2: There is no complete statistical information available in MySQL;

Therefore, it is necessary to optimize this architecture by adding a built-in optimizer between the SQL engine and the database server, and provide the WHAT-IF function through the built-in optimizer. However, this architecture still faces several challenges:

Challenge 3: How to minimize the gap between the two optimizers;
Challenge 4: There are differences between the statistical information in the built-in optimizer and the statistical information in MySQL, so how to reduce or optimize the statistical information difference between them?

3.png

Single SQL optimized diagnosis-optimized recommendation algorithm selection based on cost evaluation method

First, in the built-in optimizer part, Ali will evaluate the cost based on the physical plan, and then choose from it. The difference between this and the optimizer in traditional databases lies in the consideration of adding candidate indexes and SQL rewriting. In addition, the optimizer calculates the cost based on statistical information, so an adaptive sampling algorithm is used in the statistical information problem, and adaptive sampling realizes the adaptive decision of the data sampling amount within the specified error range. One more thing to note is that the process of data sampling cannot put too much pressure on the target database instance.

4.png

Single SQL optimized diagnosis-test set with sufficient coverage · overall idea

In order to ensure adequate coverage of the SQL optimization engine, a sufficient test set is required. When choosing a test set, you will face three questions. First, what test cases should be included in the selected test set? Secondly, how many test cases can prove to be comprehensive enough? Third, where is the current ability of the SQL optimization engine? The choice of test set is difficult because there are too many factors that affect SQL optimization. How to map these features to test cases one by one is also a relatively large project. In addition, the design of test cases requires professional knowledge and a large amount of information, and the design of a single test case also requires professional knowledge and the amount of information carried in the test case is large.

The test case coverage analysis report is generated through the process on the right side of the figure below. The first is to analyze the factors that affect SQL optimization and decompose it into a multi-dimensional test case feature set. Then through the formal description of features, a formal feature library of test cases is generated. Later, with the help of Ali's rich business scenarios, we collected all online SQL and all slow SQL. Then combine the formal features to extract online test cases to generate a test case library. Finally, combine the test case running system and test case analysis tools to evaluate test case coverage and generate analysis reports. In the whole process, the first step is to formally transform the multi-dimensional features, and then build a bridge to the engine test set through online resources. In addition, build a ruler for detecting leaks and filling in the engine test set.

5.png

Single SQL optimized diagnosis-test set with sufficient coverage and test case characterization

The following figure shows the structure of the test case characterization. Start with the factors that affect the index selection, and list these factors. Then divide SQL into two scenarios, Single Table and Multi Table, and divide SQL statements from the influencing factors down. Then through three scenarios, the mapping from feature set to capability level is completed.

6.png
The three scenarios are L1, L2, and L3. L1 supports full permutation of the core label predicate part and aggregation sorting part to ensure that non-core labels are covered, and coarse-grained permutation and combination of predicate aggregation and sorting. L2 includes support for LIMIT, NOT predicate, aggregation support, function support, OR predicate support, INNER JOIN of two tables, UNION of single or two tables, SUBQUERY support, implicit conversion, etc. L3 includes three tables to five tables INNER JOIN, UNION, SUBQUERY, LEFT/RIGHT JOIN, NATUAL JOIN, etc.

7.png

Single SQL optimized diagnosis-large-scale diagnosis capability and data-driven

To support the diagnosis service of large-scale business scenarios, the practice of SQL optimization strategy still needs to complete a lot of things. First, the computing services are split to ensure the horizontal scaling of computing services. It is also necessary to effectively ensure parallel sampling efficiency, control concurrent access to resources, eliminate priority scheduling isolation, and buffer business peaks. Only in this way can it meet the online SQL optimization application that supports large-scale business scenarios.
8.png

2. Global optimization based on Workload

The optimization strategy for single SQL has been discussed above, so from the perspective of supporting business, we still need to start from the overall situation and do global optimization. Global optimization is based on the Workload load as the optimization unit, and comprehensively considers the characteristics of the Workload that affect the overall performance, so as to maximize the overall performance of the load and minimize the space consumption. As shown on the left side of the figure below, the workload of Workload is extracted from the full amount of SQL. Through the SQL global optimization engine, the new index that needs to be created, the new index that needs to be rewritten, and the need are output, taking the storage constraint S and cost constraint C into consideration. Delete the new index, and provide SQL rewrite suggestions.
9.png
The table on the left of the figure below is a series of simple SQL statements and Workload features, including INSERT statements, SELECT statements, and the number of executions in each time period. From the perspective of single SQL optimization, four optimization statements of SQL2-SQL6 are recommended. However, from the perspective of Workload global optimization, two SQL optimizations are recommended. Compared with the single SQL optimization, the overall RT of Workload global optimization is reduced by 14.45%, and the index space is saved by 50%.
10.png

Three, SQL automatic optimization closed loop

1. SQL automatic optimization closed loop-practical effect

SQL automatic optimization closed-loop refers to the automatic optimization closed-loop from problem SQL identification to automatic generation and evaluation based on Workload global optimization suggestions, optimization online, and quantitative tracking evaluation. The automatic optimization closed loop transforms manual passive optimization into active optimization based on intelligence. The left side of the figure below shows several key optimization nodes of the entire SQL automatic optimization closed loop. The first is the continuous tracking for 24 hours, the indicator anomaly detection and the workload anomaly detection, and the abnormal points are found. Then through the SQL optimization engine, optimization suggestions are given. If the user adopts the automatic optimization suggestion, the gray scale goes online. If it is not adopted, it needs to pass the intelligent pressure test verification, and then go to the gray scale to go online, and then perform optimization effect tracking.
Alibaba has implemented a fully automated closed loop of SQL optimization. Automatic SQL optimization continues to keep the database instance running in the best optimized state. At present, Alibaba has automatically optimized 49 million slow SQL internally, and the overall network slow SQL has dropped significantly by 92%. The overall network slow SQL recommendation The rate reached 75%. The automatic optimization closed-loop assisted autonomy of more than 300,000 service instances on the cloud, and the monthly growth rate of instances across the entire network reached 90%. SQL automatic optimization closed-loop hopes to continuously optimize and improve in terms of scale, accuracy, safety, comprehensiveness, and linkage to serve more users.
11.png

2. SQL automatic optimization closed loop-generating optimized revenue report based on stress testing

The left side of the figure below is an optimized revenue report based on stress testing. According to the SQL optimization suggestions generated by the SQL optimization engine, select the user's real load data and perform stress testing. After the stress test is completed, a comprehensive evaluation of the optimization suggestions in the real scenario is generated, and the optimization benefits are analyzed.
12.png

3. SQL auto-optimization closed loop-demonstration and review

SQL optimization provides users with a wealth of test scenarios, and automatic optimization based on SQL is just one of the scenarios. How to mix SQL automatic optimization with other test scenarios? What wonderful effect will this have? What problems can be solved at the same time?
The following figure shows a graph of database performance changes over time, and what SQL automatic optimization does during the process. The yellow line in the figure is the number of active sessions, the dark blue line is the CPU utilization, and the light blue line is the IOPS utilization. The first stage is the orange-yellow part. Since the database is abnormal at 21:06 on September 3, 2020, the abnormality can be found within 1 minute, and the abnormality can be located within 2 minutes, and the SQL current limit is automatically found, and then the current limit takes effect , The number of active sessions in yellow returns to the original position, the CPU utilization in dark blue drops, and the business returns to normal. In the second phase, the green part of the SQL automatic optimization is started, and the abnormal SQL optimization diagnosis will be initiated at 21:17 on September 3, 2020. Then the optimization index change will go online and the index change will be completed. 24-hour tracking will be carried out, and then the current limit will be lifted. The Autoscaling proposal was immediately introduced
to upgrade the database specifications based on changes in load .
13.png

Related Reading

[Contains dry goods PPT download] DTCC 2020 | Alibaba Cloud Ye Zhengsheng: Database 2025
https://developer.aliyun.com/article/780725

[Contains dry goods PPT download] DTCC 2020 | Aliyun Zhao Diankui: PolarDB's smooth migration path to Oracle
https://developer.aliyun.com/article/780749

[Contains dry goods PPT download] DTCC 2020 | Alibaba Cloud Zhu Jie: The latest technology development trend of NoSQL
https://developer.aliyun.com/article/780746

[Containing dry goods PPT download] DTCC 2020 | Alibaba Cloud Wang Tao: Alibaba e-commerce database cloud practice
https://developer.aliyun.com/article/781001

[Containing dry goods PPT download] DTCC 2020 | Alibaba Cloud Zhang Xin: Alibaba Cloud Cloud Native and Multiple Live Solution
https://developer.aliyun.com/article/781031

[Containing dry goods PPT download] DTCC 2020 | Aliyun Chengshi: Database Management in the Cloud Native Era
https://developer.aliyun.com/article/780992

[Containing dry goods PPT download] DTCC 2020 | Alibaba Cloud Ji Jiannan: Interpretation of key technologies for online analysis to enter the Fast Data era
https://developer.aliyun.com/article/780747

Original link: https://developer.aliyun.com/article/781036?

Copyright statement: The content of this article is voluntarily contributed by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/112520080