Slow SQL governance practice and implementation results sharing | JD Logistics technical team

1. Governance background

Database system performance issues can negatively impact application performance and user experience. Slow queries may cause problems such as slow application response, accumulation of requests, increased system load, and even cause system crashes or unavailability. Slow SQL management is an important task in optimizing and improving slow-executing SQL queries in the database system.

However, the original governance rhythm generally focused on investing manpower in emergency governance during the preparation period for major promotions, and did not pay much attention to slow SQL on a daily basis; even if the R&D team wanted to start governance, the detailed screening of SQL under the instance was cumbersome, the trend was unclear, and there was a lack of system , digital governance solutions.

Therefore, in order to ensure system stability and prevent emergency accidents caused by potential slow SQL, a special slow SQL normalization preparation project was launched. The following mainly describes the practice and implementation of the special project.

2. Stage planning

1.0 stage

Goal: [Form a normalized governance mechanism and focus on the efficiency of solving slow SQL]

Change the slow SQL management habits, from the original management during the preparation for the big promotion, to follow up the slow SQL generated in the daily dimension every day, and pay attention to the effectiveness of biweekly dimension management.

Indicators to watch:

Overdue rate = number of overdue work orders (tasks created in the current quarter) / total amount (tasks created in the current quarter) Note: If the work order is not completed for more than 14 days, it will be considered overdue. Whether it is overdue or not will be based on the time of first completion. To judge, if it is not completed before the deadline, it is considered overdue; if it is completed before the deadline, but after reopening, it is completed after the deadline, it is not considered overdue, and it is counted as reopened; if it is pending for more than 14 days, it will be counted Overdue;

Reopening rate = number of work order reopenings (tasks created in the current quarter, if a task is reopened 5 times, recorded as 5) / total amount (tasks created in the current quarter)

2.0 stage

Goal: [Completely eradicate the historical debt of slow SQL and achieve a periodic clearance of >0.9s]

After the 1.0 phase R&D team conducted an orderly and step-by-step management of slow SQL, some slow SQL data has been effectively resolved in the early stage, but there are still historical debts that affect system stability. Phase 2.0 requires biweekly phased clearing.

Indicators to watch:

P0 number of work orders pushed = total number of tasks with a push time greater than 0.9s in the current week Note: Declaration level division,

P0 execution time is greater than 0.9 seconds and reaches the threshold 10 times

P1 execution time is greater than 0.9 seconds, but does not reach the threshold 10 times

P2 execution time is less than 0.9 seconds without indexing, and reaches the threshold 10 times

P3 execution time is less than 0.9 seconds without indexing, but the threshold is not reached 10 times

P0 work order stock number = total number of tasks with push time greater than 0.9s that have a status of yes or no and have been resolved in the current week

Resolution rate = greater than 0.9s, the task status of the push time is solved/the total number of tasks with the push time of the week

3.0 stage

Goal: [Improve system performance indicators and reduce slow SQL threshold in a stepwise manner]

After the 0.9s phased clearing of major hidden dangers, slow SQL work orders are gradually refined, the slow SQL definition threshold is gradually reduced according to the ladder dimension, and new slow SQL is cleared according to the biweekly dimension.

Indicators to watch:

P0 number of work orders pushed = total number of tasks with a push time greater than 0.9s in the current week Note: Declaration level division,

P0 execution time is greater than 0.9 seconds

P1 execution time is less than or equal to 0.9 seconds and greater than 0.7 seconds

P2 execution time is less than or equal to 0.7 seconds and greater than 0.5 seconds

P3 execution time is less than or equal to 0.5 seconds without indexing

P1 work order push number = total number of tasks with push time less than or equal to 0.9s and greater than 0.7s in the current week

P2 work order push number = total number of tasks with push time less than or equal to 0.7s and greater than 0.5s in the current week

Number of stocks = the total number of tasks in the push time that have a status of non-resolved during the current week

Resolution rate = The status of the tasks at the push time is solved/The total number of tasks at the push time in the current week

4.0 stage

Goal: [Prevent slow SQL in advance and implement database operation specifications]

The expected goal is to completely resolve historical debt and improve system performance indicators, implement database operation specifications to prevent new slow SQLs, continue to pay attention to new slow SQLs, and control the number of new additions to be clear.

Indicators to watch:

The number of new work orders = the total number of tasks with non-existing fingerprint IDs that were pushed in the current week

Number of stocks = the total number of tasks in the push time that have a status of non-resolved during the current week

Resolution rate = The status of the tasks at the push time is solved/The total number of tasks at the push time in the current week

3. Implementation plan

①Data preparation

Threshold definition

Combined with the secondary department business, the query time for collecting SQL every day is T-1 days. If the execution time is >0.9 seconds or <0.9 seconds but the index is not included in the execution plan, after excluding bi_cx and wlcx decimals (without distinguishing between master and slave), Slow SQLs with the same fingerprint are all identified as existing risky slow SQLs.

Clarify the level

At different governance stages, slow SQL will be prioritized in the order of P0-P3, and research and development will be promoted from high to low and assessed according to different solution timeliness. At the same time, auxiliary diagnostic information is provided, including the database IP/domain name/library name/execution time/execution plan that triggered the slow SQL management task.

Sorting and filtering

Categorize according to instance information, database name, attributed system, attributed product line, query time, aggregate fingerprint, etc. to easily classify the same source of slow SQL problems.

②Work order advancement

Work order flow

Divide according to business lines, clarify the interface person for each line's work orders, and uniformly assign slow SQL work orders to the interface personnel, who will distribute the slow SQL work orders to the students in the group according to the system and solve them one by one.

Solutions

Learn from the solution ideas provided by DBAs and others, and at the same time summarize the solutions implemented within the team to promote the rapid resolution of slow SQL.

③Trend analysis

Chart making

Based on the indicator data of each stage, a slow SQL solution trend chart is made so that the team can clearly view the slow SQL details under each instance and supports multiple dimension filtering; at the same time, it supports viewing solution trends, existing quantities, etc. according to the time dimension. .

Tongshai review

In the form of special weekly meetings, the R&D team's processing rhythm and progress are synchronized to ensure continuous advancement.

④Process tracking

Phase 1.0 focuses on solving effectiveness

Phase 2.0 focuses on the governance of >0.9s and cleans up historical debts

P0 level SQL resolution follow-up:

Clearing existing historical debt:

Trends in slow SQL volume by month

4. Implementation results

【System guarantee】

After the gradual management of slow SQL, the team had solved a total of 831 slow SQLs before the 529 closure, completing the periodic clearing of historical debt during this year's 618 preparations. **Complete slow sql management index addition and code optimization on a daily basis, focusing on problems and analyzing unused indexes during major promotions, so as to ensure that no omissions are missed in preparation.

It also directly guarantees the stability of the system. In the past six months, there have been no online problems caused by slow SQL.

【Project Precipitation】

As slow SQL governance is integrated into the daily work of R&D as a special project, first of all, in order to avoid the generation of new slow SQL, the team implemented database development specifications, JD Group Database Development Specifications-V1.0-public notice, and how to analyze SQL and quickly locate , solve it efficiently, and the team also outputs governance solutions.

5. Summary

After the implementation of each stage of the special project, the team has implemented the governance goals and precipitated solutions. Subsequent slow SQL governance will be continuously promoted to ensure system stability.

Author: JD Logistics Liu Hongyan
Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source for reprinting

Guess you like

Origin blog.csdn.net/jdcdev_/article/details/133021071
Recommended