How do we ensure data stability?

Didi’s customer service business is a business with strong operations, and the core of operations is index data. Some of these indicators are OKR indicators to achieve strategic goals, and some are to achieve settlement indicators for settlement with partners. Data stability is crucial to the operation of the entire customer service business.

Interpretation of data failure governance construction goals

Real-time indicators, including incoming line volume, queuing volume, pick-up rate, reach rate and other indicators. Lagging indicators, including resolution rate, clearance rate, upgrade rate, satisfaction, service quality and other indicators.

In the past two years, in order to ensure business continuity, we have invested a lot of energy in building stability. The overall construction is divided into three stages:

The first stage: Stability construction centered on faults. A series of engineering capabilities, process mechanisms, and construction methodologies have been systematically implemented around system faults before, during, and after events; centered on reducing occurrence and impact, the final number of faults and The failure time is greatly reduced.

The second stage: business-centric stability construction, centering on business characteristics and starting from the actual situation of the business, set up a horizontal cross-organizational special team to solve the stability problems existing in the connection between business and technology, and realize technology for business continuity guarantee the global optimum of .

The third stage: Normalized capacity building. With the continuous deepening of stability construction work, the organization has more and more requirements for stability team work. It has been upgraded from purely technical stability work to cover security compliance. , cost reduction and efficiency increase and other related work content. In order to avoid sports-style work input and make stable work low-cost and sustainable, we will focus on improving automation tools to improve efficiency, build a sustainable operating mechanism, and ultimately shape the stable work culture of the team.

We completed the first phase of system stability construction last year, forming a guarantee for real-time indicators. The construction of data stability belongs to the second phase of work. Its work content is closely related to business characteristics, and it mainly solves the stability problem in the production and use of business lag indicators. System stability is the foundation of data stability. Only when system stability is done well can there be a basis for data stability. 

In order to do a good job in data stability construction, we first did the following things.

Formulate data fault grading standards and do data grading

We have 1000+ indicators. Due to limited resources, it is impossible to cover everything. The establishment of data failure grading standards actually answers what kind of data we want to protect and to what extent. It clearly defines what kind of impact is a fault, what level corresponds to what kind of fault impact, and what kind of indicators have higher requirements for stability.

After this step, we have clarified that the types of indicators that need to be preserved for data classification include: OKR indicators, settlement indicators, and other indicators. Among them, the OKR indicator covers the work efficiency dimension, service quality dimension, safety dimension and risk dimension.

Target dismantling

After we have the data index grading standard, we need to consider how to ensure the stability of the data.

Stability construction requires the joint construction of three parties (R&D, data warehouse, and data), and the three parties jointly serve the business. They need to share a certain proportion of failures with each other, and dismantle the stability goal to the three parties, so that everyone can work together.

clear style of play

After the goal is disassembled, it is necessary to clarify the relevant style of play and rhythm, including the specific principles, plans and actions for accomplishing the goal. The specific plan covers before, during and after the event, with the goal of reducing the number and level of failures. The corresponding methodology is centered around: target system, personnel awareness, personnel quality and system tools.

4244860ca1ed07244279f6da54cc6195.png

Many technical students have a better understanding of system stability, but there are still many differences in principle in the construction of data stability. Let me briefly introduce a few.

What is the most important thing for a data failure?

The answer, first, is time. The impact time has two points: the total amount of data impact and the duration of data repair .

Different from system availability failures, the most important thing for data failures is the time for data retrospective repair. It is how to quickly recover and trace back the affected data, and complete the data recovery before the data is used.

This requires some guarantees before, during and after the event:

  • Guarantee in advance : Before operating data or making database table changes, R&D knows whether it will affect related ODS and key indicators, and after a reasonable assessment, possible damage to data indicators can be avoided. 

  • Guarantee in the event : Carry out alarm improvement and fine-grained latitude around the key database table fields. When a problem occurs, the problem can be quickly discovered, located, and quickly intervened to solve the problem. Found early, the amount of data affected is small, and the repair process can be very fast.

  • After-the-fact guarantee: There are repair tools at your fingertips, or reusable redundant data and repair scripts are deposited to help with data backtracking. The overall failure data backtracking efficiency will be greatly improved, further reducing the impact of data-type failures and lowering the level of failures .

Therefore, based on the above guarantee principles, the preliminary work focuses on pre-event (three-party collaboration SOP, special checklist automation mechanism), in-process (core database table, field monitoring and alarm coverage), and post-event (data redundancy plan, rapid repair template) , is to focus on prevention + improve the efficiency of backtracking.

Data Fault Rating Criteria 

This part is relatively complicated at present, and most of the questions raised by many students also appear here. Although many people know that it is important to formulate fault grading standards, it is actually difficult to formulate a simple, understandable, and executable standard.

The first thing to do is to classify data, so that different levels of data can be provided with different levels of resource protection.

Indicators are divided into three categories: OKR indicators, settlement indicators, and common indicators. The three types of indicators have different sensitivities to data errors. If the OKR indicator is stuck at a threshold point, the impact will be relatively large, and it will affect the achievement of OKR. Settlement category mainly involves settlement to suppliers.

Factors affecting the fault level: mainly time, the time from discovery of the problem to the repair of the problem.

Calculation of the number of days affected by the accident

Because of the unique characteristics of the data warehouse side, the data is partitioned by time. For example, the data of a certain day is only affected for a small period of time. The less, the efficiency is still guaranteed.

Relationship between indicators and ODS tables

This part is the most important starting point for R&D students to participate in the most and reduce the level of data failure. 

The entire indicator system has 1k+ indicators. Through data blood analysis, it is still possible to find the database tables and fields that need to be focused for protection. Based on the index blood relationship, find the mapping document of the core library table-ODS table. 

The focus of research and development is to ensure the stability and reliability of the data production link before the ODS table. The data warehouse students focus on the stability and reliability of the data usage link and further shorten the time for ODS data backtracking.

Mining Data Fault Status 

The above introduces some principled content of data stability construction. For specific work, we still need to return to the status quo to look at specific issues. Heshucang and Datamate made an inventory of historical faults, and the following problems are more obvious:

  • Lack of attention and understanding of indicators and ODS in research and development

  • Indicators and ODS tables lack responsible persons

  • R&D, data warehouse, and data parties lack an effective collaboration mechanism

  • The alarm of key indicators is not fine enough and the coverage is not enough

  • The fault handling process lacks SOP guidance, resulting in the number of repair errors and rework

  • Lack of fast retrospective data tools at hand

Lack of attention and understanding of indicators and ODS in research and development

In the past, R&D students focused on the data mainly at the database level, and did not pay attention to how the data is used and what indicators are generated after the data is stored. More attention is paid to the production of indicators. There is a lack of understanding of the pre-links for indicator generation, and the formulation of many indicator generation rules is just "one-sided words." 

Combining the three roles of R&D, data warehouse, and data product, the data production process can be disassembled as follows: 

d8b7bc79ad1708ee943e93c3a5d2ed7d.png

  • R & D side : mainly focus on the reliability of the core database table data generation link corresponding to ODS, ensure the accuracy of the data, and take responsibility for it;

  • Data warehouse side : mainly focus on the backtracking when ODS data is abnormal, ensure the reliability of data indicator production, and take responsibility for it;

  • Data side : Mainly focus on whether the indicators are generated accurately and take responsibility for them.

With limited human resources, the early stage concentrated resources to re-insure key indicators, such as OKR, settlement, and risk indicators. On the basis of the monitoring and alarming of the original system stability, the monitoring and alarming of data index dimensions are enriched.

Indicators and ODS tables lack responsible persons

After confirming that the research and development should take responsibility for the core database table and ODS, it is possible to delineate the specific person to claim the ODS and identify the responsible person. Once the data production or use link finds that there is a problem with the indicator, the corresponding person in charge can be quickly found through the ODS lineage and follow up.

In the past, after many data failures occurred, it may take two months to find the corresponding R&D owner. Now the responsibility is on the person, and after the problem is discovered, the R&D owner can be found on the same day, and a repair plan can be produced. Therefore, it seems simple that the responsibility is assigned to the person, but the benefits are huge.

For the system owner, not only the design of system stability needs to be considered in engineering, but also the stability design of several corresponding core database tables and fields. 

In order to further improve the coverage of key indicator monitoring and reduce the concern cost of the system owner, the platform provides a set of automated tools for core ODS indicator collection and monitoring to achieve early detection and early intervention.

R&D, data warehouse, and data parties lack an effective collaboration mechanism 

R&D, data warehouses, and data belong to different teams, and the collaboration was not close before, which to some extent led to longer processing time for data failures. After the joint construction, the cooperation mechanism of the three parties was clarified. The key responsible persons of the three parties are in the same group, and the key information is deposited in the group announcement. For changes in R&D or data indicators, automatic notification and collaboration can be realized through the robot of the internal communication software.

The alarm of key indicators is not fine enough 

The monitoring alarms configured before R&D are only configured around system availability indicators, and there is no monitoring dimension covering the accuracy of database tables and field data. 

The previous monitoring alarms on the data warehouse and data side are either inaccurate or too late (T+N, N may be 30 days). If the problem is discovered too late, the cost of data retrospective will become higher. It is necessary to improve the monitoring and alarm inspection of the log warehouse and the data side, try to achieve T+1, find the problem as soon as possible and deal with the problem quickly.

The system owner needs to monitor the key database tables and fields, and trigger an alarm when the data is incorrect, inaccurate, or lost.

Alarm rules that can be added include but are not limited to:

  • Field non-null check

  • field type change

  • DDL addition and deletion field changes

  • Year-on-year changes in key table indicators

  • Format confirmation of time type (timestamp or yyyy-MM-dd)

  • Cross-system RPC data consistency comparison

  • Historical data reconciliation, year-on-year

  • The rules of the field content have changed, regular matching

Through the BCP+ low-code platform, the system owner enters the database table field rules, monitors based on binlog, and triggers an alarm when there is an exception.

25229ad388ff5a13a3978a250bab1d0a.png

Reconciliation of key indicators, timely discovery, and rapid correction of data through a TCC-like interface. In principle, false positives are better than false negatives. The same requirements also apply to data warehouses and data-side systems.

The fault handling process lacks SOP guidance, resulting in error repair and rework

Sometimes a problem with an ODS field may affect multiple indicators, but some are ODS indicators and some are ordinary indicators.

Which index to repair first may affect the final fault rating, or the failure to obtain accurate information before the fault repair caused secondary damage. 

Establish a SOP for troubleshooting collaboration, combined with internal communication software robots, covering operational guidance during and after the event, reducing the number of repair errors and rework.

Once the indicator generation link changes (DDL type operation developed, indicator level upgrade on the data side), relevant stakeholders need to be notified.

0b08fb50ab6f93c4ce97c2f9eecfd91f.png

Sort out the standard failure emergency standards and procedures, and clarify the responsibilities and relevant stakeholders at each stage.

Lack of fast retrospective data tools at hand

Build some tools for data reliability guarantee, and improve the efficiency guarantee of some unrelated links. Solve the problem of relying on manpower or the lack of capacity of existing retrospective tools through tool improvement. 

Data acquisition link reinforcement: including transforming the original public.log log file data reporting scheme, and strengthening the data bus (logbook) to ensure that data is not lost, not duplicated, and can be traced back.

Data playback improves data repair efficiency: build core link data playback capabilities. When data loss is found in the link, abnormal data can be quickly corrected through data playback to improve data repair efficiency.

The number repair script can be reused to prevent the occurrence of wrong number repairs : some repair number templates and scripts are deposited in high-frequency scenarios, and the repair link can replace templates to reuse related repair number scripts to reduce the risk of error repair and improve repair time. Number efficiency.

The above processes, norms, and standards are improved and accurate with the help of tool automation.

Construction plan and path

Principles before, during and after

In summary, for data type failures, the main actions before, during and after the event are as follows:

  • Beforehand: clarify which indicators are key indicators, which ODS tables are associated with key indicators, the person in charge of the relevant ODS tables, and the collaborative SOP for the entire link and multiple links, so as to turn uncertain situations into deterministic consensus;

  • Ongoing: Do a good job of fine-grained monitoring and alarm coverage. Many data indicators are produced in the form of T+N. There is inertia in the monitoring and alarm configuration, and the alarm is only issued when the data is determined to be used. At this time, the impact of the fault has already occurred. The current principle is "prefer false positives rather than missed negatives", so that The system closest to the ODS library table should be well monitored, and it must be detailed and practical, and regular inspections should not be taken by chance;

  • Afterwards: For data failures, the most important thing is to quickly repair the number. There have been many cases of "stop loss for 5 minutes and repair the number for two weeks", so it is very important to have tools at hand. From this perspective, starting from the three levels of "disk, optimization, and construction", "inventory" the improvement of existing data production and data processing tools; "optimize" existing tools to make them more useful and easy to use; "Build" missing tools, reduce the degree of manual intervention, shorten the data recovery link, and automate efficiency.

Continuing operation

Two-way notification of clear data and indicator changes:

  • Notify R&D when data indicators change

  • R&D makes data changes, such as notifying data when operating DDL

  • New OKR indicators are generated and notified by email

Establish an index that sorts out around the core index—the ODS mapping document. When the key database table fields involved in the document are modified, such as data modification and field offline, you need to do synchronization and publicity in the group before the operation, and @related data classmates Well known.

If there is no notification, the follow-up review will investigate the problems in the operating process standards, and the internal IM tool robot will be used as the bottom line at the tool level.

Establish a reward and punishment mechanism to punish those who fail to comply with the rules and regulations, and reward individuals who comply with the standard process for good cases. Through the reward and punishment mechanism + periodic review, the cultural shaping of data stability construction is realized.

construction effect 

With the joint efforts of production and research, data warehouse, data, and business, through timely discovery of problems, complete troubleshooting, grading standards, fault recovery mechanism, automation tools, and in-depth operations, the number of data faults and data fault handling this year The duration has been significantly reduced.

  • The number of failures decreased by 42% compared to the previous year

  • The timeliness of fault repair increased by 134% compared with the previous year

In addition to the obvious benefits seen in the indicators, there are also some hidden benefits.

Through the complete mechanism construction, the responsibility boundaries of the three parties have been clarified, issues such as judgment, transparent transmission, and apportionment have been regulated, communication efficiency has been improved, and the sense of happiness of the students in the warehouse has been improved.

It clarifies the handling process of a typical failure, covers key links and key responsible persons, and provides a good foundation for further cultural shaping and automation efficiency improvement.

Summarize

The fault handling world view of the previous research and development students was built based on system stability guarantees. Everyone is used to stopping losses quickly, which can prevent a small fault from turning into a big fault. But this worldview may not work so well when it comes to data failure handling.

Data failures have their particularities. Compared with system stability, the simplest difference is that the stop loss action after a data failure occurs is only the beginning of the entire data failure recovery process: stop loss for 5 minutes, and repair for two weeks. 

The second part of the world view of data stability construction is to have redundancy, whether it is data redundancy for post-event data repair or the reuse of repair scripts after a failure occurs, it is a part of redundancy.

The data stability plan should be fast and accurate, and reduce the possibility of secondary damage, which requires our plan script to be accurate and reliable, and regularly reviewed.

The data retrospective repair process after the fault occurs is the real big part of the work. The workload at this stage is heavy and trivial. It is a lot of physical work, which makes everyone miserable. Standardized process + automation can improve happiness. 

In line with the principle of "sweat more in peacetime and bleed less in wartime", there are four requirements:

  • Invest more in advance : If a certain demand or technical change action is identified that may affect key data indicators, try to find relevant stakeholders to conduct a full evaluation before the operation, don't take it too troublesome, and don't take chances.

  • Have a global perspective : Sometimes when data students determine the generation rules of an indicator, their perspective is very limited. When they see a table with a corresponding value, they use it, but they do not know that this value is unreliable. Using it is neither guaranteed nor introduces unnecessary technical debt, which hinders the continuous iteration of subsequent business systems. A good way is to sit down with R&D students and have a good chat, and see where this indicator is most suitable from a global perspective.

  • The solution is in place : Various stability requirements, such as monitoring and alarm mechanisms, must be implemented in place, and they need to be solidified into their own system links in the form of coding. Through the review, it was found that many failures can be avoided if the implementation is in place. For example, fields must have values, must conform to a certain data type, and must belong to fields within a certain range. These can be implemented in the code very simply in a coded manner. In many cases, repeated failures occur because of insufficient execution.

  • The pessimist is right : Everyone has heard that "the optimist succeeds, and the pessimist is right." Stability means keeping the bottom line. In program design, it is not a bad thing to have a "pessimist" attitude and think more. It is a good thing to do defensive programming and achieve the goal.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132613786