How can “newcomers” in data research and development be implemented quickly?

Author: Xiao Di (Mo Xu)

picture

1. Preface

This quarter mainly promotes the security month to build & consolidate the stability chassis, and organized the students in the group to conduct stability research on the core business links. During the process of touch-finding, a voice keeps asking you, are all the questions you found out through touch-finding? Have you added all the monitoring? Have you considered all aspects of your technical transformation plan? (This sound mainly comes from the left ear, because my leader is sitting on my left, hahahaha) So we have been thinking and focusing on how to systematically build stability, with methodological guidance and precipitation horizontally, and tracking each business vertically. The process and result of the line, so we have the picture below.

picture

This picture is mainly divided into four parts. First, determining the goal is the prerequisite for everything to start; second, the methodology part is used to precipitate theoretical methods for stability construction and support subsequent actions; third, action routing, corresponding to the methodology part, I hope it can Use a picture to clearly explain the construction path; fourth, get the results and track the progress corresponding to each stage to ensure that you can get the final results. The methodology part will mainly answer the previous three questions by focusing on how to conduct rankings.

2. Determine goals

The first principle of stability investigation is to assume that problems may occur anywhere, but the direct impact of this principle is that during the investigation process, you will feel that "it is all a problem, and it may not be a problem." Therefore, the first thing to do when doing a survey is to determine the goals. Only with goals can you have focus and priority.

The overall goal should be roughly the same for all businesses, "no failures above Px", "0 capital loss", "failure recovery 1-5-10" . However, different businesses and non-application responsibilities will have different goal emphasis. For example, the goal of 2C business is more focused on service stability , the goal of 2B business is more focused on data consistency , and the business goal involving capital flow is more focused on capital loss .

Therefore, before mooting and troubleshooting, you must determine what the goals of the business application you are responsible for are. You can discuss all fault scenarios with GOC classmates, so that the priorities of mooting and troubleshooting will be clearer. After we have the key goals, we can clearly distinguish which are the core links, and we can also prioritize the problems identified during the process, which will help us to implement monitoring and technical improvements faster and more effectively in the future.

Take our own business as an example. Our team is called Content Assets, which is responsible for the management and value mining of Youku’s tens of billions of content assets. The three main businesses are CRP for content copyright management, CCC for content rights management, and an open platform for content introduction. CRP manages the flow of copyright information and funds, so the key goal is to avoid 0 capital losses; CCC data will affect whether the downstream end side has the right to play, and the inability to play the content may lead to customer complaints, so the key goal is to avoid customer complaints caused by inaccurate data. Complaint: Users of the open platform are partners and are considered semi-2C, so the key goal is to make the service available.

The first focus of CRP is zero capital loss , so the core link may cause capital loss scenarios; issues that may cause capital losses have high priority; the focus of 2B business is on the accuracy of data, so data may be inaccurate The priority of issues is high; other issues such as usability issues, service unavailability issues that do not affect the first two items, etc. can be listed as low.

3. Methodology

3.1 Are the problems you have sorted out all the problems?

This is the first problem faced by stable platooning. When facing this problem, we must admit that there is no silver bullet that can kill all the monsters and take away the "black swan". But there should be a methodology for how to find all the hidden dangers within our cognitive scope.

When the team was first doing the troubleshooting, in order to solve this problem, I made a long list based on my experience, hoping that it could help find all the problems when troubleshooting the code. There are two problems with this method: the first is " if there is no priority, then all problems are low priority ". If you assume that every line of code may have problems, you will not know where to start, and the workload will be huge; second, There is " only results but no process ". When actually going through the problem list, there is no way to judge whether it is all problems.

It can be seen that the above method is not a reliable and efficient method. We need a methodology that can guide the process and derive complete results.

Process routing

In the entire process modeling and routing, we need three diagrams: core link diagram, process sequence diagram, and problem routing diagram.

picture



Core link diagram

The principle of mooting is to assume that every line of code may have problems, but we cannot complete mocking and troubleshooting all at once. So what codes should be mooted and mopped with high quality? This means that we need to deduce it from the goals we initially determined. The process where key problems may occur is the scope of high-quality troubleshooting. Take CRP as an example. The first focus of CRP is 0 capital losses. The payment process involving capital losses is the module that we need to conduct high-quality inspections. The actual approval and payment process requires high-quality inspections, and the payment list will not be generated. Capital losses can be prioritized lower.

Based on the derivation, we found the core link and transformed it into the core link diagram of the business. The key outputs in the diagram are: 1. N core links; 2. The call entrance of each link; 3. The middle of the link process. software and query dependencies; 4. Write dependencies.

picture

Process sequence diagram

After we have the core link diagram, we determine the priority scope, and then actually sort out the code and produce the process sequence diagram corresponding to the core link. The sequence diagram must ensure two things: 1. N core links must correspond to at least N process sequence diagrams to ensure that they are not missed; 2. In the process sequence diagram, focus on RPC calls and operations of key entities. Don't miss it.

picture

problem routing map

After the first two steps of derivation, we have found the key nodes in the core link. Next, we only need to route to possible problems for focused investigation based on the type of key points. When calling the entry point, you need to pay attention to traffic issues, core parameter verification issues, and idempotence issues; when writing dependencies, you need to pay attention to unavailability issues, idempotence issues, data consistency issues, etc.; during the process, you need to pay attention to transaction issues, concurrency logic issues, and meeting issues. Problems with calculating the amount of capital losses, etc.



picture



With these three pictures, we can analyze and judge all the problems in each line of code, and reduce them to the corresponding types of problems corresponding to the key nodes in the core process; and in the process, we have the identification of core links and key nodes. Derivation can ensure that there will be no omissions in the calculation. When reviewing within the group, the three pictures above can also provide a basis for judging whether it is "complete".

3.2 Have you added all the monitoring functions?

After solving the problem of incomplete rankings, the second question we face is, have we added them all? As the main means of discovering problems, monitoring is very important in daily operation and maintenance work. If it can be discovered as early as possible, the bleeding can be stopped in time. Commonly used monitoring methods include data reconciliation and log monitoring. Data reconciliation is often used to monitor the data consistency of information flow and capital flow, and log monitoring is often used to monitor system process abnormalities. So how do we ensure that our monitoring is comprehensive and effective?

Data dependent routing

Data reconciliation is to discover data inconsistencies in the process of information and capital flow. However, the fields of business documents range from dozens to hundreds. Which fields should be monitored, and how should the trigger points for reconciliation be set ? There are still three diagrams for derivation: scenario use case diagram, data model dependency diagram, and state machine list.

picture

Scenario use case diagram

First of all, we need to deduce from the goal, find the scenarios where all key issues may occur, and draw the scenario use cases into the mind map. The focus is to cover all problem scenarios, as well as the business entities and fields involved in the problem scenarios.

Taking our own business scenario as an example, the payment of copyright assets is the exit of the capital chain introduced by Youku content, and it is easy to cause capital loss scenarios. There are three main scenarios of asset loss: incorrect actual payment amount, incorrect calculation of payment amount, and repeated payment.

Payment amount calculation is incorrect:

1. When submitting the CRP payment order, make sure that the payment amount does not exceed the contract guarantee amount.

2、......

The actual payment amount is incorrect: .....

picture

Through the above combing, you can get the second picture

Data model dependency graph

Still using the above example, the dependencies and key fields of all involved business entities can be deduced from the descriptions of all scenarios in the above figure . These are the core fields we need to do data reconciliation. Once the core fields are in place, we need to solve the second problem. What should be the event that triggers reconciliation? In most cases, we will only add monitoring in the positive direction according to the business process, such as using the submission of CRP payment documents as an event to trigger the reconciliation of the document amount and the contract amount.

This will cover most problem scenarios, but may ignore some low-frequency but important problems. For example, a CRP payment order has been submitted to trigger positive data reconciliation, but during the capital flow process, the contract amount becomes smaller and less than the payment amount. If timely notification and intervention are not achieved, capital loss problems will occur.

picture

Therefore, when adding monitoring, in addition to forward data reconciliation based on business trends, reverse data reconciliation should also be considered, and priority monitoring construction should be carried out based on the severity of possible problems.

State machine list

Since there is a dependency on the status flow between the payment document and the payment voucher (bill), and the status flow of the two documents themselves also has checkpoint rules, so during the actual operation we made a state machine list, and the auxiliary data reconciliation was accurate and smooth. Missing landing. This diagram is not required, but if the business of monetization also has complex dependencies on business document status transfer, it is recommended to draw this diagram to assist in the addition of monitoring.

picture

System-level unified monitoring

In addition to data reconciliation monitoring, another important means is log monitoring. When adding log monitoring, you usually face a situation where it is difficult to locate the problem even after adding monitoring. Therefore, we need a method that can not only detect the occurrence of problems, but also locate where the problem occurs, what the problem is, and even which machine has the problem in a multi-machine application. Here we have learned from the methodology of a big boss and configured a unified monitoring system from the whole to the details, from perception to diagnosis, which can solve the above-mentioned problems very well. ↓↓↓↓↓

picture

3.3 Have you considered all aspects of your technical transformation plan?

The problems in the code have been identified, and some require technical transformation. Have you considered all the technical solutions for different problems? Following this question, I compiled an action routing diagram that charts the entire link of stability modeling and planning (Modification and Planning -> Monitoring and Supplementary -> Technical Transformation -> Plan), hoping that it will enable team members to have a system in the stability building process. cultural guidance.

picture

Common solutions are mainly divided into several categories, including data consistency solutions, idempotent solutions, asset loss prevention solutions, and slow SQL transformation solutions. Slow SQL transformations are all basic skills and will not be described in detail.

4. Get results

Needless to say, this part is the monitoring of process results corresponding to each stage of Action routing, and the achievement of the final result. The form of presentation is not important, the key is to follow up to ensure the final result is achieved. So far, our team has worked out a total of 13 high-quality problems on the core links, and all technical transformation plans are being reviewed and scheduled for resolution; we have cooperated with the test students to add 42 new monitoring items, 19 of which were worked out. ; No P-level bugs, 0 capital losses.

picture

5. Finally

System stability is the basis for ensuring that we obtain all business value, and its importance is self-evident. Stability building is actually a continuous task. In the recent fiscal year, a lot of new businesses were handed over, so we conducted a surprise survey. The work is large and complex and requires the full cooperation of the testing, product, and development teams. I hope some of the methodologies summarized during the process will be helpful to everyone.

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/133077631