Chaos Drill Practice (2) - Payment Add Link Drill | JD Cloud Technical Team

1. Background

Under the current microservice architecture, each service is highly dependent and the calling relationship is complex. Business scenarios can rarely be realized through one system. The realization of common business scenarios basically involves multiple upstream and downstream systems. To ensure the stability of the overall link, it is necessary to Minimize the coupling between systems to avoid the failure of the entire link caused by a single point of failure.

2. Goals

Verify the performance of the overall link when some systems in the link fail through chaos drills, check and evaluate the ability of the link to maintain normal operation, identify unknown hidden dangers in advance and repair them, and then ensure that the entire link can better resist The out-of-control conditions in the production environment improve the stability of the overall scene function.

3. Drill link

To conduct chaos drills for real business scenarios, it is necessary to sort out the links of related services and call relationships in the business scenarios. Generally, it is necessary to draw a system interaction diagram based on the actual business scenario, and through link series, data tracking, and upstream and downstream confirmation. Organize the link diagram in other ways.

4. Exercise plan

Before the chaos drill, it is necessary to evaluate the feasibility, evaluate the service deployment environment that can be drilled, the maturity of the drill tool, and the explosion radius of the drill scene, etc., and then decide the drill scene and carry out practical operations.

5. Content loading practice

5.1 Link sorting

Content loading link drill, sorting out the loading link through content loading system interaction: gloe engine execution-AB splitting-CMS resource acquisition-Eagle Eye content sending

5.2 Interface sorting

According to the call relationship of the call link, the specific interface is sorted out:

5.3 Develop an exercise plan

Drill time: 2023.03.28 14:00-22:00

Drill attackers: Sun Xying, Chen Xran; Drill defenders: Zhang Xlei, Fu Xjun, Liu X, Han X

For link interface design rehearsal scenarios, generally design faults that are more likely to occur according to system characteristics. For example, if the application is partial to calculation and consumes CPU, the fault design includes CPU full load. If the application has strict requirements on response time, it generally includes method delay faults. design.

The link failure scenario design is as follows:

For specific drill scene design, please refer to: Chaos combat drill (1)

5.4 Exercise Execution

At present, use the Tianquan automated operation and maintenance platform to conduct chaos attack and defense drills, enter the tool market-drill category, select different failure solutions, and click "Execute Now";

For example, if you choose Java process full load scenario drill, select the full load rate of 100%, the number of full load cores is the number of CPU cores for the drill application deployment service, and the drill duration is the duration of full load execution. Select the specific application for the drill and specify the IP, and execute the drill plan.

Drill example, configure the fault parameters according to the scenario of the drill, the following figure shows the precision touch system - message touch method delay increase 30ms parameter setting:

Drill execution result inspection, the following figure shows the offload service - JAVA process is fully loaded, the CPU of the specified offload process is fully loaded, and the execution result of the fault:

5.5 Exercise monitoring

Use monitoring tools to collect the performance status of the server during the chaos exercise in real time, such as the usage of CPU and memory at the system level, and observe the response time, success rate and other indicators of the method. On the one hand, verify whether the system status meets the expectations during the execution of the chaos scenario. At the same time, record the problems that occurred during the drill and record the scene. On the other hand, if there are risky problems found through monitoring, manual intervention will be carried out, and the drill will be terminated immediately.

Scenario 1: Precise Reach System - Message Reach Method Delay Increased by 30ms

Drill Monitoring Method Execution Success Rate and TP 999:

Scenario 2: Offload service-JAVA process is fully loaded, and the CPU of the specified offload process is fully loaded

The monitoring platform monitors the CPU usage of the system in real time:

5.6 Exercise Feedback

Check whether the monitoring means are perfect when a system failure occurs, and whether the R&D personnel can quickly locate and solve the problem according to the system alarm. Check the team's response and collaboration efficiency.

Email accident alert:

Accident Recovery Alert:

5.7 Environmental Restoration

Scenario rehearsal can be stopped automatically after the rehearsal duration is over, or it can be canceled and terminated manually.

After the drill is complete, it is recommended to restart the container to ensure that the service returns to the normal state.

5.8 Exercise review

After the exercise is over, we need to review the exercise. Under different faults, the performance of the system and the impact on the overall business scenario, as well as the problems found during the drill, need to be presented in the review report.

6. Summary

Link drill actively injects faults in advance, discovers strong and weak dependencies between systems, and tests links to reduce the probability of faults in the production environment.

"Be prepared for danger in times of peace, and be prepared when you think about it. Be prepared and avoid danger."

Author: JD Technology Sun Minying

Content source: JD Cloud developer community

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/8900153
Recommended