Helping 618-Y's road to chaotic practice | JD Cloud technical team

1. Write in front

1. What is chaos?

The concept of Chaos Engineering was proposed by Netflix in 2010. By actively introducing abnormal states into the system and determining optimization strategies based on the behavior of the system under various pressures, it is a new method to ensure system stability.

Chaos engineering is the discipline of conducting experiments on distributed systems with the goal of building on people's knowledge of how complex systems can withstand unexpected events in a production environment.

2. Why do chaos?

Chaos engineering simulates the imperfect environment in the real world by intentionally introducing faulty, abnormal or uncertain conditions. Its core idea is to gradually verify and improve the robustness of the system by actively introducing faults and abnormal conditions, thereby increasing the stability and reliability of the system in the face of complex environments in the real world. Its purpose is to identify potential system weaknesses and improve the robustness and resilience of application systems, reduce the impact of system failures, and provide a better user experience.

3. The principle of chaos

Chaos engineering mainly follows the following principles:

1. Assumption-Driven: Clarify the key assumptions about the behavior and performance of the system. These assumptions can be based on aspects such as system requirements, design decisions, or the operating environment. Experiments in chaos engineering should aim to verify or disprove these assumptions.
2. Experimentation: Simulate the imperfect environment in the real world by intentionally injecting faulty, abnormal or uncertain conditions. Experiments should be controllable and repeatable in order to test and observe system responses within safe limits.
3. Minimizing Blast Radius: When implementing chaos experiments, attention should be paid to minimizing the negative impact on the production environment and users. Reasonably limit the scope and impact of experiments, and protect critical business functions with an appropriate risk management approach.
4. Monitoring and Measurement: A close system alarm mechanism is required during the experiment. Collect experimental data using monitoring tools and metrics to assess system stability and resiliency.
5. Analyzing and Learning: Review the experimental results and draw lessons from them. Determine the root cause of the problem and develop an improvement plan and solution accordingly.
6. Continuous Improvement: Chaos engineering is a process of continuous improvement. Improve the resilience, stability and recoverability of the system through continuous experimentation, analysis and correction.

2. The chaotic development of Y

In the past three years, JD Chaos Engineering, as one of the three lines of defense, has played a very important role before the promotion, and Y’s chaos practice has also been continuously upgraded, mainly from the two aspects of application coverage and scene coverage. The direction has clearly defined the direction of improvement, and has achieved a series of breakthroughs and achievements in the group chaos competition.



1. Exploration stage (21 years)

Dating back to 618 in 21, Y mainly aimed to explore pilot projects. The coverage of chaos test applications mainly focused on non-level 0/1 applications. The drill scenarios mainly focused on simple scenarios such as network disconnection drills, and both offensive and defensive launches were developed.

2. Development stage (22 years)

With the iterative upgrade of Jingdong Chaos Engineering in 22 years, the drill scenarios and system usability have been significantly improved. The Y side also focuses on comprehensive coverage of the chaos drill scenarios, expanding from basic resource failures to external dependency failures to advanced Scene additions to continuously improve system stability. At the same time, the 0/1-level core system is gradually covered, and the chaos drill operation manual, chaos drill specifications, etc. are accumulated. The drill takes testing as the offensive side and research and development as the defensive side, and the division of responsibilities is clarified.

 
In 22, 618, the test students undertook the chaos drill, and carried out the following related work before, during and after the drill:
1. Define exercise objectives: Define exercise objectives and expected outcomes. It includes determining the scope of the application system to be executed, the execution scenario, the configuration of application system monitoring, the observation of failure scenarios, and the problem handling mechanism, etc., so as to improve the health of the application system.
2. Identify key components and scenarios: Identify key components and dependencies in the system, and identify typical scenarios that may affect system stability and performance. These scenarios can include network failures, resource exhaustion, high concurrency, etc.
3. Develop an exercise plan: Develop a detailed experiment plan, including the time, scope, duration of the exercise, and the roles and responsibilities of the participants. Make sure that all participants understand the experimental plan and expected results.
4. Set up the monitoring system: Before the drill, check the MDC, UMP, middleware and other configurations in Taishan to further understand the monitoring information of the application, and simulate system failures more targetedly according to the monitoring information of the application.
5. Execute drill scenarios: execute chaos drills according to the drill plan, such as simulating network failures, memory failures, CPU failures, middleware failures, etc., observe system alarm information and R&D operation and maintenance responses, and record key indicators and events.
6. Results review: After the exercise, collect and analyze the data and observations made during the experiment. Evaluate the system's stability, recoverability, and ability to handle abnormal conditions. Identify existing problems and formulate systemic improvement plans.
7. Improvement and optimization: Based on the drill results and analysis, formulate an improvement plan and take corresponding measures. This may include fixing bugs, improving the fault tolerance of the system, optimizing resource utilization, etc. Make sure lessons learned are documented so they can be used in future drills and operations.
8. Regular drills and continuous improvement: through normalized chaos drills, ensure the continuous stability and flexibility of the system.

3. Growth stage (23 years)

After 22 years of actual combat summary, 618 Y focused on promoting the improvement of application coverage in 23 years, and finally reached 99.68%, the retail TOP1. Practice strategy Prioritize the completion of the 9 major scenarios recommended by the system according to the requirements of the group, and at the same time select some specific scenarios in a targeted manner, improve system monitoring, and finally level 0/1 application health score > 95 points, high-risk items are cleared. During the promotion period, the performance of each system reached the standard, and wireless accidents occurred . While achieving staged results, it is inseparable from the fact that team members strictly abide by the following principles at each stage and treat each drill with high standards:

1. Goal-driven: Make sure each exercise has clear goals and expected outcomes so that its effectiveness and value can be assessed.
2. Progressive iteration: Gradually increase the complexity and challenge of the drill scenarios, enabling the team to adapt to changes and gradually improve the robustness of the system.
3. Continuous learning: Regularly review the results and feedback of the rehearsal, record each experimental case, problem and challenge, classify and analyze according to the lessons learned, and make adjustments and improvements based on the replay results.
4. Inheritance of experience: Based on the summarized experience and lessons and successful experimental cases, formulate a best practice guide. These include drill planning, scenario selection, execution planning, monitoring, and R&D problem handling mechanisms, etc., to help the team better execute chaos drills.
5. Cross-team collaboration: Chaos drills work closely with development, operation and maintenance, testing and other teams, and communicate with the chaos engineering construction team many times to jointly promote the stability and robustness of the application system.

3. The difference between chaos and traditional testing

Chaos engineering is an experimental method that helps us gain new insights into a system. It is fundamentally different from the existing methods of testing known properties such as functional testing and integration testing. Chaos engineering is an experimental method designed to help us obtain more new cognitions about the system, and usually opens up a broader cognitive space for complex systems.

Traditional testing aims to give a specific condition, and the system will output a specific binary result. It is only a test of the possible values ​​​​of known system attributes.

The way of thinking of chaos engineering is to actively find faults, which is exploratory. Although the downgrade plan was prepared according to the plan, when the node was shut down, the upstream service failure was triggered, which led to an avalanche, which could not be detected by fault injection or pre-planning.

Fourth, write on the back

Chaos engineering is a complex technical means to improve the resilience of technical architecture, aiming to nip failures in their infancy, that is, to identify them before they cause disruption. By actively creating faults, test the behavior of the system under various stresses, identify and fix fault problems, and avoid serious consequences.

With the continuous launch of new system functions and changes in dependent parties, etc., it may cause a series of unknown failures in the system. Therefore, the most important thing in the practice of chaos engineering is to be sustainable. By increasing the number of chaos experiments, the value of chaos engineering is constantly exerted. Y has been on the way!



Author: JD Retail Li Jinping Ma Chunrong

Source: JD Cloud Developer Community



Clarification about MyBatis-Flex plagiarizing MyBatis-Plus Arc browser officially released 1.0, claiming to be a substitute for Chrome OpenAI officially launched Android version ChatGPT VS Code optimized name obfuscation compression, reduced built-in JS by 20%! LK-99: The first room temperature and pressure superconductor? Musk "purchased for zero yuan" and robbed the @x Twitter account. The Python Steering Committee plans to accept the PEP 703 proposal, making the global interpreter lock optional . The number of visits to the system's open source and free packet capture software Stack Overflow has dropped significantly, and Musk said it has been replaced by LLM
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10092147