Chaos Project: Stability of Netflix Road

Each software practitioners begin to write the first line of code, you incessantly and in the fight against software errors.

In recent years, as the system architecture to the gradual evolution of micro-services architecture, development efficiency and system scalability greatly improved. But at the same time, the complexity of the system also will be improved, the traditional testing methods can not cover all possible and a comprehensive understanding of the behavior of the system, the validity of the test is greatly reduced. Through various tests, SRE, DevOps, canary released, blue-green deployment plans, exercises and other methods failure, hoping to take preventive measures. But the growing size of the service, the dependencies between services brought about by the uncertainty grows exponentially. In such a service call network, or any unusual changes in the normal appearance of a ring, may have resulted in the general influence of the butterfly effect is similar to other services.

Surge in its own complexity of software systems , developers underestimation of risk and neglect while introducing complexity , system availability are two major challenges facing.

To meet these two challenges, Netflix chose an unusual way.

In 2008 Netflix began to migrate services from the data center to the cloud, and then it began to try to carry out some test system flexibility in a production environment. After some time, this practice was only known as chaos project. The earliest well-known is the "chaos monkey" (Chaos Monkey), because it randomly shut down the service node in a production environment and the "notorious." Later evolved into "chaos King Kong" (Chaos Kong), the benefits of these small scale before being expanded to get very large. Expand the scale of a benefit called "fault-injection test" (Fault Injection Test, FIT) tool.

Subsequently established a number of principles chaos works for the practices and norms of discipline, together with a chaotic engineering automation platform, so that chaos engineering experiments can automatically run 7 × 24 hours a day in the micro-service architecture.

Starting confusion monkeys, Netflix brings a new way of thinking to deal with areas of uncertainty - to take the initiative. This way of thinking proactive derived from a set of practices, is chaotic project, which aims to change the way of thinking to deal with the developers of software defects and faults fundamentally.

Prior to this, we hope that through a series of tests to verify the means, do everything possible to ensure that the system is running on-line free of defects and faults. The idea of ​​the chaos that this project is neither realistic nor consistent with the laws of natural development system. Chaos works to promote positive we must first accept the system will be flawed, and the fact that failure will occur from time to time; then, requires us to identify risk points possible problems through a series of experiments, and then at the same time continue to reinforce the system, prompting developers must choose the built-in defensive system when developing software.

"Chaos" the word reminds us of randomness and disorder. However, this does not mean that the implementation of the project is random and chaotic random, does not mean chaos engineer job is to lead to confusion. Every failure in the system benefit, then continue to evolve, this is the core idea of ​​the chaos project.

In practice, the chaos works to promote a series of experiments to truly verify the system performance under various types of failure scenarios, frequently by a large number of experiments, both making anti vulnerability has continued to improve the system itself, but also to the developers of the system I am more confident. Because in a large number of experiments run automatically every day, developers have to think "when encoding my code how to survive under these experimental scenes of chaos down ", and gradually improve the quality of forming a virtuous circle.

Chaos project is well suited for the production of exposure to unknown vulnerabilities in the system, but if you're determined a chaotic system engineering experiments can cause serious failures, then carry out such experiments is no meaning. You need to solve this problem, then back to the chaotic construction, engineering chaos after performing experiments, you can either continue to discover more unknown vulnerabilities, or can be more confidence in the system, the real level of resilience.

In addition, you need to use supporting monitoring system to determine the current state of the system. If you are unable to observe the behavior of the system, you can not draw valid conclusions from the experiment.

You can project the chaos as a "Our systems have much distance from the edge of chaos" experience solutions. From another point of view to think, "If we mess injection system, it will happen?"

减少问题的最好方法就是让问题经常性地发生,通过不断重复失败过程并找出解决方案,来持续提升系统的容错能力和弹性。混沌工程作为一门新兴学科,还处于一个定义和被定义的过程。如果你对混沌工程感兴趣,愿意去了解和实践混沌工程,非常推荐你从《混沌工程:Netflix系统稳定性之道》一书开始行动。

发布了1739 篇原创文章 · 获赞 740 · 访问量 476万+

Guess you like

Origin blog.csdn.net/broadview2006/article/details/98588201