[There are] stability day9 engineering practice like chaos - not destroy Bug might as well make friends with them

This article comes to praise Sun Jun teachers to share.

With the end of Moore's Law, stand-alone computing performance has reached its limit, however, our software systems both in size and complexity has been growing, so the software system are invariably moving in the direction distributed. In recent years, with the advent of cloud services, containers, some distributed systems easier to service of micro.

Despite these kinds of distributed technology, we said on system reliability requirements is the same: Distributed systems require high availability, even if there is a single point of failure or clusters, but also want the system to have the ability to self-restore elasticity or graceful degradation , fault tolerance.

We did a reasonable structure, high quality code, improve testing, etc., a lot of effort, but many are still not up to the high availability of distributed systems, elasticized, in order to explore possible weaknesses in the system, many large software companies are introducing chaotic projects, such as foreign Google, Netflix, domestic Jingdong and so on. What can be called the system weak points, such as

  • External system failure, leading to cascading failures within the system, that is our fault because there have been seven internal service failures caused by Bovine

  • Service is unavailable, inappropriate downgrade

  • Inappropriate timeout mechanism, leading to an infinite retry request error

Chaos Project definition:

Distributed by observing the behavior of the system changes in a controlled fault injection test excavations weaknesses in the system, and targeted improvements to improve system reliability, system build resilience runaway condition of confidence. So, chaotic project is not a new concept, an application of common Disaster Recovery testing is also chaotic project.

General Chaos project implementation steps

  • Looking for some metrics system under normal operating conditions as a "steady state" benchmark

  • Assuming that the experimental and control groups can continue to maintain this "steady state"

  • Injection events of the experimental group, such as a server crash, disk failure, network disconnect, etc.

  • Comparison between the experimental and control groups "steady state" overthrow article 2 of the above assumption

If the chaotic implementation of the project down both "steady state" is consistent, it can be considered to deal with this fault system is flexible, so as to establish more confidence in the system. Conversely, if both steady state is inconsistent, then we have found a system weaknesses, so you can fix it, improve system reliability.

Chaos ideal principle works:

1) The "steady state characteristics of the system" make an assumption

The following single electricity supplier, for example, the system may include a single goods and services, transaction services, payment services, "assume" is not focusing on the specific status of each "screw" services, but rather focus on the entire order entry system under normal operation external state, as a single volume, turnover, system throughput, delay, error rate, etc., these indicators generally have broad market surveillance, and except when promotional activities, these indicators generally do not curve ups and downs, its trend is expected. But there is one caveat, though not how some of the issues affecting the broader market data (such as a cache miss, a CDN node failure, etc.), but we still need to monitor the system micro-indicators for each node (such as CPU, IO, etc.) in order to It found that such problems (cache invalidation may result in increased pressure Mysql cluster, CPU / IO such as pressure increases).

2) the event is likely to occur in the real world really

Any state may affect the stability of the system can be used as an event, it is common, such as

  • Fault categories: hardware failure like server downtime, off the net, like a software failure Seven cows and other external services not available

  • Non-failure events: like traffic surges

We can also analyze the type and frequency has caused a system failure events, targeted prioritize, and implement these events, to avoid this fault system again.

3) run in a production environment

According to article 1, generally only production quotas environment is predictable, such as the new user registrations daily users Kusakabe single volume. Moreover, since the test and production environments can not be exactly the same, in order to truly reflect the reliability of the system, usually recommend the implementation of Chaos project in a production environment.

4) continuous integration

Internet software is updated every day, so as to run continuous integration as chaotic implementation project has practical significance.

5) Minimize the scope

According to Article 3, the project could lead to chaos function is not available online, or even result in capital losses, so in order to identify weaknesses in the system for the purpose of the premise, the need to minimize the scope of failure, and can be quickly restored when there is a serious problem, that failure is controllable. In view of this, and sometimes it can be introduced into the A / B testing, minimization of influence.

The above is chaos works under the most ideal circumstances, in reality, we need to have the implementation phase of chaos based on existing software maturity:

Phase One: Distributed Systems elasticity of general

Jingdong, for example, they will fail before exercise to promote dual XI, the team divided into two groups, one group as the fault of the manufacturer, the other group as a failure solvers and responder, to examine the failure of when the team of fault detection, response, processing as well as resilience. The failure does not require people to reach small intervention, a large fault can quickly deal with human intervention purposes. Chaos projects carried out through intensive two-month period before the big promotion, to improve the team tolerant of large-scale failures.

Like to have accurate market data, for example, because we have only just begun, in order to control risk, at first only in the implementation of the project chaotic test environment, it does not refer to, that is the appropriate benchmark "steady state." But not impossible, to observe the market data system it can be considered to reflect macroeconomic indicators, from the microscopic point of view, we can filter out a number of core interfaces directly affect market data (such as registrations, order quantity, etc.), and the pair post-implementation scenarios chaotic system integration tests of these interfaces, to evaluate the reliability of the system by observing the test result, thereby to find weaknesses in the system, it is possible in a test environment.

Further, the chaos can be considered general-purpose engineering abnormality variable timing of certain types of abnormal automate uncertain, if the layer is put aside, for the target machine to manually inject one or more specific abnormal, supplemented by a corresponding abnormality recovery means, then we can apply in general abnormal tests.

Phase II: Distributed Systems elasticity of mature

To Netflix, for example, they are basically in accordance with the above steps and ideal implementation of the principle of chaos engineering, continued working, automatic implementation chaotic engineering, systems with a high degree of reliability, resilient and elastic.

Chaos has achieved praise the project:

Because of the chaotic main project is to inject specific events and cause system failures, since it is "evil", so we named it into "Megatron" (villain of Transformers Boss). Since we are still in the first stage, so the injection fault is artificially controlled, the type of fault has been implemented are:

  • High CPU load

  • High load disk: disk read and write frequently

  • Low Disk Space

  • Elegant offline application: using the application's stop script to stop the smooth application

  • Stop the application directly kill the process may result in inconsistent data

  • Network deterioration: random change some packet data, the data content is not correct

  • Delay: delay time of packets of a specific range

  • Network packet loss: Construction of a tcp packet loss rate is not a total failure

  • Internet black hole: Ignore the package from a certain ip

  • External service unreachable: the external domain name service points to the local loopback address or to access an external service OUTPUT port of discarded packets

Reference
PRINCIPLES OF CHAOS ENGINEERING (http://principlesofchaos.org/)

Published 178 original articles · won praise 353 · views 160 000 +

Guess you like

Origin blog.csdn.net/Ture010Love/article/details/104374754