Introduction to Chaos Engineering and Implementation of Chaosblade

Under the distributed system architecture, the call link and access relationship between service components are becoming more and more complex, and it is difficult to assess the impact of a single service component failure on the entire system. Incomplete monitoring and alarms make it more difficult to find and locate problems. At the same time, business and technology iterations are fast, and how to continuously ensure the stability and high availability of the system is a great challenge. For this reason, the emergence of chaos engineering is particularly important. In a controllable range or environment, fault injection is used to continuously improve system stability and high availability, and improve business continuity.


1. Introduction to Chaos Engineering

1.1 Introduction to Chaos Engineering

Chaos Engineering is an experience-guided controlled experiment on a distributed system to observe system behavior and discover system weaknesses in order to build the ability and confidence that the system will cause chaos due to unexpected conditions when the scale increases. Its goal is to increase the resilience of systems to uncertain events.

Chaos engineering is an experimental science carried out on distributed systems, aiming to improve system fault tolerance and build confidence in the system against unpredictable problems in the production environment.

Chaos engineering was first created by Netflix engineers during the process of migrating the system to AWS. In order to ensure that the EC2 instance failure would not affect the business, Chaos Monkey was created, which would randomly terminate the EC2 instances running in the production environment. Engineers can quickly understand whether the services they are building are robust and resilient enough to tolerate unplanned failures. At this point, chaos engineering began to rise. In China, Internet companies like Alibaba have explored and practiced chaos engineering for more than ten years, and have provided some open source chaos engineering platforms such as Chaos Blade. [1]

insert image description here

After understanding the development history of chaos engineering, let's take a look at what chaos engineering can do under the distributed infrastructure and what value it brings. From the two different perspectives of system product roles and application scenarios, chaos engineering can verify system stability and reliability, and enhance system predictability and predictability. [2]

  • Chaos Engineering Improves Different Roles in System Development
    • For system and application architects, it is possible to verify the fault tolerance of the system architecture
    • For operation, maintenance and development, it can improve the emergency efficiency of faults, and realize timely alarms, precise positioning and rapid recovery of faults
    • For testing, test from the perspective of system architecture to fill in the blanks of testing in unknown scenarios
    • For products and designs, check product performance through testing to improve customer experience
  • Application Scenarios of Chaos Engineering
    • Disaster recovery capability test: Through fault injection, verify the impact of individual component failures on the entire system, as well as the effectiveness of emergency measures such as current limiting and downgrading, fusing, switching, and failover;
    • Microservice strong and weak dependency governance: During the injection and removal of faults in the called service, observe the indicator performance of the calling service, obtain the dependency relationship and coupling degree between core services and non-core services, and further optimize dependencies that do not meet expectations;
    • Verify the rationality of system configuration: By simulating the availability of system resources, observe the rationality of system service configuration, replica configuration and resource constraints
    • Monitoring and alarming: check whether the monitoring indicators are accurate, whether the monitoring dimensions are complete, whether the alarm threshold is reasonable, and the timeliness of alarm information through fault injection;
    • Emergency drills: Through drills in real scenarios such as red-blue confrontation, verify whether the emergency response capabilities, emergency plans, and emergency procedures for related issues are complete
1.2 Five Principles of Chaos Engineering

insert image description here
When developing a chaos engineering experiment, keeping the following principles in mind will aid in the design of the experiment.

  • Assumptions for establishing a stable state: First, define indicators that can directly reflect the business operation status, such as transaction TPS and response time changes; second, when a fault is triggered, it can make expected responses to system changes and indicator changes.
  • Use a variety of real-world events for verification: It is meaningful to introduce events that exist and occur frequently in the real world, not imagined out of thin air, such as service call delays, disk failures, etc.
  • Experiment in a production environment: Try to test in a production-like environment. The diversity of the production environment is unmatched by other environments. However, if the production system does not have disaster recovery capabilities in certain failure scenarios, chaos experiments cannot be performed to avoid losses.
  • Automated experiments for continuous operation: Continuously automated fault experiments can reduce the recurrence rate of faults and detect faults in advance, ensuring business continuity verification to the greatest extent.
  • Minimize the explosion radius: During the implementation of chaos engineering, it is necessary to ensure that the business impact on production is minimized. Therefore, during the experiment, start from a small area and expand the scope continuously, such as doing a good job of environmental isolation and executing during low-peak business hours.

The purpose of chaos engineering is to verify the stability and availability of the production system and ensure business continuity. Therefore, when designing chaos engineering experiments, follow the above five basic principles as the basis to improve the value of the experiment.

1.3 Chaos Engineering Maturity Model

The Chaos Engineering Maturity Model (CEMM) is a framework for evaluating and improving chaos engineering capabilities, which aims to help enterprises establish reliable chaos engineering capabilities and improve the stability and reliability of software systems. CEMM includes five maturity levels, each level corresponds to different capability requirements and goals, from the initial level (Level 1) to the optimization level (Level 5), gradually improving the maturity of chaos engineering capabilities. [3]

  • Initial level (Level 1): Attempts to conduct chaos experiments, but lack of unified chaos engineering strategies and specifications, and are in the exploratory stage of chaos experiments.
  • Certified level (Level 2): ​​Has completed the exploration of chaos experiments, started standardized management of chaos experiments, and started to build its own chaos engineering platform.
  • Defined level (Level 3): The standardized process and management system of chaos experiments have been established, quantitative analysis of chaos experiments has begun, and their own chaos engineering platform has begun to be built.
  • Managed level (Level 4): A complete chaos engineering platform has been established, which can conduct comprehensive quantitative analysis and monitoring of chaos experiments, and start automatic management and optimization of chaos experiments.
  • Optimization level (Level 5): It has reached the highest level of chaos engineering capabilities, has fully automated chaos experiment management capabilities, and can comprehensively optimize and improve chaos experiments to ensure the stability and reliability of the software system.

insert image description here

CEMM can help enterprises evaluate their own level of chaos engineering capabilities, and guide them to gradually improve the maturity of chaos engineering capabilities, so as to achieve the stability and reliability of software systems.

1.4 Chaos engineering practice process

insert image description here

Based on chaos engineering, faults are randomly injected into application systems and servers through a production-like environment to detect whether the system can provide external services normally when some objects or functions are abnormal. By simulating the fault scenarios in the real production environment, the fault injection capability including the application layer, system layer and container layer is established, and the operation and maintenance capabilities of emergency coordination, fault location and fault recovery are provided. The chaos engineering practice process consists of the following four stages:

  • Preparation stage: preparations before fault drills, such as environment and traffic preparation, configuration of monitoring index items, etc., to ensure that the system meets the specified requirements before fault injection, corresponding to the assumption of a steady state of chaos engineering.
  • Execution phase: Execute the experiment according to the set experiment scope and monitoring indicators. At the execution stage, it is necessary to pay attention to the controllable impact range of the fault drill and the configurable fault scenarios.
  • Verification stage: Check whether the results of the experiment meet expectations, whether the monitoring indicators are perfect, and the impact on the system and business; if it meets expectations, consider expanding the scope of the experiment to fully verify the stability of the system.
  • Recovery phase: perform recovery operations for fault injection, and restore the system and business to the state before the drill.

In actual practice, each enterprise will build a chaos engineering platform based on its own operation and maintenance scenarios to simulate real fault scenarios and system monitoring indicators. For example, China Electronics Financial Trust combined the characteristics of the banking system to create a chaos case library suitable for the banking system, and conducted red-blue confrontation drills through experimental management. [4]

insert image description here

In the chaos engineering practice shown above, the high-availability cases concerned by the financial industry are packaged into a chaos case library, which includes high-availability-related cases of stopping applications, stopping services, stopping network cards, downtime, suspended animation, etc., as well as cases from production events, emergency Abstract cases in the plan, such as storage full, damage, transaction consistency, etc.

1.5 Chaos Engineering Fault Scenarios
1.5.1 Failure Scenarios of Chaos Engineering

In the fault injection scenario of Chaos Engineering, the industry divides fault images according to the IaaS layer, PaaS layer, and SaaS layer, as shown in the figure below, such as disk failures at the infrastructure layer, network abnormalities, database connection pool full, abnormal CPU consumption, and business threads. Pool full, process false activity, etc. are common failure scenarios. Design failures are simulated against these known failure scenarios to verify the stability of real business systems. [5]

insert image description here

1.5.2 Observability indicators

The design of the observability index is one of the key indicators for the success of the chaos engineering experiment. Good system observability will bring a strong data support to the chaos engineering experiment, and provide a basis for subsequent interpretation of experimental results, problem tracking and final analysis. Solving provides a solid foundation. Common observable indicators include the following parts:

  • System observability: mainly includes Metrics (indicators), Logging (logs) and Tracing (tracking), etc.
  • Business indicators: These indicators are usually directly related to business value and user experience, and are one of the most important observation indicators in chaos engineering experiments.
  • Application health indicators: reflect the health status of the application system, including error exceptions, performance bottlenecks, security vulnerabilities, TPS and response time changes, success rate fluctuations, etc.
  • Other system indicators: reflect the operating status of infrastructure and systems, including CPU usage, memory consumption, network delay, etc.

These indicators can provide strong data support for chaos engineering experiments, helping the team interpret experimental results, track problems, and finally solve problems. In the observation of indicators, some statistical methods such as Bayesian detection, exponential smoothing, PCA analysis, etc. can also be combined to conduct correlation analysis on multiple indicator items to determine the accurate system impact.

2. Open source chaos engineering platform practice

2.1 Chaos Blade Platform

Chaosblade is an open source project of Alibaba's internal MonkeyKing. It is based on Alibaba's more than ten years of failure testing and drills. The GitHub address is https://github.com/chaosblade-io/chaosblade. The experimental scenarios currently supported by Chaosblade include infrastructure resources, Java applications, C++ applications, Docker containers, and cloud-native platforms.

insert image description here

2.1.1 ChaosBlade experience

1) Download the Chaosblade experiment package

##下载路径
https://github.com/chaosblade-io/chaosblade/releases
##解压安装包
# tar -xzvf chaosblade-1.7.2-linux-amd64.tar.gz
[root@tango-DB01 chaosblade-1.7.2]# pwd
/usr/local/chaosblade/chaosblade-1.7.2
[root@tango-DB01 chaosblade-1.7.2]# ls -l
total 39324
drwxr-xr-x 2 root root      111 May 19 11:36 bin
-rwxr-xr-x 1 root root 40265968 May 19 11:32 blade
drwxr-xr-x 4 root root       34 May 19 11:44 lib
drwxr-xr-x 2 root root      316 May 19 11:44 yaml

2) Use chaosblade to simulate failure scenarios

##1、模拟CPU满载实验
##执行命令./blade create cpu load,code=200表示执行成功,
{"code":200,"success":true,"result":"9eb3770e8fa40287"}

##使用TOP显示执行效果,CPU使用率已经接近99%
Tasks: 163 total,   2 running, 161 sleeping,   0 stopped,   0 zombie
%Cpu0  : 98.9 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  1867024 total,   600356 free,   639152 used,   627516 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  1040792 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                
  2003 root      20   0  710452   9632   3008 R 98.7  0.5   2:16.89 chaos_os

##销毁实验
# ./blade destroy 9eb3770e8fa40287
{"code":200,"success":true,"result":{"target":"cpu","action":"fullload","ActionProcessHang":false}}
再次查看CPU使用率,已经恢复正常

##2、指定百分比负载
# ./blade create cpu load --cpu-percent 60
{"code":200,"success":true,"result":"847419202a931560"}

##查看CPU使用情况
top - 19:30:23 up 26 min,  2 users,  load average: 1.24, 1.31, 0.81
Tasks: 163 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
%Cpu0  : 59.4 us,  0.3 sy,  0.0 ni, 40.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1867024 total,   602488 free,   636560 used,   627976 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.  1043252 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                
  2111 root      20   0  710452   7352   2880 S 59.5  0.4   0:24.80 chaos_os

Chaosblade provides the basic capabilities of chaos experiments, on which it is packaged into a chaos engineering platform for drills. The supported fault scenarios include:

  • chaosblade: Chaos experiment management tool, including creating experiments, destroying experiments, querying experiments, experiment environment preparation, experiment environment cancellation and other commands, is an execution tool for chaos experiments, and the execution methods include CLI and HTTP. Provide complete commands, experimental scenarios, and scenario parameter descriptions, and the operation is concise and clear.
  • chaosblade-spec-go: Chaos experiment model Golang language definition, the scenes that are easy to implement in Golang language are all based on this specification.
  • chaosblade-exec-os: Implementation of basic resource experiment scenarios.
  • chaosblade-exec-docker:: Docker container experiment scenario implementation, standardized implementation by calling Docker API.
  • chaosblade-exec-cri:: Container experiment scenario implementation, implemented by calling CRI standardization.
  • chaosblade-operator: Implementation of experimental scenarios on the Kubernetes platform. Chaos experiments are defined through Kubernetes standard CRD methods. It is very convenient to use Kubernetes resource operations to create, update, and delete experimental scenarios, including execution using kubectl, client-go, etc. And it can also be executed using the chaosblade cli tool mentioned above.
  • chaosblade-exec-jvm: Java application experiment scenario implementation, using Java Agent technology to dynamically mount, without any access, zero-cost use, and supports uninstallation, and completely recycles various resources created by the Agent.
  • chaosblade-exec-cplus: C++ application experiment scene implementation, using GDB technology to implement method and code line level experiment scene injection.
2.1.2 Chaosblade chaos experimental model

Before the Chaosblade chaos experiment, it was necessary to answer several questions such as the goal of the experiment, the scope of implementation, the specific steps of implementation, and the effective matching conditions. Based on these questions, the Target, Scope, Matcher, and Action models were abstracted.

insert image description here

  • Target: The experimental target refers to the components where the experiment occurs, such as containers, application frameworks (Dubbo, Redis, Zookeeper), etc.
  • Scope: The scope of the experiment implementation, referring to the specific machine or cluster that triggers the experiment.
  • Matcher: Experimental rule matcher, according to the configured Target, define related experimental matching rules, and multiple configurations can be configured. Because each Target may have its own special matching conditions, such as HSF and Dubbo in the RPC field, it can be matched according to the service provided by the service provider and the service called by the service consumer, and the Redis in the cache field can be matched according to the set and get operations. match.
  • Action: Refers to the specific scenario of the experimental simulation. The target is different, and the implementation scenario is different. For example, the disk can be drilled, the disk is full, the disk IO reads and writes high, and the disk hardware fails. If it is an application, experimental scenarios such as delay, exception, returning specified value (error code, large object, etc.), parameter tampering, and repeated calling can be abstracted.

Based on the above model, the simulation experiment of the chaosblade failure scenario is carried out.


References:

  • [1] Alibaba Chaos Engineering Practice, Alibaba Cloud Yunqi
  • [2] https://cloud.tencent.com/developer/article/1828940
  • [3] Netflix Chaos Engineering Maturity Model
  • [4] CLP Financial Chaos Engineering Practice Platform
  • [5] How to use chaos engineering to deal with unknown faults, cloud native foundation
  • [6] https://github.com/chaosblade-io/chaosblade

Guess you like

Origin blog.csdn.net/solihawk/article/details/132443043