Source| Alibaba Cloud Native Official Account
Preface
In a distributed system architecture, with many service components and intricate dependencies between services, it is difficult to evaluate the impact of a single fault on the entire system, and the request link is long. If basic services such as monitoring alarms and log records are not perfect, fault response, It is difficult to locate faults, so how to build a highly available distributed system is facing great challenges. Chaos engineering was born. In the controllable range or environment, by injecting faults into the system, observing system behaviors and discovering system defects, the ability and confidence in the chaos caused by unexpected conditions in the distributed system can be established, and the stability and performance of the system can be continuously improved. Available capacity.
The implementation process of chaos engineering is to formulate a chaotic experiment plan, define steady-state indicators, make assumptions about system fault-tolerant behavior, and then perform chaotic experiments to check system steady-state indicators, etc. Therefore, the entire process of chaos experiment requires reliable, easy-to-use and scene-rich chaos experiment tools to inject faults and complete distributed link tracking and system monitoring tools in order to trigger emergency response early warning solutions and quickly locate faults, and observe the entire Various data indicators of the process system, etc. In this article, we introduce the chaos experiment tool (ChaosBlade) and the distributed system monitoring tool (SkyWalking), and combine a microservice case to share the high availability practice of ChaosBlade and SkyWalking microservices.
Tool introduction
1. ChaosBlade
ChaosBlade is a chaos engineering tool that follows the experimental principles of chaos engineering and provides rich failure scenarios to help distributed systems improve fault tolerance and recoverability. It can realize the injection of underlying faults and migrate to the cloud or to cloud native systems in the enterprise The business continuity guarantee during the process is characterized by simple operation, non-intrusiveness, and strong scalability. ChaosBlade can continuously improve system stability and high availability through fault injection in a controllable range or environment.
ChaosBlade is not only easy to use, but also supports a wealth of experimental scenarios, including:
- Basic resources: experimental scenarios such as CPU, memory, network, disk, process, etc.;
- Java applications: such as database, cache, messaging, JVM itself, microservices, etc., you can also specify any class method to inject various complex experimental scenarios;
- C++ applications: such as specifying any method or a line of code injection delay, variable and return value tampering and other experimental scenarios;
- Docker container: for example, experimental scenarios such as killing the container, the CPU in the container, memory, network, disk, process, etc.;
- Cloud native platform: For example, CPU, memory, network, disk, and process experimental scenarios on Kubernetes platform nodes, Pod network and Pod itself experimental scenarios are like killing Pod, and container experimental scenarios are like the above-mentioned Docker container experimental scenario;
ChaosBlade encapsulates the realization of scenes into individual projects by domain, which can not only standardize the realization of scenes in the domain, but also facilitate the horizontal and vertical expansion of scenes. By following the chaos experimental model, it realizes the unified call of chaosblade cli.
2. SkyWalking
SkyWalking is an open source APM system that includes monitoring, tracking, and diagnosis functions for distributed systems in cloud-local architecture. The core features are as follows:
- Analysis of services, service instances, and endpoint indicators
- Root Cause Analysis
- Service topology analysis
- Service, service instance and endpoint dependency analysis
- Slow services and endpoints detected
- Performance optimization
- Distributed tracing and context propagation
- Database access indicators. Detect slow database access statements (including SQL statements).
- Call the police
Tool installation and use
ChaosBlade installation and use are very simple, unified call ChaosBlade each scene by chaosblade cli, need only download the corresponding tar package, use unzip blade
the executable file for the chaotic experiment, see Download:https://github.com/chaosblade-io/chaosblade/releases 。
1. ChaosBlade installation
This time our actual environment is linux-amd64, download the latest version of chaosblade-linux-amd64.tar.gz package, the installation steps are as follows:
## 下载
wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
## 解压
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## 设置环境变量
export PATH=$PATH:chaosblade-0.9.0/
## 测试
blade -h
2. ChaosBlade use
ChaosBlade After the installation is complete, only you need to use blade
the executable file to create chaos in all scenarios currently supported. First use blade -h
See how to use, only need to use the down layer by layer after selecting the sub-command -h
to see the full use cases as well as detailed analysis of the parameters, let's show you:
1) How to use blade
Execution blade -h
can see what support command:
An easy to use and powerful chaos engineering experiment toolkit
Usage:
blade [command]
Available Commands:
create Create a chaos engineering experiment
destroy Destroy a chaos experiment
...
2) Create an experimental scene
For example, create a CPU load scenarios, execution blade create cpu fullload -h
can see the specific scene parameters, select the appropriate parameters to perform:
Create chaos engineering experiments with CPU load
Usage:
blade create cpu fullload
Aliases:
fullload, fl, load
Examples:
# Create a CPU full load experiment
blade create cpu load
#Specifies two random kernel's full load
blade create cpu load --cpu-percent 60 --cpu-count 2
...
Flags:
--blade-release string Blade release package,use this flag when the channel is ssh
--channel string Select the channel for execution, and you can now select SSH
--climb-time string durations(s) to climb
--cpu-count string Cpu count
--cpu-list string CPUs in which to allow burning (0-3 or 1,3)
--cpu-percent string percent of burn CPU (0-100)
...
3) Recovery experiment
ChaosBlade supports three ways to recover experiments:
- After you create a successful experiment ChaosBlade will return a UID, execution
blade destroy uid
can be. - If you can not find the corresponding UID, execution
blade destroy target action
can be, for exampleblade destroy cpu fullload
. - When you create an experiment to bring
--timeout 10
parameters will automatically resume after ten seconds to perform the experimental scene, while supporting expressions, such as three minutes--timeout 30m
.
3. SkyWalking installation & usage
For SkyWalking installation and usage documents, see:https://github.com/apache/skywalking/tree/v8.1.0/docs
After the tool is deployed, we will combine the case and take the initiative to observe system behavior, locate problems and find system defects through fault injection, so as to build a highly available microservice system.
Application fault tolerance case
We deploy a microservice application in the daily environment for experimentation, and use ab to test and simulate system requests. Microservice application services include front-ends, shopping carts, recommended services, products, orders, etc., and use components include Springboot, Nacos, Mysql, Redis, Lettuce, Dubbo, etc. ChaosBlade supports most of the components of the application. We use ChaosBlade to inject chaos experiments to verify the fault tolerance of the application and use SkyWalking for application monitoring and problem location.
1. Case environment
- Linux-AMD64, release version CentOS-7.x
- JDK1.8
- chaosblade-0.9.0, download link:https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
- skywalking-apm-8.1.0, download link:https://www.apache.org/dyn/closer.cgi/skywalking/8.1.0/apache-skywalking-apm-8.1.0.tar.gz
2. Application topology
The overall architecture of the application is as follows. The frontend calls for shopping carts (cars) and products (products) through Dubbo's strong dependence.
3. Chaos experiment steps
- Develop a chaos experiment plan
- Define system steady state indicators
- Make assumptions about system fault-tolerant behavior
- Perform chaos experiment
- Check steady state indicators
- Record and restore chaos experiments
- Fix problems found
- Automated continuous verification
Below we will use ChaosBlade to actually carry out the chaos experiment according to the chaos experiment procedure.
4. Case One
1) Scene
Develop a chaos experiment plan, call downstream services with frequent delays, use ab test to simulate normal access to the shopping cart interface, start 2 threads, and perform 10,000 interface visits.
ab -n 10000 -c 2 http://127.0.0.1:8083/cart
2) Monitoring indicators
Define the system steady state index, select the /cart endpoint in the SkyWalking console, the steady state index is as follows:
- The average response time (RT) is about 15ms.
- The P99 indicator is within 20ms.
3) Expectation assumption
- Configure the call timeout time so that client requests will not be blocked for a long time.
- Configure service circuit breaker strategy/service degradation.
4) Chaos experiment
In the previous section we have introduced ChaosBlade installation and simple and practical, in this case we use ChaosBlade service injected downstream Dubbo cart delay fault (delay time of 30 seconds), execute blade create dubbo delay -h
command to view the command call delay dubbo usage:
Dubbo interface to do delay experiments, support provider and consumer
Usage:
blade create dubbo delay
Examples:
# Invoke com.alibaba.demo.HelloService.hello() service, do delay 3 seconds experiment
blade create dubbo delay --time 3000 --service com.alibaba.demo.HelloService --methodname hello --consumer
Flags:
--appname string The consumer or provider application name
--consumer To tag consumer role experiment.
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
--group string The service group
-h, --help help for delay
--methodname string The method name
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--process string Application process name
--provider To tag provider experiment
--service string The service interface
--time string delay time (required)
--timeout string set timeout for experiment in seconds
--version string the service version
Global Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker
Refer to the case and parameter explanation, the upstream service client needs to inject the delay fault (the delay time is 30 seconds). With the help of SkyWalking, it is easy to find the Dubbo service related information on the link. First, query the link with the endpoint of /cart and find it on the link Dubbo service, as shown below:
- Find link
- Get agreement details
Click here to view the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, you can get the parameters needed to use ChaosBlade to inject the delay of the upstream service. Therefore, our final parameter structure is:
--time 30000
Delay 30s--service com.alibabacloud.hipstershop.cartserviceapi.service.CartService
service--methodname viewCart
Service method--process frontend
Java process--consumer
Currently Dubbo service client
Issue commands to inject faults:
blade create dubbo delay --time 30000 --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService --methodname viewCart --process frontend --consumer
5) Monitoring indicators
Check the system indicators after injecting the fault, and check the indicators on SkyWalking:
- The average response time (RT) is around 2000ms, and the P99 indicator is around 2000ms
- /cart interface call error, com.alibabacloud.hipstershop.cartserviceapi.service.CartService service is abnormal.
- A timeout exception occurs, and the timeout period is 2000ms
The conclusion shows that the upstream service is configured with the call timeout time, but the service fuse strategy is not configured, which actually does not meet expectations.
6) Fix the problem
Configure service circuit breaker strategy/service degradation.
5. Case Two
1) Scene
During operation, the Dubbo service provider failed to access the registry, and 100% of the packets were lost when the machine injected the faulty network into the registry.
2) Monitoring indicators
Define the system steady state index, select the service endpoint in the SkyWalking console, the steady state index is as follows:
- com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal
3) Expectation assumption
The upstream service business will not be affected, and the downstream service will not be affected.
4) Chaos experiment
Inject packet loss failure (100%) to the registry port. We are using nacos as the registry of Dubbo. The default port is 8848 and the network card is eth0. The command parameters are as follows:
--interface eth0
Network card--percent 100
100% packet loss rate--local-port
Local port 8848
Issue commands to inject faults:
blade create network loss --interface eth0 --percent 100 --local-port 8848
5) Monitoring indicators
After injecting the fault, select the service endpoint on the SkyWalking console, and the steady state indicators are as follows:
- com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal
Conclusion: The service is weakly dependent on the registry and the service itself has a local cache, which meets the expected assumptions.
Assuming that the application is now deployed in a Kubernetes cluster, the horizontal expansion capability of the verification registry can be increased. ChaosBlade also supports the Kubernetes cluster scenario .
6. Small test
In the appeal case, we verified whether the service is configured with timeout and circuit breaker strategies, and verified whether Dubbo is weakly dependent on the registry and the service itself has a local cache. Are you also eager to experience it in your own system? ChaosBlade has prepared a wealth of experimental scenarios for everyone, which not only supports basic resources and application dimensions, but is also a powerful tool for cloud native platforms. ChaosBlade is easy to use, and it also provides detailed parameters to control the minimum explosion radius of failure. I believe ChaosBlade will make it easy for everyone to get started.
It's just too shallow on paper, here we provide an additional small case for everyone to practice. We often have traffic with relational databases in application development, and when application traffic grows rapidly, bottlenecks often occur on the database side, and there are many slow SQLs. When there is no slow SQL warning, it is difficult to find the original SQL and optimize it, so the slow SQL warning is very important. How to verify applications have this capability, ChaosBlade can support MySQL slow SQL injection fault, execute blade create mysql delay -h
the view command call delay MySQL usage:
Mysql delay experiment
Usage:
blade create mysql delay
Examples:
# Do a delay 2s experiment for mysql client connection port=3306 INSERT statement
blade create mysql delay --time 2000 --sqltype select --port 3306
Flags:
--database string The database name which used
--effect-count string The count of chaos experiment in effect
--effect-percent string The percent of chaos experiment in effect
-h, --help help for
--host string The database host
--offset string delay offset for the time
--override only for java now, uninstall java agent
--pid string The process id
--port string The database port which used
--process string Application process name
--sqltype string The sql type, for example, select, update and so on.
--table string The first table name in sql.
--time string delay time (required)
--timeout string set timeout for experiment in seconds
Global Flags:
-d, --debug Set client to DEBUG mode
--uid string Set Uid for the experiment, adapt to docker
You can see that ChaosBlade provides a complete case, supporting more fine-grained SQL types, table names and other parameters. Check the select operation delay of 3306 when connecting to the port by 10s. When the traffic hits, is there a warning in your application?
blade create mysql delay --time 10000 --sqltype select --port 3306
Explanation of command parameters:
--time 10000
Delay 10s--sqltype select
Only support SQL statements of select type--port 3306
Only supports connection with port 3306
to sum up
In this article, we introduced the application of chaos engineering in the actual complex distributed architecture, and combined ChaosBlade and SkyWalking to perform chaos experiments in actual applications, so that the system can be analyzed and optimized according to the failure situation, and the system can be continuously improved. Stability and high availability. ChaosBlade not only supports basic resources and application dimensions, but is also a powerful tool for cloud native platforms. You are welcome to try it.
ChaosBlade project address:https://github.com/chaosblade-io/chaosblade , welcome everyone to join and build together! Click to view contribution guidelines .
author information
Ye Fei: Github @tiny-x, open source community enthusiast, ChaosBlade Committer, participated in promoting the ecological construction of ChaosBlade Chaos Engineering.