ChaosBlade x SkyWalking microservice high availability practice

Head picture.png

Source| Alibaba Cloud Native Official Account

Preface

In a distributed system architecture, with many service components and intricate dependencies between services, it is difficult to evaluate the impact of a single fault on the entire system, and the request link is long. If basic services such as monitoring alarms and log records are not perfect, fault response, It is difficult to locate faults, so how to build a highly available distributed system is facing great challenges. Chaos engineering was born. In the controllable range or environment, by injecting faults into the system, observing system behaviors and discovering system defects, the ability and confidence in the chaos caused by unexpected conditions in the distributed system can be established, and the stability and performance of the system can be continuously improved. Available capacity.

The implementation process of chaos engineering is to formulate a chaotic experiment plan, define steady-state indicators, make assumptions about system fault-tolerant behavior, and then perform chaotic experiments to check system steady-state indicators, etc. Therefore, the entire process of chaos experiment requires reliable, easy-to-use and scene-rich chaos experiment tools to inject faults and complete distributed link tracking and system monitoring tools in order to trigger emergency response early warning solutions and quickly locate faults, and observe the entire Various data indicators of the process system, etc. In this article, we introduce the chaos experiment tool (ChaosBlade) and the distributed system monitoring tool (SkyWalking), and combine a microservice case to share the high availability practice of ChaosBlade and SkyWalking microservices.

Tool introduction

1. ChaosBlade

ChaosBlade is a chaos engineering tool that follows the experimental principles of chaos engineering and provides rich failure scenarios to help distributed systems improve fault tolerance and recoverability. It can realize the injection of underlying faults and migrate to the cloud or to cloud native systems in the enterprise The business continuity guarantee during the process is characterized by simple operation, non-intrusiveness, and strong scalability. ChaosBlade can continuously improve system stability and high availability through fault injection in a controllable range or environment.

ChaosBlade is not only easy to use, but also supports a wealth of experimental scenarios, including:

  • Basic resources: experimental scenarios such as CPU, memory, network, disk, process, etc.;
  • Java applications: such as database, cache, messaging, JVM itself, microservices, etc., you can also specify any class method to inject various complex experimental scenarios;
  • C++ applications: such as specifying any method or a line of code injection delay, variable and return value tampering and other experimental scenarios;
  • Docker container: for example, experimental scenarios such as killing the container, the CPU in the container, memory, network, disk, process, etc.;
  • Cloud native platform: For example, CPU, memory, network, disk, and process experimental scenarios on Kubernetes platform nodes, Pod network and Pod itself experimental scenarios are like killing Pod, and container experimental scenarios are like the above-mentioned Docker container experimental scenario;

ChaosBlade encapsulates the realization of scenes into individual projects by domain, which can not only standardize the realization of scenes in the domain, but also facilitate the horizontal and vertical expansion of scenes. By following the chaos experimental model, it realizes the unified call of chaosblade cli.

2. SkyWalking

SkyWalking is an open source APM system that includes monitoring, tracking, and diagnosis functions for distributed systems in cloud-local architecture. The core features are as follows:

  • Analysis of services, service instances, and endpoint indicators
  • Root Cause Analysis
  • Service topology analysis
  • Service, service instance and endpoint dependency analysis
  • Slow services and endpoints detected
  • Performance optimization
  • Distributed tracing and context propagation
  • Database access indicators. Detect slow database access statements (including SQL statements).
  • Call the police

Tool installation and use

ChaosBlade installation and use are very simple, unified call ChaosBlade each scene by chaosblade cli, need only download the corresponding tar package, use unzip bladethe executable file for the chaotic experiment, see Download:https://github.com/chaosblade-io/chaosblade/releases

1. ChaosBlade installation

This time our actual environment is linux-amd64, download the latest version of chaosblade-linux-amd64.tar.gz package, the installation steps are as follows:

## 下载
wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz
## 解压 
tar -zxf chaosblade-0.9.0-linux-amd64.tar.gz
## 设置环境变量 
export PATH=$PATH:chaosblade-0.9.0/
## 测试 
blade -h

2. ChaosBlade use

ChaosBlade After the installation is complete, only you need to use bladethe executable file to create chaos in all scenarios currently supported. First use blade -hSee how to use, only need to use the down layer by layer after selecting the sub-command -hto see the full use cases as well as detailed analysis of the parameters, let's show you:

1) How to use blade

Execution blade -hcan see what support command:

An easy to use and powerful chaos engineering experiment toolkit

Usage:
  blade [command]

Available Commands:
  create      Create a chaos engineering experiment
  destroy     Destroy a chaos experiment
...

2) Create an experimental scene

For example, create a CPU load scenarios, execution blade create cpu fullload -hcan see the specific scene parameters, select the appropriate parameters to perform:

Create chaos engineering experiments with CPU load

Usage:
  blade create cpu fullload

Aliases:
  fullload, fl, load

Examples:

# Create a CPU full load experiment
blade create cpu load

#Specifies two random kernel's full load
blade create cpu load --cpu-percent 60 --cpu-count 2
...

Flags:
      --blade-release string     Blade release package,use this flag when the channel is ssh
      --channel string           Select the channel for execution, and you can now select SSH
      --climb-time string        durations(s) to climb
      --cpu-count string         Cpu count
      --cpu-list string          CPUs in which to allow burning (0-3 or 1,3)
      --cpu-percent string       percent of burn CPU (0-100)
...

3) Recovery experiment

ChaosBlade supports three ways to recover experiments:

  • After you create a successful experiment ChaosBlade will return a UID, execution blade destroy uidcan be.
  • If you can not find the corresponding UID, execution blade destroy target actioncan be, for example blade destroy cpu fullload.
  • When you create an experiment to bring --timeout 10parameters will automatically resume after ten seconds to perform the experimental scene, while supporting expressions, such as three minutes --timeout 30m.

3. SkyWalking installation & usage

For SkyWalking installation and usage documents, see:https://github.com/apache/skywalking/tree/v8.1.0/docs

After the tool is deployed, we will combine the case and take the initiative to observe system behavior, locate problems and find system defects through fault injection, so as to build a highly available microservice system.

Application fault tolerance case

We deploy a microservice application in the daily environment for experimentation, and use ab to test and simulate system requests. Microservice application services include front-ends, shopping carts, recommended services, products, orders, etc., and use components include Springboot, Nacos, Mysql, Redis, Lettuce, Dubbo, etc. ChaosBlade supports most of the components of the application. We use ChaosBlade to inject chaos experiments to verify the fault tolerance of the application and use SkyWalking for application monitoring and problem location.

1. Case environment

2. Application topology

The overall architecture of the application is as follows. The frontend calls for shopping carts (cars) and products (products) through Dubbo's strong dependence.

1.png

3. Chaos experiment steps

  • Develop a chaos experiment plan
  • Define system steady state indicators
  • Make assumptions about system fault-tolerant behavior
  • Perform chaos experiment
  • Check steady state indicators
  • Record and restore chaos experiments
  • Fix problems found
  • Automated continuous verification

Below we will use ChaosBlade to actually carry out the chaos experiment according to the chaos experiment procedure.

4. Case One

1) Scene

Develop a chaos experiment plan, call downstream services with frequent delays, use ab test to simulate normal access to the shopping cart interface, start 2 threads, and perform 10,000 interface visits.

ab -n 10000 -c 2 http://127.0.0.1:8083/cart

2) Monitoring indicators

Define the system steady state index, select the /cart endpoint in the SkyWalking console, the steady state index is as follows:

  • The average response time (RT) is about 15ms.
  • The P99 indicator is within 20ms.

2.png

3) Expectation assumption

  • Configure the call timeout time so that client requests will not be blocked for a long time.
  • Configure service circuit breaker strategy/service degradation.

4) Chaos experiment

In the previous section we have introduced ChaosBlade installation and simple and practical, in this case we use ChaosBlade service injected downstream Dubbo cart delay fault (delay time of 30 seconds), execute blade create dubbo delay -hcommand to view the command call delay dubbo usage:

Dubbo interface to do delay experiments, support provider and consumer

Usage:
  blade create dubbo delay

Examples:
# Invoke com.alibaba.demo.HelloService.hello() service, do delay 3 seconds experiment
blade create dubbo delay --time 3000 --service com.alibaba.demo.HelloService --methodname hello --consumer

Flags:
      --appname string          The consumer or provider application name
      --consumer                To tag consumer role experiment.
      --effect-count string     The count of chaos experiment in effect
      --effect-percent string   The percent of chaos experiment in effect
      --group string            The service group
  -h, --help                    help for delay
      --methodname string       The method name
      --offset string           delay offset for the time
      --override                only for java now, uninstall java agent
      --pid string              The process id
      --process string          Application process name
      --provider                To tag provider experiment
      --service string          The service interface
      --time string             delay time (required)
      --timeout string          set timeout for experiment in seconds
      --version string          the service version

Global Flags:
  -d, --debug        Set client to DEBUG mode
      --uid string   Set Uid for the experiment, adapt to docker

Refer to the case and parameter explanation, the upstream service client needs to inject the delay fault (the delay time is 30 seconds). With the help of SkyWalking, it is easy to find the Dubbo service related information on the link. First, query the link with the endpoint of /cart and find it on the link Dubbo service, as shown below:

  • Find link

3.png

  • Get agreement details

4.png

Click here to view the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, you can get the parameters needed to use ChaosBlade to inject the delay of the upstream service. Therefore, our final parameter structure is:

  • --time 30000 Delay 30s
  • --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService  service
  • --methodname viewCart Service method
  • --process frontend Java process
  • --consumer Currently Dubbo service client

Issue commands to inject faults:

blade create dubbo delay --time 30000 --service com.alibabacloud.hipstershop.cartserviceapi.service.CartService --methodname viewCart --process frontend --consumer

5) Monitoring indicators

Check the system indicators after injecting the fault, and check the indicators on SkyWalking:

  • The average response time (RT) is around 2000ms, and the P99 indicator is around 2000ms

5.png

  • /cart interface call error, com.alibabacloud.hipstershop.cartserviceapi.service.CartService service is abnormal.

6.png

  • A timeout exception occurs, and the timeout period is 2000ms

7.png

The conclusion shows that the upstream service is configured with the call timeout time, but the service fuse strategy is not configured, which actually does not meet expectations.

8.jpg

6) Fix the problem

Configure service circuit breaker strategy/service degradation.

5. Case Two

1) Scene

During operation, the Dubbo service provider failed to access the registry, and 100% of the packets were lost when the machine injected the faulty network into the registry.

2) Monitoring indicators

Define the system steady state index, select the service endpoint in the SkyWalking console, the steady state index is as follows:

  • com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal

9.png

3) Expectation assumption

The upstream service business will not be affected, and the downstream service will not be affected.

4) Chaos experiment

Inject packet loss failure (100%) to the registry port. We are using nacos as the registry of Dubbo. The default port is 8848 and the network card is eth0. The command parameters are as follows:

  • --interface eth0 Network card
  • --percent 100 100% packet loss rate
  • --local-port Local port 8848

Issue commands to inject faults:

blade create network loss --interface eth0 --percent 100 --local-port 8848

5) Monitoring indicators

After injecting the fault, select the service endpoint on the SkyWalking console, and the steady state indicators are as follows:

  • com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal

10.png

Conclusion: The service is weakly dependent on the registry and the service itself has a local cache, which meets the expected assumptions.

11.jpg

Assuming that the application is now deployed in a Kubernetes cluster, the horizontal expansion capability of the verification registry can be increased. ChaosBlade also supports the Kubernetes cluster scenario .

6. Small test

In the appeal case, we verified whether the service is configured with timeout and circuit breaker strategies, and verified whether Dubbo is weakly dependent on the registry and the service itself has a local cache. Are you also eager to experience it in your own system? ChaosBlade has prepared a wealth of experimental scenarios for everyone, which not only supports basic resources and application dimensions, but is also a powerful tool for cloud native platforms. ChaosBlade is easy to use, and it also provides detailed parameters to control the minimum explosion radius of failure. I believe ChaosBlade will make it easy for everyone to get started.

It's just too shallow on paper, here we provide an additional small case for everyone to practice. We often have traffic with relational databases in application development, and when application traffic grows rapidly, bottlenecks often occur on the database side, and there are many slow SQLs. When there is no slow SQL warning, it is difficult to find the original SQL and optimize it, so the slow SQL warning is very important. How to verify applications have this capability, ChaosBlade can support MySQL slow SQL injection fault, execute blade create mysql delay -hthe view command call delay MySQL usage:

Mysql delay experiment

Usage:
  blade create mysql delay

Examples:
# Do a delay 2s experiment for mysql client connection port=3306 INSERT statement
blade create mysql delay --time 2000 --sqltype select --port 3306

Flags:
      --database string         The database name which used
      --effect-count string     The count of chaos experiment in effect
      --effect-percent string   The percent of chaos experiment in effect
  -h, --help                    help for 
      --host string             The database host
      --offset string           delay offset for the time
      --override                only for java now, uninstall java agent
      --pid string              The process id
      --port string             The database port which used
      --process string          Application process name
      --sqltype string          The sql type, for example, select, update and so on.
      --table string            The first table name in sql.
      --time string             delay time (required)
      --timeout string          set timeout for experiment in seconds

Global Flags:
  -d, --debug        Set client to DEBUG mode
      --uid string   Set Uid for the experiment, adapt to docker

You can see that ChaosBlade provides a complete case, supporting more fine-grained SQL types, table names and other parameters. Check the select operation delay of 3306 when connecting to the port by 10s. When the traffic hits, is there a warning in your application?

blade create mysql delay --time 10000 --sqltype select --port 3306

Explanation of command parameters:

  • --time 10000 Delay 10s
  • --sqltype select Only support SQL statements of select type
  • --port 3306 Only supports connection with port 3306

to sum up

In this article, we introduced the application of chaos engineering in the actual complex distributed architecture, and combined ChaosBlade and SkyWalking to perform chaos experiments in actual applications, so that the system can be analyzed and optimized according to the failure situation, and the system can be continuously improved. Stability and high availability. ChaosBlade not only supports basic resources and application dimensions, but is also a powerful tool for cloud native platforms. You are welcome to try it.

ChaosBlade project address:https://github.com/chaosblade-io/chaosblade , welcome everyone to join and build together! Click to view contribution guidelines .

author information

Ye Fei: Github @tiny-x, open source community enthusiast, ChaosBlade Committer, participated in promoting the ecological construction of ChaosBlade Chaos Engineering.

Guess you like

Origin blog.51cto.com/13778063/2561851