Getting Started with the RPC Framework Kitex: A Guide to Performance Testing

On September 8, 2021, ByteDance announced the official open source CloudWeGo. CloudWeGo is a set of ByteDance's internal microservice middleware collection, featuring high performance, strong scalability and stability, focusing on solving the problems of microservice communication and governance, and meeting the demands of different businesses in different scenarios. CloudWeGo first open sourced four projects: Kitex , Netpoll , Thriftgo and netpoll- http2 , mainly RPC framework Kitex and network library Netpoll.

A few days ago, the ByteDance service framework team officially open sourced CloudWeGo , and Kitex , the Golang microservice RPC framework that has deep applications in Douyin and Toutiao , is also included.  

This article aims to share the scenarios and technical issues that developers need to understand when stress testing Kitex. These suggestions will help users to better tune Kitex in combination with real RPC scenarios , making it more suitable for business needs and giving the best performance. Users can also refer to the official stress test project kitex-benchmark for more details.  

Characteristics of Microservice Scenarios

Kitex was born from the practice of ByteDance's large-scale microservice architecture, and the oriented scenario is naturally the microservice scenario. Therefore, the characteristics of microservices will be introduced first, so that developers can deeply understand the design thinking of Kitex.

  • RPC communication model

    The communication between microservices is usually based on the PingPong model, so in addition to the conventional throughput performance indicators, the average delay of each RPC is also a point that developers need to consider.

  • complex call chain

    An RPC call often requires the cooperation of multiple microservices, and downstream services will have their own dependencies, so the entire call chain will be a complex mesh structure.

    In this complex invocation relationship, the delay fluctuation of an intermediate node may be transmitted to the entire link, resulting in an overall timeout. When there are enough nodes on the link, even if the fluctuation probability of each node is very low, the timeout probability finally converged on the link will be amplified. So the latency fluctuation of a single service - the P99 latency indicator, is also a key indicator that will have a significant impact on online services.

  • package size

    Although the size of a service communication packet depends on the actual business scenario, in the internal statistics of ByteDance, we found that most of the online requests are mainly small packets (<2KB), so while taking into account the large packet scenarios, we also focus on optimization performance in small package scenarios.

Stress testing for microservice scenarios

Determine the pressure test object

Measuring the performance of an RPC framework requires thinking from two perspectives: Client perspective and Server perspective. In a large-scale business architecture, the upstream client does not necessarily use the downstream framework, and the same is true for the downstream services called by the developer. If you consider the service mesh, the situation is more complicated.

Some stress testing projects usually combine the Client and Server processes to perform stress testing, and then obtain the performance data of the entire framework , which may not match the actual online operation.

If you want to stress test the server, you should give the client as many resources as possible to push the server to the limit, and vice versa. If both the Client and Server only use 4-core CPUs for stress testing, developers will not be able to determine which perspective the final performance data is from, let alone provide practical reference for online services.

Align connection models

There are three main connection models for conventional RPC:

  • Short connection : a new connection is created for each request, and the connection is closed immediately after it is returned
  • Long connection pool : a single connection can only process a complete request and return at the same time
  • Connection multiplexing : a single connection can handle multiple requests and returns asynchronously at the same time

Each type of connection model is not absolutely good or bad, depending on the actual usage scenario. Although connection multiplexing generally has the best performance, applications must rely on the protocol to support packet sequence numbers, and some old framework services may not support multiplexing calls.

In order to ensure maximum compatibility, Kitex used short connections by default on the client side, while other mainstream open source frameworks used connection multiplexing by default, which resulted in some users experiencing relatively large performance when using the default configuration for stress testing. data bias.

Later, in order to meet the regular usage scenarios of open source users, Kitex also added the default long connection setting in v0.0.2.

Align serialization

For the RPC framework, regardless of service governance, the computational overhead is mainly concentrated in serialization and deserialization.

Kitex uses the official Protobuf library for the serialization of Protobuf. For the serialization of Thrift, it has specially optimized the performance. The content of this aspect is introduced in the official website blog .  

Most of the current open source frameworks give priority to supporting Protobuf, and the Protobuf used in some frameworks is actually a gogo/protobuf version with many performance optimizations. However, because gogo/protobuf currently has the risk of losing maintenance , for the sake of maintainability, we Still decided to only use the official Protobuf library, of course, we will also plan to optimize Protobuf in the future.  

Use exclusive CPU

Although online applications usually share the CPU with multiple processes, in the stress test scenario, both the Client and Server processes are extremely busy. If the CPU is also shared at the same time, it will cause a lot of context switching, which makes the data lack of reference, and It is easy to produce large fluctuations before and after.

So we suggest to isolate the Client and Server processes on different CPUs or different exclusive machines. If you want to further avoid the influence of other processes, you can add the nice -n -20 command to adjust the scheduling priority of the pressure measurement process.

In addition, if conditions permit, using real physical opportunities makes the test results more rigorous and reproducible compared to cloud platform virtual machines.

Performance Data Reference

On the premise of meeting the above requirements, we have compared multiple frameworks using Protobuf for stress testing, and the stress testing code is in the kitex-benchmark repository. Under the goal of fully filling the Server, Kitex has the lowest P99 Latency in connection pooling mode among all frameworks. In the multiplexing mode, Kitex also has more obvious advantages in various indicators.

Configuration:

  • Client 16 CPUs,Server 4 CPUs
  • 1KB request size, Echo scene

Reference data:

  • KITEX: connection pool mode (default mode)
  • KITEX-MUX: Multiplexing Mode
  • Other frameworks use multiplexing mode

image image

Epilogue

In the current mainstream Golang open source RPC frameworks, each framework actually has its own design goals: some frameworks focus on generality, some focus on scenarios with light business logic like Redis, some focus on throughput performance, and Some are more focused on the P99 delay.

In the daily iteration of ByteDance's business, there is often a situation where one indicator increases and another indicator decreases due to a feature. Therefore, Kitex is more inclined to solve various problems in large-scale microservice scenarios at the beginning of its design.

After the release of Kitex, we have received a large amount of self-test data from users. We thank the community for their attention and support. We also welcome developers to choose appropriate tools for their actual scenarios based on the testing guidelines provided in this article. For more questions, please file an issue on GitHub.

Related Links

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324138222&siteId=291194637