Kitex: High performance optimization practice under microservice architecture

foreword

In 2019, the ByteDance service framework team started to develop the RPC framework Kitex and a series of related basic libraries in response to the functional and performance pain points encountered under the large-scale microservice architecture, as well as absorbing the experience and lessons accumulated under the old framework in history , and officially open source on Github in 2021.

From 2019 to 2023, the scale of internal microservices has experienced a huge expansion, and the Kitex framework has also undergone performance optimization and tests time and time again in the process. This article hopes to share some performance optimization practices we have accumulated during this process, and also to make a systematic summary of our optimization work in the past few years.

The past and present of Kitex

Why do you need an RPC framework

Although the RPC framework has a long history, it has been widely used as a core component on a large scale, which is inseparable from the popularity of the microservice architecture. So it is necessary for us to review the history and explore why we need the RPC framework.

Monolithic Architecture Era

The main features of system services during this period are:

Separate different business logic by function
The performance pressure is mainly concentrated on the database, so the evolution process of the database level from the manual distribution of sub-database and sub-table to the real automatic distributed architecture

Common business codes are as follows:

func BuySomething(userId int, itemId int) {
    user := GetUser(userId)
    sth := GetItem(itemId)
}

func GetUser(userId) {
    return db.users.GetUser(userId)
}

func GetItem(itemId) {
    return db.items.GetItem(itemId)
}

This coding pattern is straightforward and very easy to refactor and write unit tests when you have a good design pattern in place. Many IT systems still use this model today. However, with the rapid development of Internet business, some super large Internet projects have also touched some ceilings:

Computing power ceiling : the upper limit of computing power owned by a request <= the total computing power of a single server / the number of simultaneous processing requests
R&D efficiency ceiling : The size of the code warehouse, the number of teams and the complexity of coding are not linearly increasing. The more difficult it is to maintain and the more difficult it is to go online.

The era of microservice architecture

In order to solve the above problems of the monolithic architecture, we have come to the era of the microservice architecture. A typical code for a microservice architecture is as follows:


func BuySomething(userId int, itemId int) {
    user := client.GetUser(userId) // RPC call
    sth := client.GetItem(itemId)  // RPC call
}

The significance of RPC (Remote Procedure Call) is to enable businesses to call remote services like calling local methods, minimize business awareness, and then do a good job in the process of evolving from a single architecture to a microservice architecture. Changes in business coding habits are minimized.

The direction of performance optimization

In the case of not using RPC, the only call overhead of the code shown in the figure below is just the overhead of a function call, which is a nanosecond-level overhead without considering inline optimization.


func client() (response) {
    response = server(request) // function call
}

func server(request) (response) {
    response.Message = request.Message
}

After replacing it with an RPC call, the call overhead directly soars to the millisecond level:

func client() (response) {
    response = client.RPCCall(request) // rpc call - network
}

func server(request) (response) {
    response.Message = request.Message
}

This is a delay difference of 10^6 level, which not only proves that RPC is very expensive, but also proves that there is also a lot of room for optimization.

The complete process of an RPC call is as follows, and later we will give the performance optimization practices we have done on it for each link:

Why self-developed RPC framework

Before understanding the performance practice, we need to explain one thing, why we still choose to develop a new RPC framework when there are many RPC frameworks. Mainly for the following reasons:

The company mainly communicates with the Thrift protocol, but most mainstream Go frameworks do not support the Thrift protocol, and it is not easy to expand multiple protocols.
The company has extreme requirements for performance, and needs to do in-depth optimization from the whole link (examples will be given later)
The company's internal microservices are huge and complex, requiring a highly scalable framework that supports deep customization

What is Kitex

development path

Kitex was officially established in 2019, released internally in 2020, and officially open sourced in 2021. Until February 2023, more than 60,000 microservices have been in use.

CloudWeGo Family

While developing the main framework of Kitex, we also open source many high-performance components that are not coupled with Kitex, thus forming a large family ecology of CloudWeGo:

Kitex vs other frameworks

Kitex supports both Thrit and gRPC protocols, but there are not many frameworks supporting Thrift in the Go ecosystem, so here we choose to use the gRPC protocol for horizontal comparison with the grpc-go framework:

gRPC Unary comparison:

gRPC Steaming comparison:

Kitex framework performance optimization practice

Many of Kitex's performance optimization ideas are actually not bound to the Go language, but here we mainly use Go as an example for convenience.

Next, we will introduce Kitex's performance optimization practices one by one along the complete flowchart of the previous RPC call.

Codec optimization

Common codec problems

Take Protobuf as an example:

Computational overhead:
1. Obtain additional information through reflection at runtime
2. Need to call many functions and create many small objects
GC overhead: not easy to reuse memory

Generate code optimization: FastThrift & FastPB

In the two protocols Thrift and Protobuf supported by Kitex, we have realized the ability to realize encoding and decoding by generating a large number of codes. Since the generated code can maximize the preset runtime information in advance, it can provide the following benefits:

Precalculate the Size and reuse the memory

When serializing, we can call at a very low cost Size()and use this to create a fixed-size memory space in advance.


type User struct {
   Id   int32
   Name string
}

func (x *User) Size() (n int) {
   n += x.sizeField1()
   n += x.sizeField2()
   return n
}

// Framework Process
size := msg.Size()
data = Malloc(size)
Encode(user, data) // encoding user object directly into the allocated memory to save one time of copy
Send(data)
Free(data) // reuse the allocated memory at next Malloc

Minimize function calls and intermediate object creation

Although the cost of function calls and small object creation is very low, for the hot path of encoding and decoding, the optimization of these low-cost and high-frequency codes can also bring great benefits, especially Go is a language with GC.

It can be seen that since the underlying fastWriteField function will be inlined at compile time, the serialized FastWrite function is essentially sequentially writing a fixed memory space (FastRead is similar).


func (x *User) FastWrite(buf []byte) (offset int) {
   offset += x.fastWriteField1(buf[offset:])
   offset += x.fastWriteField2(buf[offset:])
   return offset
}
// inline
func (x *User) fastWriteField1(buf []byte) (offset int) {
   offset += fastpb.WriteInt32(buf[offset:], 1, x.Id)
   return offset
}
// inline
func (x *User) fastWriteField2(buf []byte) (offset int) {
   offset += fastpb.WriteString(buf[offset:], 2, x.Name)
   return offset
}

Optimization effect

Optimized from the previous 3.58% to 0.98%:

JIT Alternative Generated Code: Frugal(Thrift)

After achieving good results in the previous hard-coded method, we also received some feedback, such as:

The size of the generated code expands linearly with the increase of fields
The generated code depends on the user's command line version, and multi-person collaboration can easily overwrite each other

Therefore, we naturally have a question, can the previously generated code be automatically generated at runtime? This question itself is actually an answer, that is, it is necessary to introduce JIT (Just-in-time compilation) technology to optimize code generation.

Advantage:

Use registers to pass parameters and deeper inlining to improve function call efficiency
Core computing functions use fully optimized assembly code

Optimization effect

Optimized from the previous 3.58% to 0.78%:

Frugal VS Apache Thrift codec performance comparison :

Network library optimization

Defects of native Go Net in RPC scenarios

One connection and one coroutine, when there are many upstream and downstream instances, the performance will drop sharply after the number of Goroutines increases to a certain level, which is especially not conducive to the business of large-scale instances.
Unable to automatically connect to sense off state
When a struct is NoCopy serialized, the product is often a multi-dimensional array, and Go's Write([]byte)interface cannot support the reading and writing of discontinuous memory data.


name := "Steve Jobs" // 0xc000000020
req := &Request{Id: int32(1), Name: name}

// ===> Encode to [][]byte
[
 [4 bytes],
 [10 bytes], // no copy encoding, 0xc000000020
]

// ===> Copy to []byte
buf := [4 bytes + 10 bytes] // new address

// ===> Write([]byte)
net.Conn.Write(buf)

It is strongly bound to Go Runtime, which is not conducive to transformation to support some new experimental features.

Netpoll optimization practice

Main optimization points:

Coroutine optimization: the number of connections is not bound to the number of coroutines; reuse coroutines as much as possible
Middle layer Buffer: supports zero-copy reading and writing and memory reuse, maximizing the avoidance of GC overhead during encoding and decoding
In-depth customization for scenarios with high concurrency of small RPC packets: coroutine scheduling optimization, TCP parameter tuning, etc.
In-depth customization for the infield environment, including: transforming Go Runtime to improve scheduling priority, kernel support for batch system calls, etc.

Communication layer optimization

Same-machine communication optimization: Communication efficiency issues under Service Mesh

After the introduction of Service Mesh, the business process mainly communicates with another sidecar process on the same machine, resulting in an additional level of delay.

Traditional Service Mesh solutions generally implement traffic forwarding to the sidecar process through iptables hijacking. It is conceivable that the performance loss is very exaggerated from all levels. Kitex has made many performance optimization attempts at the communication layer, and finally produced a systematic solution.

Same-machine communication optimization: UDS replaces TCP

Performance comparison between UDS and TCP:


======== IPC Benchmark - TCP ========
      Type    Conns     Size        Avg        P50        P99
    Client       10     4096      127μs       76μs      232μs
  Client-R       10     4096        2μs        1μs        1μs
  Client-W       10     4096        9μs        4μs        7μs
    Server       10     4096       24μs       13μs       18μs
  Server-R       10     4096        1μs        1μs        1μs
  Server-W       10     4096        7μs        4μs        7μs
======== IPC Benchmark - UDS ========
      Type    Conns     Size        Avg        P50        P99
    Client       10     4096      118μs       75μs      205μs
  Client-R       10     4096        3μs        2μs        3μs
  Client-W       10     4096        4μs        1μs        2μs
    Server       10     4096       24μs       11μs       16μs
  Server-R       10     4096        4μs        2μs        3μs
  Server-W       10     4096        3μs        1μs        2μs

From the performance test, we can find two conclusions:

UDS is better than TCP in all indicators
But the optimization range is not very large

Same-machine communication optimization: ShmIPC replaces UDS

In order to further squeeze the performance of inter-process communication, we developed a communication mode based on shared memory. The difficulty of shared memory communication lies in how to handle the synchronization between processes of each communication state, so we use a self-developed communication protocol, and reserve UDS as the event notification pipeline (IO Queue), and shared memory as the data transmission pipeline (Buffer). :

For more technical details of shmipc, please refer to the article we published in the past: Bytedance Open Source Shmipc: High-performance IPC based on shared memory .

Performance Testing:

Cross-machine to same-machine communication: combined deployment solution

We have carried out extreme optimization on the same machine communication before, but this is limited to the data plane communication between the service process and the Service Mesh, and the service at the opposite end is most likely not deployed on the machine. So how to optimize cross-machine communication?

A "trick" idea is to turn cross-machine problems into same-machine problems .

To accomplish this in large-scale microservice communication requires the cooperation of multiple layers of components in the architecture, so we developed a merged deployment solution:

Container scheduling layer transformation : The container scheduling system will perform affinity scheduling based on the merged relationship and instances of upstream and downstream services, and schedule upstream and downstream instances on a physical machine as much as possible.
Transformation of traffic scheduling layer : The service control plane needs to identify the downstream instances of an upstream container, and calculate the dynamic weight of each upstream instance for each upstream instance in consideration of global load balancing. More traffic can be communicated locally.
Framework transformation : extended customization supports the combination and deployment of special communication methods, and sends requests to instances on the same machine or Mesh Proxy based on the calculation results of the traffic scheduling layer.

Microservice Online Tuning Practice

In addition to the performance optimization we have done at the framework layer, in fact, a large part of the online performance bottleneck comes from the business logic itself. In this regard, we have also accumulated some practical experience.

Automated GC tuning

Problems with Go's native GC strategy

Go is not a language specifically designed for microservice scenarios, so naturally its GC strategy does not focus on optimizing delay-sensitive businesses. However, RPC services often have certain requirements for P99 delay.

Let's take a look at the basic principles of Go GC first:

GOGC principle:

Set a percentage value through the GOGC parameter, the default is 100, and calculate the heap size when the next GC is triggered: NextGC = HeapSize + HeapSize * (GOGC / 100). That is, the default is 2 times the Heap Size after the last GC.

Therefore, assuming that the active memory usage of a service is 100MB, GC will be triggered every time the heap grows to 200MB. Even if the service has 4GB of memory.

shortcoming:

In the microservice scenario, the utilization rate of service memory is generally extremely low, but more aggressive GC is still being done
For RPC scenarios, a large number of objects themselves are highly reusable, and frequent GC for these reusable objects will reduce the reuse rate

Core appeal: reduce GC frequency and improve resource reuse rate of microservices while ensuring safety

gctuner : automatic tuning GC strategy

Users can control the aggressiveness of GC they want by setting the threshold, for example, set it to memory_limit * 0.7. When it is lower than this value, GCPercent will be increased as much as possible.

When the memory does not reach the set threshold, the GOGC parameter is set larger, and when it exceeds, it is set smaller.
Anyway GOGC min 50, max 500.

Advantage:

Delay GC when low memory utilization
When high memory utilization, revert to native GC strategy

Points to note:

If the memory resources are not exclusive to the current process, memory resources need to be reserved for other processes
Not suitable for services prone to excessively extreme spikes in memory

gctuner is currently open source on github: https://github.com/bytedance/gopkg/tree/develop/util/gctuner .

Concurrency Tuning

How much CPU can I use? - container lies


apiVersion: v1
kind: Pod
spec:
  containers:
  - resources:
      limits:
        cpu: "4"

The development of microservices is accompanied by the vigorous development of containerization technology. At present, most microservices and even databases in the industry run in a container environment. Here we only discuss mainstream cgroup-based containers.

A common business development model is that R&D personnel apply for a 4-core CPU container on the container platform, and then naturally think that their program can only use up to 4 CPUs at the same time, and adjust the parameters of their program with this assumption.

After going online, enter the container and use top to see that all indicators are indeed in accordance with the 4-core standard:

Even with cat /proc/cpuinfoa computer, you can see exactly 4 CPUs, no more, no less.


processor        : 0
// ...
processor        : 1
// ...
processor        : 2
// ...
processor        : 3
// ...

But in fact, all this is just a beautiful illusion encapsulated by the container for you. The reason why this illusion is so realistic is to let you get rid of the mental burden of programming, and by the way, let those traditional Linux Debug tools run normally in the container environment.

However, in fact, the container technology based on cgroups only limits the CPU time, not the number of CPUs. If you actually log in to the machine to see the CPU number being used by each thread of the process, you will find that the sum is likely to exceed the container CPU setting:

The container applies for 4 CPU units, which means that it can run at the time equivalent to 4 CPUs in one computing cycle (generally 100ms) , instead of only using 4 physical CPUs, and it does not mean that at least 4 CPUs can be used at the same time. A CPU is used by the program. If the usage time is exceeded, all processes of the container will be suspended until the end of the computing cycle—that is, the program may be throttled.

Is upstream parallel processing faster the better? - The relationship between concurrency and timeout

When we know that the upper limit of the physical parallel computing capability allowed by our program is actually very high, we can use this technique to increase/decrease the number of worker threads (GOMAXPROCS) or the request concurrency in the program.

For example, in the following call scenario, the business sends requests to the same downstream with 4 concurrency, and each request downstream needs 50ms to process, so the upstream sets the timeout to 100ms. Sounds reasonable, but if the downstream has only 2 CPUs that can handle the request at exactly that time, and there happens to be some GC work or other work in the middle, then the 3rd RPC request will time out.

And if we set the concurrency to 2, the probability of timeout will be greatly reduced.

Of course, this does not mean that reducing the concurrency is good. If the downstream computing power is far from redundant, then only by increasing the concurrency can the downstream processing capacity be fully released.

Reject involution - reserve computing power for other processes

If there are other processes in the container, you need to consider reserving resources for other processes. Especially in scenarios where the Service Mesh data surface is deployed with the same container sidecar, if an upstream process runs out of the time slice allocated in the computing cycle, it is very easy to be throttled when it is the turn of the downstream process, so the overall delay of the service is still degraded of.

How to adjust service concurrency

Adjust the number of worker threads : For example, GOMAXPROCS is opened in Go to adjust the number of worker threads.
Modify the request concurrency in the code : The business needs to make a trade-off by itself and keep trying to find an appropriate concurrency value by weighing the delay benefit obtained by increasing the concurrency and the loss of stability during the peak period.
Use a batch interface : Of course, if the business scenario allows, it is better to replace this interface with a batch interface.

future outlook

The Last Bastion: Kernel

Currently the only optimization gap: Kernel.

In online business, we often find that even if we optimize RPC to the same machine communication level, for IO-intensive services, RPC communication overhead still often accounts for more than 20% of the total overhead. At present, we have optimized the inter-process communication to a very extreme point. If we want to further optimize it, we can only reach the point of completely breaking the constraints of the current Linux inter-process communication.

We have achieved some preliminary results in this area, and we will continue to share this content in future articles, so stay tuned.

Rethinking the TCP protocol

In the communication scenario in the data center, the defects of TCP:

The quality of the intranet network is excellent, the packet loss rate is extremely low, and many designs of TCP are wasteful
Large-scale point-to-point communication, TCP long connection is easy to degenerate into short connection
The application layer takes "message" as the unit, and the TCP data flow has no message boundary

This reason, in turn, got us thinking, should there be a proprietary data center protocol for RPC communication?

Continue to develop existing components

For existing components, we will continue to invest in further improving performance and usage scenarios:

Codec Frugal :

Support ARM architecture
Optimizing the SSA backend
Accelerate with SIMD

Network library Netpoll:

Refactor the interface to support seamless access to existing Go ecosystem libraries
SMC-R (RDMA) support

Merge deployment :

From the same machine to the same cabinet granularity

project address

GitHub：https://github.com/cloudwego

Official website: www.cloudwego.io