The practice of ByteDance on the Go network library

This article is selected from the "Byte Beat Infrastructure Practice" series of articles.

The series of articles "Practice of ByteDance Infrastructure" are technical dry goods created by the technical teams and experts of ByteDance's infrastructure department, sharing with you the practical experience and lessons of the team in the process of infrastructure development and evolution, and All technical students communicate and grow together.

As an important part of the R&D system, the RPC framework carries almost all service traffic. This article will briefly introduce the design and practice of ByteDance's self-developed network library netpoll , as well as the problems we actually encountered and solutions, hoping to provide you with some reference.

foreword

As an important part of the R&D system, the RPC framework carries almost all service traffic. With the increasing use of the Go language in the company, the business requirements for the framework are getting higher and higher, but the Go native net network library cannot provide sufficient performance and control, such as the inability to perceive the connection status, and the utilization rate due to the large number of connections. Low, unable to control the number of coroutines, etc. In order to gain full control over the network layer, and at the same time do some exploration before the business and finally enable the business, the framework team launched a new epoll-based self-developed network library - netpoll , and developed byte based on it. Inside the new generation Golang framework Kitex .

Since the principle of epoll has been described in many articles, this article will only briefly introduce the design of netpoll ; then, we will try to sort out some of our practices based on netpoll ; finally, we will share a problem we encountered and how we solved it ideas. At the same time, students who are interested in the Go language and framework are welcome to join us!

New Network Library Design

Reactor - Event monitoring and dispatching core

The core of netpoll is the Reactor event listener scheduler. The main function is to use epoll to listen to the file descriptor (fd) of the connection, and to trigger the three events of read, write and close on the connection through the callback mechanism.

image

Server - master-slave Reactor implementation

netpoll combines Reactors 1:N into master-slave mode.

  1. MainReactor mainly manages Listeners, which are responsible for listening on ports and establishing new connections;
  2. SubReactor is responsible for managing Connections, monitoring all assigned connections, and submitting all triggered events to the coroutine pool for processing.
  3. Netpoll introduces active memory management in I/O Task, provides NoCopy calling interface to the upper layer, and thus supports NoCopy RPC.
  4. Use the coroutine pool to centrally process I/O tasks to reduce the number of goroutines and scheduling overhead.

    image

Client - Shared Reactor capabilities

The client side and the server side share the SubReactor, and netpoll also implements the dialer to provide the ability to create connections. Similar to net.Conn on the client side, netpoll provides the underlying support for write -> wait read callback.

image

Nocopy Buffer

Why do you need Nocopy Buffer?

In the design of Reactor and I/O Task mentioned above, the triggering method of epoll will affect the design of I/O and buffer. Generally speaking, there are two ways:

  • With horizontal triggering (LT) , it is necessary to synchronously complete the I/O after the event is triggered, and provide buffers directly to the upper-level code.
  • Using edge trigger (ET) , you can choose to only manage event notification (such as go net design), and the upper-level code completes I/O and manages the buffer.

Both methods have their own advantages and disadvantages. Netpoll adopts the former strategy, which has better horizontal trigger timeliness, high fault tolerance, active I/O can centralize memory usage and management, provide nocopy operations and reduce GC. In fact, some popular open source network libraries also adopt the first method, such as easygo, evio, gnet, etc.

But the use of LT also brings another problem, that is, the underlying active I/O and the upper-level code operate the buffer concurrently, which introduces additional concurrency overhead. For example: I/O read data write buffer and upper-level code read buffer have concurrent read and write, and vice versa. In order to ensure the correctness of the data and not introduce lock competition, the existing open source network libraries usually adopt methods such as synchronously processing the buffer (easygo, evio) or copying the buffer and providing it to the upper-level code (gnet), which are not suitable for business processing. Or there is copy overhead.

On the other hand, common buffer libraries such as bytes, bufio, and ringbuffer all have problems such as growth, which requires copying the original array data, can only be expanded but not reduced, and occupies a large amount of memory. Therefore, we hope to introduce a new form of Buffer to solve the above two problems in one fell swoop.

Nocopy Buffer Design and Benefits

Nocopy Buffer is implemented based on linked list arrays. As shown in the figure below, we abstract the []byte array into blocks, and combine blocks into Nocopy Buffer in the form of linked list splicing. Reference counting, nocopy API and object pool are also introduced.

image

Compared with common bytes, bufio, ringbuffer, etc., Nocopy Buffer has the following advantages:

  1. Read and write parallel lock-free, support nocopy streaming read and write
    • Read and write operate the head and tail pointers separately, without interfering with each other.
  2. Efficient expansion and contraction
    • In the expansion stage, a new block can be added directly after the tail pointer without copying the original array.
    • In the shrinking stage, the head pointer will directly release the used block node to complete the shrinking. Each block has an independent reference count. When the released block no longer has references, the block node will be actively reclaimed.
  3. Flexible slicing and splicing of buffers (linked list feature)
    • Supports arbitrary read segments (nocopy), the upper-level code can process data stream segments in parallel with nocopy, without caring about the life cycle, through reference counting GC.
    • Supports arbitrary splicing (nocopy), and the write buffer supports the form of splicing to the tail pointer through block, without copying, to ensure that data is only written once.
  4. Nocopy Buffer pooling to reduce GC
    • Consider each []byte array as a block node, build an object pool to maintain free blocks, and reuse blocks to reduce memory usage and GC. Based on the Nocopy Buffer, we implement Nocopy Thrift, which enables zero allocation and zero copy of memory during encoding and decoding.

connection multiplexing

RPC calls usually take the form of short connections or long connection pools, and one call is bound to one connection at a time. When the upstream and downstream scales are large, the number of connections in the network expands at the speed of MxN, which brings huge scheduling pressure and Computational overhead makes service governance difficult. Therefore, we want to introduce a form of "parallel processing calls on a single persistent connection" to reduce the number of connections in the network, a scheme known as "connection multiplexing".

At present, there are some open source connection multiplexing solutions in the industry, which are constrained by the constraints of the code level. These solutions all require copy buffers to realize data packetization and merging, resulting in unsatisfactory actual performance. The above-mentioned Nocopy Buffer, based on its flexible slicing and splicing features, supports nocopy data packetization and merging well, making it possible to implement a high-performance connection multiplexing scheme.

The connection multiplexing design based on netpoll is shown in the figure below. We abstract the Nocopy Buffer (and its shards) as virtual connections, so that the upper-level code maintains the same calling experience as net.Conn. At the same time, on the underlying code, the data on the real connection is flexibly allocated to the virtual connection through protocol subcontracting; or the virtual connection data is combined and sent through protocol coding.

image

The connection multiplexing scheme contains the following core elements:

  1. virtual connection

    • Essentially Nocopy Buffer, the purpose is to replace the real connection and avoid memory copy.
    • The business logic/coding and decoding of the upper layer is completed on the virtual connection, and the upper layer logic can be executed asynchronously and independently in parallel.
  2. Shared map

    • Introduce shard lock to reduce lock strength.
    • The caller uses the sequence id to mark the request, and uses the shard lock to store the callback corresponding to the id.
    • After receiving the response data, find the corresponding callback according to the sequence id and execute it.
  3. Protocol packetization and encoding

    • How to identify the complete request-response data packet is the key to the feasibility of the connection multiplexing scheme, so the protocol needs to be introduced.
    • The thrift header protocol is used here, the integrity of the data packet is judged by the message header, and the corresponding relationship between the request and the response is marked by the sequence id.

ZeroCopy

The ZeroCopy mentioned here refers to the ZeroCopy capability provided by Linux. In the previous chapter, we talked about the zero-copy of the business layer. As we all know, when we call the sendmsg system call to send a package, a copy of the data is actually still generated, and the consumption of this copy is very obvious in the large package scenario. . Taking 100M as an example, perf can see the following results:

image

This is only the occupation of ordinary tcp packets. In our scenario, most services will be connected to Service Mesh, so in one packet, there will be a total of 3 copies: Business process to kernel, kernel to sidecar, sidecar to kernel. This makes the CPU usage caused by copying especially obvious for businesses with large package requirements, as shown in the figure below:

image

To solve this problem, we chose to use the ZeroCopy API provided by Linux (support for send after 4.14; support for receive after 5.4). But this introduces an additional engineering problem: the ZeroCopy send API is not compatible with the original invocation and cannot coexist well. Here is a brief introduction to the working method of ZeroCopy send: After the business process calls sendmsg, sendmsg will record the address of the iovec and return immediately. At this time, the business process cannot release this memory, and needs to wait for the kernel to call back a signal through epoll to indicate that a certain segment of iovec has been It can only be released after the sending is successful. Since we do not want to change the usage method of the business side and need to provide the upper layer with an interface for synchronous sending and receiving, it is difficult to provide both ZeroCopy and non-ZeroCopy abstractions based on the existing API; and because ZeroCopy has performance loss in small package scenarios , so you can't use this as the default option.

Therefore, the ByteDance framework group cooperates with the ByteDance kernel group, and the kernel group provides a synchronous interface: when the sendmsg is called, the kernel will monitor and intercept the callback originally given by the kernel to the business, and will let the callback be completed after the callback is completed. sendmsg returns. This allows us to easily access ZeroCopy send without changing the original model. At the same time, the ByteDance kernel group also implements ZeroCopy based on unix domain socket, which can make the communication between the business process and the Mesh sidecar reach zero copy.

After using ZeroCopy send, perf can see that the kernel is no longer occupied by copy:

image

From the cpu occupancy value, ZeroCopy can save half of the CPU in the large package scenario compared to non-ZeroCopy.

Sharing of latency problems caused by Go scheduling

In our practice, we found that although our newly written netpoll outperformed the Go native net library in terms of avg delay, the p99 and max delays were generally slightly higher than the Go native net library, and the spikes would be more Obviously, as shown in the figure below (Go 1.13, blue is netpoll + multiplexing, green is netpoll + long connection, yellow is net library + long connection):

image

We tried many ways to optimize, but with little success. In the end, we figured out that the delay was not caused by the overhead of netpoll itself, but by the scheduling of go, for example:

  1. Since in netpoll, SubReactor itself is also a goroutine, affected by scheduling, it cannot be guaranteed to be executed immediately after the EpollWait callback, so this block will be delayed;
  2. At the same time, since the SubReactor used to process I/O events and the MainReactor used to process connection monitoring are also goroutines, it is actually difficult to guarantee that these Reactors can execute in parallel in the case of multi-core; even in the most extreme cases, Maybe these Reactors will hang under the same P, and eventually become serial execution, unable to take full advantage of multi-core;
  3. Since the I/O events are processed serially in SubReactor after the EpollWait callback, the last event may have a long tail problem;
  4. In the connection multiplexing scenario, since each connection is bound to a SubReactor, the delay depends entirely on the scheduling of this SubReactor, resulting in more obvious spikes. Since Go has made special optimizations for the net library in runtime, the net library will not have the above situation; at the same time, the net library is a goroutine-per-connection model, so it can ensure that requests can be executed in parallel without affecting each other.

For the above problem, we currently have two ideas to solve:

  1. Modify the Go runtime source code, register a callback in the Go runtime, call EpollWait each time it is scheduled, and pass the obtained fd to the callback for execution;
  2. Works with the ByteDance kernel team to support simultaneous batch read/write of multiple connections to solve serial problems. In addition, after our tests, Go 1.14 can make the latency slightly lower and more stable, but the limit QPS that can be achieved is lower. We hope that our ideas can provide some reference for students in the industry who also encounter this problem.

postscript

Hope the above sharing can be helpful to the community. At the same time, we are also accelerating the construction of netpoll and Kitex, a new framework based on netpoll. Welcome all interested students to join us and jointly build the Go language ecosystem!

References

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324218117&siteId=291194637