Learn Java & Netty performance tuning through the HTTP/2 protocol case: tools, techniques and methodologies

Author: Liang Beining Apache Dubbo Contributor, Chen Youwei Apache Dubbo PMC

Summary

The Dubbo3 Triple protocol is designed with reference to the characteristics of gRPC, gRPC-Web, Dubbo2 and other protocols. It absorbs the characteristics of their respective protocols, is fully compatible with gRPC, Streaming communication, and seamlessly supports HTTP/1 and browsers.

When you use the Triple protocol in the Dubbo framework, then you can directly use the Dubbo client, gRPC client, curl, browser, etc. to access the services you publish without any additional components and configurations.

In addition to ease of use, Dubbo3 Triple has done a lot of work in performance tuning. This article will focus on the in-depth explanation of the high-performance secrets behind the Triple protocol, involving some valuable performance tuning tools, techniques and code implementation; in the next In this article, we will specifically expand some specific usage scenarios of the Triple protocol in terms of ease of use.

Why optimize the performance of the Triple protocol?

Since 2021, Dubbo3 has gradually begun to replace the widely used HSF framework within Ali as the next-generation service framework. So far, most of Ali's core applications represented by e-commerce companies such as Taobao and Tmall have been successfully upgraded to Dubbo3. As the key framework supporting Alibaba’s Double 11 trillion-level service calls in the past two years, the performance of the Triple communication protocol directly affects the operating efficiency of the entire system.

Pre-knowledge

1. Introduction to Triple protocol

The Triple protocol is designed with reference to the two protocols gRPC and gRPC-Web. It absorbs the characteristics and advantages of the two protocols and integrates them together to become a protocol that is fully compatible with gRPC and supports Streaming communication. Triple also supports HTTP/1, HTTP/2.

The design goals of the Triple protocol are as follows:

  • Triple is designed as an HTTP-based protocol that is friendly to humans, development and debugging, especially for unary type RPC requests.
  • It is fully compatible with the HTTP/2-based gRPC protocol, so the implementation of the Dubbo Triple protocol can be 100% interoperable with the gRPC system.

When you use the Triple protocol in the Dubbo framework, then you can directly use Dubbo client, gRPC client, curl, browser, etc. to access the services you publish.

The following is an example of using the curl client to access a Triple protocol service on the Dubbo server:

curl \
  --header "Content-Type: application/json"\
  --data '{"sentence": "Hello Dubbo."}'\
https://host:port/org.apache.dubbo.sample.GreetService/sayHello

In terms of specific implementation, Dubbo Triple supports Protobuf Buffer but is not bound. For example, Dubbo Java supports defining Triple services with Java Interface, which will be easier for developers who focus on the ease of use of specific languages. In addition, Dubbo currently provides language implementations such as Java, Go, and Rust, and is currently promoting the implementation of protocols in languages ​​​​such as Node.js. We plan to open up mobile, browser, and back-end micro-service systems through multi-language and Triple protocols.

The core components in the implementation of Triple are as follows:

TripleInvoker is one of the core components of the Triple protocol, which is used to request to call the server of the Triple protocol. The core method is doInvoke, which will initiate different types of requests according to the request type, such as UNARY, BiStream, etc. For example, UNARY is called synchronously under SYNC, and one request corresponds to one response. BiStream is a two-way communication. The client can continuously send requests, and the server can also continuously push messages. They interact by calling back the StreamObserver component.

TripleClientStream is one of the core components of the Triple protocol. This component corresponds to the Stream concept in HTTP/2. Every time a new request is initiated, a new TripleClientStream will be created. Similarly, the corresponding HTTP/2 Stream is also different. The core methods provided by TripleClientStream are sendHeader used to send the header frame Header Frame, and sendMessage used to send the data frame Data Frame.

WriteQueue is a buffer queue for writing messages in the Triple protocol. Its core logic is to add various operation commands QueueCommand to the internally maintained queue, and try to submit the tasks corresponding to these QueueCommands to Netty's EventLoop thread. Single-threaded, orderly execution.

QueueCommand is an abstract class for submitting tasks to WriteQueue. Different Commands correspond to different execution logics.

TripleServerStream is the Stream abstraction of the server in the Triple protocol. This component corresponds to the Stream concept in HTTP/2. Every time the client initiates a request through a new Stream, the server will create a corresponding TripleServerStream to process the client. The request information sent by the terminal.

2. HTTP/2

HTTP/2 is a new generation of HTTP protocol and a substitute for HTTP/1.1. Compared with HTTP/1.1, the biggest improvement of HTTP/2 is that it reduces resource consumption and improves performance. In HTTP/1.1, the browser can only send one request in one TCP connection. If the browser needs to load multiple resources, then the browser needs to establish multiple TCP connections. This method will cause some problems, such as the establishment and disconnection of the TCP connection will increase the network delay, and the browser may send multiple requests at the same time, causing network congestion.

On the contrary, HTTP/2 allows the browser to send multiple requests in one TCP connection at the same time. Multiple requests correspond to multiple Streams. Multiple streams are independent of each other and flow in parallel. In each stream, these requests will be split into multiple Frame frames, and these frames will flow serially in the same stream, which strictly guarantees the order of the frames. So the client can send multiple requests in parallel, and the server can also send multiple responses in parallel, which helps to reduce the number of network connections, as well as network latency and improve performance.

HTTP/2 also supports server push, which means that the server can preload resources before the browser requests them. For example, if the server knows that the browser is going to request a particular resource, the server can push that resource to the browser before the browser requests it. This helps performance because the browser doesn't have to wait for resources to be requested and responded to.

HTTP/2 also supports header compression, which means that repeated information in HTTP headers can be compressed. This helps reduce network bandwidth usage.

3. Netty

Netty is a high-performance asynchronous event-driven network framework, mainly used for rapid development of maintainable high-performance protocol servers and clients. Its main features are ease of use, strong flexibility, high performance, and good scalability. Netty uses NIO as the basis, which can easily realize asynchronous and non-blocking network programming, and supports TCP, UDP, HTTP, SMTP, WebSocket, SSL and other protocols. The core components of Netty include Channel, EventLoop, ChannelHandler and ChannelPipeline.

Channel is a bidirectional channel for transferring data and can be used to handle network I/O operations. Netty's Channel implements the Channel interface of Java NIO, and adds some functions on this basis, such as supporting asynchronous shutdown, binding multiple local addresses, binding multiple event handlers, and so on.

EventLoop is one of the core components of Netty, which is responsible for handling all I/O events and tasks. An EventLoop can manage multiple Channels, and each Channel has a corresponding EventLoop. EventLoop uses a single-threaded model to process events, avoiding competition between threads and the use of locks, thereby improving performance.

ChannelHandler is a processor connected to ChannelPipeline, which can handle inbound and outbound data, such as encoding, decoding, encryption, decryption, and so on. A Channel can have multiple ChannelHandlers, and the ChannelPipeline will call them in order of addition to process data.

ChannelPipeline is another core component of Netty, which is a set of sequentially connected ChannelHandlers for processing inbound and outbound data. Each Channel has its own exclusive ChannelPipeline. When data enters or leaves a Channel, it will pass through all ChannelHandlers, and they will complete the processing logic.

Tool preparation

In order to tune the code, we need some tools to find the performance bottleneck of the Triple protocol, such as blocking and hotspot methods. The tools used in this tuning mainly include VisualVM and JFR.

Visual VM

Visual VM is a graphical tool that can monitor the performance and memory usage of local and remote Java virtual machines. It is an open source project that can be used to identify and solve performance problems in Java applications.

Visual VM can display the health of the Java Virtual Machine, including CPU usage, thread count, memory usage, garbage collection, and more. It can also display CPU usage and stack traces for each thread to identify bottlenecks.

Visual VM can also analyze heap dump files to identify memory leaks and other memory usage issues. It can view the size, reference and type of objects, and the relationship between objects.

Visual VM can also monitor the performance of the application at runtime, including the number of method calls, time consumption, exceptions, etc. It can also generate snapshots of CPU and memory usage for further analysis and optimization.

JFR

The full name of JFR is Java Flight Recorder, which is a performance analysis tool provided by JDK. JFR is a lightweight, low-overhead event recorder that can be used to record various events, including thread lifecycle, garbage collection, class loading, lock competition, and more. Data from JFR can be used to analyze application performance bottlenecks and identify issues such as memory leaks. Compared with other performance analysis tools, JFR is characterized by its very low overhead, and the recording can be turned on all the time without affecting the performance of the application itself.

The use of JFR is very simple. You only need to add the startup parameters -XX:+UnlockCommercialFeatures -XX:+FlightRecorder when starting the JVM to enable the recording function of JFR. When the JVM is running, JFR automatically records various events and saves them to a file. After recording, we can use the tool JDK Mission Control to analyze the data. For example, we can view CPU usage, memory usage, number of threads, lock competition, and more. JFR also provides some advanced features, such as event filtering, custom events, event stack traces, and more.

In this performance tuning, we focus on events in Java that can significantly affect performance: Monitor Blocked, Monitor Wait, Thread Park, Thread Sleep.

  • The Monitor Blocked event is triggered by the synchronized block, indicating that a thread has entered the synchronized code block
  • The Monitor Wait event is triggered by Object.wait, indicating that some code calls this method
  • The Thread Park event is triggered by LockSupport.park, indicating that a thread is suspended
  • The Thread Sleep event is triggered by Thread.sleep(), indicating that there is a manual call to the method in the code

Tuning ideas

1. Non-blocking

One of the key points of high performance is that the coding must be non-blocking. If sleep, await and other similar methods are called in the code, it will block the thread and directly affect the performance of the program. Therefore, it should be avoided as much as possible in the code. Instead of a blocking API, use a non-blocking API.

2. Asynchronous

In tuning ideas, asynchrony is one of the key points. In the code, we can use asynchronous programming, such as using CompletableFuture in Java8, etc. The advantage of this is that it can avoid thread blocking, thereby improving the performance of the program.

3. Partition

In the tuning process, divide and conquer is also a very important idea. For example, a large task can be decomposed into several small tasks, and then multi-threaded parallelism can be used to process these tasks. The advantage of doing this is that it can improve the parallelism of the program, so as to fully utilize the performance of the multi-core CPU and achieve the purpose of optimizing performance.

4. Batch

In tuning ideas, batching is also a very important idea. For example, multiple small requests can be combined into one large request, and then sent to the server at one time, which can reduce the number of network requests, thereby reducing network delay and improving performance. In addition, batch processing can also be used when processing a large amount of data, such as reading a batch of data into the memory at one time and then processing it, which can reduce the number of IO operations and improve program performance.

The cornerstone of high performance: non-blocking

Unreasonable syncUninterruptibly

By directly inspecting the code, we found a method syncUninterruptibly that obviously blocks the current thread. However, by using DEBUG, it is easy to know that the code will be executed in the user thread, and the source code is as follows.

private WriteQueue createWriteQueue(Channel parent) {
  final Http2StreamChannelBootstrap bootstrap = new Http2StreamChannelBootstrap(parent);
  final Future<Http2StreamChannel> future = bootstrap.open().syncUninterruptibly();
  if (!future.isSuccess()) {
    throw new IllegalStateException("Create remote stream failed. channel:" + parent);
  }
  final Http2StreamChannel channel = future.getNow();
  channel.pipeline()
    .addLast(new TripleCommandOutBoundHandler())
    .addLast(new TripleHttp2ClientResponseHandler(createTransportListener()));
  channel.closeFuture()
    .addListener(f -> transportException(f.cause()));
  return new WriteQueue(channel);
}

The code logic here is as follows:

  • Construct Http2StreamChannelBootstrap through TCP Channel
  • Get Future by calling the open method of Http2StreamChannelBootstrap
  • Wait for the Http2StreamChannel construction to complete by calling the syncUninterruptibly blocking method
  • Get the Http2StreamChannel and then construct its corresponding ChannelPipeline

In the pre-knowledge, we mentioned that most of the tasks in Netty are executed in a single-threaded manner in the EventLoop thread. Similarly, when the user thread calls open, the task of creating the HTTP2 Stream Channel will be submitted to the EventLoop. , and block the user thread until the task is completed when the syncUninterruptibly method is called.

The submitted task is only submitted to a task queue and not executed immediately, because the EventLoop at this time may still be executing Socket read and write tasks or other tasks, so after submission, it is likely that other tasks take up more time, resulting in delays. If the task of creating Http2StreamChannel is not executed, the time for blocking user threads will increase.

From the perspective of the overall process analysis of a request, the user thread is blocked before the Stream Channel is created. After the request is actually initiated, it needs to be blocked again to wait for a response. There are two obvious blocks in one UNARY request. Behavior, which will greatly restrict the performance of the Triple protocol, then we can boldly assume that the blocking here is unnecessary. In order to confirm our inference, we can use VisualVM to sample it and analyze the time-consuming blocking of creating Stream Channel in hotspots. The following are the sample results of Triple Consumer Side.

image.png

From the figure, we can see that the StreamChannel method created by HttpStreamChannelBootstrap$1.run has a large proportion of the time-consuming of the entire EventLoop. After unfolding, we can see that the time-consuming is basically consumed in notifyAll, that is, to wake up the user thread.

Optimization

So far we have learned that one of the performance barriers is to create a StreamChannel, so the optimization solution is to asynchronously create a StreamChannel to eliminate calls to the syncUninterruptibly method. The modified code is shown below. The task of creating a StreamChannel is abstracted into a CreateStreamQueueCommand and submitted to the WriteQueue. Subsequent sendHeader and sendMessage requests are also submitted to the WriteQueue, so that it can be easily guaranteed to be executed after the Stream is created. The task to send the request to.

private TripleStreamChannelFuture initHttp2StreamChannel(Channel parent) {
    TripleStreamChannelFuture streamChannelFuture = new TripleStreamChannelFuture(parent);
    Http2StreamChannelBootstrap bootstrap = new Http2StreamChannelBootstrap(parent);
    bootstrap.handler(new ChannelInboundHandlerAdapter() {
            @Override
            public void handlerAdded(ChannelHandlerContext ctx) throws Exception {
                Channel channel = ctx.channel();
                channel.pipeline().addLast(new TripleCommandOutBoundHandler());
                channel.pipeline().addLast(new TripleHttp2ClientResponseHandler(createTransportListener()));
                channel.closeFuture().addListener(f -> transportException(f.cause()));
            }
        });
    CreateStreamQueueCommand cmd = CreateStreamQueueCommand.create(bootstrap, streamChannelFuture);
    this.writeQueue.enqueue(cmd);
    return streamChannelFuture;
}

The core logic of CreateStreamQueueCommand is as follows, by ensuring that it is executed in EventLoop to eliminate unreasonable blocking method calls.

public class CreateStreamQueueCommand extends QueuedCommand {
    ......
    @Override
    public void run(Channel channel) {
        //此处的逻辑可以保证在EventLoop下执行,所以open后可以直接获取结果而不需要阻塞
        Future<Http2StreamChannel> future = bootstrap.open();
        if (future.isSuccess()) {
            streamChannelFuture.complete(future.getNow());
        } else {
            streamChannelFuture.completeExceptionally(future.cause());
        }
    }
}

Improperly synchronized lock contention

At this time, simply looking at the source code can no longer find obvious performance bottlenecks. Next, we need to use Visual VM tools to find performance bottlenecks.

After opening the tool, we can select the process that needs to be collected. Here we collect the process of Triple Consumer, select the Sampler in the tab, and click CPU to start sampling the time-consuming hotspot method of CPU. Here are the results of our method of sampling CPU hotspots, where we unroll the call stack of the EventLoop thread where the cost is most noticeable.

image.png

After layer-by-layer expansion, we can find a very unreasonable time-consuming method from the figure——ensureWriteOpen. Doesn’t this method name seem to be a method for judging whether the Socket is writable? Why is the time-consuming proportion so large? We opened the isConnected method of sun.nio.ch.SocketChannelImpl in JDK8 with doubts, and the code is as follows.

public boolean isConnected() {
  synchronized (stateLock) {
    return (state == ST_CONNECTED);
  }
}

It can be seen that there is no logic in this method, but it has the keyword synchronized, so we can conclude that there is a lot of synchronization lock competition in the EventLoop thread! Then our next step is to find a way to compete for the lock at the same time. Our method is also relatively simple and rude, that is, to find out the method through the DEBUG conditional breakpoint. As shown in the figure below, we put a conditional breakpoint in the isConnected method. The condition for entering the breakpoint is: the current thread is not an EventLoop thread.

image.png

After the breakpoint is set, we start and initiate the request. We can clearly see that the TripleInvoker.isAvailable method call appears in our method call stack, and finally calls the isConnected of sun.nio.ch.SocketChannelImpl, thus the EventLoop thread appears The phenomenon of time-consuming lock competition.

image

Optimization

Through the above analysis, our next modification idea is very clear, that is to modify the judgment logic of isAvailable, and maintain a boolean value to indicate whether it is available, so as to eliminate lock competition and improve the performance of the Triple protocol.

Non-negligible overhead: thread context switching

We continue to observe the snapshots sampled by VisualVM to see the time-consuming situation of the overall thread, as shown in the figure below:

image

From the figure we can extract the following information:

  • The most time-consuming thread is NettyClientWorker-2-1
  • During the pressure test, there are a large number of non-consumer threads, namely tri-protocol-214783647-thread-xxx
  • The overall time consumption of consumer threads is high and the number of threads is large
  • The time consumption of user threads is very low

After we arbitrarily expand one of the consumer threads, we can also see that the consumer thread is mainly for deserialization and delivery of deserialization results (DeadlineFuture.received), as shown in the following figure:

image.png

From the above information, it seems that the bottleneck point cannot be seen. Next, we try to use JFR (Java Flight Recorder) to monitor the process information. The figure below is the log analysis of JFR.

image.png

1. Monitor Blocked event

Among them, we can first check the brief analysis of JFR, click Java Blocking to view possible blocking points, this event indicates that a thread has entered the synchronized code block, and the result is shown in the figure below.

image

You can see that there is a Class whose total blocking time takes 39 seconds. After clicking, you can see the Thread column in the figure. All the blocked threads are threads that send requests from the benchmark. Looking down at the method stack displayed in the Flame View of the flame graph, it can be analyzed that this is just waiting for the response result, the blocking is necessary, and the blocking point can be ignored.

Then click Event Browser on the left menu to view the event logs collected by JFR, and filter out the event type list named java. We first view the Java Monitor Blocked event, and the result is shown in the figure below.

image

It can be seen that the blocked threads are all the threads that initiated the request from the benchmark, and the blocked point is only waiting for the response, so this event can be ruled out.

2. Monitor Wait event

Continue to look at the Java Monitor Wait event. Monitor Wait indicates that some code calls the Object.wait method, and the result is shown in the figure below.

image.png

From the above figure, we can get the information: the benchmark request threads are all blocked, the average waiting time is about 87ms, the blocked object is the same DefaultPromise, and the blocking cut-in method is Connection.isAvailable. Then we look at the source code of this method, which is as follows. Obviously, the time-consuming of this blocking is only the time-consuming of establishing the connection for the first time, and it will not have much impact on the overall performance. So the Java Monitor Wait event here can also be ruled out.

public boolean isAvailable() {
  if (isClosed()) {
    return false;
  }
  Channel channel = getChannel();
  if (channel != null && channel.isActive()) {
    return true;
  }
  if (init.compareAndSet(false, true)) {
    connect();
  }

  this.createConnectingPromise();
  //87ms左右的耗时来自这里
  this.connectingPromise.awaitUninterruptibly(this.connectTimeout, TimeUnit.MILLISECONDS);
  // destroy connectingPromise after used
  synchronized (this) {
    this.connectingPromise = null;
  }

  channel = getChannel();
  return channel != null && channel.isActive();
}

3. Thread Sleep Incident

Next, let's look at the Java Thread Sleep event, which indicates that there is a manual call to Thread.sleep in the code to check whether there is a behavior of blocking the worker thread. As can be seen from the figure below, it is obvious that the consumer thread or the benchmark request thread is not blocked. This thread that actively calls sleep is mainly used in the request timeout scenario and has no impact on the overall performance. Java Thread Sleep events can also be ruled out.

image.png

4. Thread Park event

Finally, let's look at the Java Thread Park event. The park event indicates that the thread is suspended. The figure below is a list of park events.

image.png

It can be seen that there are 1877 park events, and most of them are threads in the consumer thread pool. From the method stack in the flame graph, we can know that these threads are all waiting for tasks, and the duration of the tasks that have not been fetched is too long. This can explain a problem: most of the threads in the consumer thread pool are not performing tasks, and the utilization rate of the consumer thread pool is very low.

To increase the utilization of the thread pool, the number of threads in the consumer thread pool can be reduced, but the consumer thread pool cannot be directly reduced in dubbo. We try to package the consumer thread pool into a SerializingExecutor in the UNARY scenario. The Executor can The submitted tasks are executed serially, and the size of the thread pool is reduced in disguise. Let's look at the reduced results as follows.

image.png

image.png

image

From the above results, it can be seen that a large number of consumer threads have been reduced, thread utilization has been greatly improved, and Java Thread Park events have also been greatly reduced, but the performance has increased by about 13%.

It can be seen that multi-thread switching has a great impact on program performance, but it also brings another problem. Is it reasonable for us to concentrate most of the logic on a small number of consumer threads through SerializingExecutor? With this question in mind, we expand the call stack of one of the consumer threads for analysis. By expanding the method call stack, you can see the words deserialize (as shown in the figure below).

image

Obviously, although we have improved the performance, we concentrate the deserialization of the response body of different requests on a small number of consumer threads, which will cause the deserialization to be executed "serially". When the deserialization is large The message time consumption will increase significantly.

So can you find a way to redistribute the deserialization logic to multiple threads for parallel processing? With this question in mind, we first sort out the current thread interaction model, as shown in the figure below.

image.png

According to the above thread interaction diagram, and UNARY SYNC's "one request corresponds to one response" feature, we can boldly infer - ConsumerThread is not necessary! We can directly assign all non-I/O tasks to user threads for execution, which can effectively utilize multi-thread resources for parallel processing, and can also greatly reduce unnecessary thread context switching. So the best thread interaction model here should be as shown in the figure below.

image.png

5. Optimization scheme

After sorting out the thread interaction model, our ideas for changes are relatively simple. According to the source code of TripleClientStream, whenever a response is received, the I/O thread will submit the task to the Callback Executor bound to TripleClientStream. The Callback Executor is the consumer thread pool by default, so we only need to replace it with ThreadlessExecutor That's it. Its changes are as follows:

image.png

A great tool for reducing I/O: batching

We introduced earlier that the triple protocol is implemented based on the HTTP/2 protocol and is fully compatible with gRPC, so it can be seen that gRPC is a good reference object. So we compared triple with gRPC. The environment is the same, but the protocol is different. The final result is that there is a certain gap between the performance of triple and gRPC. So where is the difference? With this problem in mind, we continue to stress test the two, and try to use tcpdump to capture packets of the two, and the results are as follows.

triple

image.png

gRPC

image.png

From the above results, we can see that the packet capture of gRPC and triple is very different. In gRPC, a large number of data of different streams are sent at a time point, while triple is a very regular request "one back and one back". Therefore, we can boldly guess that there must be a batch sending behavior in the code implementation of gRPC. A group of data packets are sent as a whole, which greatly reduces the number of I/O. In order to verify our conjecture, we need a deep understanding of the source code of gRPC. Finally, it was found that the batch implementation in gRPC is located in WriteQueue, and its core source code fragment is as follows:

private void flush() {
  PerfMark.startTask("WriteQueue.periodicFlush");
  try {
    QueuedCommand cmd;
    int i = 0;
    boolean flushedOnce = false;
    while ((cmd = queue.poll()) != null) {
      cmd.run(channel);
      if (++i == DEQUE_CHUNK_SIZE) {
        i = 0;
        // Flush each chunk so we are releasing buffers periodically. In theory this loop
        // might never end as new events are continuously added to the queue, if we never
        // flushed in that case we would be guaranteed to OOM.
        PerfMark.startTask("WriteQueue.flush0");
        try {
          channel.flush();
        } finally {
          PerfMark.stopTask("WriteQueue.flush0");
        }
        flushedOnce = true;
      }
    }
    // Must flush at least once, even if there were no writes.
    if (i != 0 || !flushedOnce) {
      PerfMark.startTask("WriteQueue.flush1");
      try {
        channel.flush();
      } finally {
        PerfMark.stopTask("WriteQueue.flush1");
      }
    }
  } finally {
    PerfMark.stopTask("WriteQueue.periodicFlush");
    // Mark the write as done, if the queue is non-empty after marking trigger a new write.
    scheduled.set(false);
    if (!queue.isEmpty()) {
      scheduleFlush();
    }
  }
}

It can be seen that the gRPC approach is to abstract each data packet into a QueueCommand. When the user thread initiates a request, it does not directly write it out, but submits it to the WriteQueue first, and manually schedules the EventLoop to execute the task. The logic that the EventLoop needs to execute is Take it out from the queue of QueueCommand and execute it. When the written data reaches DEQUE_CHUNK_SIZE (default 128), channel.flush will be called once to flush the content of the buffer to the peer. When all the commands in the queue are consumed, a flush flush will be executed as needed to prevent message loss. The above is the batch write logic of gRPC.

Similarly, we checked the source code of the triple module and found that there is also a class named WriteQueue, whose purpose is also to write messages in batches and reduce the number of I/Os. However, judging from the results of tcpdump, the logic of this class does not seem to meet expectations, and the messages are still sent one by one in order without batching.

We can set a breakpoint in the triple's WriteQueue constructor to check why the triple's WriteQueue did not meet the expectations of batch writing. As shown below.

image.png

It can be seen that WriteQueue will be instantiated in the TripleClientStream constructor, and TripleClientStream corresponds to Stream in HTTP/2. Every time a new request is initiated, a new Stream needs to be constructed, which means that each Stream uses Different WriteQueue instances are used, and when multiple Streams submit Commands, they are not submitted together, so that the requests initiated by different Streams will be flushed directly at the end, resulting in high I/O, which seriously affects the performance of the triple protocol.

After analyzing the reason, the optimization changes are relatively clear, that is, to share WriteQueue as a connection level, instead of holding a WriteQueue instance for each of the different Streams under a connection. When the WriteQueue connection level is singleton, it can make full use of the ConcurrentLinkedQueue queue it holds as a buffer, and realize that one flush can flush the data of multiple different Streams to the peer end, greatly improving the performance of the triple protocol.

Tuning results

Finally, let's take a look at the results of triple's optimization. It can be seen that the performance in the small packet scenario has improved significantly, with a maximum improvement rate of 45%! Unfortunately, the improvement rate of scenarios with larger packets is limited, and the scenario of larger packets is also one of the future optimization goals of the triple protocol.

image

Summarize

In addition to the performance decryption, in the next article we will introduce the design and use cases of Triple's usability, interconnection and interoperability, etc., which will mainly focus on the following two points, so stay tuned.

  • Using the Triple protocol in the Dubbo framework, you can directly use Dubbo clients, gRPC clients, curl, browsers, etc. to access your published services without any additional components and configurations.
  • Dubbo currently provides language implementations such as Java, Go, and Rust, and is currently promoting the implementation of protocols in languages ​​​​such as Node.js. We plan to open up mobile, browser, and back-end micro-service systems through multi-language and Triple protocols.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/3874284/blog/8886233