[Translation] causal analysis and optimization Go HTTP / 2 Server

Causal analysis and optimization Go HTTP / 2 Server

Brief introduction

If you have always been concerned about this blog, then you should read this introduction paper causal analysis . This analysis approach aims to establish the link between consumption and performance optimization cycle performance. I have practiced this way in the Go language analysis. I think it's time in a real software - HTTP Go standard library / 2 implementation to practice a little longer.

HTTP/2

HTTP / 2 is familiar to us and had enough of HTTP / 1 to achieve a new agreement. It is connected to a plurality of times it can be used to transmit or receive a request to reduce overhead when the connection is established. Go will realize each of goroutine a request allocation, or dispensing a plurality goroutine in connection to process asynchronous communication, in order to determine who can write to when the connection will be coordination between a plurality of mutually goroutine .

This design is ideal for causal analysis. If there is something secretly blocked a request, then the causal analysis will be very easy to find it, but in the traditional way in the analysis may not be so easy.

Experimental configuration

In order to facilitate measurement, I constructed a comprehensive benchmark based on HTTP / 2 server and its clients. Google Home server requests to obtain header and body of the request, and to each request are recorded. The client uses Firefox client request header file in the root path. The maximum amount of concurrent requests of the client 10. This number is chosen at random, but that should be enough to keep the CPU saturation.

We need a program to track in order to perform causal analysis. We will set up a Progressmarker, it will record the execution time elapsed between two lines of code. HTTP / 2 server calls the runHandlerfunction, it will run in HTTP handler in goroutine. Before we create goroutine it marks the beginning in order to assess the consumption of concurrent scheduling delays and HTTP processing. After the end flag is set to write all of the data processing program to the channel.

In order to obtain a baseline test, let us use the traditional way to get an analysis of data from the server CPU, results as shown below:

Well, that's what we've optimized from a large-scale application obtained a huge difficult to optimize the call graph. Big red box is a system call, it is impossible for us to optimize the part.

The following data gives us more relevant content, but we do not have substantial help.

(pprof) top
Showing nodes accounting for 40.32s, 49.44% of 81.55s total
Dropped 453 nodes (cum <= 0.41s)
Showing top 10 nodes out of 186
      flat  flat%   sum%        cum   cum%
    18.09s 22.18% 22.18%     18.84s 23.10%  syscall.Syscall
     4.69s  5.75% 27.93%      4.69s  5.75%  crypto/aes.gcmAesEnc
     3.88s  4.76% 32.69%      3.88s  4.76%  runtime.futex
     3.49s  4.28% 36.97%      3.49s  4.28%  runtime.epollwait
     2.10s  2.58% 39.55%      6.28s  7.70%  runtime.selectgo
     2.02s  2.48% 42.02%      2.02s  2.48%  runtime.memmove
     1.84s  2.26% 44.28%      2.13s  2.61%  runtime.step
     1.69s  2.07% 46.35%      3.97s  4.87%  runtime.pcvalue
     1.26s  1.55% 47.90%      1.39s  1.70%  runtime.lock
     1.26s  1.55% 49.44%      1.26s  1.55%  runtime.usleep
复制代码

The main program appears to contain encryption method calls and call methods at runtime. Let us put aside the encryption method, because it has enough optimized.

Causal analysis to save the program

We get the best results analysis review before about program works in the use of cause and effect. When the causal analysis is enabled, the program will perform a series of tests. First select a test call to accelerate and perform some procedures. When the call is executed (detected by analysis of the underlying program), we will come to reduce the speed of execution of another thread by accelerated procedure.

This seems counterintuitive, but since we know from the Progressprogram mark the beginning of the implementation will be much slower, we can eliminate this effect, in order to obtain the time after the accelerated program to access the site will be spent. I suggest you read my other articles on the causal analysis or original paper to gain insight into the principles therein.

Ultimately, causal analysis looks like some of accelerated's request, so that Progressthe code running time between the markers changed. For HTTP / 2 server, the result of a request as follows:

0x4401ec /home/daniel/go/src/runtime/select.go:73
  0%    2550294ns
 20%    2605900ns    +2.18%    0.122%
 35%    2532253ns    -0.707%    0.368%
 40%    2673712ns    +4.84%    0.419%
 75%    2722614ns    +6.76%    0.886%
 95%    2685311ns    +5.29%    0.74%
复制代码

In this example, we observe selectthe code of the runtime unlockcalls. We actually accelerated this first call, thus changing the number of calls, time spent and the difference between the baseline. The results show that we did not get more potential performance gains from such acceleration. In fact, when we accelerate selectthe time code, program but became more slowly.

The fourth column data looks a bit strange. It is detected in the proportion of the sample data with the request, it should be proportional to the acceleration. In the traditional analysis mode, it can be roughly expressed as accelerated bring the desired performance.

Now look at a more interesting results call:

0x4478aa /home/daniel/go/src/runtime/stack.go:881
  0%    2650250ns
  5%    2659303ns    +0.342%    0.84%
 15%    2526251ns    -4.68%    1.97%
 45%    2434132ns    -8.15%    6.65%
 50%    2587378ns    -2.37%    8.12%
 55%    2405998ns    -9.22%    8.31%
 70%    2394923ns    -9.63%    10.1%
 85%    2501800ns    -5.6%    11.7%
复制代码

The call of the stack code, the above data show that the acceleration here may get good results. The fourth column of data shows that the program is running a large proportion of this part of the code. Let us look at the results of traditional analysis methods focus on stack code based on the above test data.

(pprof) top -cum newstack
Active filters:
   focus=newstack
Showing nodes accounting for 1.44s, 1.77% of 81.55s total
Dropped 36 nodes (cum <= 0.41s)
Showing top 10 nodes out of 65
      flat  flat%   sum%        cum   cum%
     0.10s  0.12%  0.12%      8.47s 10.39%  runtime.newstack
     0.09s  0.11%  0.23%      8.25s 10.12%  runtime.copystack
     0.80s  0.98%  1.21%      7.17s  8.79%  runtime.gentraceback
         0     0%  1.21%      6.38s  7.82%  net/http.(*http2serverConn).writeFrameAsync
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*Conn).Write
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*Conn).writeRecordLocked
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*halfConn).encrypt
     0.45s  0.55%  1.77%      4.23s  5.19%  runtime.adjustframe
         0     0%  1.77%      3.90s  4.78%  bufio.(*Writer).Write
         0     0%  1.77%      3.90s  4.78%  net/http.(*http2Framer).WriteData
复制代码

The above data show that newstackfrom writeFrameAsyncthe call. When the data frame is transmitted to the client whenever the HTTP / 2 server, creates and invokes the method a goroutine. And at any time, only one writeFrameAsynccan run, if a program attempts to send more data frames, it will be blocked until a pre- writeFrameAsyncreturn.

Since the writeFrameAsynccall across multiple logical layers, thus inevitably produce large amounts of the call stack.

How will I performance HTTP / 2 server upgrade of 28.2%

Stack of growth slowed down to run the program, then we need to take some steps to avoid it. Every goroutine create is going to be calling writeFrameAsync, so when writing each data frame that we all need to pay stack growth.

Conversely, if we can reuse goroutine, we can let the stack grows only once, while each subsequent call can be reused has generated a good stack. I will deploy this change to the server, causal analysis of baseline test dropped from 2.650ms to 1.901ms, performance improved 28.2%.

It should be noted, HTTP / 2 servers do not typically run at full speed locally. I guess, if you connect the server to the Internet, the benefits will be much smaller, because the stack grows consumed CPU time than the network delay is much smaller.

in conclusion

Causal analysis method is still not mature, but I think this small example clearly shows it has the potential. You can see the project's branch , which has joined the Buried causal analysis. You can also recommend other test baseline to me, to see what conclusions we can.

Note: I'm looking for a job. If you need to understand the underlying implementation language Go inside and people familiar with the distributed architecture, please see my resume or send an email to [email protected] .

related articles

If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.


Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .

Guess you like

Origin juejin.im/post/5dc3e5faf265da4d144e86c2