- 原文地址:Using Causal Profiling to Optimize the Go HTTP/2 Server
- Original author: Morsing
- Translation from: Nuggets Translation Project
- Permalink article: github.com/xitu/gold-m...
- Translator: JackEggie
- Proofreaders: foxxnuaa
Causal analysis and optimization Go HTTP / 2 Server
Brief introduction
If you have always been concerned about this blog, then you should read this introduction paper causal analysis . This analysis approach aims to establish the link between consumption and performance optimization cycle performance. I have practiced this way in the Go language analysis. I think it's time in a real software - HTTP Go standard library / 2 implementation to practice a little longer.
HTTP/2
HTTP / 2 is familiar to us and had enough of HTTP / 1 to achieve a new agreement. It is connected to a plurality of times it can be used to transmit or receive a request to reduce overhead when the connection is established. Go will realize each of goroutine a request allocation, or dispensing a plurality goroutine in connection to process asynchronous communication, in order to determine who can write to when the connection will be coordination between a plurality of mutually goroutine .
This design is ideal for causal analysis. If there is something secretly blocked a request, then the causal analysis will be very easy to find it, but in the traditional way in the analysis may not be so easy.
Experimental configuration
In order to facilitate measurement, I constructed a comprehensive benchmark based on HTTP / 2 server and its clients. Google Home server requests to obtain header and body of the request, and to each request are recorded. The client uses Firefox client request header file in the root path. The maximum amount of concurrent requests of the client 10. This number is chosen at random, but that should be enough to keep the CPU saturation.
We need a program to track in order to perform causal analysis. We will set up a Progress
marker, it will record the execution time elapsed between two lines of code. HTTP / 2 server calls the runHandler
function, it will run in HTTP handler in goroutine. Before we create goroutine it marks the beginning in order to assess the consumption of concurrent scheduling delays and HTTP processing. After the end flag is set to write all of the data processing program to the channel.
In order to obtain a baseline test, let us use the traditional way to get an analysis of data from the server CPU, results as shown below:
Well, that's what we've optimized from a large-scale application obtained a huge difficult to optimize the call graph. Big red box is a system call, it is impossible for us to optimize the part.
The following data gives us more relevant content, but we do not have substantial help.
(pprof) top
Showing nodes accounting for 40.32s, 49.44% of 81.55s total
Dropped 453 nodes (cum <= 0.41s)
Showing top 10 nodes out of 186
flat flat% sum% cum cum%
18.09s 22.18% 22.18% 18.84s 23.10% syscall.Syscall
4.69s 5.75% 27.93% 4.69s 5.75% crypto/aes.gcmAesEnc
3.88s 4.76% 32.69% 3.88s 4.76% runtime.futex
3.49s 4.28% 36.97% 3.49s 4.28% runtime.epollwait
2.10s 2.58% 39.55% 6.28s 7.70% runtime.selectgo
2.02s 2.48% 42.02% 2.02s 2.48% runtime.memmove
1.84s 2.26% 44.28% 2.13s 2.61% runtime.step
1.69s 2.07% 46.35% 3.97s 4.87% runtime.pcvalue
1.26s 1.55% 47.90% 1.39s 1.70% runtime.lock
1.26s 1.55% 49.44% 1.26s 1.55% runtime.usleep
复制代码
The main program appears to contain encryption method calls and call methods at runtime. Let us put aside the encryption method, because it has enough optimized.
Causal analysis to save the program
We get the best results analysis review before about program works in the use of cause and effect. When the causal analysis is enabled, the program will perform a series of tests. First select a test call to accelerate and perform some procedures. When the call is executed (detected by analysis of the underlying program), we will come to reduce the speed of execution of another thread by accelerated procedure.
This seems counterintuitive, but since we know from the Progress
program mark the beginning of the implementation will be much slower, we can eliminate this effect, in order to obtain the time after the accelerated program to access the site will be spent. I suggest you read my other articles on the causal analysis or original paper to gain insight into the principles therein.
Ultimately, causal analysis looks like some of accelerated's request, so that Progress
the code running time between the markers changed. For HTTP / 2 server, the result of a request as follows:
0x4401ec /home/daniel/go/src/runtime/select.go:73
0% 2550294ns
20% 2605900ns +2.18% 0.122%
35% 2532253ns -0.707% 0.368%
40% 2673712ns +4.84% 0.419%
75% 2722614ns +6.76% 0.886%
95% 2685311ns +5.29% 0.74%
复制代码
In this example, we observe select
the code of the runtime unlock
calls. We actually accelerated this first call, thus changing the number of calls, time spent and the difference between the baseline. The results show that we did not get more potential performance gains from such acceleration. In fact, when we accelerate select
the time code, program but became more slowly.
The fourth column data looks a bit strange. It is detected in the proportion of the sample data with the request, it should be proportional to the acceleration. In the traditional analysis mode, it can be roughly expressed as accelerated bring the desired performance.
Now look at a more interesting results call:
0x4478aa /home/daniel/go/src/runtime/stack.go:881
0% 2650250ns
5% 2659303ns +0.342% 0.84%
15% 2526251ns -4.68% 1.97%
45% 2434132ns -8.15% 6.65%
50% 2587378ns -2.37% 8.12%
55% 2405998ns -9.22% 8.31%
70% 2394923ns -9.63% 10.1%
85% 2501800ns -5.6% 11.7%
复制代码
The call of the stack code, the above data show that the acceleration here may get good results. The fourth column of data shows that the program is running a large proportion of this part of the code. Let us look at the results of traditional analysis methods focus on stack code based on the above test data.
(pprof) top -cum newstack
Active filters:
focus=newstack
Showing nodes accounting for 1.44s, 1.77% of 81.55s total
Dropped 36 nodes (cum <= 0.41s)
Showing top 10 nodes out of 65
flat flat% sum% cum cum%
0.10s 0.12% 0.12% 8.47s 10.39% runtime.newstack
0.09s 0.11% 0.23% 8.25s 10.12% runtime.copystack
0.80s 0.98% 1.21% 7.17s 8.79% runtime.gentraceback
0 0% 1.21% 6.38s 7.82% net/http.(*http2serverConn).writeFrameAsync
0 0% 1.21% 4.32s 5.30% crypto/tls.(*Conn).Write
0 0% 1.21% 4.32s 5.30% crypto/tls.(*Conn).writeRecordLocked
0 0% 1.21% 4.32s 5.30% crypto/tls.(*halfConn).encrypt
0.45s 0.55% 1.77% 4.23s 5.19% runtime.adjustframe
0 0% 1.77% 3.90s 4.78% bufio.(*Writer).Write
0 0% 1.77% 3.90s 4.78% net/http.(*http2Framer).WriteData
复制代码
The above data show that newstack
from writeFrameAsync
the call. When the data frame is transmitted to the client whenever the HTTP / 2 server, creates and invokes the method a goroutine. And at any time, only one writeFrameAsync
can run, if a program attempts to send more data frames, it will be blocked until a pre- writeFrameAsync
return.
Since the writeFrameAsync
call across multiple logical layers, thus inevitably produce large amounts of the call stack.
How will I performance HTTP / 2 server upgrade of 28.2%
Stack of growth slowed down to run the program, then we need to take some steps to avoid it. Every goroutine create is going to be calling writeFrameAsync
, so when writing each data frame that we all need to pay stack growth.
Conversely, if we can reuse goroutine, we can let the stack grows only once, while each subsequent call can be reused has generated a good stack. I will deploy this change to the server, causal analysis of baseline test dropped from 2.650ms to 1.901ms, performance improved 28.2%.
It should be noted, HTTP / 2 servers do not typically run at full speed locally. I guess, if you connect the server to the Internet, the benefits will be much smaller, because the stack grows consumed CPU time than the network delay is much smaller.
in conclusion
Causal analysis method is still not mature, but I think this small example clearly shows it has the potential. You can see the project's branch , which has joined the Buried causal analysis. You can also recommend other test baseline to me, to see what conclusions we can.
Note: I'm looking for a job. If you need to understand the underlying implementation language Go inside and people familiar with the distributed architecture, please see my resume or send an email to [email protected] .
related articles
- Update the concept of cause and effect analysis
- Causal analysis Go language
- Go language exception handling
- Go language netpoller
- Go language scheduler
If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.
Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .