Based on some of the stress test four and seven of the agent performance Netty

In this paper we mainly want to test and study points:

  • The simplest difference based forwarding procedures, four and seven performance of HTTP requests Netty wrote
  • Differences among Three agents threading model performance, the following will explain three kinds of threading model
  • Differences pool and nonpooled performance ByteBuffer

Test the code used in this article:
https://github.com/JosephZhu1983/proxytest

In the code we implemented two sets of agents:
image_1demdig32ia6184m64sppm8vp90.png-55.9kB

Machine configuration using a test (Ali cloud ECS):
image_1dembkev02d2sll1ijc18fl4r48j.png-91.9kB
A total of three machines:

  • Server server installed nginx, as the back end
  • client server has wrk, a pressure measuring client
  • proxy server installed in our test code (proxy)

Nginx backend

nginx is the default configuration of the test page (point deleted content, reducing network bandwidth):
image_1dembfnk81i9m19tkvli148c13h86.png-122.8kB
directly in front of nginx pressure measured down qps 26.6 million:
image_1delvmebjcpe39n1hdni41hss13.png-55.2kB

About four and seven

Four agents, we are only using Netty to forward ByteBuf.
Seven agents, there will be more overhead, mainly codec Http request and aggregation, the server Http request:

image_1demdm2m82vg1i6b4ng1uitjcp9d.png-136.8kB

Client:
image_1demdoius2ekjds1kbr5a1vld9q.png-63.2kB

Here we can think of, four agents because less data encoding and decoding process Http, performance is certainly much better than seven, how much better we can look at the test results.

About the threading model

We know as a proxy, we need to open the service acquisition request from the upstream side, and then forwards the request to the downstream as the client, from downstream to obtain the response back to the upstream. Our services and client needs Worker IO threads to handle requests, there are three approaches;

  • A: Bootstrap client and server ServerBootstrap separate thread pool NioEventLoopGroup, referred IndividualGroup
  • B: client and server share a thread pool, referred to as ReuseServerGroup
  • C: multiplexing thread EventLoop clients directly with the server, referred ReuseServerThread

With seven proxy code as an example:
image_1demdqavbn5i19ff1g1hrp2gbsan.png-98.4kB

接下去的测试我们会来测试这三种线程模型,这里想当然的猜测是方案A的性能是最好的,因为独立了线程池不相互影响,我们接下去看看结果

四层代理 + ReuseServerThread线程模型

Layer4ProxyServer Started with config: ServerConfig(type=Layer4ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerThread, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1delvsom6v03e5pngacv714901g.png-54kB

四层代理 + IndividualGroup线程模型

Layer4ProxyServer Started with config: ServerConfig(type=Layer4ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=IndividualGroup, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1dem04l2alqs1l4u1ripg9a1fcu1t.png-54.8kB

四层代理 + ReuseServerGroup线程模型

Layer4ProxyServer Started with config: ServerConfig(type=Layer4ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerGroup, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1dem0br3r1rr3qmj1mk519nn111v2a.png-55.2kB

看到这里其实已经有结果了,ReuseServerThread性能是最好的,其次是ReuseServerGroup,最差是IndividualGroup,和我们猜的不一致。

四层系统监控图

从网络带宽上可以看到,先测试的ReuseServerThread跑到了最大的带宽(后面三个高峰分别代表了三次测试):
image_1dem0chjrimkn5va5810dk1vk62n.png-52.8kB
从CPU监控上可以看到,性能最好的ReuseServerThread使用了最少的CPU资源(后面三个高峰分别代表了三次测试):
image_1dem0ekoq1l59ju1vvn1lp575u34.png-32.5kB

七层代理 + ReuseServerThread线程模型

Layer7ProxyServer Started with config: ServerConfig(type=Layer7ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerThread, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1dem0mduhkdc11hc2ue12rd433h.png-55kB

七层代理 + IndividualGroup线程模型

Layer7ProxyServer Started with config: ServerConfig(type=Layer7ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=IndividualGroup, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1dem0tgtv13ev3h9sl51appi083u.png-55.2kB

七层代理 + ReuseServerGroup线程模型

Layer7ProxyServer Started with config: ServerConfig(type=Layer7ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerGroup, receiveBuffer=10240, sendBuffer=10240, allocatorType=Unpooled, maxContentLength=2000)
image_1dem14prr1e7kr0gi1ggiqu7l4b.png-55kB

结论一样,ReuseServerThread性能是最好的,其次是ReuseServerGroup,最差是IndividualGroup。我觉得是这么一个道理:

  • 复用IO线程的话,上下文切换会比较少,性能是最好的,后来我也通过pidstat观察验证了这个结论,但是当时忘记截图
  • 复用线程池,客户端有机会能复用到服务端线程,避免部分上下文切换,性能中等
  • 独立线程池,大量上下文切换(观察下来是复用IO线程的4x),性能最差

七层系统监控图

下面分别是网络带宽和CPU监控图:
image_1dem1fh7m1f0cl8s1d1ic7563765.png-39.3kB
image_1dem1e3g01asrq8r9u16ce5e94r.png-60.1kB
可以看到明显七层代理消耗更多的资源,但是带宽相比四层少了一些(QPS少了很多)。
出流量比入流量多一点,应该是代码里多加的请求头导致:
image_1demf0bhrikp1rh0r5i1q3c1iltc1.png-150.8kB

试试HttpObjectAggregator设置较大maxContentLength

Layer7ProxyServer Started with config: ServerConfig(type=Layer7ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerThread, receiveBuffer=10240, sendBuffer=10240, allocatorType=Pooled, maxContentLength=100000000)
image_1dem1qe4v1ddd1c2311pjej81bf16v.png-54.9kB

试试PooledByteBufAllocator

Layer7ProxyServer Started with config: ServerConfig(type=Layer7ProxyServer, serverIp=172.26.5.213, serverPort=8888, backendIp=172.26.5.214, backendPort=80, backendThreadModel=ReuseServerThread, receiveBuffer=10240, sendBuffer=10240, allocatorType=Pooled, maxContentLength=2000)
image_1dem1ifds1hoi1lkka691vekmlt6i.png-54.8kB

可以看到Netty 4.1中已经把默认的分配器设置为了PooledByteBufAllocator
image_1demg35il1ambhdb1o3m42c1j9ce.png-43.9kB

总结

这里总结了一个表格,性能损失比例都以第一行直接压Nginx为参照:
image_1demepcbume4eoacntrb11mh2b4.png-39.1kB

结论是:

  • Nginx is cattle, in fact, the machine configuration is not very good, in the configuration relatively good physical servers running of the machine, no problem Nginx single one million
  • Netty is cattle, after all, is a Java server-side, four-forward losses only 3% QPS
  • Whether or seven four-story, multiplexing threaded approach obviously the best performance, take up minimal CPU
  • Because of context switching, network developers using Netty agent should reuse IO thread
  • Seven consumption is much larger than four, even Netty can not be avoided, this is a problem of the HTTP protocol
  • PooledByteBufAllocator enhance certain performance than UnpooledByteBufAllocator (approximately 3%)
  • HttpObjectAggregator If you set a larger maximum content length, slightly affect performance points

The reason why write this article to do this analysis is done because of the recent performance of our self-developed gateway optimization and stress testing https://github.com/spring-avengers/tesla.
I found some other open-source project is not based Netty agent of multiple connections, authors may not realize this, I looked under the Zuul code, it is also reusable.

Guess you like

Origin www.cnblogs.com/lovecindywang/p/11115802.html