Reconstruction and Optimum -Flink Shuffle mechanism of power source batch stream Unified Computing Engine

Source | zh.ververica.com author | Wang Zhijiang (Amoy River) This article is a replacement yesterday's article. zh. ververica.com new Flink Chinese station.

I. Overview


This article describes the concept of range as shown in FIG shuffle dashed box, all the processes the data output from the upstream to the downstream Operator Operator consumption data, can be basically divided into three sub-modules:


  • Write upstream data: record operator sequence into the output buffer data structure is inserted into the sub partition queue;

  • Network transmission: downstream scheduling may be deployed to a different container, the need for an upstream data transmission through the network to the downstream, to the data copy process and a codec;

  • Downstream read data: received from the network into the buffer record to deserialize op process.

640

When the job is started operation scheduling, in addition to operator business logic overhead interior of the sub, the run-time overhead of the entire runtime engine substantially in shuffle process, which involves the serialized data, the codec, the memory copy network transmission and other complex operations, Therefore we can say the overall performance of shuffle determine the performance runtime engine.

Flink for batch and streaming job of shuffle is a unified architecture, from a performance perspective we have designed and implemented a unified network flow control mechanism, optimized for serialization and memory copy. From the batch job availability point of view, we achieved external shuffle service and shuffle manager mechanism of reconstruction of the plug-in, a full range of upgrade in functionality, performance and scalability, from the following three main areas were specifically introduced.

II. The new flow control mechanism


Flink original network transport mechanism is an upstream random push, downstream passive receiving mode:

  • A plurality of container vessels are typically deployed task concurrent threads of execution of business logic op, different task threads multiplexed data transmission network with a TCP channel, thus reducing the number of network connections between a scene large-scale processes;


  • Flink data structure used to define a buffer input and output buffer downstream, different input and output of op limited maintains a separate local buffer pool, which would allow the downstream more smoothly in a parallel pipelined mode of operation;


  • An upstream data sequence of the output of op flink write buffer, the network is the netty flink thread removed from the buffer copy to partition queue netty buffer, the buffer is recovered flink local buffer pool continues to op multiplexing, the final buffer netty after writing the recovered socket buffer;


  • Downstream network netty thread reads data from the socket buffer copied into Buffer netty, the decode after application flink buffer to copy data to local buffer pool, flink buffer queue is inserted into the input channel, into deserialized through input processor to record consumption op , and then is recovered into the local buffer pool continues to receive data on the network;


  • When the whole link input and output terminals of the local buffer pool if the buffer capacity can be offset by the difference in production and consumption downstream, this model does not affect the performance.


640


2.1 backpressure generation and impact


During operation of the actual job, often seen across the link and downstream inqueue outqueue buffer queues are filled with counter-pressure, in particular under load imbalance and recover data in a scene.


  • As shown above, when the downstream end of the input local buffer pool resource depletion, the network can not apply to the thread netty flink buffer to copy the received data, in order to avoid spill data to disk, and memory resources for the protection of It was forced to temporarily shut down operations on the read channel channel. But because TCP channel is shared by a plurality of op, once closed, will cause all other normal op can not receive the upstream data;


  • TCP自身的流控机制使下游client端ack的advertise window逐渐减小到0,导致上游server不再继续发送网络数据,最终socket send buffer被逐渐塞满;


  • 上游的netty buffer由于不能写入到socket send buffer,导致netty buffer水位线逐渐上升,当到达阈值后netty线程不再从partition队列中取flink buffer,这样flink buffer不能被及时回收导致local buffer pool资源最终耗尽;


  • 上游op由于拿不到flink buffer无法继续输出数据被block停止工作,这样一层层反压直到整个拓扑的source节点。

反压虽然是很难避免的,但现有的流控机制加剧了反压的影响:

  • 由于进程间的TCP共享复用,一个task线程的瓶颈会导致整条链路上所有task线程都不能接收数据,影响整体tps;


  • 一旦数据传输通道临时关闭,checkpoint barrier也无法在网络上传输,checkpoint长期做不出来,一旦发生failover需要回放大量的历史数据;


  • 除了输入输出端的flink buffer被耗尽,还会额外占用netty内部的buffer资源以及通道关闭前接收到的临时buffer overhead,在大规模场景下容易出现oom不稳定因素。


2.2 Credit-based流控机制

通过上面分析可以看出,上下游信息不对称导致上游按照数据产出驱动盲目的向下游推送,当下游没有能力接收新数据时而被迫关闭了数据通道。因此需要一种上层更细粒度的流控机制,能够让复用同一个物理通道的所有逻辑链路互不影响进行数据传输。

我们借助了credit思想让下游随时反馈自己的接收能力,这样上游可以有针对性的选择有能力的下游发送对应的数据,即之前的上游盲目push模式变成了下游基于credit的pull模式。

  • 如下图所示,上游定义了backlog概念表示sub partition中已经缓存的待发送buffer数量,相当于生产者的库存情况,这个信息作为payload随着现有的数据协议传输给下游,因此这部分的overhead可以忽略;


  • 下游定义了credit概念表示每个input channel上可用的空闲buffer数量,每个input channel都会独占有限个exclusive buffer,所有input channel共享同一个local buffer pool用来申请floating buffer,这种buffer类型的区分可以保证每个input既有最基本的资源保证不会资源抢占导致的死锁,又可以根据backlog合理的抢占全局floating资源。


  • 下游的credit应该尽量及时增量反馈,避免上游因为等待credit而延时发送数据。下游也会尽量每次申请比backlog多一些overhead的credit,可以保证上游新产出的数据不需要等待credit反馈而延时。新定义的credit反馈协议数据量很小,和正常的数据传输相比在网络带宽不是瓶颈的前提下,空间占用基本可以忽略。


640

2.3 实际线上效果

新流控机制在某条链路出现反压的场景下,可以保证共享物理通道的其它链路正常传输数据。我们用双11大屏的一个典型业务验证job整体throughput提升了20%(如下图),对于这种keyby类型的上下游all-to-all模式,性能的提升比例取决于反压后的数据分布情况。对于one-to-one模式的job,我们实验验证在出现反压场景下的性能提升可以达到1倍以上。

640

新流控机制保证上游发送的数据都是下游能正常接收的,这样数据不再堵塞在网络层,即netty buffer以及socket buffer中不再残留数据,相当于整体上in-flighting buffer比之前少了,这对于checkpoint的barrier对齐是有好处的。另外,基于新机制下每个input channel都有exclusive buffer而不会造成资源死锁,我们可以在下游接收端有倾向性的选择不同channel优先读取,这样可以保证barrier尽快对齐而触发checkpoint流程,如下图所示checkpoint对齐事件比之前明显快了几倍,这对于线上job的稳定性是至关重要的。

640

此外,基于新流控机制还可以针对很多场景做优化,比如对于非keyby的rebalance模式,上游采用round-robin方式轮询向不同下游产出数据,这种看似rebalance的做法在实际运行过程中往往会带来负载不均衡而触发反压,因为不同record的处理开销不同,以及不同下游task的物理环境load也不同。通过backlog的概念,上游产出数据不再按照简单的round-robin,而是参考不同partition中的backlog大小,backlog越大说明库存压力越大,反映下游的处理能力不足,优先向backlog小的partition中产出数据,这种优化对于很多业务场景下带来的收益非常大。新流控机制已经贡献回社区1.5版本,参考[1]。

三. 序列化和内存拷贝优化


如开篇所列,整个shuffle过程涉及最多的就是数据序列化和内存拷贝,在op业务逻辑很轻的情况下,这部分开销占整体比例是最大的,往往也是整个runtime的瓶颈所在,下面分别介绍这两部分的优化。


3.1 Broadcast序列化优化

Broadcast模式指上游同一份数据传输给下游所有的并发task节点,这种模式使用的场景也比较多,比如hash-join中build source端的数据就是通过broadcast分发的。

Flink为每个sub partition单独创建一个serializer,每个serializer内部维护两个临时ByteBuffer,一个用来存储record序列化后的长度信息,一个用来存储序列化后的数据信息。op产出的record先序列化到两个临时ByteBuffer中,再从local buffer pool中申请flink buffer进行长度和数据信息拷贝,最后插入到sub partition队列中。这种实现主要有两个问题:

  • 假设有n个sub partition对应n个并发下游,broadcast模式下同样的数据要经过n次序列化转化,再经过n次数据拷贝,当sub partition数量多时这个开销很大;

  • Serializer数量和sub partition数量成正比,每个serializer内部又需要维护两个临时数组,尤其当record size比较大时,存储数据的临时数组膨胀会比较大,这部分内存overhead当sub partition数量多时不可忽视,容易产生oom。

640

一次序列化拷贝

针对上述问题,如上图我们从两个方面进行了优化:

  • 保留一个serializer服务于所有的sub partition,这样大量减少了serializer内部临时内存的overhead,serializer本身是无状态的;


  • Broadcast场景下数据只序列化一次,序列化后的临时结果只拷贝到一个flink buffer中,这个buffer会被插入到所有的sub partition队列中,通过增加引用计数控制buffer的回收。

这样上游数据产出的开销降低到了原来的1/n,极大的提升了broadcast的整体性能,这部分工作正在贡献回社区。

3.2 网络内存零拷贝


如前面流控中提到的,整个shuffle流程上下游网络端flink buffer各会经历两次数据拷贝:


  • 上游flink buffer插入到partition队列后,先拷贝到netty ByteBuffer中,再拷贝到socket send buffer中;


  • 下游从socket read buffer先拷贝到netty ByteBuffer中,再拷贝到flink buffer中。


Netty自身ByteBuffer pool的管理导致进程direct memory的使用无法准确评估,在socket channel数量特别多的场景下,进程的maxDirectMemory配置不合理很容易出现oom造成failover,因此我们打算让netty直接使用flink buffer,屏蔽掉netty内部的ByteBuffer使用。


  • Flink的buffer数据结构从原有的heap bytes改用off-heap direct memory实现,并且继承自netty内部的ByteBuffer;


  • 上游netty线程从partition队列取出buffer直接写入到socket send buffer中,下游netty线程从socket read buffer直接申请local buffer pool接收数据,不再经过中间的netty buffer拷贝。

经过上述优化,进程的direct memory使用大大降低了,从之前的默认320m配置调整为80m,整体的tps和稳定性都有了提高。

640


四. Shuffle架构改造


上面介绍的一系列优化对于streaming和batch job都是适用的,尤其对于streaming job目前的shuffle系统优势很明显,但对于batch job的场景还有很多局限性:

  • Streaming job上下游以pipelined方式并行运行,batch job往往分stage串行运行,上游运行结束后再启动下游拉数据,上游产出的数据会持久化输出到本地文件。由于上游的container进程承担了shuffle service服务,即使上游op运行结束,在数据没有完全传输到下游前,container资源依然不能回收,如果这部分资源不能用于调度下游节点,会造成资源上的浪费;


  • Flink batch job只支持一种文件输出格式,即每个sub partition单独生成一个文件,当sub partition数量特别多,单个partition数据量又特别小的场景下,一是造成file handle数量不可控,二是对磁盘io的读写不友好,性能比较低。

针对上述两个问题,我们对shuffle提出了两方面改造,一是实现了external shuffle service把shuffle服务和运行op的container进程解耦,二是定义了插件化的shuffle manager interface,在保留flink现有实现的基础上,扩展了新的文件存储格式。

4.1 External Shuffle Service

External shuffle service可以运行在flink框架外的任何container容器中,比如yarn模式下的NodeManager进程中,这样每台机器部署一个shuffle service统一服务于这台服务器上所有job的数据传输,对本地磁盘的读取可以更合理高效的全局控制。

我们从flink内置的internal shuffle service中提取了网络层的相关组件,主要包括result partition manager和transport layer,封装到external shuffle service中,上面提到的流控机制以及网络内存拷贝等优化同样收益于external shuffle service。

  • 上游result partition通过内置shuffle service与远程external shuffle service进行通信,把shuffle相关信息注册给result partition manager;


  • 下游input gate也通过内置shuffle service与远程external shuffle service通信请求partitoin数据,result partition manager根据上游注册的shuffle信息可以正确解析文件格式,并按照credit流控模式向下游发送数据。


640

基于external shuffle service运行的batch job,上游结束后container资源可以立刻回收,资源利用率更加合理,external shuffle service根据磁盘类型和负载,合理控制读取充分发挥硬件性能。


4.2 插件化Shuffle Manager

为了解决flink batch job单一文件存储格式的局限性,我们定义了shuffle manager interface支持可扩展的上下游shuffle读写模式。job拓扑支持在边上设置不同的shuffle manager实现,来定义每条边的上下游之间如何shuffle数据。shuffle manager有三个功能接口:

  • getResultPartitionWriter用来定义上游如何写数据,即描述输出文件的存储格式,同时result partition自己决定是否需要注册到shuffle service中,让shuffle service理解输出文件进行数据传输;


  • getResultPartitionLocation用来定义上游的输出地址,job master在调度下游时会把这个信息携带给下游描述中,这样下游就可以按照这个地址请求上游的输出数据;


  • getInputGateReader用来定义下游如何读取上游的数据。


640

Based on the above interface, we achieve a new sort-merge upstream of an output format, i.e., all the sub partition data is written to a file, and eventually merge into a finite number of files, the index file identified by the index read different sub partition The data. In some scenarios this model's performance will be better than the original flink single partition file format, but also as a model for online use by default. Overall reconstruction work also contributed back to the community.


V. Prospects

Flink shuffle future work on the stream will be the ultimate pursuit of higher performance, how ran the best results with fewer resources, in batches to take full advantage of the existing flow accumulation, better play and take full advantage of the hardware unified performance and architecture.



640?wx_fmt=jpeg

Guess you like

Origin blog.csdn.net/u013411339/article/details/90657452