Summary of Flink flow control and backpressure mechanism

The author recently reviewed my understanding of the details of the Flink technology stack, and found that there is a relatively large blind spot in Flink's network stack, flow control and back pressure mechanisms. Although I have dealt with the problem of job backpressure many times, I don't fully understand the implementation behind it. So I wrote a summary, standing on the shoulders of the big guys to thoroughly understand how Flink does flow control and handle back pressure.

Data flow direction of Flink network transmission

The data flow of Flink network transmission is shown in the following figure:
Insert picture description here
When sending data, Sender first writes to the internal network buffer of TaskManager, and uses Netty for transmission-the data to be sent is stored in Netty's ChannelOutboundBuffer, and then sent via the Socket's sending buffer Get out. Receiver is the reverse when receiving data, and it also needs to go through three layers of buffering, namely Socket receiving buffer→Netty ChannelInboundBuffer→TaskManager network buffer. To achieve flow control is to make a fuss on the above process.

Flink's backpressure propagation

Back pressure is the dynamic feedback mechanism of processing capacity in the streaming system, and it is the feedback from downstream to upstream. The following figure shows the logic of data flow between Flink TaskManager.

Insert picture description here
It can be seen that once the back pressure occurs due to insufficient downstream processing capacity, the propagation of the back pressure signal should be divided into two stages: one is from the input of the downstream TaskManager (InputGate) to the output of the directly upstream TaskManager (ResultPartition); the other is Propagate from output to input within TaskManager. Of course, what we need to consider is the cross-TaskManager backpressure propagation, because its link is relatively long (refer to the data flow diagram in the previous section), and it is more likely to become a bottleneck.

Let's first introduce the flow control and backpressure mechanism in the old version.

Before Flink 1.5: TCP-based flow control and back pressure Before version 1.5, Flink did not specifically implement its own flow control mechanism, but directly relied on the sliding window mechanism of the TCP protocol itself at the transport layer (University Computer Network Course Must speak). Let's review how the TCP sliding window implements flow control through examples.

  1. The initial situation is shown in the figure below. Sender sends 3 packets per unit time, the initial size of the sending window is 3; Receiver receives 1 packet per unit time, and the initial size of the receiving window is 5 (the same size as the receiving buffer).

Insert picture description here
Insert picture description here

  1. Receiver consumes a packet, slides the receiving window forward one space, and informs Sender ACK=4 (indicating that it can start sending from the 4th packet), and Window=3 (indicating that the current vacant margin of the receiving window is 3).

Insert picture description here

  1. Sender sends 4~6 three packets after receiving the ACK message, and Receiver puts them in the buffer after receiving it.
    Insert picture description here
  2. Receiver consumes a packet, slides the receiving window forward, and informs Sender ACK=7 (indicating that it can start sending from the 7th packet), and Window=1 (indicating that the current free margin of the receiving window is 1). After receiving the ACK message, the Sender finds that the Receiver can only receive 1 more packet, and adjusts the size of the sending window to 1 and sends the packet 7, achieving the purpose of current limiting.
    Insert picture description here
    Insert picture description here
    After analyzing this process, we can know that Sender will not be able to send data eventually (because Receiver reports Window=0), and it will not continue to send until Receiver consumes the data in the cache. At the same time, Sender will periodically send ZeroWindowProbe detection messages to Receiver to ensure that Receiver can report consumption capacity to Sender in time.

Next, an example is used to introduce the back pressure process.

  • As shown in the figure, the ratio of the sending speed of Sender to the receiving speed of Receiver is 2:1, and at first it can send and receive normally.

Insert picture description here

  • After a period of time, the buffer of the InputChannel itself on the Receiver side is exhausted, so a new buffer will be applied to the local buffer pool LocalBufferPool.

Insert picture description here

  • After a period of time, the available quota of LocalBufferPool will be exhausted, so a new cache will be applied to the network buffer pool NetworkBufferPool

Insert picture description here

  • As the data continues to accumulate, NetworkBufferPool's quota will also be exhausted. At this time, there is no room to receive new data. Netty's auto read will be closed and no more data will be read from the Socket cache.

Insert picture description here

  • After the Socket buffer is exhausted, Receiver reports Window=0 (see the sliding window above), and the Sender Socket will stop sending data.

Insert picture description here

  • The backlog of the Socket cache on the Sender side prevents Netty from sending any more data.

Insert picture description here

  • The data to be sent is backlogged in the Sender's ChannelOutboundBuffer. When the amount of data exceeds the Netty high watermark, the Channel is set to be unwritable, and the ResultSubPartition no longer writes data to Netty.

Insert picture description here

  • After the ResultSubPartition buffer on the Sender side is full, it will continue to apply for new buffers from the LocalBufferPool and NetworkBufferPool, just like the InputChannel on the Receiver side, until the buffers are exhausted and the RecordWriter can no longer write data.

Insert picture description here
Insert picture description here
Insert picture description here
In this way, we have realized the transfer of back pressure to the upstream TaskManager.

After Flink 1.5: Credit-based flow control and back pressure
TCP-based flow control and back pressure solutions have two major disadvantages:

As long as a task executed by the TaskManager triggers back pressure, the Socket of the TaskManager and the upstream TaskManager can no longer transmit data, which affects all other normal tasks and the flow of Checkpoint Barrier, which may cause a job avalanche;

The propagation link of the back pressure is too long, and it needs to exhaust all network buffers before it can be triggered effectively, and the delay is relatively large.

In order to solve these two problems, Flink 1.5+ introduced the flow control and back pressure mechanism based on Credit. It essentially promotes TCP's flow control mechanism from the transport layer to the application layer-that is, the ResultPartition and InputGate levels, thereby avoiding congestion at the transport layer. Specifically:

  • The ResultSubPartition on the Sender side will count the accumulated message volume (in terms of the number of buffers), and notify the InputChannel on the Receiver side in the form of backlog size;

  • The InputChannel on the Receiver side will calculate how much space is available to receive messages (also counted as the number of buffers), and notify the ResultSubPartition on the Sender side in the form of credit.

In other words, Sender and Receiver accurately perform flow control by telling each other about their processing capabilities (note that backlog size and credit are also through the transport layer, not directly exchanged). Next, we will use examples to illustrate the flow control and backpressure process based on Credit.

  • It is still a scenario where the ratio of the sending speed of Sender to the receiving speed of Receiver is 2:1. The ResultSubPartition on the Sender side has a backlog of 2 cached data, so the batch of data to be sent will be sent to the Receiver together with backlog size = 2. After Receiver receives the current batch of data and backlog size, it will calculate whether the InputChannel has enough buffer to receive the next batch of data. If it is not enough, it will go to LocalBufferPool/NetworkBufferPool to apply for buffering, and notify the upstream ResultSubPartition of credit = 3, which means I can receive 3 cached messages.

image

With the continuous backlog of data on the Receiver side, the network cache will eventually be exhausted, so it will be fed back to the upstream credit = 0 (equivalent to window = 0 in the TCP sliding window), and the link between the ResultPartition on the Sender side and Netty will be blocked. According to the process described in the previous section, the network cache on the Sender side will be exhausted faster, and the RecordWriter can no longer write data, thus achieving the effect of back pressure.

image

It can be seen from the above that the back pressure signal does not need to be fed back up with the data through the transmission layer between TaskManagers, which greatly reduces the delay of back pressure. And it will not block the entire Socket link due to the back pressure of a Task, and can control the flow at the task granularity quite accurately, which is not only lightweight, but also efficient.

Guess you like

Origin blog.csdn.net/qq_43081842/article/details/112404882