Flink receives BufferOrEvent data processing source code analysis

Foreword:

The flink version used in this article is: 2.11.0

1. Data types in Flink's data stream

Flink divides the data in the data stream into two types, one is called Buffer and the other is called Event. The two are encapsulated into a class called BufferOrEvent, as shown in the following figure:

public class BufferOrEvent {

	private final Buffer buffer;

	private final AbstractEvent event;

    ...
}

So which data belongs to Buffer and which data belongs to Event? Let's look at the explanation of the AbstractEvent abstract class:

​​​​/**
 * This type of event can be used to exchange notification messages between
 * different {@link TaskExecutor} objects at runtime using the communication
 * channels.
 *
 * 翻译:在任务运行期间,通过channels,event(事件)类型被用来在不同的taskexecutor
 *       之间传递、交换信息
 */
public abstract class AbstractEvent implements IOReadableWritable {}

In our daily development, the most common Event data are: CheckpointBarrier, CancelCheckpointMarker.

The most common Buffer data are: WaterMark and custom data

Let's take a look at how each task receives data and processes it.

2. Task data reception

         Use this class as the analysis entry: org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput

         emitNext function:

@Override
	public InputStatus emitNext(DataOutput<T> output) throws Exception {
        //作者注: 无限循环,直到获取到checkpointbarrier或者watermark或者实际的数据为止
		while (true) {
			/* 作者注: currentRecordDeserializer 的赋值是在processBufferOrEvent方法中,
             *         processBufferOrEvent是在有Buffer数据时才进行调用。
             **/
			if (currentRecordDeserializer != null) {
				DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);
				if (result.isBufferConsumed()) {
					currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
					currentRecordDeserializer = null;
				}

				if (result.isFullRecord()) {
                    //作者注:处理Buffer数据
					processElement(deserializationDelegate.getInstance(), output);
					return InputStatus.MORE_AVAILABLE;
				}
			}
                
            /*作者注:从pollNext方法中获取每一个channel的数据。pollNext中会对event数据    
             *         (CheckPointBarrier)进行处理.
            **/
			Optional<BufferOrEvent> bufferOrEvent = checkpointedInputGate.pollNext();
			if (bufferOrEvent.isPresent()) {
				
               
                /* 作者注:如果是event数据,不做处理,因为event数据在调用pollNext方法中已经做 
                 *          过处理
                **/
				if (bufferOrEvent.get().isEvent() && bufferOrEvent.get().getEvent() instanceof CheckpointBarrier) {
					return InputStatus.MORE_AVAILABLE;
				}
                /* 作者注:如果不是event数据,调用processBufferOrEvent函数,将数据放到
                 *       currentRecordDeserializer中
                 **/
				processBufferOrEvent(bufferOrEvent.get());
			} else {
				if (checkpointedInputGate.isFinished()) {
					checkState(checkpointedInputGate.getAvailableFuture().isDone(), "Finished BarrierHandler should be available");
					return InputStatus.END_OF_INPUT;
				}
				return InputStatus.NOTHING_AVAILABLE;
			}
		}
	}

processBufferOrEvent function:

This function adds the obtained buffer data to the currentRecordDeserializer, and then uses the above while(true) loop to obtain and process.

private void processBufferOrEvent(BufferOrEvent bufferOrEvent) throws IOException {
		if (bufferOrEvent.isBuffer()) {
			lastChannel = channelIndexes.get(bufferOrEvent.getChannelInfo());
			checkState(lastChannel != StreamTaskInput.UNSPECIFIED);
			currentRecordDeserializer = recordDeserializers[lastChannel];
			checkState(currentRecordDeserializer != null,
				"currentRecordDeserializer has already been released");
            
            //作者注:把buffer数据塞入到currentRecordDeserializer
			currentRecordDeserializer.setNextBuffer(bufferOrEvent.getBuffer());
		}
		else {
			// Event received
			final AbstractEvent event = bufferOrEvent.getEvent();
			// TODO: with checkpointedInputGate.isFinished() we might not need to support any events on this level.
			if (event.getClass() != EndOfPartitionEvent.class) {
				throw new IOException("Unexpected event: " + event);
			}

			// release the record deserializer immediately,
			// which is very valuable in case of bounded stream
			releaseDeserializer(channelIndexes.get(bufferOrEvent.getChannelInfo()));
		}
	}

3. The process of checkpointBarrier

        As mentioned in 2, the process of checkpointBarrier is checkpointedInputGate.pollNext(); let's take a look at this pollNext function, as follows:

@Override 
public Optional<BufferOrEvent> pollNext() throws Exception { 
   while (true) { 
      //Author's Note: Get data from the input gateway 
      Optional<BufferOrEvent> next = inputGate.pollNext(); 

      if (!next.isPresent()) { 
         return handleEmptyBuffer(); 
      } 

      BufferOrEvent bufferOrEvent = next.get(); 
      checkState(!barrierHandler.isBlocked(bufferOrEvent.getChannelInfo())); 
      //Author's Note: If it is buffer type (waterMark and custom data) data, If it is not processed, it returns directly and is handed over to the previous method for processing. 
      if (bufferOrEvent.isBuffer()) { 
         return next; 
      }  
      // Author's Note: If it is a CheckpointBarrier type, it will be taken out for processing. See the processBarrier method for the processing process.
      else if (bufferOrEvent.getEvent().getClass() == CheckpointBarrier.class) {
         CheckpointBarrier checkpointBarrier = (CheckpointBarrier) bufferOrEvent.getEvent();
         // 作者注:barrierHandler 分为两种:CheckpointBarrierAligner和CheckpointBarrierTracker
         // CheckpointBarrierAligner对应exactly-once语境

         barrierHandler.processBarrier(checkpointBarrier, bufferOrEvent.getChannelInfo());
         return next;
      }
      else if (bufferOrEvent.getEvent().getClass() == CancelCheckpointMarker.class) {
         barrierHandler.processCancellationBarrier((CancelCheckpointMarker) bufferOrEvent.getEvent());
      }
      else {
         if (bufferOrEvent.getEvent().getClass() == EndOfPartitionEvent.class) {
            barrierHandler.processEndOfPartition();
         }
         return next;
      }
   }
}

According to the different semantics of flink, the processing method of barrierHandler is also different. The following mainly explains the semantics of exactly-once and at-least-once respectively;

exactly-once semantics: The barrierHandler corresponding to exactly-once semantics is CheckpointBarrierAligner. Aligner means alignment. In order to ensure exactly-once semantics, the channel where the checkpoint barrier arrives first is blocked, and the block is not released until the checkpoint barriers in all channels arrive and the snapshot is completed. code show as below:

@Override
	public void processBarrier(CheckpointBarrier receivedBarrier, InputChannelInfo channelInfo) throws Exception {
		final long barrierId = receivedBarrier.getId();

		// fast path for single channel cases
		//作者注: 如果上游只有一个分区,直接进行处理即可
		if (totalNumberOfInputChannels == 1) {
			resumeConsumption(channelInfo);
			if (barrierId > currentCheckpointId) {
				// new checkpoint
				currentCheckpointId = barrierId;
				notifyCheckpoint(receivedBarrier, latestAlignmentDurationNanos);
			}
			return;
		}

		// -- general code path for multiple input channels --

		// 判断接收的是否是第一个barrierid
		if (isCheckpointPending()) {
			// this is only true if some alignment is already progress and was not canceled
			//作者注: 如果barrierid和当前的CheckpointId相等,则调用onBarrier阻塞当前channel
			if (barrierId == currentCheckpointId) {
				// regular case
				onBarrier(channelInfo);
			}
			
			/*
			* 作者注: 如果barrierId大于当前的CheckpointId,说明当前的CheckpointId还未完成,下一个barrierId已经到来。
			*          1、正常情况下,这种现象应该不会发生,因为只要上一个CheckpointId还未处理完,channel是被阻塞的。
			*          2、如果这种现象发生,则进行处理,上一个CheckpointId的数据不再处理,以新来的barrierId为基础重新进行对齐。
			*             处理过程:通知终止上一个barrierid的快照  -->  释放所有的堵塞channel,继续消费数据 -->  进行新barrierid的对齐预处理。
			**/ 
			           
			else if (barrierId > currentCheckpointId) {
				// we did not complete the current checkpoint, another started before
				LOG.warn("{}: Received checkpoint barrier for checkpoint {} before completing current checkpoint {}. " +
						"Skipping current checkpoint.",
					taskName,
					barrierId,
					currentCheckpointId);

				// let the task know we are not completing this
				// 作者注: 通知终止上一个barrierid的执行
				notifyAbort(currentCheckpointId,
					new CheckpointException(
						"Barrier id: " + barrierId,
						CheckpointFailureReason.CHECKPOINT_DECLINED_SUBSUMED));

				// abort the current checkpoint
				//作者注: 释放所有的堵塞channel,继续消费数据
				releaseBlocksAndResetBarriers();

				// begin a new checkpoint
				//作者注: 进行新barrierid的对齐预处理。
				beginNewAlignment(barrierId, channelInfo, receivedBarrier.getTimestamp());
			}
			else {
				// ignore trailing barrier from an earlier checkpoint (obsolete now)
				//作者注: 异常情况:如果barrierId小于当前的CheckpointId,直接释放当前channel继续消费数据
				resumeConsumption(channelInfo);
			}
		}
		// 作者注: 如果接收的是第一个新的barrierId,则进行新barrierid的对齐预处理。
		else if (barrierId > currentCheckpointId) {
			// first barrier of a new checkpoint
			beginNewAlignment(barrierId, channelInfo, receivedBarrier.getTimestamp());
		}
		//作者注:异常情况  barrierId小于当前的CheckpointId,直接释放当前channel继续消费数据
		else {
			// either the current checkpoint was canceled (numBarriers == 0) or
			// this barrier is from an old subsumed checkpoint
			resumeConsumption(channelInfo);
		}

		// check if we have all barriers - since canceled checkpoints always have zero barriers
		// this can only happen on a non canceled checkpoint
		//作者注: 如果所有channel的barrierid都接收到,则释放阻塞的channel继续消费数据,并且通知jobManager已完成barrierid
		if (numBarriersReceived + numClosedChannels == totalNumberOfInputChannels) {
			// actually trigger checkpoint
			if (LOG.isDebugEnabled()) {
				LOG.debug("{}: Received all barriers, triggering checkpoint {} at {}.",
					taskName,
					receivedBarrier.getId(),
					receivedBarrier.getTimestamp());
			}
			// 作者注:释放阻塞的channel继续消费数据
			releaseBlocksAndResetBarriers();
			// 作者注:通知JobManager当前barrierid已完成
			notifyCheckpoint(receivedBarrier, latestAlignmentDurationNanos);
		}
	}

At-least-once semantic processing: the corresponding processing class is: org.apache.flink.streaming.runtime.io.CheckpointBarrierTracker. Compared with exactly-once semantics, at-least-once semantic processing does not require channel blocking. The execution code is as follows:

public void processBarrier(CheckpointBarrier receivedBarrier, InputChannelInfo channelInfo) throws Exception {
		final long barrierId = receivedBarrier.getId();

		// fast path for single channel trackers
		// 作者注: 如果上游只有一个分区,直接通知进行cp快照
		if (totalNumberOfInputChannels == 1) {
			notifyCheckpoint(receivedBarrier, 0);
			return;
		}

		// general path for multiple input channels
		if (LOG.isDebugEnabled()) {
			LOG.debug("Received barrier for checkpoint {} from channel {}", barrierId, channelInfo);
		}

		// find the checkpoint barrier in the queue of pending barriers
		CheckpointBarrierCount barrierCount = null;
		int pos = 0;

		// 作者注:从pendingCheckpoints中获取和当前barrierId相同的CheckPointBarrier
		for (CheckpointBarrierCount next : pendingCheckpoints) {
			if (next.checkpointId == barrierId) {
				barrierCount = next;
				break;
			}
			pos++;
		}

		if (barrierCount != null) {
			// add one to the count to that barrier and check for completion
			//作者注:将当前的获取到的barrierId的量加1
			int numBarriersNew = barrierCount.incrementBarrierCount();
			
			//作者注:如果当前barrierId所有上游的barrier都已经获取到,则进行两个处理:
			//        1、小于当前barrierId的barrier不在处理,从pendingCheckpoints弹出(pollFirst)
			//        2、想JobManager通知当前barrierID已全部获取。
			if (numBarriersNew == totalNumberOfInputChannels) {
				// checkpoint can be triggered (or is aborted and all barriers have been seen)
				// first, remove this checkpoint and all all prior pending
				// checkpoints (which are now subsumed)
				for (int i = 0; i <= pos; i++) {
					pendingCheckpoints.pollFirst();
				}

				// notify the listener
				if (!barrierCount.isAborted()) {
					if (LOG.isDebugEnabled()) {
						LOG.debug("Received all barriers for checkpoint {}", barrierId);
					}

					notifyCheckpoint(receivedBarrier, 0);
				}
			}
		}
		//作者注:如果当前barrierId是第一次获取到,则加入到pendingCheckpoints中去。
		else {
			// first barrier for that checkpoint ID
			// add it only if it is newer than the latest checkpoint.
			// if it is not newer than the latest checkpoint ID, then there cannot be a
			// successful checkpoint for that ID anyways
			if (barrierId > latestPendingCheckpointID) {
				markCheckpointStart(receivedBarrier.getTimestamp());
				latestPendingCheckpointID = barrierId;
				pendingCheckpoints.addLast(new CheckpointBarrierCount(barrierId));

				// make sure we do not track too many checkpoints
				// 为了保证pendingCheckpoints不至于过大,超过阈值,则将最小的barrierId去掉
				if (pendingCheckpoints.size() > MAX_CHECKPOINTS_TO_TRACK) {
					pendingCheckpoints.pollFirst();
				}
			}
		}
	}
The above is the process of checkpoint barrier.

4. The process of watermark

As we know from the code in 2, the processing of Buffer data is  done in the processElement(deserializationDelegate.getInstance(), output) function. Let's take a look at this function:

private void processElement(StreamElement recordOrMark, DataOutput<T> output) throws Exception {
        //作者注: 自定义数据的处理
		if (recordOrMark.isRecord()){
			output.emitRecord(recordOrMark.asRecord());
        //作者注: WaterMark的处理
		} else if (recordOrMark.isWatermark()) {
			statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), lastChannel);
		} else if (recordOrMark.isLatencyMarker()) {
			output.emitLatencyMarker(recordOrMark.asLatencyMarker());
		} else if (recordOrMark.isStreamStatus()) {
			statusWatermarkValve.inputStreamStatus(recordOrMark.asStreamStatus(), lastChannel);
		} else {
			throw new UnsupportedOperationException("Unknown type of StreamElement");
		}
	}
1. Watermark processing: enter this method to track, statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), lastChannel);
public void inputWatermark(Watermark watermark, int channelIndex) throws Exception {
		// ignore the input watermark if its input channel, or all input channels are idle (i.e. overall the valve is idle).
		if (lastOutputStreamStatus.isActive() && channelStatuses[channelIndex].streamStatus.isActive()) {
			long watermarkMillis = watermark.getTimestamp();

			// if the input watermark's value is less than the last received watermark for its input channel, ignore it also.
			// 作者注: 如果当前的watermark 大于 该channel保存的watermark,则进行替换
			if (watermarkMillis > channelStatuses[channelIndex].watermark) {
				channelStatuses[channelIndex].watermark = watermarkMillis;

				// previously unaligned input channels are now aligned if its watermark has caught up
				// 作者注: lastOutputWatermark保存的是上一次所有channel的watermark都对齐时的最小的watermark。
				//         如果当前watermark大于lastOutputWatermark,说明当前channel重新进行了对齐,则当前isWatermarkAligned置为true
				if (!channelStatuses[channelIndex].isWatermarkAligned && watermarkMillis >= lastOutputWatermark) {
					channelStatuses[channelIndex].isWatermarkAligned = true;
				}

				// now, attempt to find a new min watermark across all aligned channels
				//作者注:寻找所有channel最小的watermark,并进行处理。
				findAndOutputNewMinWatermarkAcrossAlignedChannels();
			}
		}
	}

FindAndOutputNewMinWatermarkAcrossAlignedChannels method analysis:

private void findAndOutputNewMinWatermarkAcrossAlignedChannels() throws Exception {
		long newMinWatermark = Long.MAX_VALUE;
		boolean hasAlignedChannels = false;

		// determine new overall watermark by considering only watermark-aligned channels across all channels
		// 作者注: 遍历所有的channel,找出最小的watermark
		for (InputChannelStatus channelStatus : channelStatuses) {
			if (channelStatus.isWatermarkAligned) {
				hasAlignedChannels = true;
				newMinWatermark = Math.min(channelStatus.watermark, newMinWatermark);
			}
		}

		// we acknowledge and output the new overall watermark if it really is aggregated
		// from some remaining aligned channel, and is also larger than the last output watermark
		//作者注:如果所有的channel都已经完成了watermark的对齐,且最小的watermark比lastOutputWatermark大,
		//        则进行 lastOutputWatermark的替换,且进行watermark的处理。
		if (hasAlignedChannels && newMinWatermark > lastOutputWatermark) {
			lastOutputWatermark = newMinWatermark;
			output.emitWatermark(new Watermark(lastOutputWatermark));
		}
	}

Explanation: How to understand the alignment of watermark?

Under normal circumstances, watermarks are generated periodically (for example, once every 10 seconds). When the current operator (usually a window operator) receives watermarks generated by all upstream channels in the same period, the watermark alignment is completed , this is the same as receiving the checkpointBarrier with the same barrierId to complete a barrier alignment.

5. Custom data processing

The custom data processing process is shown in the figure below: output.emitRecord(recordOrMark.asRecord()); This function will eventually pass the data to the custom function for calculation.

private void processElement(StreamElement recordOrMark, DataOutput<T> output) throws Exception {
        //作者注: 自定义数据的处理
		if (recordOrMark.isRecord()){
			output.emitRecord(recordOrMark.asRecord());
        //作者注: WaterMark的处理
		} else if (recordOrMark.isWatermark()) {
			statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), lastChannel);
		} else if (recordOrMark.isLatencyMarker()) {
			output.emitLatencyMarker(recordOrMark.asLatencyMarker());
		} else if (recordOrMark.isStreamStatus()) {
			statusWatermarkValve.inputStreamStatus(recordOrMark.asStreamStatus(), lastChannel);
		} else {
			throw new UnsupportedOperationException("Unknown type of StreamElement");
		}
	}

The above is the processing process after the data is received from a task.

Guess you like

Origin blog.csdn.net/chenzhiang1/article/details/126667231