Byte bounce single point recovery function and Regional CheckPoint optimization practice

Introduction:  This article introduces the two main features that ByteDance has done in the past period of time. One is the single point recovery function at the Network layer, and the other is the Regional Checkpoint at the Checkpoint layer.

Author|Liao Jiayi

Abstract: This article introduces the two main features that byte beating has done in the past period of time. One is the single point recovery function at the Network layer, and the other is the Regional Checkpoint at the Checkpoint layer. content include:

  1. Single point recovery mechanism
  2. Regional Checkpoint
  3. Other optimizations in Checkpoint
  4. Challenges & future planning

The author shares the original video review: https://www.bilibili.com/video/BV13a4y1H7XY?p=2

1. Single point recovery mechanism

In the real-time recommendation scenario of byte beating, we use Flink to stitch user characteristics and user behaviors in real time, and the stitched samples are used as the input of the real-time model. The delay and stability of the splicing service directly affect the recommendation effect of online products to users. This splicing service is a similar implementation of dual-stream Join in Flink. Any failure of any Task or node in the Job will cause the entire Failover of the job affects the real-time recommendation effect of the corresponding business.

Before introducing single point recovery, let's review Flink's Failover strategy.

  • Individual-Failover:

Only restart tasks with errors, which is suitable for situations where there is no connection between tasks, and the application scenarios are limited.

  • Region-Failover:

This strategy divides all tasks in the job into regions. When a task fails, it will try to find the smallest set of regions that needs to be restarted for failure recovery. Compared with the global restart failure recovery strategy, this strategy requires fewer tasks to restart for failure recovery in some scenarios.

If you use the Region-Failover strategy, but because Job is a fully connected topology, it is a large region in itself. Restarting a region is equivalent to restarting the entire job, so we consider whether we can use the Flink Individual-task-failover strategy to replace the Region-failover strategy? The individual-task-failover strategy is completely inapplicable under this topology. Therefore, we need to design and develop a new Failover strategy for scenarios with the following characteristics:

  • Multi-stream Join topology
  • Large traffic (30M QPS), high concurrency (16K*16K)
  • Allow a small amount of data to be lost in a short time
  • High requirements for continuous data output

Before describing the technical solution, take a look at Flink's existing data transmission mechanism.

image.png

Looking from left to right (SubTaskA):

  1. When data flows in, it will be received by RecordWriter first
  2. RecordWriter shuffles the data according to the information of the data, such as the key, and selects the corresponding channel
  3. Load the data into the buffer and put it in the buffer queue corresponding to the channel
  4. Send downstream through Netty Server
  5. Downstream Netty Client receives data
  6. According to the partition information in the buffer, forward it to the corresponding downstream channel
  7. The InputProcessor takes the data out of the buffer and executes the operator logic

According to the ideas put forward above, we have to solve the following problems:

  • How to make upstream tasks aware of downstream failure
  • After the downstream task fails, how to make the upstream task send data to the normal task
  • After the upstream task fails, how to make the downstream task continue to consume the data in the buffer
  • How to deal with incomplete data in upstream and downstream
  • How to establish a new connection

Propose solutions to the above problems.

■  How to make upstream Task aware of downstream failure

image.png

The downstream SubTask actively transmits the failure information to the upstream, or the TM is shut down and the upstream Netty Server can also be aware. In the figure, X represents an unavailable SubPartition.

First, set SubPartition1 and the corresponding view (a structure used by Netty Server to fetch SubPartition data) as unavailable.

Later, when the Record Writer receives new data and needs to send data to SubPartition1, an availability judgment is required at this time. When the SubPartition status is available, it will be sent normally, and the data will be discarded if it is unavailable.

■  Upstream Task receives a new connection from downstream Task

image.png

After the downstream subtask is rescheduled and started, it sends a Partition Request to the upstream. After receiving the Partition Request, the upstream Netty Server re-creates the corresponding View for the downstream SubTask. At this time, the upstream Record Writer can write data normally.

■  Downstream tasks sense the failure of upstream tasks

image.png

The same downstream Netty Client can perceive that the upstream subTask has failed, then find the corresponding channel and insert an unavailable event at the end (here an exclamation point is used to indicate the event). Our goal is to lose as little data as possible. At this time, the buffer in the channel can be consumed normally by the InputProcessor until the "unavailable event" is read. Then mark the channel unavailable and clean up the corresponding buffer queue.

■  There is incomplete data in the buffer

The first thing to know is where the incomplete data is stored. It exists inside the input process. The input process maintains a small buffer queue for each channel. When a buffer is received, it is incomplete data, then after receiving the next buffer, it will be spliced ​​into a complete data and sent to the operator.

■  Downstream Task and Upstream Task Reconnect

image.png

When the upstream task in question is rescheduled, the downstream is notified by calling the TaskManager API. After receiving the notification, the downstream Shuffle Environment judges the corresponding channel status. If it is not available, it directly generates a new channel and releases the old one. If it is available, it means that the channel's buffer has not been consumed, and you need to wait for the buffer to be consumed before performing the replacement operation.

Business income

image.png

The above figure is a comparison test based on a job with 4000 parallelism as an example. The business is to join a user presentation flow and a user behavior flow. The entire job has 12,000 Tasks.

The single point recovery (reservation of resources) in the above figure  is a feature made by the scheduling group. When applying for resources, choose to apply for some additional resources. When a failover occurs, the time overhead of applying for resources from YARN is eliminated.

Finally, the output of the job was reduced by one-thousandth, and the recovery time was about 5 seconds. Because the entire recovery process takes a short time, it can basically be unaware downstream.

二、Regional Checkpoint

In a more classic data integration scenario, data is imported and exported. For example, importing from Kafka to Hive meets the following characteristics.

  • There is no All-to-All connection in the topology
  • Strongly rely on Checkpoint to achieve data output under Exactly-Once semantics
  • The checkpoint interval is long, and the success rate is high

In this case, the data does not have any shuffle.

What are the problems encountered in the data integration scenario?

  • The failure of a single Task Checkpoint will affect the global Checkpoint output
  • Network jitter, write timeout/failure, and storage environment jitter have too obvious impact on the job
  • The success rate of operations above 2000 parallel has dropped significantly, which is lower than business expectations

Here, we think that the job will divide the topology of the job into multiple regions according to the region-failover strategy. So can Checkpoint take a similar approach, and manage checkpoint in units of regions? The answer is yes.

In this case, there is no need to wait until all task checkpoints are completed before performing partition archiving operations (such as HDFS file rename). But when a region is completed, the region-level checkpoint archiving operation can be performed.

Before introducing the solution, briefly review Flink's existing checkpoint mechanism. I believe everyone is familiar.

image.png
Existing ckp

The figure above is an example of the topology of Kafka source and Hive sink operators with a parallelism of 4.

First, the checkpoint coordinator triggers the triggerCheckpoint operation and sends it to each source task. After the Task receives the request, it triggers the operator in the Task to perform a snapshot operation. There are 8 operator states in the example.

image.png
Existing ckp1

After each operator completes the snapshot, the task sends an ACK message to the checkpoint coordinator to indicate that the current task has completed the checkpoint.

Later, when the coordinator receives all successful ACK messages for all tasks, the checkpont can be considered as successful. Finally, the finalize operation is triggered to save the corresponding metadata. Notify all Task checkpoints of completion.

What problems will we encounter when we use the Region method to manage checkpoints?

  • How to divide Checkpoint Region

Divide the set of tasks that are not connected to each other into 1 region. There are four Regions in the obvious example.

  • How to deal with the Checkpoint result of the failed Region

Assuming that the first checkpoint can be completed normally, the corresponding status of each operator is successfully written to the HDFS checkpoint1 directory, and through logical mapping, 8 operators are mapped to 4 checkpoint regions. Note that it is only a logical mapping, and no movement or modification is made to the physical file.

image.png
Existing ckp1

The region-4-data (Kafka-4, Hive-4) checkpoint failed during the second checkpoint. There is no corresponding Kafka-4-state and Hive-4-state files in the checkpoint2 (job/chk_2) directory, and the current checkpoint2 is incomplete. To ensure completeness, find the successful state file of region-4-data from the last or previous successful checkpoint file, and perform logical mapping. In this way, the state file of each region of the current checkpoint is complete, and the checkpoint can be considered as complete.

At this time, if most or all regions fail, and if the previous checkpoint is referenced, then the current checkpoint is the same as the previous checkpoint and it does not make sense.

By configuring the maximum failure ratio of the region, such as 50%, the 4 regions in the example can accept failures of up to two regions.

  • How to avoid storing too much Checkpoint historical data on the file system

If a certain region fails all the time (meeting dirty data or code logic problems), the current mechanism will cause all historical checkpoint files to be retained, which is obviously unreasonable.

The maximum number of consecutive failures in the region is supported by configuration. For example, 2 means that the region can refer to the region results of the previous two checkpoints at most.

Difficulties in project realization

  • How to deal with Task Fail and checkpoint timeout
  • How to deal with the subtask status that has been snapshot successfully in the same region
  • How to ensure compatibility with checkpoint Coordinator

Let's see how Flink is doing it.

image.png
Existing coordinator

When a task failure occurs, the JobMaster FailoverStrategy will be notified first, and the checkpoint coordinator will be notified to perform the checkpoint cancel operation through the FailoverStrategy.

So how to deal with the checkpoint timeout situation? When the checkpoint is triggered by the coordinator, the checkpoint canceller is turned on. There is a timer in the canceller. When the preset time is exceeded and the coordinator has not completed the checkpoint, it means that there is a timeout, and the coordinator is notified to cancel the checkpoint.

Either Task fail or timeout will eventually point to the penalty checkpoint, and the currently pointed checkpoint will be discarded.

Before making corresponding changes, sort out the checkpoint-related Messages and the responses that the checkpoint coordinator will make.

image.png

Global checkpoint is Flink's existing mechanism.

image.png

In order to maintain compatibility with the checkpoint Coordinator, a CheckpointHandle interface is added. Two implementations are added, namely GlobalCheckpointHandle and RegionalCheckpointHandle, which implement global checkpoint and region checkpoint related operations by filtering messages.

Region checkpoint mention a little. If the handler receives a failure message, it sets the region as a failure, and tries to perform a logical mapping of the region from the previous successful checkpoint. Similarly, the nofityComplate message sent by the coordinator will first be filtered by the handler to filter out the message sent to the failed task.

image.png
Business income

The test is performed at 5000 parallelism, assuming that the success rate of a single Task snapshot is 99.99%. The success rate of using Global checkpoint is 60.65%, while using Region checkpoint can still maintain 99.99%.

3. Other optimizations on Checkpoint

image.png

■  Parallel restore operator state

Union state is a special state. When restoring, we need to find all Task state of the job and then perform union restoration to a single Task. If the job parallelism is very large, such as 10,000, then at least 10,000 files need to be read when the union state of each task is restored. If the states in these 10,000 files are restored serially, the time-consuming restoration can be imagined to be very long.

Although the data structure corresponding to OperatorState cannot be operated in parallel, the process of reading files can be parallelized. In the recovery process of OperatorStateBackend, we will parallelize the process of reading HDFS files and wait until all state files are parsed. After the memory is stored, single-threaded processing is used, so that we can reduce the state recovery time of tens of minutes to a few minutes.

■  Enhance CheckpointScheduler and support Checkpoint trigger on the whole point

The interval and timeout of Flink checkpoint cannot be modified after the task is submitted. However, it can only be set based on experience values ​​when it is just launched. However, it is often found that the interval, timeout and other parameter settings are unreasonable during the peak period of the job. At this time, one method is usually to modify the parameters to restart the task, which has a relatively large impact on the business. Obviously, this method is unreasonable.

Here, we refactored the Checkpoint trigger mechanism inside CheckpointCoordinator, abstracted the existing Checkpoint trigger process, so that we can quickly customize the Checkpoint trigger mechanism based on abstract classes. For example, in scenarios that support data import, in order to form Hive partitions more quickly, we have implemented an hourly trigger mechanism to facilitate downstream viewing of the data as soon as possible.

There are many optimization points that are not listed one by one.

4. Challenges & Future Planning

image.png

At present, the internal job status of the byte can reach the level of about 200TB, and for this kind of high-traffic and large-state operation, it is impossible to directly use RocksDB StateBackend. So in the future, we will continue to do more work on optimization and stability of state and checkpoint performance, such as strengthening the existing StateBackend, solving checkpoint rate problems under tilt and back pressure, and enhancing debugging capabilities.

Activity recommendation:

You can experience the real-time computing Flink version of Alibaba Cloud's enterprise-level product based on Apache Flink for only 99 yuan! Click the link below to learn about the event details: https://www.aliyun.com/product/bigdata/sc?utm_content=g_1000250506

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/115087993