FlinK’s checkpoint and savepoint mechanism

CheckpointCheckpoint

Overview

In Flink, checkpoints are a key mechanism for achieving state consistency and failure recovery. Checkpointing ensures that the state of a job can be reliably restored in the event of a failure.

Checkpoints have the following characteristics:

状态一致性:检查点会将作业的状态数据保存在持久化存储中,以确保数据的一致性。通过在特定时间点对作业状态进行快照,可以捕获整个作业的状态

容错性:当作业发生故障时,Flink可以使用最近的检查点来恢复作业的状态。恢复作业时,Flink会从最近一个成功完成的检查点开始进行恢复操作

同步机制:为了确保一致性,在触发检查点时,Flink会暂停作业中的所有计算任务,并将状态快照写入持久化存储中。一旦检查点完成,作业的计算任务将会继续执行

检查点模式:Flink提供两种检查点模式:
	Exactly-once:确保数据在一次计算中只会被处理一次,并且状态可以准确地恢复到检查点的状态
	At-least-once:允许数据在一次计算中可能被处理多次,但可以在发生故障后恢复到检查点的状态

高可用性:Flink 支持将检查点数据存储在分布式文件系统(如 HDFS)上,以提供高可用性和容错能力

Save time

In Flink, the saving of checkpoints is controlled by configuring the checkpoint interval and trigger conditions.

Checkpoint Interval:

可以配置一个时间间隔,在该间隔时间过后,Flink 将自动触发一个新的检查点

检查点间隔可以根据应用程序的需求进行调整,通常是根据数据处理的延迟和吞吐量来决定

较短的检查点间隔意味着更频繁的检查点,但可能会增加系统开销和延迟;而较长的间隔则会减少开销,但会增加数据丢失的潜在风险

Triggering Condition:

精确一次处理 Exactly-Once Processing:在 Flink 中,可以通过配置“精确一次处理”的语义来触发检查点。这意味着只有在前一个检查点成功完成后才会触发新的检查点,确保数据的一致性和完整性。

基于数据量 Data Size:可以设置一个数据大小阈值,在接收到一定数量的记录后触发检查点。这个阈值可以根据应用程序的需求进行调整,以平衡开销和一致性。

基于时间 Time Based:可以设置一个时间阈值,在一定时间间隔内没有触发检查点时,强制触发一个检查点。这用于确保即使在没有活动数据的情况下也能定期保存状态数据。

Save and restore

The checkpoint saving process in Flink is roughly divided into the following steps:

1.触发检查点:Flink会根据配置的触发条件(例如时间间隔或数据量)自动触发一个新的检查点。或者可以通过编程方式在代码中显式触发检查点。

2.暂停计算任务:在触发检查点时,Flink会暂停应用程序的所有计算任务,确保没有新的状态更新。

3.状态快照:一旦计算任务暂停,Flink将应用程序的状态数据进行快照(snapshot),包括Keyed State和Operator State。这些状态数据会被写入到持久化存储中。

4.存储检查点数据:Flink将状态快照持久化存储到配置的检查点存储后端中。具体存储的方式取决于所选择的存储后端,例如写入到分布式文件系统、对象存储服务或远程文件系统。

5.恢复计算任务:一旦状态数据持久化完成,Flink 将恢复应用程序的计算任务,继续处理输入数据。

In the event of a failure, Flink can use the saved checkpoint data to recover the application. The process is as follows:

1.检测到故障:Flink监控作业状态时会检测到故障,例如计算节点崩溃。

2.选择最近的检查点:一旦故障被检测到,Flink会选择最近成功完成的检查点。

3.从检查点恢复状态:Flink会从所选检查点的存储位置中读取状态数据,并将其加载到应用程序中,恢复应用程序的状态。

4.恢复计算任务:一旦状态数据加载完成,Flink 将恢复应用程序的计算任务,从故障发生时的状态继续处理输入数据。

Checkpoint algorithm

In Flink, distributed snapshots based on the Chandy-Lamport algorithm are used to save state backups to checkpoints without pausing the overall stream processing.

Flink's checkpoint algorithm is implemented based on the snapshot mechanism. It is an incremental snapshot algorithm called an asynchronous barrier algorithm, which is used to achieve checkpoint consistency.

Through the asynchronous barrier algorithm, Flink can achieve consistent preservation of application state in a distributed environment, ensuring that the state can be correctly restored when a failure occurs and data processing can continue. The design of this algorithm takes into account the characteristics of concurrency and asynchronous processing in a distributed environment to ensure an efficient and reliable checkpoint mechanism.

The basic process of the checkpoint algorithm:

触发检查点:根据配置的触发条件,Flink触发一个新的检查点

异步屏障插入:在所有算子之间插入屏障(barrier)操作。屏障是一种特殊的记录,在流中传播,并且在到达算子后会阻塞输入,直到在所有流并行分区的所有算子上都收到屏障为止

等待所有屏障:一旦插入屏障,所有算子将等待所有输入流上的屏障。这可以确保在屏障之后的所有记录都是一致的,并且有效防止任何后续状态更新

状态快照:在等待所有屏障后,Flink将进行实际的状态快照操作。它会在算子中将状态数据写入到持久化存储中,以进行检查点保存

完成屏障传播:在状态快照完成后,屏障将被传播到下游算子,通知它们可以继续处理记录

恢复计算任务:在发生故障时,Flink使用已保存的检查点数据进行应用程序的恢复,通过加载状态数据来恢复应用程序的状态

checkpoint dividing line

The checkpoint boundary (Barrier) is a special record used in Flink to ensure the consistency of checkpoints. In Flink's asynchronous barrier algorithm, the concepts of barrier insertion and dividing lines are related.

When a new checkpoint is triggered, Flink inserts barrier operations between all operators in the stream. A barrier is a special record that propagates through the stream and upon reaching an operator blocks the input until the barrier is received on all operators in all parallel partitions of the stream.

The arrival of the barrier serves two important functions:

一致性分界线:屏障的到达表示在该位置之前的所有记录都被视为属于之前的检查点,而之后的记录将属于新检查点。这样,屏障将流分割为多个一致性分段,确保在一个检查点完成后,新的检查点将在屏障之后开始。
	
阻塞输入:屏障的到达会阻塞输入,直到所有输入流并行分区上的算子都收到屏障为止。这确保了在算子处理记录之前,所有输入数据都已经达到一致的检查点位置。

By inserting barriers and handling barrier arrivals, Flink can guarantee checkpoint consistency and the integrity of state snapshots. Barriers serve as checkpoint boundaries, dividing the stream into a single consistent segment, and ensuring that all operators start processing new data records after the checkpoint completes.

Barrier alignment is accurate once

Exactly-once semantics of the distributed snapshot algorithm, which combines Barrier Alignment and State Snapshotting to achieve exactly-once semantics guarantee.

The process is as follows:

屏障对齐:在启动检查点时,Flink 将在流的所有并行分区之间插入屏障(barrier)。屏障的到达通知所有算子在其上的输入都已经到达一致的检查点位置。

状态快照:在屏障对齐后,Flink 使用状态快照机制将算子的状态数据保存到持久化存储中,形成一个一致的检查点。

检查点确认:在状态快照完成后,Flink 会等待所有参与分布式快照的任务确认其已完成检查点。这可以确保所有任务已经成功保存了状态数据。

任务协调器通知:一旦所有任务都确认完成检查点,任务协调器将发送一个通知,表示检查点已经完成。

恢复:当发生故障时,Flink 可以使用已保存的检查点数据进行应用程序的恢复。它将加载检查点中的状态数据,并通过屏障对齐来确保在故障点之后的记录不会被处理。

This algorithm effectively guarantees that the state of the application can be correctly restored when a failure occurs and avoids repeated data processing or data loss, thereby achieving exactly-once semantics guarantee.

A Task will back up its local state only after receiving all upstream barriers with the same number. During the barrier alignment process, the data behind the barrier is blocked and waits, and will not cross the barrier.

Barrier aligned at least once

The at-least-once semantics of the distributed snapshot algorithm is implemented through a checkpoint algorithm based on an asynchronous barrier.

The process is as follows:

触发检查点:根据配置的触发条件,Flink 触发一个新的检查点。

异步屏障插入:在所有算子之间插入异步屏障(barrier)操作。屏障是特殊记录,在流中传播,并且在到达算子后会阻塞输入,直到在所有流并行分区的所有算子上都收到屏障为止。

检查点确认:在屏障插入之后,Flink 会等待所有参与检查点的任务确认它们已经处理了该屏障。一旦收到所有任务的确认,就可以继续进行下一步。

状态快照:在检查点确认之后,Flink 执行实际的状态快照操作。它会在算子中将状态数据写入持久化存储以进行检查点保存。

完成屏障传播:在状态快照完成后,屏障将被传播到下游算子,通知它们可以继续处理记录。

Although some duplicate processing may occur when a failure occurs, this algorithm ensures at-least-once semantics, that is, each record is processed at most once. This is a fault-tolerant mechanism that ensures that the application state can be correctly restored in the event of a failure, ensuring that data is processed at least once.

A Task will back up its local state only after receiving all upstream barriers with the same number. During the barrier alignment process, the data of the barrier that arrives first will be calculated without blocking.

Precise once for non-Barrier alignment

The distributed snapshot algorithm's non-Barrier aligned exact-once semantics uses asynchronous processing and checkpoint saving to achieve exactly-once semantics.

The process is as follows:

触发检查点:根据配置的触发条件,Flink 触发一个新的检查点。

非Barrier对齐:不像Barrier对齐算法那样在流中插入屏障,非Barrier对齐算法利用异步处理和状态保存来实现。在每个算子中,Flink 会使用异步的方式执行状态快照,并将状态数据保存到持久化存储中,形成一个一致的检查点。

检查点确认:一旦算子完成了状态快照,Flink 会等待所有参与检查点的任务确认它们已经完成了状态保存。这可以确保所有任务都成功保存了状态数据。

任务协调器通知:当所有任务确认已完成检查点后,任务协调器将发送一个通知,表示检查点已经完成。

恢复:当发生故障时,Flink 可以使用已保存的检查点数据进行应用程序的恢复。它会从持久化存储中加载检查点中的状态数据,并通过异步处理来确保在故障点之后的记录不会被重复处理。

The algorithm utilizes asynchronous processing and checkpoint saving to achieve state consistency, and uses a recovery mechanism to handle failure situations, ensuring that the application's state can be restored correctly.

When a task receives the first barrier, it starts to perform backup, which can ensure accurate one-time execution.

先到的barrier,将本地状态备份,其后面的数据接着计算输出

未到的barrier,其前面的数据接着计算输出,同时也保存到备份中

最后一个barrier到达该Task时,这个Task的备份结束

Checkpoint configuration

The function of checkpoints is for fault recovery. Saving checkpoints cannot take up a lot of time and cause data processing performance to be significantly reduced. In order to balance fault tolerance and processing performance, checkpoints can be configured variously in the code.

Enable checkpoints

Flink programs disable checkpoints by default. To enable the function of automatically saving snapshots for Flink applications, you need to explicitly call the enableCheckpointing() method of the execution environment in the code.

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
	// 状态检查点间隔,单位为毫秒,表示周期性保存检查点的间隔时间。默认间隔周期500毫秒,已经被弃用。
	// 每隔1秒启动一次检查点保存
	env.enableCheckpointing(1000);
	// 启用检查点: 默认是barrier对齐的,周期为5s, 精准一次
    env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
    // 周期为5s, 至少一次
    env.enableCheckpointing(5000, CheckpointingMode.AT_LEAST_ONCE);

The time between checkpoints is a trade-off between processing performance and speed of failure recovery.

如果希望对性能的影响更小,可以调大间隔时间

如果希望故障重启后迅速赶上实时的数据处理,就需要将间隔时间设小一些

Specify storage location

The specific persistent storage location of the checkpoint depends on the checkpoint storage settings. By default, checkpoints are stored in the JobManager's heap memory. For the persistence of large states, Flink also provides an interface for saving in other storage locations.

Flink mainly provides two persistent storage locations: 作业管理器的堆内存 and 文件系统. Configure by calling setCheckpointStorage() of the checkpoint configuration and passing in a CheckpointStorage implementation class.

1. Configure storage checkpoints to JobManager heap memory

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointStorage(new JobManagerCheckpointStorage());

2. Configure storage checkpoints to the file system. Actual production applications generally configure CheckpointStorage as a highly available distributed file system.

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig .setCheckpointStorage(new FileSystemCheckpointStorage("hdfs://node01:9000/flink/checkpoints"));

Other configurations

Checkpoints also have many configurable options, which can be set by getting the checkpoint configuration.

CheckpointConfig checkpointConfig = env.getCheckpointConfig();
		// 超时时间: 默认10分钟。用于指定检查点保存的超时时间,超时没完成就会被丢弃掉。传入一个长整型毫秒数作为参数,表示超时时间
        checkpointConfig.setCheckpointTimeout(60000);

        // 最大并发检查点数量:用于指定运行中的检查点最多可以有多少个
        // 由于每个任务的处理进度不同,完全可能出现后面的任务还没完成前一个检查点的保存、前面任务已经开始保存下一个检查点。这个参数就是限制同时进行的最大数量
        checkpointConfig.setMaxConcurrentCheckpoints(1);

        // 最小间隔时间: 用于指定在上一个检查点完成之后,检查点协调器最快等多久可以出发保存下一个检查点的指令。当指定这个参数时,实际并发为1
        // 直白说就是:上一轮checkpoint结束 到 下一轮checkpoint开始之间的间隔
        checkpointConfig.setMinPauseBetweenCheckpoints(1000);


        // 开启外部持久化存储: 用于开启检查点的外部持久化,而且默认在作业失败的时候不会自动清理
        // 传入的参数ExternalizedCheckpointCleanup指定了当作业取消的时候外部的检查点该如何清理,即数据是否保留在外部系统
        // DELETE_ON_CANCELLATION: 在作业取消的时候会自动删除外部检查点,但是如果是作业失败退出,则会保留检查点
        // RETAIN_ON_CANCELLATION: 作业取消的时候也会保留外部检查点
        checkpointConfig.setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

        // 检查点连续失败次数:用于指定检查点连续失败的次数,当达到这个次数,作业就失败退出。默认为0
        checkpointConfig.setTolerableCheckpointFailureNumber(10);

        // 开启非对齐检查点: 不再执行检查点的分界线对齐操作,启用之后可以大大减少产生背压时的检查点保存时间
        // 设置要求检查点模式必须为精准一次exctly-once,并且最大并发的检查点个数为1
        checkpointConfig.enableUnalignedCheckpoints();

        //  对齐检查点超时时间: 参数只有在启用非对齐检查点的时候有效
        //  默认是0,表示一开始就直接用非对齐检查点。如果设置大于0,一开始会使用对齐的检查点
        //  当对齐时间超过该参数设定的时间,则会自动切换成非对齐检查点
        checkpointConfig.setAlignedCheckpointTimeout(Duration.ofSeconds(1));

Universal increment

Flink’s incremental checkpointing is an optimization technique used to reduce the storage and processing costs required for checkpointing. Traditional full checkpointing requires saving the state of the entire application at each checkpoint. Incremental checkpoints, on the other hand, save only the portion of the state that has changed since the previous checkpoint, reducing storage and processing burdens.

Universal incremental checkpoint is a feature introduced in Flink version 1.11. It implements incremental checkpoints by combining Full Checkpoint and Incremental Checkpoint type checkpoints.

The specific incremental checkpoint process is as follows:

全量检查点:在首次触发检查点时,会执行全量检查点,将整个应用程序的状态保存为检查点。

增量检查点:自上一个检查点以来,当应用程序状态发生更改时,Flink 会执行增量检查点,仅保存这些更改的状态部分。这些更改的状态会与上一个全量检查点的状态进行比较,只保留与之不同的部分。

检查点结果:最终的检查点结果包含上一个全量检查点的状态和增量检查点中的更改部分。

For crash recovery:

Flink can use these checkpoint results for recovery operations. The application's state is restored by first loading the state from the last full checkpoint into memory, and then applying the changes from the incremental checkpoint in sequence.

Advantages and Disadvantages:

Incremental checkpoints can reduce the amount of data stored and transmitted, thereby reducing the overhead of checkpoint generation and recovery. However, since incremental checkpoints require calculation and processing of state changes, some additional computational overhead will be introduced.

Note: Before Flink1.15, only RocksDB supported incremental snapshots. How to enable it:

EmbeddedRocksDBStateBackend backend = new EmbeddedRocksDBStateBackend(true);

Method 1: Configuration file specification

# 启用增量检查点
state.backend.changelog.enabled: true
state.backend.changelog.storage: filesystem

# 存储changelog数据
dstl.dfs.base-path: hdfs://node01:9000/changelog
execution.checkpointing.max-concurrent-checkpoints: 1
execution.savepoint-restore-mode: CLAIM

Method 2: Set in code
needs to introduce dependencies

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-statebackend-changelog</artifactId>
    <version>${
    
    flink.version}</version>
    <scope>runtime</scope>
</dependency>

Turn on changelog

env.enableChangelogStateBackend(true);

SavepointSavepoint

Overview

In Flink, savepoints are a mechanism for long-term storage of application state. Savepoints can be used to perform restarts and upgrades when an application encounters a failure or requires changes to application code.

Savepoints are a special form of checkpoints that can be used to migrate job status between different Flink versions. Savepoints are cross-version state snapshots that allow jobs to be migrated from one version of Flink to another without losing state. Savepoints are typically used for job upgrades, experiments, and version rollbacks.

The difference between savepoints and checkpoints:

The biggest difference between savepoints and checkpoints is the timing of triggering:

检查点是由Flink自动管理的,定期创建,发生故障之后自动读取进行恢复

保存点不会自动创建,必须由用户明确地手动触发保存操作。

Purpose of save points:

Savepoints can be used as a powerful operation and maintenance tool. You can create a savepoint when needed, then stop the application, make some processing adjustments, and then restart from the savepoint.

Application scenarios:

更新应用程序

作业升级

调整并行度

实验和测试

版本回退

暂停应用程序

故障排查和分析

Operator ID:

The operator ID can be called directly in the code. For operators that do not have an ID set, Flink will automatically set it by default. After restarting the application, the ID may be different and incompatible with the previous state.

Manually specify the ID for each operator in the program

DataStream<String> stream = env
    .addSource(new StatefulSource()).uid("source-id")
    .map(new StatefulMapper()).uid("mapper-id")
    .print();

Use save points

1. Create a savepoint

Create a savepoint image for a running job from the command line

jobId:需要填充要做镜像保存的作业ID

targetDirectory:目标路径可选,表示保存点存储的路径

Examples are as follows:

bin/flink savepoint :jobId [:targetDirectory]

Create savepoints for running jobs and create savepoints directly when stopping a job

bin/flink stop --savepointPath [:targetDirectory] :jobId

2. Save point path

1. You can modify the default path of the save point through the configuration file flink-conf.yaml

state.savepoints.dir: hdfs://node01:9000/flink/savepoints

2. For individual jobs, the execution environment can be set in the program code.

env.setDefaultSavepointDir("hdfs://node01:9000/flink/savepoints");

3. Restart the application from the save point

-s参数:指定保存点的路径

runArgs:其它启动时的参数
bin/flink run -s :savepointPath [:runArgs]

Switch state backend

When using savepoint to restore the state, you can change the state backend. It is recommended not to specify the status backend in the code, but to configure it through the configuration file or the -D parameter.

1. Submit flink job

bin/flink run-application -d -t yarn-application -Dstate.backend=hashmap -c cn.ybzy.demo.SavepointDemo FlinkDemo.jar

2. When stopping the flink job, trigger the save point

Method 1: stop gracefully stops and triggers the save point, requiring the source to implement the StoppableFunction interface

bin/flink stop -p savepoint路径 job-id -yid application-id

3. Restore the job from savepoint and modify the status backend at the same time

bin/flink run-application -d -t yarn-application -s hdfs://node01:9000/flink/savepoint-26c8e0-9e0b6cd976a4 -Dstate.backend=rocksdb -c cn.ybzy.demo.SavepointDemo FlinkDemo.jar   

4. Resume the job from the saved checkpoint

bin/flink run-application -d -t yarn-application -Dstate.backend=rocksdb -s hdfs://node01:9000/flink/4d8435f6be2a1d9e0b6cd976a24f6c8e/chk-175 -c cn.ybzy.demo.SavepointDemo ./FlinkDemo.jar

Operation in SQL client

submit homework

Submit an insert job, you can set a name for the job

INSERT INTO tb_test select  * from datagen;

View job list

SHOW JOBS;

trigger

Set checkpoint and savepoint paths

SET state.checkpoints.dir='hdfs://node01:9000/ck';
SET state.savepoints.dir='hdfs://node01:9000/sp';

Stop the job and trigger savepoint

STOP JOB '4d8435f6be2a1d9e0b6cd976a24f6c8e' WITH SAVEPOINT;

recover

Set the recovery path, then submit the sql, and it will be restored from savepoint.

SET execution.savepoint.path='hdfs://node01:9000/sp/savepoint-26c8e0-9e0b6cd976a4';  

--允许跳过无法还原的保存点状态
set 'execution.savepoint.ignore-unclaimed-state' = 'true'; 

Skip unrecoverable savepoint state

set 'execution.savepoint.ignore-unclaimed-state' = 'true'; 

Use the RESET command to resetexecution.savepoint.path configuration, which will affect all DML statements executed later

RESET execution.savepoint.path;

Guess you like

Origin blog.csdn.net/qq_38628046/article/details/109278175