Flink-use checkpoint and savepoint for snapshot recovery

Use checkpoint (automatically, managed by flink itself)

Prepare test code

public class Demo1 {
    
    
    public static void main(String[] args) throws Exception {
    
    
        // 创建环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(1000);
        // 一次保证性
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
        // 保存的文件系统的地址
        env.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));

        // 设置需要手动输入的参数(这种写法比较好,这样规定了flink-web页面输入的参数名称)
        // 用args[]的方式可能出现一些格式问题,用ParameterTool工具很方便
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        String ip = parameterTool.get("hostname");
        Integer port = Integer.valueOf(parameterTool.get("port"));
        // 创建第一个dataStream
        DataStreamSource<String> dataStream = env.socketTextStream(ip, port);

        SingleOutputStreamOperator<Tuple2<String, Integer>> dataStream2 = dataStream.flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
    
    
            Arrays.stream(line.split(" ")).forEach(word -> {
    
    
                out.collect(Tuple2.of(word, 1));
            });
        })   // 如果使用了lambda表达式,必须使用returns来返回一个规定的类型
                .returns(Types.TUPLE(Types.STRING, Types.INT));
        SingleOutputStreamOperator<Tuple2<String, Integer>> sum = dataStream2.keyBy(0)
                .sum(1);

        sum.print();
        env.execute("LambdaStreamWordCount");
    }
}

Test steps

Into a jar package

mvn clean package

Put it on flink for execution (on the corresponding ip address, you need to open port 8888 first, you can use the nc command):

nc -lk 8888

Insert picture description here
After submitting, it can be found that the program is executed normally.
Insert picture description here
You can view the detailed information of the checkpoint, including the address stored in the disk:
Insert picture description here
to check on the corresponding environment, you can find that the chk directory name is constantly changing because of the checkpoint time set in the code The interval is 1s:
Insert picture description here
how to checkpoint?
Note: You
cannot cancel the task on the web page, but you need to stop the cluster so that the data in the checkpoint directory will remain.
Excuting an order:

bin/stop-cluster.sh 

After stopping, you can find that the corresponding chk directory is no longer changed, and the data is retained.
Insert picture description here
Recovery operation (the flink cluster has been closed above, remember to open it here):
execute the command (I executed the -n parameter, the meaning of the parameter is below, otherwise an error will be reported), the error message part is as follows:

If you want to allow to skip this, 
you can set the --allowNonRestoredState option on the CLI.

This means that there are some operations that you cannot restore, and you need to perform skip operations.

bin/flink run -n -s /tmp/flink/checkpoints/294069f9b37727c70c4b8c07436f87bf/chk-547/_metadata -c com.pro.flink.sink.Demo1 /opt/modules/flink-1.0-SNAPSHOT.jar --hostname 192.168.135.237 --port 8888

Common parameters of flink run

Here are the parameters of flink: (flink run)

-c,--class <classname> Flink应用程序的入口
-C,--classpath <url> 指定所有节点都可以访问到的url,可用于多个应用程序都需要的工具类加载
-d,--detached 是否使用分离模式,就是提交任务,cli是否退出,加了-d参数,cli会退出
-n,--allowNonRestoredState 允许跳过无法还原的savepoint。比如删除了代码中的部分operator
-p,--parallelism <parallelism> 执行并行度
-s,--fromSavepoint <savepointPath> 从savepoint恢复任务
-sae,--shutdownOnAttachedExit 以attached模式提交,客户端退出的时候关闭集群

You can see that the task is started normally, and you can check where the restoration was from, which is the latest Restore in the legend
Insert picture description here

Configuration file configuration of flink-checkpoint (no code setting required)

# 用于指定checkpoint state存储的backend,默认为none
state.backend: filesystem
 
# 用于指定backend是否使用异步snapshot(默认为true),
# 有些不支持async或者只支持async的state backend可能会忽略这个参数
state.backend.async

# 默认为1024,用于指定存储于files的state大小阈值
# 如果小于该值则会存储在root checkpoint metadata file
state.backend.fs.memory-threshold

# 默认为none,用于指定checkpoint的data files和meta data存储的目录
# 该目录必须对所有参与的TaskManagers及JobManagers可见
state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
 
# Default target directory for savepoints, optional.
# 默认为none,用于指定savepoints的默认目录
state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
 
# 默认为false,用于指定是否采用增量checkpoint,有些不支持增量checkpoint的backend会忽略该配置
state.backend.incremental: false

# 默认为1,用于指定保留的已完成的checkpoints个数
state.checkpoints.num-retained

Code level configuration of flink-checkpoint

 // 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 指定了checkpoint的时间间隔以及配置Mode为保持State的一致性
env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
// 也可以这么配置
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 配置Checkpoint彼此之间的停顿时间(即限制在某段时间内,只能有一个Checkpoint进行)单位毫秒
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(60 * 1000);
// 配置Checkpoint的并发量(比如某些程序的Checkpoint生产需要很长时间,可以通过这种方式加大效率)
env.getCheckpointConfig().setMaxConcurrentCheckpoints(3);
// 配置Checkpoint的超时时间(避免Checkpoint生产时间过长)默认10分钟
env.getCheckpointConfig().setCheckpointTimeout(5 * 1000);
// 配置Checkpoint失败的最大容忍次数,默认0次,如果超过了这个次数,则认为Checkpoint失败
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(3);
// 配置Checkpoint的时间间隔,单位毫秒
env.getCheckpointConfig().setCheckpointInterval(1000);
// 配置Checkpoint的存放的文件路径
env.setStateBackend(new FsStateBackend("file:///tmp/flink/checkpoints"));
// Checkpoint默认的配置是失败了,就重启恢复。因此当一个Flink失败/人为取消的时候,Checkpoint会被人为清除
// 配置Checkpoint开启 外化功能 。即应用程序停止时候,保存Checkpoint
// 支持2种外化:DELETE_ON_CANCELLATION:当应用程序完全失败或者明确地取消时,保存 Checkpoint。
//              RETAIN_ON_CANCELLATION:当应用程序完全失败时,保存 Checkpoint。如果应用程序是明确地取消时,Checkpoint 被删除。
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

Use Savepoint (manual user operation)

Points to note about Savepoint:

When Savepoint is triggered, a new Savepoint directory will be created in which data and metadata will be stored. You can configure the default targetDirectory or specify a custom targetDirectory. If there is no customization or configuration, then using savepoint will fail.

Manual savepoint steps

  1. View the job ID on the web page and copy it as follows:
    Insert picture description here
  2. Use the command on the terminal: Format: bin/flink savepoint [jobId] [Save the snapshot address]
bin/flink savepoint 93e589184e0e78d57077178438807889 /tmp/flink/checkpoints/

Information after success: A directory file of savepoint-93e589-957052f01c84 was generated .
Insert picture description here

  1. Execute recovery command
bin/flink run -n -s /tmp/flink/checkpoints/savepoint-93e589-957052f01c84/_metadata -c com.pro.flink.sink.Demo1 /opt/modules/flink-1.0-SNAPSHOT.jar --hostname 192.168.135.237 --port 8888
  1. result:
    Insert picture description here

Delete savepoint

Note: It is necessary to delete the directory generated by the savepoint command (prefix is ​​savepoint)

bin/flink savepoint -d /tmp/flink/checkpoints/savepoint-93e589-957052f01c84/

Results
Insert picture description here
If you delete the snapshot file generated by flink through the checkpoint mechanism, an error will be reported. The error message is as follows:
Insert picture description here

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107857519