4. Stream Computing Fault Tolerance

1. Why do you need fault tolerance?

two, state

Three, State Backend type

4. Restart strategy

When a task fails, Flink needs to restart the faulty task and other affected tasks to restore the normal execution state of the job. The restart strategy can be configured in two ways

1) Configuration file flink-conf.yaml

restart-strategy.fixed-delay.attempts:3

restart-strategy.fixed-delay.delay:5

2) Based on code configuration

  env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
                5, //尝试重启次数
                org.apache.flink.api.common.time.Time.of(5,TimeUnit.SECONDS)));//超时时间

Replenish:

//1、异常退出,不用重启,默认为该策略
 env.setRestartStrategy(RestartStrategies.noRestart());


//2、固定重启次数策略
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
    5, //尝试重启次数
     org.apache.flink.api.common.time.Time.of(5,TimeUnit.SECONDS)));//超时时间


//3、频率重启次数(5分钟内,重记了5次,还有异常就退出,否则重新计算)
 env.setRestartStrategy(RestartStrategies.failureRateRestart(
     5, //尝试重启次数
      org.apache.flink.api.common.time.Time.of(5,TimeUnit.MINUTES),
      org.apache.flink.api.common.time.Time.of(5,TimeUnit.SECONDS)));//重启之间的时间间隔

//4、基于Checkpoint行为
env.enableCheckpointing(5000);

5. Recovery Mechanism

 

1. By default, the state will be lost and recalculated

2. Open the checkpoint failure to restore the state and continue to run

 env.enableCheckpointing(5000);//状态存在内存中
#如果要保留cp的内容,需要指定存储方式,如下保存到文件中
env.setStateBackend(new FsStateBackend("file:flink/cpdir",false));//将cp结果保存到文件

3. Turn on the option to keep the checkpoint file, and upgrade the job based on the checkpoint

1) Modify flink-conf.yaml

state.backend: filesystem
#配置checkpoint&savepoint
state.checkpoints.dir: file:///tmp/chkdir
state.savepoints.dir: file:///tmp/spdir

3配置失败重启策略
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 2 s
#配置checkpoint保存个数
state.checkpoints.num-retained: 2
#配置local recovery for this state backend(任务恢复的方式)
state.backend.local-recovery: true

 2) Submit the task -> (the program is closed abnormally or manually close the task) close the task -> restore the task

bin/flink run -m localhost:4000 -c java类名 jar包路径

#停止flink任务
bin/flink cancel  任务ID

bin/flink run -m localhost:4000 -s checkpoint目录(file:///tmp/chkdir/任务ID/chk-364)  -c java类名 jar包路径

4. Trigger savepoint and upgrade job based on savepoint


bin/flink savepoint flink作业ID #创建savepoint
#停止以前的作业,然后从savepoint启动

 bin/flink run -m localhost:4000 -s  savepoint的目录(file:///flink/savepoint/xxxx) -c 类名 jar包目录

 6. Computing topology changes and upgrades

Guess you like

Origin blog.csdn.net/lzzyok/article/details/120686491