Brief introduction
- Apache Flink provides a fault-tolerant mechanism, we can continue to restore the state data streaming applications.
- This mechanism ensures that even if the failure, after recovery, the state of the program will return to its previous state.
- Flink presided over at least once and exactly once semantics semantics
- Flink is achieved by regularly doing checkpoint recovery and fault tolerance, fault tolerance mechanisms continue to generate a snapshot of the data stream, it will not have much impact on performance.
- State is stored in streaming applications where a configurable (e.g., major nodes or HDFS)
- If the car program failure (due to the machine, network, or software failure), Flink will stop the flow of distributed data stream.
- Then restart the system operator, and set it to a number of the most recent checkpoint.
- important point:
- By default, disabled checkpoint
- To make the normal operation of fault tolerance, data flow source need to be able to flow back down before the specified point.
- This method has for example the Apache Kafka, flink Kafka and the connector may be utilized to reset kafka topic offset to achieve the purpose of the re-read data.
- Because Flink checkpoint implemented by distributed snapshots, following a "checkpoint" and a "snapshot" is the same meaning.
CheckPoint (checkpoint)
- Flink fault tolerance of the core of the distributed data stream to generate a consistent snapshot of the state and operator.
- These snapshots serve as checkpoints, when the system can be rolled back early failure.
- Distributed Snapshots are implemented by the Chandy-Lamport algorithm.
- Barriers (fencing)
restore
- Flink recovery mechanisms in case of very straightforward: when the system fails, select Flink recently completed checkpoint k, the next re-deployment of the entire system, a data flow graph, then the state corresponding to each check Operator point k.
- Source data were read from the set Sk of the data stream.
- For example, when Apache Kafka perform the recovery, the system will notify consumers begin to get data from cheaper Sk.
prerequisites
- Flink's checkpoint mechanism in general, it requires:
- Continuous data source
- Such as message queues (Apache Kafka, RabbitMQ) or a file system (e.g., HDFS, Amazon S3, GFS, NFS, Ceph ......).
- Persistent state storage
- Usually distributed file system (HDFS, Amazon S3, GFS, ...)
- Continuous data source
Enable and Configure checkpoint
By default, flink disabled checkpoint.
Open checkpoint by: calling env.enableCheckpointing (n), where N is the checkpoint interval ms units.
Checkpoint related parameters:
State Backends (state backpressure)
Then the following calculation flow scenarios need to preserve state:
- Window operation
- Function operation using KV
- Inherited the function of CheckpointFunction
When the checkpoint (checkpoint) mechanism was started, the state will persist at checkpoints to deal with data loss and recovery.
The state represented internally how, how the state is persisted to the checkpoint and the persistence to go depends on the selected State Backend.
Flink in the state of preservation, supports three storage:
- MemoryStateBackend (memory state backpressure)
- FsStateBackend (file state backpressure)
- RocksDBStateBackend (RocksDB state backpressure)
If anything else is not configured, the system will use the default MemoryStateBackend.
MemoryStateBackend
Such a storage policy data stored in the java pile, such as: kv window manipulation or state hash table to store the value and the like.
When carried out checkpoints, this strategy will make a snapshot of the state, and then send a snapshot to JobManager as part of a checkpoint in, JM also save on the heap.
Memory StateBackend can use asynchronous manner snapshot, the government has also encouraged the use of asynchronous way, to avoid blocking, the default is now asynchronous.
important point:
- Asynchronous snapshot mode, operator operator will also do snapshot processing new data flows, the default asynchronous
- Synchronous snapshots: operator operator to do a snapshot, it does not deal with the inflow of new data, synchronous snapshots will increase the latency of data processing.
If you do not want to asynchronous, false can be passed at construction time, as follows:
new MemoryStateBackend(MAX_MEM_STATE_SIZE, false);
This policy limits:
- The default maximum size of a single state is limited to 5MB, this value can be changed by the constructor.
- No matter the size of the largest single state is limited to how much, are not available through the large frame size of akka.
- JM polymerized state are written in the memory.
The appropriate scene:
- Local development and debugging
- Less job status
FsStateBackend
URL is set by the file system, as follows:
- hdfs://namenode:40010/flink/checkpoints
- file:///data/flink/checkpoints
When selecting FsStateBackend, it will initially saved in the Task Manager (Task Manager) of memory.
As checkpointing when the state will be written to the snapshot file, saved in the file system.
A small amount of metadata is saved in JM's memory.
By default, FsStateBackend asynchronous configured to provide a snapshot at the time of the write state in order to avoid blocking the checkpoint processing pipeline (processing pipeline).
This feature can be disabled by the constructor corresponding boolean flag to false
new FsStateBackend(path, false);
Applicable scene:
- State is relatively large, long windows, large state KV
- HA scenarios needs to be done
RockDBStateBackend
URL is set by the file system, such as:
- hdfs://namenode:40010/flink/checkpoints
- file:///data/flink/checkpoints
This way kv state needs to be managed by rockdb database, which is memory or file backend and the biggest difference.
RocksDBStateBackend use RocksDB database stores data, stored in the database TaskManager data directory.
Note: RocksDB, it is a high-performance database Key-Value. Data will be put before the memory of them, under certain conditions, triggers written to disk file
When checkpoint, the data is a snapshot of the entire RocksDB database, and then save the configuration to a file system (usually hdfs).
Meanwhile, Apache Flink some minimum metadata stored in the memory or the JobManager Zookeeper (for the case of high availability).
RocksDB default configured to perform asynchronous snapshot
To suit the scene:
- RocksDBStateBackend is the only state available to support streaming applications have incremental checkpoint.
- Note: incremental checkpoint means that when you save a snapshot, a snapshot of the data in the data long enough to save the difference.
- RocksDBStateBackend way to be able to hold state depends only on how much disk size can be used.
- Compared FsStateBackend state stored in memory, which allows the use of very large state.
- But it also means that the strategy of throughput will be limited.
Code:
// 默认使用内存的方式存储状态值, 单词快照的状态上限为10MB, 使用同步方式进行快照。 env.setStateBackend(new MemeoryStateBackend(10*1024*1024, false)); // 使用 FsStateBackend的方式进行存储, 并且是同步方式进行快照 env.setStateBackend(new FsStateBackend("hdfs://namenode....", false)); try{ // 使用 RocksDBStateBackend方式存储, 并采用增量的快照方式进行存储。 env.setStateBackend(new RocksDBStateBackend("hdfs://namenode....", true)); } catch(IOException e){ e.printStackTrace(); }
Checkpoint use
During operation program repeats every env.enableCheckpointing (5000) time, generates a checkpoint snapshot point.
When a checkpoint hdfs to store state data of the snapshot point,
If the program fails, when we restart the program, you can specify a snapshot point from which to recover.
flink-1.9.1/bin/flink run -s hdfs://ronnie01:8020/data/flink-checkpoint/xxxxxxxxxxxxxxx(哈希码)/chk-xxx/ metadata -c com.ronnie.flink.test.checkPointTest flink-test.jar
Code:
package com.ronnie.flink.stream.test; import org.apache.flink.api.common.functions.FlatMapFunction; import org.apache.flink.api.common.restartstrategy.RestartStrategies; import org.apache.flink.api.java.tuple.Tuple; import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.contrib.streaming.state.RocksDBStateBackend; import org.apache.flink.runtime.state.filesystem.FsStateBackend; import org.apache.flink.runtime.state.memory.MemoryStateBackend; import org.apache.flink.streaming.api.CheckpointingMode; import org.apache.flink.streaming.api.datastream.DataStreamSource; import org.apache.flink.streaming.api.datastream.KeyedStream; import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator; import org.apache.flink.streaming.api.environment.CheckpointConfig; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.util.Collector; import java.io.IOException; public class CheckPointTest { public static void main(String[] args) { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // 启动checkpoint, 并且设置多久进行一次checkpoint, 即两次checkpoint的时间间隔 env.enableCheckpointing(5000); env.setParallelism(1); CheckpointConfig checkpointConfig = env.getCheckpointConfig(); env.setRestartStrategy(RestartStrategies.fallBackRestart()); /* 设置 checkpoint 语义, 一般使用 exactly_once 语义。 at_least_once 一般在那里非常低的延迟场景使用。*/ checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE); /* 设置检查点之间的2最短时间 检查点之间的最短时间: 为确保流应用程序在检查点之间取得一些进展, 可以定义检查点之间需要经过多长时间。 如果将此值设置为例如500, 则无论检查点持续时间和检查点间隔如何, 下一个检查点将在上一个检查点完成后的500ms内启动 请注意, 这意味检查点间隔永远不会小于此参数。 */ checkpointConfig.setMinPauseBetweenCheckpoints(500); // 设置超时时间, 若本次checkpoint时间超时, 则放弃本次checkpoint操作 checkpointConfig.setCheckpointTimeout(60000); /* 同一时间最多可以进行多少个checkpoint 默认情况下, 当一个检查点仍处于运行状态时, 系统不会触发另一个检查点 */ checkpointConfig.setMaxConcurrentCheckpoints(1); /*开启checkpoints的外部持久化,但是在job失败的时候不会自动清理,需要自己手工清理state DELETE_ON_CANCELLATION:在job canceled的时候会自动删除外部的状态数据,但是如果是FAILED的状态则会保留; RETAIN_ON_CANCELLATION:在job canceled的时候会保留状态数据 */ checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // 默认使用内存的方式存储状态值。单次快照的状态上限内存为10MB, 使用同步方式进行快照。 env.setStateBackend(new MemoryStateBackend(10*1024*1024, false)); // 使用 FsStateBackend的方式进行存储, 并且是同步方式进行快照 env.setStateBackend(new FsStateBackend("hdfs://ronnie01:8020/data/flink-checkpoint",false)); try { env.setStateBackend(new RocksDBStateBackend("hdfs://ronnie:8020/data/flink-checkpoint", true)); } catch (IOException e) { e.printStackTrace(); } // DataStreamSource<String> dataStreamSource = env.socketTextStream("ronnie01",9999); // // SingleOutputStreamOperator<Tuple2<String, Integer>> pairStream = dataStreamSource.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() { // @Override // public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception { // String[] split = value.split(" "); // for (String word : split) { // System.out.println("--------- lala --------"); // out.collect(new Tuple2<String, Integer>(word, 1)); // } // } // }); // // // KeyedStream<Tuple2<String, Integer>, Tuple> keyedStream = pairStream.keyBy(0); // // // SingleOutputStreamOperator<Tuple2<String, Integer>> sum = keyedStream.sum(1); // // // sum.print(); // // // try { // //转换算子都是懒执行的,最后要显示调用 执行程序, // env.execute("checkpoint-test"); // } catch (Exception e) { // e.printStackTrace(); // } } }
Savepoint (save point)
Flink the Savepoints Checkpoints differs from that of a conventional backup and recovery logs of different database systems.
The main purpose of the checkpoint is to provide a recovery mechanism when the job fails unexpectedly.
Checkpoint's life cycle, that is created by Flink Flink management, own and publish Checkpoint - without user interaction.
As a method of restoring and regular trigger, Checkpoint main design goals are:
- Create a checkpoint, lightweight
- Recovery as quickly as possible
On the contrary, Savepoints created by the user, with or removed.
They are generally planned for manual backup and recovery.
For example, when Flink version needs to be updated, or change your stream processing logic, change the parallelism and so on.
In this case, we tend to shut it flow, which requires us to the state of the stream is stored, will be back in time to re-deploy the job.
Conceptually, Savepoints generation and recovery costs may be higher, and more attention to support portability and operating changes to previously mentioned.
use:
command:
flink savepoint jobID target_directory
Save the state of the current stream to the specified directory:
bin/flink savepoint xxxxxxxx(哈希码) hdfs://ronnie01:8020/data/flink/savepoint
Restart, recover the data stream:
flink-1.9.1/bin/flink run -s hdfs://ronnie01:8020/data/flink/savepoint/savepoint-xxxxx-xxxxxxxxx -c com.ronnie.flink.stream.test.CheckPointTest flink-test.jar