Flink-StateBackend state backend

table of Contents

1. Available State Backends

2.1 MemoryStateBackend

2.2 FsStateBackend

2.3 RocksDBStateBackend

Two, set the State Backend

Set the State Backend of each job

Set the default (global) State Backend

3. Advanced RocksDB State Backend

Incremental snapshot

Java using RocksDB example


When the CheckPoint mechanism is started, the state will be persisted with CheckPoint to prevent data loss and ensure consistency during recovery. The internal storage format of the state, how the state is persisted in CheckPoint, and where it is persisted all depend on the selected  State Backend .

1. Available State Backends

Flink has the following state backends built in out of the box:

  • MemoryStateBackend
  • FsStateBackend
  • RocksDBStateBackend

If not set, MemoryStateBackend is used by default.

2.1 MemoryStateBackend

Inside  MemoryStateBackend  , data is stored in the heap in the form of Java objects. Key/value forms of state and window operators hold a hash table that stores state values ​​and triggers.

During CheckPoint, State Backend takes a snapshot of the state and sends the snapshot information as part of the CheckPoint response message to JobManager (master), and JobManager also stores the snapshot information in the heap memory.

MemoryStateBackend can configure asynchronous snapshots. It is strongly recommended to use asynchronous snapshots to prevent data flow blocking. Note that asynchronous snapshots are enabled by default. When instantiating, MemoryStateBackend the user can  set the corresponding Boolean type of construction parameter  false to close the asynchronous snapshot (only used for debugging), for example:

new MemoryStateBackend(MAX_MEM_STATE_SIZE, false);

Limitations of MemoryStateBackend:

  • By default, the size limit for each individual state is 5 MB. Its size can be increased in the constructor of MemoryStateBackend.
  • No matter how large the configured maximum state memory size (MAX_MEM_STATE_SIZE) is, it cannot be larger than the akka frame size (see configuration parameters ).
  • The aggregated state must be able to fit into the memory of the JobManager.

MemoryStateBackend applicable scenarios:

  • Local development and debugging.
  • Jobs with very small status, for example: Jobs composed of functions (Map, FlatMap, Filter, etc.) that only process one record at a time. Kafka Consumer only needs very small state.

It is recommended to set managed memory  to 0 at the same time  to ensure that the maximum amount of memory is allocated to user code on the JVM.


2.2 FsStateBackend

FsStateBackend  needs to configure a file system URL (type, address, path), for example: "hdfs://namenode:40010/flink/checkpoints" or "file:///data/flink/checkpoints".

FsStateBackend saves the running state data in the memory of the TaskManager. When CheckPoint, write the state snapshot to the configured file system directory. A small amount of metadata information is stored in the memory of the JobManager (in high availability mode, it is written to the metadata file of CheckPoint).

FsStateBackend uses asynchronous snapshots by default to prevent data processing from blocking when CheckPoint writes the state. When instantiating, FsStateBackend the user can  set the corresponding Boolean construction parameter  false to close the asynchronous snapshot, for example:

new FsStateBackend(path, false);

FsStateBackend applicable scenarios:

  • Jobs with larger status, longer window, and larger key/value status.
  • All highly available scenarios.

It is recommended to set managed memory  to 0 at the same time  to ensure that the maximum amount of memory is allocated to user code on the JVM.


2.3 RocksDBStateBackend

RocksDBStateBackend  needs to configure a file system URL (type, address, path), for example: "hdfs://namenode:40010/flink/checkpoints" or "file:///data/flink/checkpoints".

RocksDBStateBackend saves the running state data in the  RocksDB  database. The RocksDB database stores the data in the TaskManager data directory by default. During CheckPoint, the entire RocksDB database is checkedpoint to the configured file system directory. A small amount of metadata information is stored in the memory of the JobManager (in the high availability mode, it is stored in the CheckPoint metadata file).

RocksDBStateBackend only supports asynchronous snapshots.

Limitations of RocksDBStateBackend:

  • Because RocksDB's JNI API is built on the byte[] data structure, each key and value supports a maximum of 2^31 bytes. Important information: The cumulative data size of the RocksDB merge operation (for example: ListState) can exceed 2^31 bytes, but it will fail the next time the data is retrieved. This is the current RocksDB JNI limitation.

Applicable scenarios of RocksDBStateBackend:

  • Jobs with very large status, very long window, and very large key/value status.
  • All highly available scenarios.

Note that the size of the state you can keep is only limited by disk space. Compared with FsStateBackend where the state is stored in memory, RocksDBStateBackend allows very large states to be stored. However, this also means that using RocksDBStateBackend will reduce the maximum throughput of the application. All reads and writes must be serialized and deserialized, which is much less efficient than the state backend based on heap memory.

Please also refer to   the recommendations on RocksDBStateBackend in Task Executor memory configuration .

RocksDBStateBackend is currently the only State Backend that supports incremental CheckPoint (see  here ).

Some RocksDB local metrics can be used, but they are disabled by default. You can   find documentation on RocksDB's local metrics here .

The total memory amount of RocksDB instance(s) per slot can also be bounded, please refer to documentation here for details.


Two, set the State Backend

If not explicitly specified, jobmanager will be used as the default state backend. You can  set other default State Backends for all jobs in  flink-conf.yaml . The state backend configuration of each job will override the default state backend configuration, as shown below:

Set the State Backend of each job

StreamExecutionEnvironment You can set the State Backend of each Job as follows:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"));

If you want to use it in the IDE  RocksDBStateBackend, or need to dynamically configure it programmatically in the job  RocksDBStateBackend, you must add the following dependencies to the Flink project.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-statebackend-rocksdb_2.11</artifactId>
    <version>1.11.0</version>
    <scope>provided</scope>
</dependency>

Set the default (global) State Backend

The   default State Backend flink-conf.yaml can be state.backendset by key  .

Optional values ​​include  jobmanager  (MemoryStateBackend), filesystem  (FsStateBackend), rocksdb  (RocksDBStateBackend), or use  the fully qualified class name of the class that implements the state backend factory  StateBackendFactory , for example: RocksDBStateBackend corresponds to  org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory.

state.checkpoints.dir The option specifies all the directories where State Backend writes CheckPoint data and metadata files. You can   find detailed information about the CheckPoint directory structure here .

Some examples of configuration files are as follows:

# 用于存储 operator state 快照的 State Backend

state.backend: filesystem


# 存储快照的目录

state.checkpoints.dir: hdfs://namenode:40010/flink/checkpoints

3. Advanced RocksDB State Backend

This section describes more details of RocksDBStateBackend

Incremental snapshot

RocksDBStateBackend supports incremental snapshots . Unlike generating a full backup that contains all data, incremental snapshots only contain records that have been modified since the last snapshot was completed, so the time it takes to complete the snapshot can be significantly reduced.

An incremental snapshot is based on (usually multiple) pre-sequence snapshots. Because there is a compaction mechanism inside RocksDB to merge sst files, Flink's incremental snapshots will also periodically rebase (rebase), so the incremental chain will not keep growing, and the files contained in the old snapshots will gradually expire and be automatically cleaned up.

Compared with the recovery time based on full snapshots, if the network bandwidth is the bottleneck, then recovery based on incremental snapshots may consume more time, because there may be data overlap between the sst files contained in incremental snapshots, resulting in a larger amount of data to be downloaded ; And when CPU or IO is the bottleneck, the recovery based on incremental snapshot will be faster, because recovery from incremental snapshot does not need to parse Flink's unified snapshot format to rebuild the local RocksDB data table, but can be loaded directly based on the sst file .

Although we recommend using incremental snapshots when the amount of state data is large, this is not the default snapshot mechanism. You need to manually enable this feature through the following configuration:

  • In the  flink-conf.yaml set: state.backend.incremental: true or
  • Configure in the code as on the right (to override the default configuration):RocksDBStateBackend backend = new RocksDBStateBackend(filebackend, true);

It should be noted that, once incremental snapshots are enabled, the page displayed  Checkpointed Data Size only represents the amount of incrementally uploaded data, not the complete amount of data in a single snapshot.

 

Java using RocksDB example

public class C06_StateBackendForRocksDB {
    public static void main(String[] args) throws Exception {

        // 状态后端数据存储应该存储在分布式文件系统里,便于管理维护
        System.setProperty("HADOOP_USER_NAME", "root");
        System.setProperty("hadoop.home.dir", "/opt/cloudera/parcels/CDH-5.16.1-1.cdh5.16.1.p0.3/bin/");

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 只有开启了checkpoint,重启策略才会生效;默认不开启重启策略
        env.enableCheckpointing(5000); // 开启,检查点周期,单位毫秒;默认是-1,不开启

        // 默认的重启策略是固定延迟无限重启
        //env.getConfig().setRestartStrategy(RestartStrategies.fallBackRestart());
        // 设置固定延迟固定次数重启;默认是无限重启
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 1000));

        // 设置状态数据存储的后端,本地文件系统;默认:状态保存在 TaskManager 的内存中,检查点保存在 JobManager 的内存中
        //StateBackend stateBackend = new FsStateBackend("file:\\\\lei_test_project\\idea_workspace\\FlinkTutorial\\check_point_dir");
        StateBackend stateBackend = new RocksDBStateBackend("file:\\\\lei_test_project\\idea_workspace\\FlinkTutorial\\check_point_dir");
        env.setStateBackend(stateBackend);
        // 生产环境将StateBackend保存到分布式文件系统
        //env.setStateBackend(new FsStateBackend("hdfs://node-01:8020/user/root/sqoop/flink_state_backend"));

        // 程序异常退出或人为cancel掉,不删除checkpoint的数据;默认是会删除Checkpoint数据
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

        DataStreamSource<String> lines = env.socketTextStream("node-01", 7777);
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = lines.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String word) throws Exception {
                if (word.startsWith("null")) {
                    throw new RuntimeException("输入为null,发生异常");
                }
                return Tuple2.of(word, 1);
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> summed = wordAndOne.keyBy(0).sum(1);

        summed.print();

        env.execute("C06_StateBackendDemo");
    }
}

At the end of the article, I recommend some popular technical blog links :

  1. JAVA related deep technical blog link
  2. Flink related technical blog links
  3. Spark core technology link
  4. Design Pattern-Deepin Technology Blog Link
  5. Machine learning-deep technology blog link
  6. Hadoop related technical blog links
  7. Super dry goods-Flink mind map, it took about 3 weeks to compile and proofread
  8. Deepen the core principles of JAVA JVM to solve various online faults [with case]
  9. Please talk about your understanding of volatile? --A recent "hardcore contest" between Xiao Lizi and the interviewer
  10. Talk about RPC communication, an interview question that is often asked. Source code + notes, package understanding
  11. In-depth talk about Java garbage collection mechanism [with schematic diagram and tuning method]

Welcome to scan the QR code below or search the public account "Big Data Senior Architect", we will push more and timely information to you, welcome to communicate!

                                           

       

 

Guess you like

Origin blog.csdn.net/weixin_32265569/article/details/108449983