The relationship between Flink state management and fault tolerance mechanism (CheckPoint & SavePoint)

1. What is status

Examples of stateless calculations: For example, if I input an addition operator for the first time, I will get the same result when 2+3=5I input the data multiple times in the future . The conclusion is that the same input will get the same result, regardless of the number of times. An example of stateful computing: statistics of visits. We all know that the access log has one log per request. Based on this, we can count visits. As shown below, on the first visit, the returned result is , but on the second visit, the returned result becomes . Why do you know that it has been processed once before ? This is where it comes into play. It is called to store the data that needs to be counted before. The call of the interface will create pairs for division. This is the prerequisite for use. The conclusion is that the same input produces different results, which is related to the number of times. This is stateful data.2+35
Nginx/api/aurlcount12Flinkhello worldstatekeyed statekeybykeyed streamkeykeyed state
[Click and drag to move]

In what scenarios will this kind of status data be used in large quantities ? Just give a few examples:
[1] In the need to remove duplicates, for example, we just want to know which departments 100this colleague belongs to, etc.
【2】Window calculation , untriggered data has been entered. For example, we count statistics once a minute, 1-2and 1.5the data at this time 2is stateful data for us, because 2the results are 1.5related to.
【3】Machine learning/deep learning , trained models and parameters. This is a deep impression for machine learning students. For example, the first time I input hello, the machine will give me a feedback, and then it will do further learning and processing based on this feedback next time. Then the result of the previous step is a stateful input to me.
【4】To access historical data , you need to compare it with yesterday. Yesterday's data is also a state for today. You taste it, you taste it carefully.

Why do we need to manage state ? Isn’t it good to use memory? First of all, the churn operation has its own standards. It is not something that can just be said to be churn processing. First of all, it runs 24/7 and is highly reliable. If your memory is not good, your capacity will eventually be used up. Secondly, if the data is not lost , it will not be heavy. If it is calculated exactly once, your memory needs to be backed up and restored, and you will always be accompanied by the loss of a small part of the data. Finally, data is generated in real time without delay . When your memory is not enough for horizontal expansion, you need to delay it.

The ideal state management is what is described below, Flinkand it has been implemented for us.
[Click and drag to move]

2. Type of status

Managed State & Raw State

	Managed State	Raw State
Status management method	Flink Runtime management—automatic storage, automatic recovery—optimized memory management	Users manage it themselves (Flink does not know the data structure you store in State) - you need to instantiate it yourself
status data structure	Known data structures—value, list, map…	Byte data—byte[]
Recommended usage scenarios	Can be used in most cases	Can be used when customizing Operator (used when Managed State is not enough)

Managed Stated is divided into: Keyed Stated and Operator State
[1] Keyed Stated: operators that can only be used for keyBygeneration . KeyedStreamEach one keycorresponds to one State, one Operatorinstance handles multiple Key, and accesses the corresponding multiple State. The same Keywill be processed in the same instance. If there is no keyByoperation in the whole process, it is useless KeyedStreamand Keyed Statedcan only be applied KeyedStream on.

Concurrent changes: State as they Keymigrate between instances. For example: Ain the instance, I processed KeyAand before KeyB, and then I extended the instance B, then the instance Aonly needs to be processed KeyA, and KeyBit is handed over to the instance Bfor processing. The installation status is separated and can be understood as distributed.

Accessed through RuntimeContext , the description Operatoris one Rich Function, otherwise it cannot be obtained RuntimeContext.

Supported data structures: ValueState , ListState, ReducingState, AggregatingState,MapState

【2】Operator State: Can be used for all operators, often used sourceon, for example FlinkKafkaConsumer. One Operatorinstance corresponds to one State, so Operatormultiple instances will be processed in one instance key, which can be understood as a cluster.

Concurrent changes: Operator State No key, reallocation is required when concurrent changes occur. There are two built-in solutions: even distribution and merging to get the full amount.

Access method: implementation CheckpointedFunctionor ListCheckpointedinterface.

Supported data structures: ListState

3. Keyed State usage examples

What is keyed state: For keyed state, there are two characteristics:
[1] It can only be applied to the functions and operations of KeyedStreamKeyed UDF , for example ,; [2] It has been partitioned/divided, and each key can only belong to a certain keyed statewindow state . ; For how to understand the concept of partitioning, we need to look at the semantics. You can see that there are three concurrencies on the left side of the picture below, and there are also three concurrencies on the right side. After the words on the left come in, they will be distributed accordingly. For example , this word will always go to the lower right and concurrent upper part through operation.
keyed state
keybykeybyhello wordhellohashtask
[Click and drag to move]

What is operator state
[1] Also known as non-keyed state, each is bound operator stateto only one operatorinstance.
【2】A common example operator stateis source stateto record the current code sourceand offsetthen look at a piece of code operator stateused word count:
[Click and drag to move]

fromElementsThe class that will be called here FromElementsFunctionuses the type list stateof operator state. Keyed StateThe dependencies between the following types are all statesubclasses of . Their access methods and data structures have certain differences.
[Click and drag to move]

	Status data type	Access interface	Remark
ValueState	single value	[update(T) modify/T value obtain]	For example, WordCount uses word as the key, and state is a single value. This single item can also be a string, object, etc. It is possible. There are only two ways to access it.
MapState	Map	put(UK key, UV value) putAll(Map<UK,UV> map) remove(UK key) boolean contains(UK key) UV get(UK key) Iterable<Map.Entry> entries() Iterable<Map.Entry> iterator() Iterable keys() Iterable values()	Key that can operate on specific objects
ListState	List	add/ addAll(List) update(List) Iterable get()
ReducingState	single value	add/ addAll(List) update(List) T get()	It is the same parent class as List. This add directly updates the data into the Reducing result. For example, if we count the results for 1 minute, the data will be added to the list first, and then all the data will be counted after 1 minute. Reducing counts one result for each one. The advantage is that it saves memory.
AggregatingState	single value	add(IN)/OUT get()	It is the same parent class as List. The difference from Reducing is that the input and output types of Reducing are the same. Aggregating can be different. For example, if I want to calculate a tie value, Reducing will return the calculation, while Aggregating will return the sum and number.

Give ValueStatea case

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据流
DataStream<Event> events = env.addSource(source);

DataStream<Alert> alerts = events
        // 生成 keyedStata 通过 sourceAddress
        .keyBy(Event::sourceAddress)
        // StateMachineMapper 状态机
        .flatMap(new StateMachineMapper());


//我么看下状态机怎么写   实现 RichFlatMapFunction
@SuppressWarnings("serial")
static class StateMachineMapper extends RichFlatMapFunction<Event, Alert> {
    
    

    private ValueState<LeaderLatch.State> currentState;

    @Override
    public void open(Configuration conf) {
    
    
        // 获取一个 valueState
        currentState = getRuntimeContext().getState(
                new ValueStateDescriptor<>("state", State.class));
    }

    //来一条数据处理一条
    @Override
    public void flatMap(Event evt, Collector<Alert> out) throws Exception {
    
    
        // 获取 value
        State state = currentState.value();
        if (state == null) {
    
    
            state = State.Initial;//State 是本地的变量
        }

        // 把事件对状态的影响加上去，得到一个状态
        State nextState = state.transition(evt.type());

        //判断状态是否合法
        if (nextState == State.InvalidTransition) {
    
    
            //扔出去
            out.collect(new Alert(evt.sourceAddress(), state, evt.type()));
        }
        //是否不能继续转化了，例如取消的订单
        else if (nextState.isTerminal()) {
    
    
            // 从 state 中清楚掉
            currentState.clear();
        }
        else {
    
    
            // 修改状态
            currentState.update(nextState);
        }
    }
}

4. The relationship between CheckPoint and state

CheckpointIt is a global operation from sourcetriggering to completion of all downstream nodes. The following picture can give you a correct Checkpointintuitive feeling. In the red box, you can see that it was triggered a total of 569Ktimes Checkpoint, and then all of them were successfully completed. There is no such failthing.
[Click and drag to move]

**state is actually the main data of the main persistent backup made by Checkpoint. **Look at the specific data statistics in the figure below, which stateis also 9kbthe size.
[Click and drag to move]

5. How to save and restore status

CheckpointMake distributed snapshots regularly to back up the status of the program. In the event of a failure, roll back the entire job Taskto the last successful Checkpointstate and continue processing from the saved point.

Necessary condition: The data source supports retransmission (if it does not retransmit, the lost message will really be lost)

Consistency semantics: Exactly once (if pthe same, single thread, multiple threads, some operators may have been calculated once, and some operators have not been calculated once, you need to pay attention to it), at least once.

//  获取运行环境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//状态数据
//两个checkpoint 触发间隔设置1S，越频繁追的数据就越少，io消耗也越大
env.enableCheckpointing(1000);
//EXACTLY_ONCE语义说明 Checkpoint是要对替的，这样消息不会重复，也不会对丢。
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//两个checkpoint 最少等待500ms 例如第一个checkpoint做了700ms按理300ms后就要做下一个checkpoint。但是它们之间的等待时间300ms<500ms 此时，就会延长200ms减少checkpoint过于频繁，影响业务。
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
//checkpoint多久超时，如果这个checkpoint在1分钟内还没做完，那就失败了
env.getCheckpointConfig().setCheckpointTimeout(60000);
//同时最多有多少个checkpoint进行
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
//当重新分配并发度，拆分task时，是否保存checkpoint。如果不保存就需要使用savepoint来保存数据，放到外部的介质中。
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);

Checkpoint vs Savepoint

	Checkpoint	Savepoint
Trigger management method	Automatically triggered and managed by Flink	Triggered and managed manually by the user
The main purpose	Quickly recover when task exceptions occur, such as timeout exceptions caused by network jitters	Carry out planned backups so that jobs can be stopped and then resumed, such as modifying code and adjusting concurrency.
Features	Lightweight, automatically serviced from failures, cleared by default after job stops	Durable, stored in a standard format, allowing code or configuration changes to manually trigger savepoint recovery.

Optional state storage method:
[1] MemoryStateBackend: Construction method:

MemoryStateBackend(int maxStateSize, boolean asynchronousSnapshots)

Storage method: State : TaskManagerMemory. Checkpoint: JobManagerMemory.
Capacity limit: single State maxStateSizedefault 5M. maxStateSize <= akka.framesizedefault 10M. The total size does not exceed JobManagermemory.
Recommended usage scenarios: local testing, almost stateless jobs, such as ETL/JobManagersituations that are not easy to hang or have little impact. Not recommended for use in production scenarios.

【2】FsStateBackend: Construction method:

FsStateBackend(URL checkpointDataUri, boolean asynchronousSnapshots)

Storage method: State : TaskManagerMemory. Checkpoint: External file system (local or HDFS).
Capacity limit: The total capacity of a single TaskManagerdevice Statedoes not exceed its memory. The total size does not exceed the configured file system capacity (will be cleaned regularly).
Recommended usage scenarios: jobs with regular usage status, such as minute-level window aggregation, join. Jobs that need to be started HA. Can be used in production environment.

【3】RocksDBStateBackend: Construction method:

RocksDBStateBackend(URL checkpointDataUri, boolean enableIncrementalCheckpointing)

Storage method: State : Database TaskManageron KV(actual use of memory + disk). Checkpoint: External file system (local or HDFS).
Capacity limit: The total capacity of a single TaskManagerdevice Statedoes not exceed its memory + disk, keythe maximum of a single device 2G. The total size does not exceed the configured file system capacity.
Recommended usage scenarios: jobs with very large status, such as day-level window aggregation. Jobs that need to be started HA. Jobs that require relatively high status reading and writing performance. Can be used in production environment.