Detailed explanation of Flink-State/Checkpoint and Savepoint

One of Flink's features: stateful computing

What is the state calculation: The intermediate result generated inside the program during the program calculation and provided to the subsequent operators.
As shown in the figure:
Insert picture description here
each module passes its results to the following Task, which is the state calculation.

Flink-State division

State is roughly divided into two categories:

  1. Keyed State: It is related to the key and acts on the Function or Operator corresponding to the key.
    For example: ValueState, ListState, ReducingState, AggregatingState, MapState.
  2. Operator State: Bind to parallel operator instances. And when the degree of parallelism changes (divide a State), the state data can be automatically redistributed.
    For example: ListState, BroadcastState.

Note:
These State objects are only used for state changes, for interactive behaviors, such as update, delete, and empty operations. In fact, there are 3 ways to store these states.

  1. MemoryStateBackend
  2. FsStateBackend
  3. The
    Insert picture description here
    main difference between RocksDBStateBackend :
    MemoryStateBackend and FsStateBackend store data in JavaHeap.
    The third RocksDBStateBackend is stored on RocksDB (a kind of memory-disk hybrid storage), and each State is stored in a single Column Family, and uses (key+keyGroup+namespace) as the key, as shown in the following figure:
    Insert picture description here

Flink-state management

There are two types of Flink state management:

  1. Managed status
  2. Native state

1. Managed state: Manager State. It is managed by Flink itself , and the state data is converted into HashTables or RocksDB objects for storage , and then persisted in Checkpoint for exception recovery.
2. Native state: Row State. The operator itself manages the data structure . After the Checkpoint is triggered, the data is converted into Bytes and then stored on the Checkpoint. When the exception is recovered, the operator itself deserializes the Bytes to obtain the data.

The similarities between the two: both rely on Checkpoint
differences:

  1. The hosting status is handed over to Flink RunTime to complete. The data is converted to HashTables or RocksDB objects.
  2. Native state: Data is converted to Bytes.

Checkpoint

What is Checkpoint? The so-called checkpoint is a mechanism for failure recovery. Spark also has Checkpoint. Flink, like Spark, uses Checkpoint to store a snapshot of a certain time or a certain period of time, which is used to restore tasks to a specified state.

The core of the checkpoint implementation is the barrier. Flink generates barriers on the data set at intervals, and saves the data for a certain period of time in the checkpoint through the barrier.

barrier

The barrier can be divided into single stream and parallel.
Single stream barrier:
Insert picture description here
Parallel barrier:
Insert picture description here

Characteristics of barrier

  1. The barrier flows in as part of the data flow.
  2. The barrier occupancy is very small. It's lightweight.
  3. The barrier strictly follows the generation of interval, and there will be no disorder.
  4. The barrier also has its own Id, so it can be uniquely identified.

barrier alignment mechanism

In fact, it can also be understood as the EXACTLY ONCE mechanism.
I didn't know what Exactly once was before. Later, I went to Baidu, which means: guarantee one-time results . So how is the barrier guaranteed?
step:

  1. According to the configuration, Flink performs Checkpoint according to the time interval, and inserts barriers to multiple DataSources at the same time (because there may be more than one source).
  2. The barrier will become a part of the data flow and flow downstream as the data flows. (Enter the DataStream section)
  3. Because there may be multiple input terminals to input data to the same downstream Operator. Then, once the downstream receives one of the upstream barriers, it starts to stop receiving new data . (Note that new data may have been collected for some time at this time, then this end of the data will be used as buffered data, let's call it buf for the time being), until the downstream has received all the barriers inserted at the same time point in the upstream.
  4. After receiving all the barriers at the same time point, these data will become snapshots, and flink will transmit them as a checkpoint data. At the same time, the buf data generated in the third step is transmitted to the downstream as the downstream input (Outgoing Records)

Checkpoint detailed

The core algorithm of Flink's Checkpoint is called Chandy-Lamport.
The general process is as follows:

  1. Checkpoint Coordinator initiates the Checkpoint mechanism.

Insert picture description here

  1. Broadcast the barrier and make it persistent.

Insert picture description here

  1. After completing the backup, notify the Coordinator of the State Handle (this time may be only part of it, but only a certain task of a Flink task has been completed, and there are other tasks that need to complete Checkpoint)

Insert picture description here

  1. Repeat the third step until the downstream sink node (usually the last task), after collecting the barriers of the two upstream inputs, perform a local snapshot. And according to the configuration, write the data to the corresponding place, such as RocksDB.
    Here is to write the red triangle (RocksDB) first, and write to the third-party persistent database (purple triangle)

Insert picture description here

  1. After the sink node completes its Checkpoint, it returns the state Handle to the Coordinator.

Insert picture description here

  1. When the Coordinator gathers the stateHandles of all tasks, it considers that a Checkpoint is globally completed. Back up a Checkpoint meta file to the persistent storage address.

Insert picture description here

Comparison of Checkpoint and Savepoint

Let me talk about what Savepoint
Savepoint is: a function used to generate a snapshot at a certain point in time.
The contents of the generated snapshot are:

  1. Directory: It contains a lot of binary files (generally larger), used to save the stream state.
  2. Metadata file: Store data.

The following figure shows the main differences between them:
Insert picture description here

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/107892504