Flink: State management and fault tolerance mechanism

Table of Contents

Status in Flink

What is the status of flink?

Keyed State 和 Operator State

Original state and managed state

How to use Managed Keyed State

State life cycle (TTL)

How to use Managed Operator State

Fault tolerance mechanism

What is checkpoint

checkpoint algorithm

How to use checkpoint

Enable checkpoint



Status in Flink

What is the status of flink?

The state of flink, in simple terms, is a variable stored locally when a stateful function or operator processes data. This variable can be data of a custom structure, used to record the results generated during calculations, or other data . The stateful operation will calculate or update the state information based on the state when processing each piece of data, as shown in the following figure:

 

Based on the state, flink can use more refined operations, such as:

  • All elements can be saved in the state, so that the application can find data in the state.
  • The result of the aggregation operation can be saved in the state, such as the result of the reduce operator calculation, the result of the window aggregation function, etc.
  • When training a machine learning model on the data stream, use the state to save the current version of the model parameters.
  • The historical data can be saved in the state to improve the efficiency of managing historical data.

The state of flink runs in memory first, and is periodically saved to checkpoints (checkpoints are saved in the local file system) to prevent data loss caused by unexpected interruptions. At the same time, you can also use savepoints to manually save the state in a stable file system, such as hdfs, S3, etc.

Keyed State 和 Operator State

First of all, there are two types of state in flink : Keyed State and Operator State.

Keyed State : The keyed state is always related to the key, so the keyed state can only be used in the functions and operators of the KeyedStream. It can be understood that the KeyedStream operator or function partitions the data stream according to the key. Each key is a partition, and each partition stores a keyed state.

In future versions, the keyed state may be changed to Key Groups. Key Groups are the combination of all keys assigned to a flink instance. So the number of Key Groups is equal to the set parallelism.

Operator State : Operator State is non-keyed state . Each parallel task of an operator operation or a non-keying function is bound to an Operator State. For example, the kafka connector is a good example: each partition of a kafka consumer will maintain a map type of data as a state to save topics, partitions and offsets. When the degree of parallelism changes, Operator State supports redistribution of states.

Original state and managed state

Keyed state and Operator State can exist in two forms: managed  ( managed) and raw (raw).

Managed State : Managed State is controlled by flink at runtime and stored in structured data such as hash tables and RocksDB. Such as ValueState, ListState. Flink will encode the Managed State and write it into the checkpoint.

Raw State : Raw State is state information saved in a custom data type. When writing a checkpoint, it is written into the checkpoint as a binary sequence. So flink doesn't know the data structure of Operator State and can only get the original binary sequence. Under normal circumstances, the use of managed state is mostly.

All flink functions can use Managed State, but if you need to use Raw State, you need to implement the corresponding interface in the function. Compared to Raw State, the official recommendation is to use Managed State . When using Managed State, it supports automatic redistribution of the state after modifying the parallelism, and has a more complete memory management.

Note : If you use Managed State, you need a custom serializer. Reference: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/state/custom_serialization.html

How to use Managed Keyed State

As mentioned above, the state of flink can save all elements, aggregation results, historical data, etc., and flink provides corresponding interfaces to implement these functions. In addition, keyed state, as the name implies, must be used after stream.keyBy(...), otherwise an error will be reported.

The following are the status data types provided by flink:

ValueState<T>: Only save an updateable and searchable value. The scope is the key of the input element, that is, each key saves a state of type <T>. You can use the update(T) method to update the state, or use the value() method to get the state.

ListState<T>: Save the status of a list. You can write or retrieve this list. Use add(T) or addAll(List<T>) method to add data. Use the get() method to obtain an iterable object, in which data can be retrieved. You can also use update(List<T>) to overwrite all data.

ReducingState<T>: Save a unique value, which is the pre-aggregation result of all current elements. This interface is ListState similar, the difference is that ReducingStatethe add() method calls the ReduceFunction method to calculate the current element and the previous pre-aggregation result, and then save the new pre-aggregation result.

AggregatingState<IN, OUT>: Save a unique value, which is the pre-aggregation result of all current elements. The ReducingStatedifference is that AggregatingStatethe input type and the output type can be different, AggregatingStatedefine input and output data types of the two arguments. Add(IN) internally calls the AggregateFunction method.

FoldingState<T, ACC>: Save a unique value, which is the pre-aggregation result of all current elements. And ReducingStatesimilar, except that add(T) the interior is different from the method call FoldFunction methods, and ReduceFunction FoldFunction FoldFunction may be provided that an initial value, FoldingStatethe initial value of this parameter is the ACC. This method is outdated.

MapState<UK, UV>: Save the state of the map type of a list. You can use the put method to add kv-type key-value pairs to it, or it can be used for retrieval. Use  put(UK, UV) or putAll(Map<UK, UV>) method to add data; use entries()keys() and values() to retrieve key and value. Use to  isEmpty()determine whether there is data.

All state types have a clear() method to clear all data in the current key state.

Key reminder 1 : The above state type objects are only used as a state interface. The state is not necessarily stored in the above objects, but can also be stored on local disks or other places.

Important note 2 : The value you get from the state depends on the key of the current input element, so the same function you call will return different values ​​according to different keys.

When obtaining the state, a StateDescriptor object must be created to describe the name and data type of the state, and it may also contain custom functions, such as ReduceFunction. The state is obtained by calling the getState method of RuntimeContext, so it must be a rich function to obtain the state.

The corresponding ways to obtain different states are as follows:

  • ValueState<T> getState(ValueStateDescriptor<T>)
  • ReducingState<T> getReducingState(ReducingStateDescriptor<T>)
  • ListState<T> getListState(ListStateDescriptor<T>)
  • AggregatingState<IN, OUT> getAggregatingState(AggregatingStateDescriptor<IN, ACC, OUT>)
  • FoldingState<T, ACC> getFoldingState(FoldingStateDescriptor<T, ACC>)
  • MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV>)

Take FlatMapFunction as an example, the usage status code is as follows:

class CountWindowAverage extends RichFlatMapFunction[(Long, Long), (Long, Long)] {

  private var sum: ValueState[(Long, Long)] = _

  override def flatMap(input: (Long, Long), out: Collector[(Long, Long)]): Unit = {

    // 获取状态的值
    val tmpCurrentSum = sum.value

    // 如果状态不为空,则将其值赋给currentSum;否则初始化currentSum为(0L,0L)
    val currentSum = if (tmpCurrentSum != null) {
      tmpCurrentSum
    } else {
      (0L, 0L)
    }

    // 计算sum值
    val newSum = (currentSum._1 + 1, currentSum._2 + input._2)

    // 更新状态
    sum.update(newSum)

    // 当元素个数达到2, 发出平均值并清空状态。
    if (newSum._1 >= 2) {
      out.collect((input._1, newSum._2 / newSum._1))
      sum.clear()
    }
  }

  override def open(parameters: Configuration): Unit = {
    //在open函数中初始化状态,以免过早地获取状态导致数据错误。也可以在外部使用lazy修饰,效果与在open中初始化一样。
    sum = getRuntimeContext.getState(
      new ValueStateDescriptor[(Long, Long)]("average", createTypeInformation[(Long, Long)])
    )
  }
}


object ExampleCountWindowAverage extends App {
  val env = StreamExecutionEnvironment.getExecutionEnvironment

  env.fromCollection(List(
    (1L, 3L),
    (1L, 5L),
    (1L, 7L),
    (1L, 4L),
    (1L, 2L)
  )).keyBy(_._1)
    .flatMap(new CountWindowAverage())
    .print()
  // the printed output will be (1,4) and (1,5)

  env.execute("ExampleManagedState")
}

In this example, with the first element of the input tuple as the key (all keys in the example are 1), the function saves the sum of the number of elements and the value in the state. When the number of elements reaches 2, the average value of value is returned and the state is cleared.

Note that if the first element of the tuple in the tuple list is not the same (that is, the key is different), then this retains a different state for each different key.

State life cycle (TTL)

Any type of keyed state can be assigned a lifetime time (TTL). If TTL is configured and a state has expired, then this state is cleared. Each key has its corresponding state, and the state collector independently judges the TTL for each state, which means that if the state of a key expires, only the state of the key will be affected, and the state of other keys will not be affected.

The logic for judging the expiration is: the last timestamp + TTL <= current time, it is regarded as expired. The following is the source code to determine whether it is out of date:

public class TtlUtils {
	static <V> boolean expired(@Nullable TtlValue<V> ttlValue, long ttl, TtlTimeProvider timeProvider) {
		return expired(ttlValue, ttl, timeProvider.currentTimestamp());
	}

	static <V> boolean expired(@Nullable TtlValue<V> ttlValue, long ttl, long currentTimestamp) {
		return ttlValue != null && expired(ttlValue.getLastAccessTimestamp(), ttl, currentTimestamp);
	}

	static boolean expired(long ts, long ttl, TtlTimeProvider timeProvider) {
		return expired(ts, ttl, timeProvider.currentTimestamp());
	}
    //上一个时间戳+TTL<=当前时间,则视为过期
	public static boolean expired(long ts, long ttl, long currentTimestamp) {
		return getExpirationTimestamp(ts, ttl) <= currentTimestamp;
	}

	private static long getExpirationTimestamp(long ts, long ttl) {
		long ttlWithoutOverflow = ts > 0 ? Math.min(Long.MAX_VALUE - ts, ttl) : ttl;
		return ts + ttlWithoutOverflow;
	}

	static <V> TtlValue<V> wrapWithTs(V value, long ts) {
		return new TtlValue<>(value, ts);
	}
}

Configure TTL

To configure TTL, you first need to create an StateTtlConfig object to configure TTL related information. Then call the enableTimeToLive method of the state descriptor to turn on TTL, and then obtain the state in RumTimeContext through the descriptor. Examples are as follows:

import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .build();
    
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

Among them, the setUpdateType method is used to set the TTL refresh mode, and there are two refresh mechanisms:

  • StateTtlConfig.UpdateType.OnCreateAndWrite -Only when creating and writing.
  • StateTtlConfig.UpdateType.OnReadAndWrite -When creating, writing, and reading.

The setStateVisibility method is used to set how to deal with the state that has expired but has not been cleaned up. There are also two mechanisms:

  • StateTtlConfig.StateVisibility.NeverReturnExpired -Expired data is not visible, even if it is not cleared, it is not visible.
  • StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp -Expired but not cleared visible.

Clear expiration status

By default, expired data will be automatically deleted when it is read, and then garbage collection will be performed periodically in the background. You can also choose to close the background garbage collection, the code is as follows:

import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .disableCleanupInBackground()
    .build();

It can also be set to clear the expired state when creating a full state mirror, which can reduce the size of the snapshot. In this mode, the local state will not be cleaned up, but if the state is restored from the snapshot, it will not contain expired data. Note: This option is not suitable for incremental checkpointing using RocksDB. The setting method is as follows:

import org.apache.flink.api.common.state.StateTtlConfig
import org.apache.flink.api.common.time.Time

val ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .cleanupFullSnapshot
    .build

It can also trigger the state deletion operation when accessing the state or processing data. If this strategy is used, the state storage backend will save a lazy-loaded global iterator to store all state. Only when a cleanup operation is triggered will this iterator be activated, traverse all states and clean up expired states. The configuration code is as follows:

import org.apache.flink.api.common.state.StateTtlConfig
val ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .cleanupIncrementally(10, true)
    .build

To configure this method, you need to pass in two parameters. The first parameter is the number of states checked each time when accessing the state (the access state must trigger the cleanup); the second parameter is whether to process each data to trigger the cleanup operating. The default is to check 5 status data at a time, and no cleanup is triggered based on processed data.

Specific state interface in Scala DataStream API

In addition to the above interfaces, in the scala API, map() when flatMap()函数在操作pairing or  KeyedStream, a shortcut is also provided to access one ValueState . Such as:

val stream: DataStream[(String, Int)] = ...

val counts: DataStream[(String, Int)] = stream
  .keyBy(_._1)
  .mapWithState((in: (String, Int), count: Option[Int]) =>
    count match {
      case Some(c) => ( (in._1, c), Some(c + in._2) )
      case None => ( (in._1, 0), Some(in._2) )
    })

How to use Managed Operator State

Because most of the keyed state is used in the production environment, and the operator state is rarely used, only a few simple examples are shown here, without repeating them.

In the following example, a stateful SinkFunction is used, and CheckpointedFunction is used to buffer the element before outputting it, and then split the event and update the state.

class BufferingSink(threshold: Int = 0)
  extends SinkFunction[(String, Int)]
    with CheckpointedFunction {

  @transient
  private var checkpointedState: ListState[(String, Int)] = _

  private val bufferedElements = ListBuffer[(String, Int)]()

  override def invoke(value: (String, Int), context: Context): Unit = {
    bufferedElements += value
    if (bufferedElements.size == threshold) {
      for (element <- bufferedElements) {
        // send it to the sink
      }
      bufferedElements.clear()
    }
  }

  override def snapshotState(context: FunctionSnapshotContext): Unit = {
    checkpointedState.clear()
    for (element <- bufferedElements) {
      checkpointedState.add(element)
    }
  }

  override def initializeState(context: FunctionInitializationContext): Unit = {
    val descriptor = new ListStateDescriptor[(String, Int)](
      "buffered-elements",
      TypeInformation.of(new TypeHint[(String, Int)]() {})
    )

    checkpointedState = context.getOperatorStateStore.getListState(descriptor)

    if(context.isRestored) {
      for(element <- checkpointedState.get()) {
        bufferedElements += element
      }
    }
  }

}

Fault tolerance mechanism

The fault-tolerant mechanism of flink is based on checkpoint (state consistency check). Simply put, flink saves the state to the checkpoint during the calculation process. When the task is terminated by a failure, the data can be restored from the checkpoint and the task can continue. , To achieve the purpose of fault tolerance.

What is checkpoint

Checkpoint is the core of flink's failure recovery mechanism, which can ensure accurate data consumption once. The so-called checkpoint is actually a snapshot of the state of a stateful flow at a certain point in time. This point in time should be when all tasks have just processed the same input data , that is, when the last operation in the entire flink program has processed this data. The status of other data that has not been processed will not be saved.

When encountering a failure, causing the application to stop. The first step is to restart the application, and then restore the state from the checkpoint. At this time, the state will be restored to the state of the previous checkpoint, and then continue to run the application normally.

The checkpoint of flink is different from the checkpoint of spark-streaming. Spark-streaming is batch processing, so its checkpoint is relatively simple, because the bottom layer is rdd, so just save the rdd, but there is a disadvantage, that is, once In the event of a failure, a whole batch of data may have to be recalculated. Adding a large amount of data will consume more events; while flink is streaming processing, and its checkpoint is for each piece of data (you can set Each process is saved once, or you can set a period of events to save once), so its checkpoint is more complicated, but the impact on the entire application after a failure is smaller.

There is a watermark-like mechanism in chekpoint, called checkpoint-barrier, which is used for checkpoint alignment and saves a snapshot when receiving checkpoint-barrier. The checkpoint-barrier has three attributes: ID, timestamp, and checkpoint-options. Every stateful operation encounters a checkpoint-barrier and saves a snapshot, and only after the last operation saves the snapshot, the checkpoint is completed.

checkpoint algorithm

Based on the distributed snapshot Chandy-Lamport, you can check this blog for details https://www.cnblogs.com/yuanyifei1/p/10360465.html

How to use checkpoint

There is a CheckpointConfig object in StreamExecutionEnvironment. When the environment object's env.enableCheckpointing(1000) method is called, it actually calls various set methods of the CheckpointConfig object. Such as:

Turning on checkpoint through the env object is actually calling the setCheckpointInterval method of checkpointconfig:

//env的enableCheckpointing方法
	public StreamExecutionEnvironment enableCheckpointing(long interval) {
		checkpointCfg.setCheckpointInterval(interval);
		return this;
	}

The setCheckpointInterval method in checkpointconfig

//checkpointconfig
	public void setCheckpointInterval(long checkpointInterval) {
		if (checkpointInterval <= 0) {
			throw new IllegalArgumentException("Checkpoint interval must be larger than zero");
		}
		this.checkpointInterval = checkpointInterval;
	}

Enable checkpoint

By default, checkpoint is closed, you need to call the enableCheckpointing (n) method of the environment object StreamExecutionEnvironment to enable checkpoint. The parameter n represents a checkpointbarrier every n milliseconds.

The following is an example of setting checkpoint:

//创建环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment()

// 每1000ms做一次快照
env.enableCheckpointing(1000)

// 以上就开启了checkpoint了,以下是一些其他可选设置:

// 设置 exactly-once 模式(默认就是exactly-once)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)

// 确保检查点之间有最小间隔为500 ms,假设每10s做一次checkpoint,某次耗时9s,那么正常在本次checkpoint完成后的1s又该做checkpoint了,以下配置可以确保每次checkpoint的最小间隔。
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)

// 设置checkpoint超时时间,超过一分钟则丢弃
env.getCheckpointConfig.setCheckpointTimeout(60000)

// 保存checkpoint时发生故障,是否停止任务。如果配置false,那么checkpointing时如果发生故障,则不停止任务,仅丢弃该次checkpoint。
env.getCheckpointConfig.setFailTasksOnCheckpointingErrors(false)

// 设置checkpoint并行度
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
key default Types of description

state.backend

null String State backend used to store and checkpoint state. Determines the storage location of checkpoint (usually placed in remote storage space such as fs or rocksdb)

state.backend.async

true Boolean Select whether to use asynchronous snapshot method for state.backend. Some state.backends may not support asynchronous snapshots, or only support asynchronous snapshots, so this option is ignored.

state.backend.fs.memory-threshold

1024 Int The minimum size of the status data file. When all the status data is less than the second value, it is stored in the memory, and the disk is placed when it exceeds this value.

state.backend.fs.write-buffer-size

4096 Int The default size of the write buffer. The actual write buffer size is the maximum value of this option and the option "state.backend.fs.memory-threshold".

state.backend.incremental

false Boolean Whether to use incremental checkpoints (if possible). The incremental checkpoint only stores the difference from the previous checkpoint, not the complete checkpoint state. Some state backends may not support incremental checkpointing, so this option is ignored.

state.backend.local-recovery

false Boolean Whether to configure to restore the state from the local. By default, local recovery is disabled. In the current version (1.10), local recovery only supports the keyed state backend. MemoryStateBackend does not support local recovery, please ignore this option.

state.checkpoints.dir

null String The default directory for storing checkpoint data files and metadata. It must be a storage path that all TaskManagers and JobManagers can access.

state.checkpoints.num-retained

1 Int The maximum number of completed checkpoints to keep.

state.savepoints.dir

null String The default directory of savepoint. Used to write savepoint into the file system (MemoryStateBackend, FsStateBackend, RocksDBStateBackend).

taskmanager.state.local.root-dirs

null String The config parameter defines the root directory used to store file-based state for local recovery. In the current version (1.10), local recovery only supports the keyed state backend. MemoryStateBackend does not support local recovery, please ignore this option.

Guess you like

Origin blog.csdn.net/x950913/article/details/106599275