Flink from entry to proficiency series (7)

9. State programming

9.1. State in Flink

In stream processing, data is continuously arriving and processed. When each task performs calculation processing, it can directly convert the output results based on the current data; it can also rely on some other data. All the data maintained by a task and used to calculate the output is called the state of the task.

9.1.1. Stateful operators

In Flink, operator tasks can be divided into stateless and stateful.
The stateless operator task only needs to observe each independent event, and directly convert the output result according to the current input data, as shown in the figure below.
insert image description here

For example, you can split a string type of data and output it as a tuple; you can also do some calculations on the data, such as adding 1 to each field representing the quantity. For example, map, filter, and flatMap are stateless operators that do not depend on other data for calculation.

For stateful operator tasks, in addition to the current data, some other data is required to obtain the calculation result. The "other data" here is the so-called state, and the most common one is the data that arrived before, or a result calculated from the previous data. For example, when performing a sum calculation, it is necessary to save the sum of all previous data, which is the state. The window operator will save all the data that has arrived, and these are also its states, such as aggregation operators and window operators. All are stateful operators.
insert image description here
The following figure shows the general processing flow of stateful operators, and the specific steps are as follows.

  • The operator task receives the data from the upstream;
  • Get the current state;
  • Calculate and update status according to business logic;
  • The calculation result is obtained, and the output is sent to the downstream task.

9.1.2. Status classification

9.1.2.1, Managed State and Raw State

Flink has two states: Managed State and Raw State.

  • The hosting state is managed by Flink in a unified way. A series of problems such as state storage access, fault recovery and reorganization are all realized by Flink. We only need to adjust the interface;
  • The original state is self-defined, which is equivalent to opening up a piece of memory, which needs to be managed by ourselves to achieve state serialization and fault recovery.

Specifically, the managed state is managed by Flink's runtime (Runtime); after the fault tolerance mechanism is configured, the state will be automatically persisted and restored in the event of a failure. When the application scales out, the state is automatically reorganized and distributed to all subtask instances. For specific state content, Flink also provides multiple structures such as ValueState, ListState, MapState, and AggregateState, and supports various data types internally. The built-in states in operators such as aggregation and windows are all managed states; we can also customize the states through the context in the rich function class (RichFunction), which are also managed states.

In contrast, the original state requires all customization. Flink does not perform any automatic operations on the state, nor does it know the specific data type of the state, it will only store it as the most primitive byte (Byte) array. Therefore, we will consider using the original state only when encountering special requirements that cannot be met by the managed state, and it is generally not recommended.

9.1.2.2. Operator State and Keyed State

In Flink, an operator task is divided into multiple parallel subtasks for execution according to the degree of parallelism, and different subtasks occupy different task slots. Since different slots are physically isolated in terms of computing resources, the state that Flink can manage cannot be shared between parallel tasks, and each state can only be valid for the instance of the current subtask.

Hosting status is divided into two categories: operator status and partition status by key.

9.1.2.2.1, Operator State (Operator State)

The state scope is limited to the current operator task instance, that is, it is only valid for the current parallel subtask instance. This means that for a parallel subtask, occupying a "partition", all the data it processes will access the same state, and the state is shared for the same task, as shown in the figure below.
insert image description here
The operator state can be used on all operators. When used, it is actually no different from a local variable - because the scope of the local variable is also the current task instance. When in use, we need to further implement the CheckpointedFunction interface.

9.1.2.2.2. Keyed state (Keyed State)

The state is maintained and accessed according to the key defined in the input stream, so it can only be defined in the keyed stream (KeyedStream), that is, it can be used after keyBy, as shown in the figure below.
insert image description here
The key partition state is widely used. The aggregation operator must be used after keyBy, because the aggregation result is saved in the form of Keyed State.

In addition, the Keyed State can also be customized through the Rich Function class, so as long as the operator of the Rich Function class interface is provided, the Keyed State can also be used. So even for stateless basic conversion operators such as map and filter, we can "add" Keyed State to them through rich function classes, or implement the CheckpointedFunction interface to define Operator State; from this perspective, all operators in Flink Both can be stateful.

Whether it is Keyed State or Operator State, they are maintained on the local instance, that is to say, each parallel subtask maintains a corresponding state, and the state between subtasks of an operator is not shared.

9.2. Keyed State

9.2.1. Basic concepts and features

Keyed State, as the name implies, is a state that tasks access and maintain according to keys. Its characteristic is very distinct, that is, it uses the key as the scope of action for isolation.

After partitioning by key (keyBy), all data with the same key will be assigned to the same parallel subtask; so if the current task defines a state, Flink will assign each key value in the current parallel subtask instance An instance that maintains a state. So the current task will maintain and process the corresponding state according to the key for all the allocated data.

Because a parallel subtask may process data of multiple keys, Flink needs to perform some special optimizations on Keyed State. At the bottom layer, Keyed State is similar to a distributed mapping (map) data structure, and all states will be saved in the form of key-value pairs (key-value) according to the key. In this way, when a piece of data arrives, the task will automatically limit the access scope of the state to the key of the current data, and read the corresponding state value from the map storage. So all data with the same key will access the same state, and the states of different keys are isolated from each other.

This way of binding the state to the key is equivalent to making a one-to-one correspondence between the state and the logical partition of the flow: no data of other keys will access the current state; and the data corresponding to the key of the current state will only be accessed This state will not be distributed to other partitions. This ensures that the operations on the state are all performed locally, and the processing of data flow and state achieves partition consistency.

In addition, when the parallelism of the application changes, the state needs to be reorganized accordingly. The Keyed State corresponding to different keys can further form so-called key groups (key groups), and each group corresponds to a parallel subtask. The key group is the unit for Flink to redistribute Keyed State, and the number of key groups is equal to the defined maximum parallelism. When the operator parallelism changes, the Keyed State will be redistributed evenly according to the current parallelism to ensure that the load of each subtask is the same during runtime.

It should be noted that the use of Keyed State must be based on KeyedStream. For a DataStream without keyBy partition, even if the conversion operator implements the corresponding rich function class, it cannot access the Keyed State through the runtime context.

9.2.2. Supported structure types

9.2.2.1, Value State (ValueState)

As the name suggests, only one "value" is stored in the state. ValueState itself is an interface, defined in the source code as follows:

public interface ValueState<T> extends State {
    
    
T value() throws IOException;
void update(T value) throws IOException;
}

Here T is a generic type, indicating that the data content of the state can be any specific data type. If you want to save a long integer value as state, then the type is ValueState.

We can read and write the value state in the code to access and update the state.

  • T value(): Get the value of the current state;
  • update(T value): Update the state, and the parameter value passed in is the state value to be overwritten.

In specific use, in order to make the runtime context clear which state it is, we also need to create a "state descriptor" (StateDescriptor) to provide basic information about the state. For example, in the source code, the construction method of ValueState's state descriptor is as follows:

public ValueStateDescriptor(String name, Class<T> typeClass) {
    
    
 super(name, typeClass, null);
}

Here we need to pass in the name and type of the state - this is exactly what we do when declaring a variable. With this descriptor, the runtime environment can obtain the control handle (handler) of the state.

9.2.2.2, list state (ListState)

Organize the data to be saved in the form of List. There is also a type parameter T in the ListState interface, indicating the type of data in the list. ListState also provides a series of methods to operate the state, and the usage is very similar to the general List.

  • Iterable get(): Get the current list state and return an iterable type Iterable;
  • update(List values): Pass in a list of values ​​to directly overwrite the state;
  • add(T value): add an element value in the state list;
  • addAll(List values): Add multiple elements to the list, passed in as list values.

Similarly, the state descriptor of ListState is called ListStateDescriptor, and its usage is exactly the same as that of ValueStateDescriptor.

9.2.2.3. Mapping state (MapState)

Save some key-value pairs (key-value) as a whole state, which can be considered as a list of key-value mappings. In the corresponding MapState<UK, UV> interface, there will be two generic types, UK and UV, respectively representing the types of the saved key and value. Likewise, MapState provides methods for manipulating the state of the map, very similar to the use of Map.

  • UV get(UK key): Pass in a key as a parameter, and query the corresponding value;
  • put(UK key, UV value): Pass in a key-value pair and update the value corresponding to the key;
  • putAll(Map<UK, UV> map): Add all key-value pairs in the incoming mapping map to the mapping state;
  • remove(UK key): delete the key-value pair corresponding to the specified key;
  • boolean contains(UK key): Determine whether the specified key exists, and return a boolean value.
  • Iterable<Map.Entry<UK, UV>> entries(): Get all key-value pairs in the mapping state;
  • Iterable keys(): Get all the keys (keys) in the mapping state and return an iterable Iterable type;
  • Iterable values(): Get all the values ​​(value) in the mapping state and return an iterable Iterable type;
  • boolean isEmpty(): Determines whether the map is empty and returns a boolean value.

9.2.2.4. Reducing State (ReducingState)

Similar to the value state (Value), but it is necessary to reduce all the added data, and save the value after the reduction aggregation as the state. The method called by the ReducintState interface is similar to ListState, except that it saves only an aggregate value, so when the .add() method is called, instead of adding elements to the state list, it directly reduces the new data and the previous state , and update the state with the result obtained.

The definition of reduction logic is realized by passing in a reduction function (ReduceFunction) in the reducing state descriptor (ReducingStateDescriptor).

public ReducingStateDescriptor(String name, ReduceFunction<T> reduceFunction, Class<T> typeClass) {
    
    ...}

The descriptor here has three parameters, the second parameter is the ReduceFunction that defines the reduction aggregation logic, and the other two parameters are the name and type of the state.

9.2.2.5, Aggregating State (AggregatingState)

Aggregation state is also a value, which is used to save the aggregation result of all the data added. Different from ReducingState, its aggregation logic is defined by passing in a more general aggregation function (AggregateFunction) in the descriptor; AggregateFunction, which uses an accumulator (Accumulator) to represent the state, so the aggregated state The type can be completely different from the added data type, making it more flexible to use.

Similarly, the calling method of the AggregatingState interface is the same as that of ReducingState. When calling the .add() method to add elements, it will directly use the specified AggregateFunction to aggregate and update the state.

9.2.2.3, code implementation

In Flink, the state is always associated with a specific operator; the operator needs to "register" before using the state. In fact, it tells Flink the information about the state defined in the current context, so that Flink at runtime can know which operators are state.

The registration of the state is mainly realized through the "state descriptor" (StateDescriptor). The most important content in the state descriptor is the name (name) and type (type) of the state.

In addition, a user-defined function (UDF) may also need to be passed in to the state descriptor to describe the processing logic, such as the aforementioned ReduceFunction and AggregateFunction. Taking ValueState as an example, we can define the value state descriptor as follows:

ValueStateDescriptor<Long> descriptor = new ValueStateDescriptor<>(
	"my state", // 状态名称
	Types.LONG // 状态类型
);

Here we define a descriptor of a long ValueState called "my state".
The complete operation in the code is to first define the state descriptor; then call the .getRuntimeContext() method to obtain the runtime context; then call the RuntimeContext method to obtain the state, pass in the state descriptor, and the corresponding state can be obtained.

Because the state access needs to obtain the runtime context, which can only be obtained in the rich function class (Rich Function), so the custom Keyed State can only be used in the rich function. Of course, the underlying processing function (Process Function) itself inherits the AbstractRichFunction abstract class, so it can also be used.

In the rich function, after calling the .getRuntimeContext() method to obtain the runtime context, the RuntimeContext has the following methods to obtain the state:

ValueState<T> getState(ValueStateDescriptor<T>)
MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV>)
ListState<T> getListState(ListStateDescriptor<T>)
ReducingState<T> getReducingState(ReducingStateDescriptor<T>)
AggregatingState<IN, OUT> getAggregatingState(AggregatingStateDescriptor<IN, ACC, OUT>)

For the state of different structure types, just pass in the corresponding descriptor and call the corresponding method. After obtaining the state objects, you can call their respective methods to perform read and write operations. Additionally, all types of states have a method .clear() that clears the current state.

public static class MyFlatMapFunction extends RichFlatMapFunction<Long, String> {
    
    
        // 声明状态
        private transient ValueState<Long> state;

        @Override
        public void open(Configuration config) {
    
    
            // 在 open 生命周期方法中获取状态
            ValueStateDescriptor<Long> descriptor = new ValueStateDescriptor<>(
                    "my state", // 状态名称
                    Types.LONG // 状态类型
            );
            state = getRuntimeContext().getState(descriptor);
        }

        @Override
        public void flatMap(Long input, Collector<String> out) throws Exception {
    
    
            // 访问状态
            Long currentState = state.value();
            currentState += 1; // 状态数值加 1
            // 更新状态
            state.update(currentState);
            if (currentState >= 100) {
    
    
                out.collect("state: " + currentState);
                state.clear(); // 清空状态
            }
        }
    }

9.2.2.3.1, Value State (ValueState)

Here we will use the user id to divide the flow, and then count the pv data of each user separately. Since we don't want to send the statistical results to the downstream every time the pv is increased by one, we register a timer here for Send pv statistical results at intervals, so that the pressure on downstream operators will not be too great.

The specific implementation is to define a value state variable used to save the timer timestamp. When the timer triggers and sends data downstream, the state variable storing the timer timestamp is cleared, so that when new data arrives, it is found that there is no timer, and a new timer can be registered. After registering the timer Afterwards, the time stamp of the timer continues to be saved in the state variable.

 public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
    
    
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
    
    
                                return element.timestamp;
                            }
                        })
                );
        stream.print("input");
        // 统计每个用户的 pv,隔一段时间(10s)输出一次结果
        stream.keyBy(data -> data.user).process(new PeriodicPvResult())
                .print();
        env.execute();
    }

    // 注册定时器,周期性输出 pv
    public static class PeriodicPvResult extends KeyedProcessFunction<String, Event, String> {
    
    
        // 定义两个状态,保存当前 pv 值,以及定时器时间戳
        ValueState<Long> countState;
        ValueState<Long> timerTsState;

        @Override
        public void open(Configuration parameters) throws Exception {
    
    
            countState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("count", Long.class));
            timerTsState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("timerTs", Long.class));
        }

        @Override
        public void processElement(Event value, Context ctx, Collector<String> out)
                throws Exception {
    
    
            // 更新 count 值
            Long count = countState.value();
            if (count == null) {
    
    
                countState.update(1L);
            } else {
    
    
                countState.update(count + 1);
            }
            // 注册定时器
            if (timerTsState.value() == null) {
    
    
                ctx.timerService().registerEventTimeTimer(value.timestamp + 10 * 1000L);
                timerTsState.update(value.timestamp + 10 * 1000L);
            }
        }

        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
    
    
            out.collect(ctx.getCurrentKey() + " pv: " + countState.value());
            // 清空状态
            timerTsState.clear();
        }
    }

9.2.2.3.2, list state (ListState)

public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);
        SingleOutputStreamOperator<Tuple3<String, String, Long>> stream1 = env
                .fromElements(
                        Tuple3.of("a", "stream-1", 1000L),
                        Tuple3.of("b", "stream-1", 2000L)
                )
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String,
                        Long>>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
    
    
                            @Override
                            public long extractTimestamp(Tuple3<String,
                                    String, Long> t, long l) {
    
    
                                return t.f2;
                            }
                        })
                );
        SingleOutputStreamOperator<Tuple3<String, String, Long>> stream2 = env
                .fromElements(
                        Tuple3.of("a", "stream-2", 3000L),
                        Tuple3.of("b", "stream-2", 4000L)
                )
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String,
                        Long>>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
    
    
                            @Override
                            public long extractTimestamp(Tuple3<String,
                                    String, Long> t, long l) {
    
    
                                return t.f2;
                            }
                        })
                );
        stream1.keyBy(r -> r.f0)
                .connect(stream2.keyBy(r -> r.f0))
                .process(new CoProcessFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>, String>() {
    
    
                    private ListState<Tuple3<String, String, Long>> stream1ListState;
                    private ListState<Tuple3<String, String, Long>> stream2ListState;

                    @Override
                    public void open(Configuration parameters) throws Exception {
    
    
                        super.open(parameters);
                        stream1ListState = getRuntimeContext().getListState(new ListStateDescriptor<Tuple3<String, String, Long>>("stream1-list", Types.TUPLE(Types.STRING, Types.STRING))
                        );
                        stream2ListState = getRuntimeContext().getListState(new ListStateDescriptor<Tuple3<String, String, Long>>("stream2-list", Types.TUPLE(Types.STRING, Types.STRING))
                        );
                    }

                    @Override
                    public void processElement1(Tuple3<String, String, Long> left, Context context, Collector<String> collector) throws Exception {
    
    
                        stream1ListState.add(left);
                        for (Tuple3<String, String, Long> right :
                                stream2ListState.get()) {
    
    
                            collector.collect(left + " => " + right);
                        }
                    }

                    @Override
                    public void processElement2(Tuple3<String, String, Long> right,Context context, Collector<String> collector) throws Exception {
    
    
                        stream2ListState.add(right);
                        for (Tuple3<String, String, Long> left :
                                stream1ListState.get()) {
    
    
                            collector.collect(left + " => " + right);
                        }
                    }
                }).print();
        env.execute();
    }

The output is:
(a,stream-1,1000) => (a,stream-2,3000)
(b,stream-1,2000) => (b,stream-2,4000)

9.2.2.3.3. Mapping state (MapState)

public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
    
    
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
    
    
                                return element.timestamp;
                            }
                        })
                );

        // 统计每10s窗口内,每个url的pv
        stream.keyBy(data -> data.url)
                .process(new FakeWindowResult(10000L))
                .print();

        env.execute();
    }

    public static class FakeWindowResult extends KeyedProcessFunction<String, Event, String> {
    
    
        // 定义属性,窗口长度
        private Long windowSize;

        public FakeWindowResult(Long windowSize) {
    
    
            this.windowSize = windowSize;
        }

        // 声明状态,用map保存pv值(窗口start,count)
        MapState<Long, Long> windowPvMapState;

        @Override
        public void open(Configuration parameters) throws Exception {
    
    
            windowPvMapState = getRuntimeContext().getMapState(new MapStateDescriptor<Long, Long>("window-pv", Long.class, Long.class));
        }

        @Override
        public void processElement(Event value, Context ctx, Collector<String> out) throws Exception {
    
    
            // 每来一条数据,就根据时间戳判断属于哪个窗口
            Long windowStart = value.timestamp / windowSize * windowSize;
            Long windowEnd = windowStart + windowSize;

            // 注册 end -1 的定时器,窗口触发计算
            ctx.timerService().registerEventTimeTimer(windowEnd - 1);

            // 更新状态中的pv值
            if (windowPvMapState.contains(windowStart)) {
    
    
                Long pv = windowPvMapState.get(windowStart);
                windowPvMapState.put(windowStart, pv + 1);
            } else {
    
    
                windowPvMapState.put(windowStart, 1L);
            }
        }

        // 定时器触发,直接输出统计的pv结果
        @Override
        public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
    
    
            Long windowEnd = timestamp + 1;
            Long windowStart = windowEnd - windowSize;
            Long pv = windowPvMapState.get(windowStart);
            out.collect("url: " + ctx.getCurrentKey()
                    + " 访问量: " + pv
                    + " 窗口:" + new Timestamp(windowStart) + " ~ " + new Timestamp(windowEnd));

            // 模拟窗口的销毁,清除map中的key
            windowPvMapState.remove(windowStart);
        }
    }

9.2.2.3.4, Aggregating State (AggregatingState)

public static void main(String[] args) throws Exception{
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        SingleOutputStreamOperator<Event> stream = env.addSource(new ClickSource())
                .assignTimestampsAndWatermarks(WatermarkStrategy.<Event>forMonotonousTimestamps()
                        .withTimestampAssigner(new SerializableTimestampAssigner<Event>() {
    
    
                            @Override
                            public long extractTimestamp(Event element, long recordTimestamp) {
    
    
                                return element.timestamp;
                            }
                        })
                );


        // 统计每个用户的点击频次,到达5次就输出统计结果
        stream.keyBy(data -> data.user)
                .flatMap(new AvgTsResult())
                .print();

        env.execute();
    }

    public static class AvgTsResult extends RichFlatMapFunction<Event, String>{
    
    
        // 定义聚合状态,用来计算平均时间戳
        AggregatingState<Event, Long> avgTsAggState;

        // 定义一个值状态,用来保存当前用户访问频次
        ValueState<Long> countState;

        @Override
        public void open(Configuration parameters) throws Exception {
    
    
            avgTsAggState = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor<Event, Tuple2<Long, Long>, Long>(
                    "avg-ts",
                    new AggregateFunction<Event, Tuple2<Long, Long>, Long>() {
    
    
                        @Override
                        public Tuple2<Long, Long> createAccumulator() {
    
    
                            return Tuple2.of(0L, 0L);
                        }

                        @Override
                        public Tuple2<Long, Long> add(Event value, Tuple2<Long, Long> accumulator) {
    
    
                            return Tuple2.of(accumulator.f0 + value.timestamp, accumulator.f1 + 1);
                        }

                        @Override
                        public Long getResult(Tuple2<Long, Long> accumulator) {
    
    
                            return accumulator.f0 / accumulator.f1;
                        }

                        @Override
                        public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
    
    
                            return null;
                        }
                    },
                    Types.TUPLE(Types.LONG, Types.LONG)
            ));

            countState = getRuntimeContext().getState(new ValueStateDescriptor<Long>("count", Long.class));
        }

        @Override
        public void flatMap(Event value, Collector<String> out) throws Exception {
    
    
            Long count = countState.value();
            if (count == null){
    
    
                count = 1L;
            } else {
    
    
                count ++;
            }

            countState.update(count);
            avgTsAggState.add(value);

            // 达到5次就输出结果,并清空状态
            if (count == 5){
    
    
                out.collect(value.user + " 平均时间戳:" + new Timestamp(avgTsAggState.get()));
                countState.clear();
            }
        }
    }

9.2.2.4, state time to live (TTL)

When the state is created, set the expiration time = current time + TTL; if there is access and modification to the state later, we can update the expiration time; when the set clearing condition is triggered (for example, when the state is accessed, Or scan the failure status every once in a while), you can judge whether the status is invalid, and then clear it.

When configuring state TTL, you need to create a StateTtlConfig configuration object, and then call the .enableTimeToLive() method of the state descriptor to enable the TTL function.

StateTtlConfig ttlConfig = StateTtlConfig
 .newBuilder(Time.seconds(10))
 .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
 .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
 .build();
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("my state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

Several configuration items are used here:


  • The constructor method of .newBuilder() state TTL configuration must be called. After returning a Builder, call the .build() method to get StateTtlConfig. The method needs to pass in a Time as a parameter, which is the set state survival time.
  • .setUpdateType()
    sets the update type. The update type specifies when to update the status expiration time. OnCreateAndWrite here indicates that the update expiration time is only when the status is created and changed (write operation). Another type, OnReadAndWrite, indicates that the invalidation time will be updated regardless of read and write operations, that is, as long as the state is accessed, it indicates that it is active, thereby prolonging the survival time. This configuration defaults to OnCreateAndWrite.
  • .setStateVisibility()
    sets the visibility of the state. The so-called "status visibility" means that because the clearing operation is not real-time, it may still be based on existence after the status expires. At this time, if it is accessed, whether it can be read normally is a problem. The everReturnExpired set here is the default behavior, which means that the expired value is never returned, that is, as long as it expires, it is considered to have been cleared, and the application cannot continue to read; this is more important when dealing with session or private data. Another corresponding configuration is ReturnExpireDefNotCleanedUp, which returns its value if the expired state still exists.

9.3. Operator State

9.3.1. Basic concepts and features

Operator State is the state defined on an operator parallel instance, and its scope of action is limited to the current operator task. The operator state has nothing to do with the key of the data, so as long as data with different keys are distributed to the same parallel subtask, they will access the same Operator State.

9.3.2, status type

The operator state also supports different structure types, mainly three types: ListState, UnionListState and BroadcastState.

9.3.2.1, list state (ListState)

Like ListState in Keyed State, the state is represented as a list of a set of data. The difference from the list state in Keyed State is that in the context of the operator state, the state will not be processed separately by key (key), so only one "list" (list) will be kept on each parallel subtask, that is, the current A collection of all state items on parallel subtasks.

The state items in the list are the most granular that can be reassigned, completely independent of each other. When the parallelism of the operator is scaled and adjusted, all elements in the list state of the operator will be collected uniformly, which is equivalent to merging the lists of multiple partitions into a "big list", and then evenly distributing them to all Parallel tasks. The specific method of this "uniform distribution" is "round-robin". Similar to the rebanlance data transmission method introduced earlier, the state items are evenly distributed by "dealing" one by one. This method is also called "even-split redistribution".

9.3.2.2, Union List State (UnionListState)

Like ListState, federated liststates also represent the state as a list. The difference between it and the regular list state is that the state is allocated differently when the operator parallelism is scaled and adjusted.

The focus of UnionListState is on "union". When adjusting the parallelism, the regular list state polls the allocation state items, while the operator of the joint list state directly broadcasts the complete list of states. In this way, the parallel subtasks after parallelism scaling can obtain the complete "big list" after the union, and can choose the state items to use and the state items to discard by themselves. This distribution is also called "union redistribution". If there are too many status items in the list, it is generally not recommended to use joint reorganization for resource and efficiency considerations.

9.3.2.3, broadcast state (BroadcastState)

Sometimes we want the operator parallel subtasks to maintain the same "global" state for unified configuration and rule setting. At this time, all data in all partitions will access the same state, as if the state is "broadcast" to all partitions. This special operator state is called broadcast state (BroadcastState).

Because the instance of the broadcast state is the same on each parallel subtask, it is relatively simple to adjust the degree of parallelism, as long as a copy is copied to a new parallel task, the expansion can be realized; and for the case where the degree of parallelism is reduced, the The redundant parallel subtasks are directly cut off together with the state - because the state is copied and will not be lost. At the bottom layer, the broadcast state is saved as a key-value pair (key-value) similar to a map structure (map), which must be created based on a "Broadcast Stream".

9.4, Broadcast State (Broadcast State)

There is a special type of operator state, which is the Broadcast State. Conceptually and in principle, when the status is broadcast, the status of all parallel subtasks is the same; when adjusting the degree of parallelism, you only need to copy it directly. However, in application, the broadcast state is quite different from other operator states.

9.4.1. Basic Usage

One of the most common applications is "dynamic configuration" or "dynamic rules". When we process streaming data, we sometimes base it on some configurations or rules.

Since configuration or rule data is globally available, we need to broadcast it to all parallel subtasks. The subtask needs to save it as an operator state to ensure that the processing results are consistent after the fault recovery. The state at this time is a typical broadcast state. The broadcast state is different from the list structure of other operator states. The bottom layer is described in the form of key-value pairs (key-value), so it is actually a map state (MapState).

In the code, you can directly call the .broadcast() method of DataStream, pass in a "mapping state descriptor" (MapStateDescriptor)to indicate the name and type of the state, and then you can get a "Broadcast Stream" (BroadcastStream); and then compare the data stream to be processed with This broadcast flow is connected (connect), and a "broadcast connection flow" will be obtained (BroadcastConnectedStream). Note that broadcast state can only be used in broadcast connection streams.

MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(...);
BroadcastStream<Rule> ruleBroadcastStream = ruleStream.broadcast(ruleStateDescriptor);
DataStream<String> output = stream.connect(ruleBroadcastStream).process(new BroadcastProcessFunction<>() {
    
    ...} );

We define a "rule stream" ruleStream, the data in it represents the rules for data stream processing, and the data type of the rule is defined as Rule. So it is necessary to define a MapStateDescriptor to describe the broadcast state first, and then pass it in ruleStream.broadcast()to get the broadcast stream, and then use stream to connect with the broadcast stream. Here the key type in the state descriptor is String, which is the name of the key given to distinguish different state values.

9.4.2, code examples

 public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        // 读取用户行为事件流
        DataStreamSource<Action> actionStream = env.fromElements(
                new Action("Alice", "login"),
                new Action("Alice", "pay"),
                new Action("Bob", "login"),
                new Action("Bob", "buy")
        );

        // 定义行为模式流,代表了要检测的标准
        DataStreamSource<Pattern> patternStream = env
                .fromElements(
                        new Pattern("login", "pay"),
                        new Pattern("login", "buy")
                );

        // 定义广播状态的描述器,创建广播流
        MapStateDescriptor<Void, Pattern> bcStateDescriptor = new MapStateDescriptor<>(
                "patterns", Types.VOID, Types.POJO(Pattern.class));
        BroadcastStream<Pattern> bcPatterns = patternStream.broadcast(bcStateDescriptor);

        // 将事件流和广播流连接起来,进行处理
        DataStream<Tuple2<String, Pattern>> matches = actionStream
                .keyBy(data -> data.userId)
                .connect(bcPatterns)
                .process(new PatternEvaluator());

        matches.print();

        env.execute();
    }

    public static class PatternEvaluator
            extends KeyedBroadcastProcessFunction<String, Action, Pattern, Tuple2<String, Pattern>> {
    
    

        // 定义一个值状态,保存上一次用户行为
        ValueState<String> prevActionState;

        @Override
        public void open(Configuration conf) {
    
    
            prevActionState = getRuntimeContext().getState(
                    new ValueStateDescriptor<>("lastAction", Types.STRING));
        }

        @Override
        public void processBroadcastElement(
                Pattern pattern,
                Context ctx,
                Collector<Tuple2<String, Pattern>> out) throws Exception {
    
    

            BroadcastState<Void, Pattern> bcState = ctx.getBroadcastState(
                    new MapStateDescriptor<>("patterns", Types.VOID, Types.POJO(Pattern.class)));

            // 将广播状态更新为当前的pattern
            bcState.put(null, pattern);
        }

        @Override
        public void processElement(Action action, ReadOnlyContext ctx,
                                   Collector<Tuple2<String, Pattern>> out) throws Exception {
    
    
            Pattern pattern = ctx.getBroadcastState(
                    new MapStateDescriptor<>("patterns", Types.VOID, Types.POJO(Pattern.class))).get(null);

            String prevAction = prevActionState.value();
            if (pattern != null && prevAction != null) {
    
    
                // 如果前后两次行为都符合模式定义,输出一组匹配
                if (pattern.action1.equals(prevAction) && pattern.action2.equals(action.action)) {
    
    
                    out.collect(new Tuple2<>(ctx.getCurrentKey(), pattern));
                }
            }
            // 更新状态
            prevActionState.update(action.action);
        }
    }

    // 定义用户行为事件POJO类
    public static class Action {
    
    
        public String userId;
        public String action;

        public Action() {
    
    
        }

        public Action(String userId, String action) {
    
    
            this.userId = userId;
            this.action = action;
        }

        @Override
        public String toString() {
    
    
            return "Action{" +
                    "userId=" + userId +
                    ", action='" + action + '\'' +
                    '}';
        }
    }

    // 定义行为模式POJO类,包含先后发生的两个行为
    public static class Pattern {
    
    
        public String action1;
        public String action2;

        public Pattern() {
    
    
        }

        public Pattern(String action1, String action2) {
    
    
            this.action1 = action1;
            this.action2 = action2;
        }

        @Override
        public String toString() {
    
    
            return "Pattern{" +
                    "action1='" + action1 + '\'' +
                    ", action2='" + action2 + '\'' +
                    '}';
        }
    }

Here we define the detected behavior pattern as POJO class Pattern, which contains two consecutive behaviors. Since only one Pattern is saved in the broadcast state, and the key in the MapState is not concerned, it is also possible to directly specify the type of the key as Void, and the specific value is null. In the specific operation process, we save the Pattern data in the broadcast stream as a broadcast variable; read the current broadcast variable after the behavior data Action arrives, determine the behavior pattern, and save the previous behavior as a ValueState—this is The state of the current user is saved, so Keyed State is used. If it is detected that the previous behavior is the same as action1 in Pattern and the current behavior is the same as action2, then a set of behaviors matching the pattern is found, and the detection result is output.

9.5, state persistence and state backend

In Flink's state management mechanism, a very important function is to save the state persistently (persistence), so that it can be restarted and restored after a failure occurs. The way Flink persists the state is to save all the current distributed state as a "snapshot", write a "checkpoint" (checkpoint) or savepoint (savepoint) and save it to the external storage system. The specific storage medium is generally a distributed file system (distributed file system).

9.5.1, Checkpoint (Checkpoint)

A checkpoint in a stateful streaming application is actually a snapshot (a copy) of the state of all tasks at a certain point in time.

Checkpointing is disabled by default and needs to be enabled manually in code. Checkpointing can be enabled by directly calling the .enableCheckpointing() method of the execution environment.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getEnvironment();
env.enableCheckpointing(1000);

In addition to checkpoints, Flink also provides the function of "savepoint". The savepoint is exactly the same as the checkpoint in principle and form, and it is also a snapshot of state persistence; the difference is that the savepoint is a custom image save, so it will not be automatically created by Flink, but needs to be manually triggered by the user. This is useful for programmatically stopping and restarting applications.

9.5.2. State Backends

The preservation of checkpoints is inseparable from the coordination of JobManager and TaskManager, as well as external storage systems. When the application saves the checkpoint, the JobManager will first issue a command to trigger the checkpoint to all TaskManagers; after the TaskManger receives it, it will take a snapshot of all the status of the current task and persist it to the remote storage medium; Return a confirmation message. This process is distributed. When JobManger receives the return information from all TaskManagers, it will confirm that the current checkpoint is successfully saved, as shown in the figure below. And the coordination of all these tasks requires a "professional staff" to complete.
insert image description here
In Flink, state storage, access, and maintenance are determined by a pluggable component called the state backend. The state backend is mainly responsible for two things: one is local state management, and the other is writing checkpoints to remote persistent storage.

9.5.2.1, Classification of state backends

The state backend is an "out of the box" component that can be configured independently without changing the application logic. Flink provides two different types of state backends, one is "hash table state backend" (HashMapStateBackend), and the other is "embedded RocksDB state backend" (EmbeddedRocksDBStateBackend). If not specifically configured, the system's default state backend is HashMapStateBackend.

  1. HashMapStateBackend (HashMapStateBackend)

This method is what we said before, storing the state in memory. In terms of specific implementation, the hash table state backend will directly treat the state as objects (objects) internally and save it on the JVM heap (heap) of Taskmanager. Ordinary state, as well as data collected in the window and triggers (triggers), will be stored in the form of key-value pairs (key-value), so the underlying layer is a hash table (HashMap), and this state backend is therefore named. For the storage of checkpoints, it is generally placed in a persistent distributed file system (file system), or it can be specified separately by configuring "Checkpoint Storage".

HashMapStateBackend puts all the local state into the memory, so that the fastest reading and writing speed can be obtained, and the computing performance can be optimized; the cost is the memory usage. It is suitable for jobs with large state, long windows, large key-value state, and is also effective for all high availability settings.

  1. Embedded RocksDB State Backend (EmbeddedRocksDBStateBackend)

RocksDB is an embedded key-value storage medium that can persist data to the local hard disk.
After EmbeddedRocksDBStateBackend is configured, all the processing data will be put into the RocksDB database, and RocksDB
is stored in the local data directory of TaskManager by default. Unlike HashMapStateBackend, which directly stores objects in the heap memory, the state is mainly stored in RocksDB in this way. Data is stored as serialized byte arrays (Byte Arrays), and read and write operations require serialization/deserialization, so state access performance is poor.

In addition, because of the serialization, the comparison of the key will also be performed according to the byte, instead of calling the .hashCode() and .equals() methods directly. For checkpoints, it will also be written to the remote persistent file system. EmbeddedRocksDBStateBackend always executes asynchronous snapshots, that is, it does not block data processing due to saving checkpoints; and it also provides a mechanism for incrementally saving checkpoints, which can greatly improve saving efficiency in many cases. Because it dumps state data to disk and supports incremental checkpoints, it is a good choice in application scenarios with very large states, very long windows, and large key/value states. It is also valid for all high-availability settings .

9.5.2.2. How to choose the right state backend

The biggest difference between the two state backends of HashMap and RocksDB lies in where the local state is stored: the former is memory, and the latter is RocksDB. In practical applications, choosing which state backend mainly needs to make a choice in terms of processing performance and application scalability according to business requirements.

HashMapStateBackend is memory computing, and the reading and writing speed is very fast; however, the size of the state will be limited by the available memory of the cluster. If the state of the application continues to grow over time, memory resources will be exhausted.

And RocksDB is hard disk storage, so it can be expanded according to the available disk space, and it is the only state backend that supports incremental checkpoints, so it is very suitable for super massive state storage. However, since the reading and writing of each state needs to be serialized/deserialized, and data may need to be read directly from the disk, this will lead to a decrease in performance. The average read and write performance is an order of magnitude slower than that of HashMapStateBackend.

Practical application is the trade-off after weighing the pros and cons. The most ideal is of course that the processing speed is fast and the memory is not limited to handle massive states, then a very large memory resource is required, otherwise it will be a slightly slower processing speed or a slightly smaller processing scale.

9.5.2.3, configuration of state backend

When not configured, the default state backend used by the application is specified in the cluster configuration file flink-conf.yaml, and the key name of the configuration is state.backend. This default configuration is valid for all jobs running on the cluster, we can change the default state backend by changing the configuration value. In addition, we can also configure the state backend separately for the current job in the code, this configuration will override the default value of the cluster configuration file.

9.5.2.3.1. Configure the default state backend

In flink-conf.yaml, the default state backend can be configured using state.backend.
The possible value of the configuration item is hashmap, which is configured as HashMapStateBackend; it can also be rocksdb, which is configured as EmbeddedRocksDBStateBackend. Alternatively, it can be the fully qualified class name of a class that implements the state backend factory StateBackendFactory. The following is an example of configuring HashMapStateBackend:

# 默认状态后端
state.backend: hashmap
# 存放检查点的文件路径
state.checkpoints.dir: hdfs://namenode:40010/flink/checkpoints

The state.checkpoints.dir configuration item here defines the directory where the state backend will write checkpoints and metadata.

9.5.2.3.2, Configure the state backend separately for each job (Per-job)

Each job's independent state backend can be set directly in the code based on the execution environment of the job. code show as below:

StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new HashMapStateBackend());

The above code sets HashMapStateBackend. If you want to set EmbeddedRocksDBStateBackend,
you can use the following configuration method:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new EmbeddedRocksDBStateBackend());

It should be noted that if you want to use EmbeddedRocksDBStateBackend in the IDE, you need to add
dependencies to the Flink project:

<dependency>
 <groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_${
    
    scala.binary.version}</artifactId>
 <version>1.13.0</version>
</dependency>

Since RocksDB is included by default in the Flink distribution, as long as there is no RocksDB-related content in our code, there is no need to introduce this dependency. Even if we set state.backend to rocksdb in the flink-conf.yaml configuration file, it can run normally and use RocksDB as the state backend.

10. Fault tolerance mechanism

10.1, Checkpoint (Checkpoint)

Checkpoints are at the core of Flink's fault tolerance mechanism. The so-called "check" here is actually for the result of failure recovery: the result of continuing processing after failure recovery should be exactly the same as before the failure occurred, and we need to "check" the correctness of the result. Therefore, checkpoints are sometimes called "consistency checkpoints".

10.1.1. Checkpoint saving

10.1.1.1. Periodic trigger save

If a checkpoint is saved every time a piece of data is processed, when a large amount of data arrives at the same time, a lot of resources will be consumed to perform frequent checkpoints, and the speed of data processing will be affected. So a better way is to do an archive every once in a while, so that the normal processing of data will not be affected, and there will not be too much delay-after all, failure recovery does not happen at any time.

In Flink, the saving of checkpoints is triggered periodically, and the interval can be set. Therefore, as an "archive" of the application state, the checkpoint is actually a "snapshot" (snapshot) of all task states at the same point in time, and its trigger is periodic. Specifically, when the checkpoint save operation is triggered every once in a while, the current state of each task is copied, put together and persisted according to a certain logical structure, and a checkpoint is formed.

10.1.1.2. Time point of saving

When all tasks happen to have processed the same input data, save their state. First of all, this avoids the storage of additional information other than state and improves the efficiency of checkpoint saving. Secondly, a piece of data is either completely processed by all tasks, and the state is saved; or it is not processed, and the state is not saved at all: this is equivalent to constructing a "transaction" (transaction). If there is a failure, we restore the previously saved state, and all the data being processed at the time of the failure needs to be reprocessed; so we only need to let the source (source) task resubmit the offset to the data source and request to replay the data. .

10.1.1.3. The specific process of saving

The key to saving checkpoints is to wait for all tasks to process the "same data". Let's use a specific example to describe in detail the specific saving process of the checkpoint.
The first implemented program for counting word frequency - WordCount. For convenience here, we directly read in the separated words from the data source, for example, the input here is:

“hello”“world”“hello”“flink”“hello”“world”“hello”“flink”……

The corresponding code can be simplified to:

SingleOutputStreamOperator<Tuple2<String, Long>> wordCountStream = env.addSource(...)
 .map(word -> Tuple2.of(word, 1L))
 .returns(Types.TUPLE(Types.STRING, Types.LONG));
 .keyBy(t -> t.f0);
 .sum(1);

The source (Source) task reads data from an external data source, and records the current offset, which is saved as the operator state (Operator State). Then send the data to the downstream Map task, which will convert a word into a (word, count) two-tuple, and the initial count is 1, that is, ("hello", 1), ("world", 1) like this form; this is a stateless operator task. Then use word as the key (key) to partition, call the .sum() method to sum the count value; the Sum operator will save the current sum result as the keyed state (Keyed State). The final result is the frequency statistics (word, count) of the current word, as shown in the figure below.

insert image description here
When we need to save a checkpoint (checkpoint), it is to save a snapshot of the state after all tasks have processed the same piece of data. For example, in the figure above, three pieces of data have been processed: "hello", "world" and "hello", so we will see that the offset of the Source operator is 3; the subsequent Sum operator has processed the third piece of data "hello". "After that, there are already 2 "hello" and 1 "world", so the corresponding state is "hello" -> 2, "world" -> 1 (here, the bottom layer of KeyedState will be stored in the form of key-value). At this point, all tasks have processed the first three data, so we can save the current state as a checkpoint and write it to external storage. As for where to save it, it is determined by the configuration item "CheckpointStorage" of the state backend. There are two choices: the job manager's heap memory (JobManagerCheckpointStorage) and the file system (FileSystemCheckpointStorage). Typically, we write checkpoints to a persistent distributed file system.

10.1.2. Restoring state from a checkpoint

When running a stream processing program, Flink saves checkpoints periodically. When a failure occurs, it is necessary to find the last successfully saved checkpoint to restore the state. For example, in the wordCount example, a checkpoint is saved after processing three pieces of data. After that, it continued to run, and another data "flink" was processed normally, but a failure occurred when processing the fifth data "hello", as shown in the figure below.
insert image description here
Here, the Source task has been processed, so the offset is 5; the Map task has also been processed. However, the Sum task failed during processing, and the state was not saved at this time.

The next step is to restore the state from the checkpoint. The specific steps are:

  1. Restart the application

After encountering a failure, the first step is of course to restart. After we restart the application, the status of all tasks will be cleared, as shown in the figure below.
insert image description here

  1. read checkpoint, reset state

Find the last saved checkpoint, read the snapshot of each operator task state from it, and fill it into the corresponding state respectively.
In this way, the state of all tasks inside Flink is restored to the moment when the checkpoint was saved, that is, when the third data has just been processed, as shown in the figure below. Here the key is "flink" and no data arrives, so the initial value is 0.
insert image description here

  1. replay data

There is another problem after restoring the state from the checkpoint: if you continue to process the data directly, then save the data from the checkpoint to the failure period, that is, the 4th and 5th data ("flink" "hello") It is equivalent to throwing away; this will cause errors in the calculation results. In order not to lose data, we should re-read the data after saving the checkpoint, which can be achieved by resubmitting the offset (offset) to the external data source through the Source task, as shown in the figure below.
insert image description here
In this way, the state of the entire system has been completely rolled back to the moment when the checkpoint save is completed.

  1. Continue to process the data
    Next, we can process the data normally. The first is to replay the 4th and 5th data, and then continue to read the following data, as shown in the figure below.

insert image description here
When the fifth data is processed, the system state at the time of the failure has been caught up. Processing continues as if there had been no failure; we neither lost data nor recalculated data, which ensures the correctness of the calculation results. In a distributed system, this is known as achieving an "exactly-once" state consistency guarantee.

10.1.3. Checkpoint Algorithm

In Flink, a distributed snapshot based on the Chandy-Lamport algorithm is adopted.

10.1.3.1. Checkpoint boundary (Barrier)

On the premise of not suspending the stream processing, let each task "recognize" the data saved by the trigger checkpoint, learn from the design of the watermark, and insert a special data structure in the data stream, which is specially used to represent the trigger The point in time at which the checkpoint is saved.

After receiving the instruction to save the checkpoint, the Source task can insert this structure in the current data flow; all subsequent tasks will start to save the persistent snapshot of the state as soon as they encounter it. Since the data flow is processed sequentially in order, encountering this mark means that the previous data has been processed, and a checkpoint can be saved; and the state changes caused by the data after it will not be reflected in this checkpoint. , but needs to be saved until the next checkpoint.

This special data form separates the data on a stream according to different checkpoints, so it is called the "Checkpoint Barrier" of the checkpoint.

Similar to the water level, the checkpoint boundary is also a special piece of data, which is injected into the regular data flow by the Source operator. Its position is limited and cannot exceed other data, nor can it be exceeded by subsequent data. There is a checkpoint ID in the checkpoint demarcation line, which is the unique identifier of the current checkpoint to be saved, as shown in the figure below. In this way, the dividing line logically divides a flow into two parts: the state changes caused by the data arriving before the dividing line will be included in the checkpoint represented by the current dividing line; and the state caused by the data after the dividing line Changes will be included in subsequent checkpoints.
insert image description here
There is a "checkpoint coordinator" (checkpoint coordinator) in the JobManager, which is specially used to coordinate the related work of processing checkpoints. The checkpoint coordinator will periodically send instructions to TaskManager to save checkpoints (with checkpoint ID); TaskManager will let all Source tasks save their offsets (operator status) The ID barrier (barrier) is inserted into the current data stream, and then passed downstream like normal data; after that, the Source task can continue to read in new data. As long as each operator task reaches the barrier, it takes a snapshot of the current state; before receiving the barrier, it still processes the previous data normally without being affected at all. For example, in the figure above, when the Source task receives the No. 1 checkpoint save command, it reads three data, so it saves the offset 3 to the external storage; then injects the barrier with ID 1 into the data stream; at the same time , the Map task has just received the last piece of data "hello", while the Sum task is still processing the previous second piece of data (world, 1). The downstream task will not save the state immediately at this time, but will take a snapshot when the barrier is received. At this time, it can be guaranteed that the first three data have been processed. Similarly, when a downstream task takes a state snapshot, it will not affect the processing of the upstream task. The snapshots of each task are saved in parallel, and there will be no pause and wait time.

10.1.3.2, distributed snapshot algorithm

By inserting barriers into the stream, we can explicitly indicate when a checkpoint save is triggered. On a single stream, data is processed sequentially, and the order remains unchanged; however, for distributed stream processing, it is not so easy to maintain the order of data all the time.

In terms of implementation, Flink uses a variant of the Chandy-Lamport algorithm, known as the "asynchronous barrier snapshotting" algorithm. The core of the algorithm is two principles: when an upstream task sends a barrier to multiple parallel downstream tasks, it needs to be broadcast; and when multiple upstream tasks send a barrier to the same downstream task, it is necessary to perform "boundary alignment" in the downstream tasks (barrier alignment) operation, that is, you need to wait until the barriers of all parallel partitions are in place before you can start saving the state.

To extend the previous word count program, consider the scenario where the parallelism of all operators is 2, as shown in the figure below.
insert image description here
We have two parallel Source tasks that read two data streams (or different partitions of a source) respectively. The data in each stream here are words one by one: "hello", "world", "hello" and "flink" appear alternately. At this time, the Source task of the first stream (for convenience, we will directly call it "Source 1" below, and other tasks are similar) has read 3 data, and the offset is 3; while the Source task of the second stream ( Source 2) only read a "hello" data, the offset is 1. The first data "hello" in the first stream has been completely processed, so the key in the state of the Sum task is hello corresponding to the value 1, and the result (hello, 1) has been sent; the second data "world" After the conversion of the Map task, it is still being processed by the Sum task; the third data "hello" is still being processed by the Map task. The first data "hello" of the second stream has also been converted by Map and is being processed by the Sum task.

Next is the algorithm for checkpoint saving. The specific process is as follows:

  1. The JobManager sends instructions to trigger the saving of the checkpoint; the Source task saves the state and inserts the dividing line. The JobManager will periodically send a message with a new checkpoint ID to each TaskManager to start the checkpoint in this way. After receiving the instruction, TaskManger will insert a boundary (barrier) in all Source tasks, and save the offset to the remote persistent storage, as shown in the figure below.
    insert image description here
    The state saved by the parallel Source task is 3 and 1, indicating that the current checkpoint No. 1 should contain: all state changes up to the third data in the first stream and up to the first data in the second stream. It can be found that when the Source task does this, it does not affect the processing of subsequent tasks. The Sum task has already processed (world, 1) from the first stream, and the corresponding state has also changed.

  2. After the state snapshot is saved and the boundary line is transferred downstream and
    the state is stored in the persistent storage, it will return a notification to the Source task; the Source task will confirm the completion of the checkpoint to the JobManager, and then pass the barrier to the downstream task like data, as shown in the following figure Show.
    insert image description here
    Since there is a one-to-one (forward) transfer relationship between Source and Map (the operator chain is not considered here), the barrier can be directly passed to the corresponding Map task. After that, the Source task can continue to read new data. At the same time, Sum 1 has processed the (hello, 1) from the second stream and updated the status.

  3. Broadcast the dividing line to multiple parallel subtasks downstream, and perform dividing line alignment

The Map task has no state, so the barrier is directly passed downstream. At this time, due to the keyBy partition, the barrier needs to be broadcast to the two downstream parallel Sum tasks, as shown in the figure below. At the same time, the Sum task may receive barriers from two upstream parallel Map tasks, so it needs to perform the "alignment of dividing line" operation.
insert image description here
At this time, Sum 2 has received barriers from two upstream Map tasks, indicating that the third data of the first stream and the first data of the second stream have been processed, and the state can be saved; while Sum 1 only The barrier from Map 2 has been received, so we need to wait for the dividing line to be aligned. During the waiting process, if the partition task Map 1 that has not yet reached the boundary sends data (hello, 1), it means that it needs to be saved to the checkpoint, and the Sum task should continue to process the data normally, and the status is updated to 3; However, if the partition task Map 2 that has already reached the dividing line transmits data, this is already the content to be saved at the next checkpoint, so it should not be processed immediately, but should be cached and processed after the state is saved.

  1. After the dividing line is aligned, save the state to persistent storage

After the boundaries of each partition are aligned, you can take a snapshot of the current state and save it to persistent storage. After the storage is completed, the barrier is also passed downstream and the JobManager is notified that the storage is complete, as shown in the figure below.
insert image description here
In this process, each task saves its own state relatively independently and does not affect each other. We can see that when Sum saves the current state, the Source 1 task has read the fifth data of the first stream.

  1. Process cached data first, then continue processing normally

After the checkpoint is saved, the task can continue processing data normally. At this time, if there is data cached while waiting for the boundary line to be aligned, it needs to be processed first; and then the newly arrived data is processed in sequence. When the JobManager receives information that all tasks have successfully saved their state, it can confirm that the current checkpoint has been successfully saved. You can recover from here if you encounter a failure. Since boundary line alignment requires the partition that arrives first to be cached, this will affect the processing speed to a certain extent; when backpressure occurs, downstream tasks will accumulate a large amount of buffered data, and it may take a long time for the checkpoint to be saved. In order to cope with this scenario, Flink 1.11 and later provides an unaligned checkpoint saving method, which can also save unprocessed buffer data (in-flight data) into the checkpoint. In this way, when we encounter a partition barrier, we don't need to wait for alignment, but can directly start saving the state.

10.1.4, checkpoint configuration

The role of checkpoints is for fault recovery. We cannot save a lot of time because of saving checkpoints, resulting in a significant decrease in data processing performance. In order to balance fault tolerance and processing performance, we can configure various checkpoints in the code.

  1. enable checkpoint

By default, Flink programs have checkpointing disabled. If you want to enable the function of automatically saving snapshots for Flink applications, you need to explicitly call the .enableCheckpointing() method of the execution environment in the code:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 每隔 1 秒启动一次检查点保存
env.enableCheckpointing(1000);

Here you need to pass in a long integer number of milliseconds, indicating the interval for periodically saving checkpoints. The checkpoint interval is a trade-off between processing performance and failure recovery speed. If we want to have less impact on performance, we can increase the interval time; and if we want to quickly catch up with real-time data processing after a fault restarts, we need to set the interval time smaller.

  1. Checkpoint Storage (Checkpoint Storage)
    The specific persistent storage location of the checkpoint depends on the setting of "Checkpoint Storage". By default, checkpoints are stored in the JobManager's heap memory. For the persistent storage of large states, Flink also provides an interface for saving in other storage locations, which is CheckpointStorage.
    Specifically, it can be configured by calling .setCheckpointStorage() of the checkpoint configuration, and a
    CheckpointStorage implementation class needs to be passed in.

Flink mainly provides two kinds of CheckpointStorage: the heap memory of the job manager (JobManagerCheckpointStorage) and the file system (FileSystemCheckpointStorage).

// 配置存储检查点到 JobManager 堆内存
env.getCheckpointConfig().setCheckpointStorage(new 
JobManagerCheckpointStorage());
// 配置存储检查点到文件系统
env.getCheckpointConfig().setCheckpointStorage(new FileSystemCheckpointStorage("hdfs://namenode:40010/flink/checkpoints"));

For actual production applications, we generally configure CheckpointStorage as a highly available distributed file system (HDFS, S3, etc.).

  1. Other advanced configuration
    Checkpoint also has many configurable options, which can be set by obtaining the checkpoint configuration (CheckpointConfig).
CheckpointConfig checkpointConfig = env.getCheckpointConfig();

Here we make a simple enumeration:

  • Checkpointing Mode (CheckpointingMode)
    sets the guarantee level of checkpoint consistency, with two options of "exactly-once" and "at-least-once". The default level is exactly-once, and for most low-latency stream processing programs, at-least-once is sufficient, and the processing efficiency will be higher.
  • The timeout (checkpointTimeout)
    is used to specify the timeout period for checkpoint saving, and it will be discarded if the timeout is not completed. Pass in a long integer number of milliseconds as a parameter, indicating the timeout period.
  • The minimum interval (minPauseBetweenCheckpoints)
    is used to specify how long the checkpoint coordinator (checkpoint coordinator) can wait to save the next checkpoint after the previous checkpoint is completed. This means that even if the time point of the periodic trigger has been reached, as long as the interval from the completion of the previous checkpoint is not enough, the next checkpoint cannot be saved. This leaves plenty of room for normal processing of the data. When this parameter is specified, the value of maxConcurrentCheckpoints is forced to be 1.
  • The maximum number of concurrent checkpoints (maxConcurrentCheckpoints)
    is used to specify the maximum number of checkpoints in operation. Due to the different processing progress of each task, it is entirely possible that the subsequent task has not completed the saving of the previous checkpoint, and the previous task has begun to save the next checkpoint. This parameter is to limit the maximum number of simultaneous. If minPauseBetweenCheckpoints is set earlier, the parameter maxConcurrentCheckpoints will not work.
  • Enable external persistent storage (enableExternalizedCheckpoints)
    is used to enable the external persistence of checkpoints, and by default, it will not be automatically cleaned up when the job fails. If you want to free up space, you need to clean it up manually. The parameter ExternalizedCheckpointCleanup passed in specifies how to clean up the external checkpoint when the job is canceled.
    • DELETE_ON_CANCELLATION: The external checkpoint will be automatically deleted when the job is canceled, but if the job fails to exit, the checkpoint will be retained.
    • RETAIN_ON_CANCELLATION: External checkpoints are also retained when the job is canceled.
  • Whether to make the entire task fail when the checkpoint is abnormal (failOnCheckpointingErrors) is used to specify whether the task should fail to exit directly when an exception occurs at the checkpoint. The default is true, if set to false, the task will discard the checkpoint and continue running.
  • Unaligned checkpoints (enableUnalignedCheckpoints)
    no longer perform the boundary alignment operation of checkpoints. After enabling it, it can greatly reduce the checkpoint saving time when backpressure is generated. This setting requires that the checkpoint mode (CheckpointingMode) must be exclusively-once, and the number of concurrent checkpoints must be 1.

The specific settings in the code are as follows:

StreamExecutionEnvironment env = SteamExecutionEnvironment.getExecutionEnvironment();
// 启用检查点,间隔时间 1 秒
env.enableCheckpointing(1000);
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
// 设置精确一次模式
checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 最小间隔时间 500 毫秒
checkpointConfig.setMinPauseBetweenCheckpoints(500);
// 超时时间 1 分钟
checkpointConfig.setCheckpointTimeout(60000);
// 同时只能有一个检查点
checkpointConfig.setMaxConcurrentCheckpoints(1);
// 开启检查点的外部持久化保存,作业取消后依然保留
checkpointConfig.enableExternalizedCheckpoints(
 ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// 启用不对齐的检查点保存方式
checkpointConfig.enableUnalignedCheckpoints();
// 设置检查点存储,可以直接传入一个 String,指定文件系统的路径
checkpointConfig.setCheckpointStorage("hdfs://my/checkpoint/dir")

10.2. State Consistency

10.2.1. Concepts and levels of consistency

The concept of consistency in Flink is mainly used in the description of failure recovery, so it is more similar to the expression in transactions. So what exactly is consistency?

Simply put, consistency is actually the correctness of the results. For Flink, multiple nodes process different tasks in parallel. If we want to ensure that the calculation results are correct, we must not miss any data, and we will not process the same data repeatedly. Stream computing comes one by one, so the result must be correct during normal processing; but when a failure occurs and the state needs to be restored for rollback, more guarantee mechanisms are needed.

Generally speaking, there are three levels of state consistency:

  • At most once (AT-MOST-ONCE)
    When a task fails, the easiest way is to restart it directly and do nothing else; neither restore the lost state nor replay the lost data. Each piece of data will be processed once under normal circumstances, and will be lost when a failure occurs, so it is "processed at most once".

  • At least once (AT-LEAST-ONCE)
    In practical applications, we generally hope that at least no data will be lost. This level of consistency is called "at-least-once", which means that all data will not be lost and must be processed; however, there is no guarantee that it will be processed only once, and some data will be processed repeatedly.
    In some scenarios, repeated data processing does not affect the correctness of the results, and this operation is "idempotent". For example, if we count the UV of an e-commerce website, we need to deduplicate the access data of each user, so even if the same data is processed multiple times, it will not affect the final result. At this time, use at-least-once Semantics are perfectly fine. Of course, if repeated data has an impact on the results, such as PV, or the previous statistical word frequency word count, using at-least-once semantics may lead to inconsistencies in the results. To guarantee at-least-once state consistency, we need to be able to replay data in the event of a failure. The most common approach is to use a persistent event log system to write all events to persistent storage. At this time, you only need to record an offset. When the task fails and restarts, reset the offset to replay the data after the checkpoint. Kafka is a typical implementation of this architecture.

  • Exactly once (EXACTLY-ONCE)
    The strictest consistency guarantee is the so-called "exactly-once" (exactly-once, sometimes translated as "exactly once"). This is also the most difficult state consistency semantics to implement. exactly-once means that not only will all data not be lost, but it will only be processed once and will not be processed repeatedly. That is to say, for each piece of data, it is finally reflected in the status and output results, and there can only be one statistics. exactly-once can truly guarantee that the result is absolutely correct, and after a failure recovery, it is as if the failure never occurred.

Obviously, for exactly-once to be done, the at-least-once requirement must first be met, that is, no data loss. So a data replay mechanism is also needed to ensure this. In addition, a special design is required to ensure that each data is processed only once. Flink uses a lightweight snapshot mechanism - checkpoint (checkpoint) to ensure exactly-once semantics.

10.2.2. End-to-end state consistency

We already know that checkpoints can ensure the consistency of Flink's internal state, and it can be done exactly once (exactly-once). Does that mean that as long as the checkpoint is turned on and a failure occurs to recover, there will be no problems in the result?

not that simple. In practical applications, it is generally necessary to ensure that the final consumed data is correct from the user's point of view. However, users or external applications do not directly read data from Flink's internal state, and often require us to write the processing results to external storage. This requires us not only to consider the processing and conversion of Flink's internal data, but also to read from external data sources and write to external persistence systems. The entire application processing process should be correct from beginning to end. Therefore, a complete stream processing application should include three parts: data source, stream processor and external storage system.

The consistency of this complete application is called "end-to-end state consistency", and it depends on the weakest link among the three components. Generally speaking, whether the at-least-once consistency level can be achieved mainly depends on the data source's ability to replay data; and whether the exactly-once level can be achieved, the internal stream processor, data source, and external storage must have corresponding guarantees mechanism

10.3. End-to-end exactly-once

In practical applications, the most difficult and desirable consistent semantics is undoubtedly end-to-end "exactly-once". We know that for Flink internally, the checkpoint mechanism can ensure that data will not be lost after fault recovery (on the premise that it can be replayed), and it will only be processed once, so the exactly-once consistency semantics can already be achieved.

It should be noted that when we say that the checkpoint can ensure that the data is only processed once after the fault is restored, it does not mean that a certain data has been counted before, and it cannot be counted again now; it depends on the change of the state and the output result. Contains a processing of this data. Since the checkpoint saves the state snapshot after all previous tasks have processed a certain data, the state changes caused by the replayed data will not be included in it, and the final result is only processed once. Therefore, the key point of end-to-end consistency lies in the input data source and the output external storage.

10.3.1. Guarantee of input terminals

The input mainly refers to the external data source read by Flink. For some data sources, there is no buffering or persistent storage of data, and the data completely disappears after being consumed. For example, the socket text stream is like this. The socket server is not responsible for storing data. After sending a piece of data, we can only consume it once, which is a "one-shot deal". For such a data source, even if we restore the previous state through the checkpoint after the failure, the data from the checkpoint to the failure period can no longer be resent, which will lead to data loss. Therefore, only the consistency semantics of at-most-once can be guaranteed, which is equivalent to no guarantee.

In order not to lose data after failure recovery, the external data source must have the ability to replay data. A common practice is to persist the data and reset the read location of the data. One of the most classic applications is Kafka. In Flink's Source task, the offset of the data read is saved as a state, so that it can be read from the checkpoint during fault recovery, reset the offset of the data source, and retrieve the data.

The data source can replay data, or reset the read data offset, and Flink's Source operator saves the offset as a state into the checkpoint, which can ensure that the data will not be lost. This is the basic requirement to achieve at-least-once consistency semantics, and of course it is also the basic requirement to realize end-to-end exactly-once.

10.3.2. Output Guarantee

With Flink's checkpoint mechanism and external data sources for replayable data, we can already do at-least-once. But there is a greater difficulty in achieving exactly-once: data may be repeatedly written to an external system.

Because after the checkpoint is saved, the incoming data will be processed one by one, the status of the task will be updated, and finally the calculation result will be output to the external system through the Sink task; but the status change has not been saved in the next checkpoint. At this time, if there is a failure, the data will be repeated, and it will be calculated twice. We know that for Flink's internal state, repeated calculations have no effect, because the state has been rolled back, and the final change will only happen once; It cannot be recovered, and if the write is performed again, the same data will be written twice.

So at this time, we only guarantee the end-to-end at-least-once semantics. In order to achieve end-to-end exactly-once, we also need to have additional requirements for external storage systems and Sink connectors. There are two writing methods that can guarantee exactly-once consistency:

  • idempotent write
  • Transactional writing
    We need the external storage system to support these two writing methods, and Flink also provides some Sink connector interfaces.
  1. Idempotent (idempotent) write
    The so-called "idempotent" operation means that an operation can be repeated many times, but only results in a change of result. That is to say, repeated execution later will not have an effect on the result, and in the field of data processing, the most typical is the insertion operation of HashMap: if it is the same key-value pair, subsequent repeated insertion will have no effect .

This is equivalent to saying that we have not really solved the problem of data recalculation and writing; rather, it does not matter if rewriting is repeated, and the result will not change. So the main limitation of this method is that the external storage system must support such idempotent writes: such as key-value storage in Redis, or update operations that meet query conditions in relational databases (such as MySQL).

It should be noted that for idempotent writes, transient inconsistencies may occur when recovering from failures. Because the data between the completion of the savepoint and the occurrence of the failure has actually been written once, and they cannot be eliminated during rollback. If there is an external application reading the written data, you may see strange behavior: for a short period of time, the result will suddenly "jump back" to some previous value, and then "replay" a period of previous data. However, when the replay of data gradually exceeds the point of failure, the final result is still consistent.

  1. Transactional writing
    If idempotent writing has too many restrictions on application scenarios, then transactional writing can be said to be a more general way to ensure consistency. The biggest problem at the output end is that "overwhelming water is hard to recover", and the data written to the external system is difficult to withdraw. So how can a piece of data that has been written be recovered? You can do it with transactions. We all know that a transaction is a series of rigorous operations in an application, and all operations must complete successfully, otherwise all changes made in each operation will be undone.

Transaction has four basic characteristics: atomicity (Atomicity), consistency (Correspondence), isolation (Isolation) and persistence (Durability), which is the famous ACID. When the results of Flink stream processing are written to the external system, if a transaction can be constructed so that the write operation can be committed and rolled back along with the checkpoint, then the problem of repeated writing can naturally be solved. So the basic idea of ​​transaction writing is: use a transaction to write data to the external system, and this transaction is bound to the checkpoint. When the Sink task
encounters a barrier, it starts a transaction when it starts to save the state, and then all data is written in this transaction; when the current checkpoint is saved, the transaction is committed, and all written data is really usable. If there is a failure in the middle process, the state will roll back to the previous checkpoint, and the current transaction is not closed normally (because the current checkpoint has not been saved), so it will be rolled back, and the data written to the outside will be revoked.

Specifically, there are two implementations: write-ahead log (WAL) and two-phase commit (2PC)

  • Write-ahead-log (WAL) We found that transaction commit requires external storage system to support transactions, otherwise there is no way to truly implement write rollback. For storage systems that generally do not support transactions, can transaction writes be implemented?
    Write-ahead logging (WAL) is a very simple way. The specific steps are:
    • First save the result data as a log (log) state
    • When saving checkpoints, these result data will also be stored persistently
    • When notified that a checkpoint is complete, write all results to the external system at once.

We will find that this method is similar to doing a batch process when the checkpoint is completed, and one-time writing will bring some performance problems; the advantage is that it is relatively simple, because the data is cached in the state backend in advance, So no matter what external storage system, theoretically, it can be done in batches in this way. In Flink, the DataStream API provides a template class GenericWriteAheadSink to implement this transactional writing method.
It should be noted that the batch writing method of the pre-write log may fail to write; so after performing the write action, you must wait for the successful return confirmation message. After all the data has been successfully written, the corresponding checkpoint is reconfirmed internally, which represents the actual completion of the checkpoint. Here, the confirmation information needs to be stored persistently. When recovering from a fault,
only when there is corresponding confirmation information can it be guaranteed that this batch of data has been written and can be restored to the corresponding checkpoint position. However, this "reconfirmation" method also has some defects. If our checkpoint has been successfully saved, and the data has been successfully written to the external system in batches, but there is a failure when saving the confirmation information, Flink will eventually consider that the write was not successful. Therefore, when a failure occurs, this checkpoint will not be used, but needs to be rolled back to the previous one; this will lead to repeated writing of this batch of data.

  • Two-phase-commit (two-phase-commit, 2PC)
    The various ways of implementing exactly-once mentioned above are somewhat flawed. Is there a better way? Naturally, this is the legendary two-phase commit (2PC).
    As the name suggests, the idea is to divide it into two phases: do a "pre-commit" first, and wait until the checkpoint is complete before committing formally. This commit method is truly transaction-based, and requires external systems to provide transaction support.
    The specific implementation steps are:
    • The Sink task starts a transaction when the first piece of data arrives, or when a checkpoint boundary is received.
    • All data received next will be written to the external system through this transaction; at this time, since the transaction has not been committed, the data is not available even though it has been written to the external system, and is in a "pre-committed" state.
    • When the Sink task receives the notification from the JobManager that the checkpoint is complete, it officially submits the transaction, and the written result is really available.

When a failure occurs in the middle, the currently uncommitted transaction will be rolled back, so all the data written to the external system will be withdrawn. This two-phase commit (2PC) method makes full use of Flink's existing checkpoint mechanism: the arrival of the demarcation line marks the start of a new transaction; and the receipt of the successful checkpoint message from the JobManager is the instruction to commit the transaction . The writing of each result data is still in a stream mode, and there is no longer the performance problem of batch processing when pre-writing logs; when the final submission is made, only an additional confirmation message needs to be sent. Therefore, the 2PC protocol not only realizes exactly-once in the true sense, but also realizes transactions by carrying Flink's checkpoint mechanism, which only adds a little overhead to the system.
Flink provides the TwoPhaseCommitSinkFunction interface, which is convenient for us to customize the implementation of the SinkFunction for two-phase commit, and provides a true end-to-end exactly-once guarantee. However, although the two-phase commit is exquisite, it has high requirements on the external system.

The requirements of 2PC for external systems are listed as follows:

  • The external system must provide transaction support, or the sink task must be able to simulate transactions on the external system.
  • Between checkpoints, it must be possible to open a transaction and accept data writes.
  • A transaction must be in a "waiting to commit" state before being notified that a checkpoint is complete. In case of failure recovery, this may take some time. If the external system closes the transaction at this time (for example, it times out), the uncommitted data will be lost.
  • Sink tasks must be able to recover transactions after process failures.
  • Committing a transaction must be an idempotent operation. In other words, repeated submissions of transactions should be invalid.

It can be seen that 2PC will also be subject to relatively large restrictions in practical applications. The specific selection in the project should ultimately be a trade-off consideration between consistency level and processing performance.

10.3.3 Exactly-once guarantee when connecting Flink and Kafka

In the application of stream processing, the best data source is of course the message queue with resettable offset; it can not only provide the function of data replay, but also store and process data in the form of stream by nature. Therefore, as a representative of message queues in big data tools, Kafka can be said to be a match made in heaven with Flink. In actual projects, applications that use Kafka as a data source and external system for writing are often seen.

10.3.3.1. Overall introduction

Since it is end-to-end exactly-once, we can still analyze it from the perspective of three components:

  1. Inside Flink
    Flink can guarantee the exactly-once semantics of state and processing results through the checkpoint mechanism.
  2. Input
    Kafka at the input data source can persist the data and reset the offset (offset). So we can save the currently read offset as the operator state in the Source task (FlinkKafkaConsumer) and write it into the checkpoint; when a failure occurs, read the recovery state from the checkpoint and send it to the connector FlinkKafkaConsumer By resubmitting the offset to Kafka, you can re-consume the data and ensure the consistency of the results.
  3. The output
    terminal guarantees the best implementation of exactly-once, of course, two-phase commit (2PC). As a natural pair with Flink, Kafka naturally needs to prove itself with the strongest consistency guarantee. In the Kafka connector officially implemented by Flink, the FlinkKafkaProducer written to Kafka is provided, which implements the TwoPhaseCommitSinkFunction interface:
public class FlinkKafkaProducer<IN> extends TwoPhaseCommitSinkFunction<IN, 
FlinkKafkaProducer.KafkaTransactionState, 
FlinkKafkaProducer.KafkaTransactionContext> {
    
    
...
}

That is to say, the process of our writing to Kafka is actually a two-stage commit: the result is obtained after processing, and the transaction-based "pre-commit" is written to Kafka; the transaction will not be committed until the checkpoint is saved. officially submitted". If there is a failure in the middle, the transaction will be rolled back, and the pre-commit will be abandoned; after the state is restored, all operations that have been confirmed to be committed can only be restored.

10.3.3.2. Specific steps

For the convenience of illustration, let's consider a specific stream processing system, where Flink reads data from Kafka and writes the processing results to Kafka, as shown in the figure below.
insert image description here
This is a complete data pipeline built by Flink and Kafka. The Source task reads data from Kafka, undergoes a series of processing (such as window calculation), and then the Sink task writes the result to Kafka. The two-phase commit of the connection between Flink and Kafka is inseparable from the cooperation of checkpoints. This process requires the JobManager to coordinate each TaskManager to take a state snapshot, and the specific storage location of the checkpoint is configured and managed by the State Backend. In general, we store checkpoints on a distributed file system.
The specific process of realizing end-to-end exactly-once can be decomposed as follows:

  1. Initiate checkpoint saving

The initiation of checkpoint saving marks our entry into the "pre-commit" phase of the two-phase commit protocol. Of course, there is no specific submitted data yet.

insert image description here
As shown in the above figure, the JobManager notifies each TaskManager to start checkpoint saving, and the Source task will inject the checkpoint boundary (barrier) into the data stream. This barrier can divide the data in the data stream into the set entering the current checkpoint and the set entering the next checkpoint.

  1. The operator task takes
    a snapshot of the state and the barrier will be passed down between operators. When each operator receives the barrier, it will take a snapshot of the current state and save it to the state backend.
    insert image description here

As shown in the figure above, after the Source task inserts the barrier into the data stream, it will also write the offset of the currently read data into the checkpoint as the state and store it in the state backend; then pass the barrier downstream, and you can continue to read The data is fetched. Next, the barrier is passed to the internal Window operator, which also takes a snapshot of its own state and writes it to the remote persistent storage.

  1. The Sink task starts a transaction and pre-submits it
    insert image description here
    . As shown in the figure above, the barrier is finally passed to the Sink task, and the Sink task will start a transaction at this time. All incoming data will be written to Kafka by the Sink task through this transaction. Here barrier is the demarcation line of the checkpoint and also the demarcation line of the transaction. Since the previous checkpoint may not have been completed, the previous transaction may not have been committed; at this time, the arrival of the barrier starts a new transaction. Although the previous transaction may not have been committed, it will no longer receive new data.
    For Kafka, submitted data is marked as "uncommitted". This process is the so-called "pre-commit" (pre-commit).
  2. The checkpoint is saved and the transaction is committed

When the snapshots of all operators are completed, that is, when the checkpoint saving is finally completed, the JobManager will send a confirmation notification to all tasks, telling everyone that the current checkpoint has been successfully saved, as shown in the figure below.

insert image description here

10.3.3.3. Required configuration

In a specific application, some additional configuration is required to achieve true end-to-end exactly-once:

  • Checkpointing must be enabled;
  • Pass in the parameter Semantic.EXACTLY_ONCE in the constructor of FlinkKafkaProducer;
  • Configure the isolation level of consumers who read data from Kafka. The Kafka mentioned here is an external system for writing. The data in the pre-commit phase has been written, but it is marked as "uncommitted", and the default isolation level isolation.level in Kafka is read_uncommitted, that is, uncommitted data can be read. In this way, external applications can directly consume uncommitted data, and the guarantee of transactionality becomes invalid. Therefore, the isolation level should be configured as read_committed, which means that when the consumer encounters an uncommitted message, it will stop consuming data from the partition, and will not resume consumption until the message is marked as committed. Of course, in doing so, there will be significant delays in consuming data by external applications.
  • Transaction timeout configuration
    The transaction timeout time transaction.timeout.ms configured in Flink's Kafka connector defaults to 1 hour, while the maximum transaction timeout time transaction.max.timeout.ms configured in the Kafka cluster defaults to 15 minutes. Therefore, when the checkpoint is saved for a long time, Kafka may think that the transaction has timed out and discard the pre-submitted data; while the Sink task thinks that it can continue to wait. If the next checkpoint is saved successfully, and the state of the checkpoint is rolled back after a failure occurs, this part of the data is actually lost. Therefore, for these two timeouts, the former should be less than or equal to the latter.

Guess you like

Origin blog.csdn.net/prefect_start/article/details/129502353