Flink learning seven Flink states (flink state)

Flink learning seven Flink states (flink state)

1. Status Introduction

In streaming computing logic, such as sum, max; need to record and use some historical cumulative data for subsequent calculations,

The state is : the variable used by the user to record information in the program logic

In Flink, the state state is not only to record the state; if it fails during the running of the program, it needs to be restored again, so this state also needs to be persisted; the subsequent program continues to run

1.1 row state

We customize variables to hold data

public class _01_status_row {
    
    

	public static void main(String[] args) throws Exception {
    
    
		// 获取环境
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		DataStream<String> dataStreamSource = env.socketTextStream("192.168.141.141", 9000);
		DataStream<String> dataStream = dataStreamSource.map(new MapFunction<String, String>() {
    
    
            //自己定义的 变量来保存中间值:这里就无法有效的持久化和恢复
            //状态: raw state  状态
			String oldString = "";

            //如何让flink 来托管我们的状态变量,完成持久化和恢复??
			@Override
			public String map(String value) throws Exception {
    
    
                oldString = oldString + value;
				return oldString;
			}
		});
		dataStream.print();
		env.execute();
	}
}

1.2 flink state managed state

Flink provides a built-in state data management mechanism, also called state mechanism: state consistency maintenance, state data access and storage;

1.3 Recovery

A Flink task is a JOB. JOB has many Tasks, and Task corresponds to the example subtask

When a subtask makes an error, the bottom layer of flink will automatically restore the operation of the task for us

If the job fails to recover from the flink state, some parameters need to be specified in particular

2. Status classification

Operator status :

  • Each subtask holds its own independent state data
  • After the operator function implements CheckpointFunction, the operator state can be used
  • Operator state: generally used in source operators, it is recommended to use keyedState (keyed state) in other scenarios

Keyed State Keyed State

  • Keyed state, which can only be used in operators of KeyedStream
  • The operator binds an independent state data for each key

More usage scenarios are Keyed State Keyed State

3. Operator State Operator State

Each subtask holds an independent state data; the operator state, logically, is shared by all subtasks under the operator task;

How to understand: During normal operation, subtask reads and writes its own state data; once the job restarts and the parallelism of the operator with state changes, the previous state data will be evenly distributed among the new batch of subtasks

insert image description here

public class _02_operator_flink_status {
    
    

    public static void main(String[] args) throws Exception {
    
    
        // 获取环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        
        //=============配置 ===============
        //需要开启 Checkpoint 机制
        env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
        //需要开启持久化的路径  可选hdfs 本地
        env.getCheckpointConfig().setCheckpointStorage("file:///D:/Resource/FrameMiddleware/FlinkNew/sinkout2/");
        //task级别的failover
        //一个task 失败 job 失败 ,有很多重启策略
        //env.setRestartStrategy(RestartStrategies.noRestart());
        //task 失败 重启最多3次 , 失败后1秒重启
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,1000));
        //=============配置 ===============

        DataStream<String> dataStreamSource = env.socketTextStream("192.168.141.141", 9000);
        DataStream<String> dataStream = dataStreamSource.map(new StateMapFunction());
        dataStream.print();
        env.execute();
    }
}


class StateMapFunction implements MapFunction<String,String> , CheckpointedFunction {
    
    

    ListState<String> listState;

    //正常的处理逻辑
    @Override
    public String map(String value) throws Exception {
    
    
        listState.add(value);
        Iterable<String> strings = listState.get();
        StringBuilder sb = new StringBuilder();
        for (String string : strings) {
    
    
            sb.append(string);
        }
        //写一个异常
        if(value.length()==5){
    
    
            int a = 1/ 0;
        }
        return sb.toString();
    }

    //持久化之前会调用的方法
    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
    
    
        long checkpointId = context.getCheckpointId();
        System.out.println("执行快照!!!!!"+ checkpointId);
    }

    //算子的任务在启动之前,会调用下面的方法,为用户的状态初始化
    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
    
    
        //context 获取状态存储器
        OperatorStateStore operatorStateStore = context.getOperatorStateStore();
        //定义一个昨天存储结构的描述器
        ListStateDescriptor<String> listStateDescriptor = new ListStateDescriptor<>("保存字符串", String.class);
        //获取状态存储器 中获取容器来存储器
        //getListState 方法还会加载之前存储的状态数据
         listState = operatorStateStore.getListState(listStateDescriptor);
    }
}

3. Keyed State Keyed State

3.1 Basic concepts

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-981L9koP-1687272668448) (flink7 hand-painted/state_partitioning.svg)]1

difference:

In the operator state, an operator has a state storage space

Keyed State: Each Key has its own state storage space

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-Wq7IOvvT-1687272668448) (flink7 hand-painted/state_keyed.png)]

3.2 Examples

public class _03_keyed_flink_status {
    
    

    public static void main(String[] args) throws Exception {
    
    
        // 获取环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
        //需要开启 Checkpoint 机制
        env.enableCheckpointing(1000, CheckpointingMode.EXACTLY_ONCE);
        //需要开启持久化的路径  可选hdfs 本地
        env.getCheckpointConfig().setCheckpointStorage("file:///D:/Resource/FrameMiddleware/FlinkNew/sinkout4/");
        //task级别的failover
        //一个task 失败 job 失败
        env.setRestartStrategy(RestartStrategies.noRestart());
        //task 失败 重启最多3次 , 失败后1秒重启
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,1000));

        DataStream<String> dataStreamSource = env.socketTextStream("192.168.141.141", 9000);
        DataStream<String> dataStream = dataStreamSource.keyBy(x -> x)
                .map(new KeyedStateMapFunction()).setParallelism(2);
        dataStream.print("===>").setParallelism(3);
        env.execute();
    }
}

//flink 状态管理 算子需要实现CheckpointedFunction
class KeyedStateMapFunction extends RichMapFunction<String, String>{
    
    
    ListState<String> listState;

    @Override
    public void open(Configuration parameters) throws Exception {
    
    
        super.open(parameters);
        RuntimeContext runtimeContext = getRuntimeContext();
        ListStateDescriptor<String> listStateDescriptor = new ListStateDescriptor<>("保存字符串", String.class);
         listState = runtimeContext.getListState(listStateDescriptor);
    }
    //正常的处理逻辑
    @Override
    public String map(String value) throws Exception {
    
    
        listState.add(value);
        Iterable<String> strings = listState.get();
        StringBuilder sb = new StringBuilder();
        for (String string : strings) {
    
    
            sb.append(string);
        }
        //写一个异常
        if(value.length()==5){
    
    
            int a = 1/ 0;
        }
        return sb.toString();
    }
}

//======
[root@localhost ~]# nc -lk 9000
a
a
a
b
b
b
c
c
c
c
d
d
d
 控制台数据输出为
===>:2> a
===>:3> aa
===>:1> aaa
===>:1> b
===>:2> bb
===>:3> bbb
===>:1> c
===>:2> cc
===>:3> ccc
===>:1> cccc    ========> 每个key 都有一个自己的ListState<String> listState;

3.3 Status API usage

class KeyedStateMapFunction_2 extends RichMapFunction<String, String>{
    
    
    ValueState<String> valueState;
    ListState<String> listState;
    MapState<String, String> mapState;
    ReducingState<Integer> reducingState;
    AggregatingState<Integer, Double> aggState;

    @Override
    public void open(Configuration parameters) throws Exception {
    
    
        RuntimeContext runtimeContext = getRuntimeContext();

        //单值状态存储器
         valueState = runtimeContext.getState(new ValueStateDescriptor<String>("string", String.class));
         //列表状态存储器
         listState = runtimeContext.getListState(new ListStateDescriptor<>("list", String.class));
         //map 状态存储器
         mapState = runtimeContext.getMapState(new MapStateDescriptor<String, String>("map", String.class, String.class));
         //做累加 reduce
         reducingState = runtimeContext.getReducingState(new ReducingStateDescriptor<Integer>("reduce", new ReduceFunction<Integer>() {
    
    
            @Override
            public Integer reduce(Integer value1, Integer value2) throws Exception {
    
    
                return value1+value2;
            }
        }, Integer.class));
         //记录聚合状态  --> 平均值
        AggregatingState<Integer, Double> aggState = runtimeContext.getAggregatingState(new AggregatingStateDescriptor<>("aggState", new AggregateFunction<Integer, Tuple2<Integer, Integer>, Double>() {
    
    
            @Override
            public Tuple2<Integer, Integer> createAccumulator() {
    
    
                return Tuple2.of(0, 0);
            }

            @Override
            public Tuple2<Integer, Integer> add(Integer value, Tuple2<Integer, Integer> accumulator) {
    
    
                return Tuple2.of(accumulator.f0 + value, accumulator.f1 + 1);
            }

            @Override
            public Double getResult(Tuple2<Integer, Integer> accumulator) {
    
    
                return Double.valueOf(accumulator.f1 / accumulator.f0);
            }

            //批处理会使用
            @Override
            public Tuple2<Integer, Integer> merge(Tuple2<Integer, Integer> a, Tuple2<Integer, Integer> b) {
    
    
                return Tuple2.of(a.f0 + b.f0, b.f0 + b.f1);
            }
        }, TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {
    
    
        })));
    }
    //正常的处理逻辑
    @Override
    public String map(String value) throws Exception {
    
    
        //valueState
        valueState.update("new value");//更新值
        String value1 = valueState.value();//q取值

        //listState
        listState.add(value); //添加一个数据
        listState.addAll(Arrays.asList("1","2")); //添加多个数据
        listState.update(Arrays.asList("1","2")); //替换原有数据

        //mapState
        Iterable<String> keys = mapState.keys(); 
        boolean contains = mapState.contains("1");
        mapState.put("1","2");  //添加数据
        Map<String,String> map = new HashMap<>();
        map.put("1","2");
        mapState.putAll(map);//批量添加数据


        //reducingState
        //做累加
        reducingState.add(Integer.valueOf(value));
        Integer integer = reducingState.get(); //取值
        //计算平均值
        aggState.add(Integer.valueOf(value));
        Double aDouble = aggState.get();//取值
        return value1;
    }
}

3.4 State TTL management

        RuntimeContext runtimeContext = getRuntimeContext();
        //单值状态存储器
        ValueStateDescriptor<String> valueStateDescriptor = new ValueStateDescriptor<>("string", String.class);
        //存活时间和过期 参考
        StateTtlConfig build = StateTtlConfig.newBuilder(Time.milliseconds(5000))  //数据存活时间
                .setTtl(Time.milliseconds(5000)) //数据存活时间 和上面效果一样
                .updateTtlOnCreateAndWrite() //插入和更新时 TTL 重新计算存活时间
                .updateTtlOnReadAndWrite()  //读或者写 TTL 重新计算存活时间  //比如List 是单条数据  Map 则是一个Key value 是一个单独的TTL
                .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) //返回已经过期的数据
                .setStateVisibility(StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp) //没清楚可以返回过期数据
                .setTtlTimeCharacteristic(StateTtlConfig.TtlTimeCharacteristic.ProcessingTime)//TTL处理时间语义
                .useProcessingTime() //效果同上
                .cleanupFullSnapshot()//清理过期状态数据 在checkpoint 的时候
                .cleanupInRocksdbCompactFilter(1000) //只对rocksdb 生效 在rockdb Compact机制在Compact 时过期时间清理
                .build();
        valueStateDescriptor.enableTimeToLive(build);
        valueState = runtimeContext.getState(valueStateDescriptor);

4. State backend

4.1 Basic concepts

Realization of state data storage management , local reading and writing of state data, remote snapshot data storage

The state backend is pluggable and replaceable, which shields the underlying differences from the upper layer, because when changing the state backend, the user's code does not need to make any changes

4.2 Available state backends

  • HashMapStateBacked

    • heap Heap memory, if it overflows, it is the local disk, and exists in the form of objects
    • If there is not enough memory for large-scale data, it will overflow to disk
    • Support large-scale data status, if there is overflow to disk, the efficiency will be significantly reduced
  • EmbeddedRocksDBStateBackend

    • The data status is handed over to RocksDb for management and storage
    • The data is stored in serialized KV bytes,
    • The data in RocksDb will exist in memory cache and disk
    • RocksDb reads disk data faster, and the performance will not have a big impression

    The snapshot checkpoint files generated by the two state backend strategies are the same, and the StateBacked can be changed after restarting to be compatible; the way the program changes the state backend after restarting does not affect the operation of the program;

4.3 Setting the state backend

// HashMapStateBacked    
env.setStateBackend(new HashMapStateBackend());

//EmbeddedRocksDBStateBackend  
env.setStateBackend(new EmbeddedRocksDBStateBackend());

5. Broadcast state broadcast state

The broadcast state is used when broadcasting the join of the stream mentioned in the previous chapter

Flink 学习三 Flink 流&process function API ==> 1.7.broadcast

new BroadcastProcessFunction();  

The way of the state backend does not affect the running of the program;**

4.3 Setting the state backend

// HashMapStateBacked    
env.setStateBackend(new HashMapStateBackend());

//EmbeddedRocksDBStateBackend  
env.setStateBackend(new EmbeddedRocksDBStateBackend());

5. Broadcast state broadcast state

The broadcast state is used when broadcasting the join of the stream mentioned in the previous chapter

Flink 学习三 Flink 流&process function API==>1.7.broadcast

new BroadcastProcessFunction();  

Guess you like

Origin blog.csdn.net/weixin_44244088/article/details/131317135