Flink / Scala - Detailed Explanation of DataStream Broadcast State Pattern Example

I. Introduction

The previous article Flink / Scala - DataSet Application Broadcast Variables introduces the use of Broadcast in the DataSet scenario. This article will introduce the Broadcast application scenario in DataStream. Similar to DataSet, the value of Broadcast is common to all tasks, and Broadcast State is owned by DataStreaming. Real-time modifiable public value for task customization.

2. Introduction to code routines

DataStream<T> output = dataStream
                 .connect(BroadcastStream)
                 .process(
                     
                     // KeyedBroadcastProcessFunction 中的类型参数表示：
                     //   1. key stream 中的 key 类型
                     //   2. 非广播流中的元素类型
                     //   3. 广播流中的元素类型
                     //   4. 结果的类型，在这里是 string
                     
                     new KeyedBroadcastProcessFunction<Ks, In1, In2, Out>() {
                         // 模式匹配逻辑
                     }
                 );

In normal use, we all include a data stream DataStream, which contains the data we need to process. If the processing logic changes with the change of a state value, it is possible to introduce a second data stream to become a broadcast stream BroadcastStream, by calling DataStream connect method, and pass in the BroadcastStream parameter to get a BroadcastConnectedStream. At this time, the data contains both data stream and state stream. It is necessary to rewrite the process function to process the data of the two streams. According to whether the DataStream is a Keyd-Stream, the Process method is divided into for:

· keyed flow, that is KeyedBroadcastProcessFunction type
· non-keyed flow, that is BroadcastProcessFunction type

In the incoming BroadcastProcessFunction or KeyedBroadcastProcessFunction, we need to implement two methods. The processBroadcastElement() method is responsible for processing elements in broadcast streams, and processElement() is responsible for processing elements in non-broadcast streams. The two subtypes are defined as follows:

public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends BaseBroadcastProcessFunction {

    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;

    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;
}
public abstract class KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT> {

    public abstract void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out) throws Exception;

    public abstract void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out) throws Exception;

    public void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out) throws Exception;
}

Note that processBroadcastElement() handles elements of a broadcast stream, while processElement() handles elements of another stream. The second parameter (Context) of the two methods is different, and both have the following methods:

Get the storage state of the broadcast stream: ctx.getBroadcastState(MapStateDescriptor<K, V> stateDescriptor)
Query the timestamp of the element: ctx.timestamp()
Query the current Watermark: ctx.currentWatermark()
Current processing time (processing time): ctx .currentProcessingTime()
produces bypass output: ctx.output(OutputTag<X> outputTag, X value)

3. Application Examples

The above is more official. Let's understand the usefulness of BroadCast Value and BroadCast Stream through a simple example. The above mentioned BroadCast Stream as a state flow to control the data output of DataStream. The following functions are implemented:

DataStream: Periodically generate 100 numbers of num - 100+num, initialize num + 100 for each generation cycle

BroadCastStream: irregular incoming state control output state, divided into odd-single even-even

Sink: According to the status of odd and even, print outputs a singular or even number of 100 numbers

1.DataStream

The number of num - (num+100) is generated in 5s, and the next batch of data is increased by 100 compared with the previous batch. Here, the RichSourceFunction custom Source source is inherited and then implemented through addSource. For the complete DataStream Source generation method, please refer to: Flink / Scala - One of DataSource DataStream to get the data summary .

    // 每5s生成一批数据 数据流
    case class InputData(num: Int)

    class SourceFromCollection extends RichSourceFunction[InputData] {
      private var isRunning = true
      var start = 0

      override def run(ctx: SourceFunction.SourceContext[InputData]): Unit = {
        while ( {
          isRunning
        }) {
          (start to (start + 100)).foreach(num => {
            ctx.collect(InputData(num))
          })
          start += 100
          TimeUnit.SECONDS.sleep(5)
        }
      }

      override def cancel(): Unit = {
        isRunning = false
      }
    }

    val keyedStream = env.addSource(new SourceFromCollection()).setParallelism(1).keyBy(_.num)

The above stream generates InputData class according to the number of num- num+100, and generates Keyd-Stream through keyBy.

2.BroadCastStream

The BroadCastStream broadcast stream is the state stream in this example. Here, the state value is passed and parsed through File. It also inherits RichFunction to implement a custom Source, and reads from the corresponding file every 1s to obtain whether there is a new state incoming.

    // MapStateDescriptor odd: 奇数 even: 偶数
    case class FilterState(state: String)    

    // 每s监控一次文件，并读取最新的状态
    class SourceFromFile extends RichSourceFunction[String] {
      private var isRunning = true

      override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
        val bufferedReader = new BufferedReader(new FileReader("./data.txt"))
        while ( {
          isRunning
        }) {
          val line = bufferedReader.readLine
          if (!StringUtils.isBlank(line)) {
            ctx.collect(line)
          }
          TimeUnit.SECONDS.sleep(1)
        }
      }

      override def cancel(): Unit = {
        isRunning = false
      }
    }

    val ruleStateDescriptor = new MapStateDescriptor("RulesBroadcastState", classOf[String], classOf[FilterState])

    // 广播流，广播规则并且创建 BroadCast
    val ruleStream = env.addSource(new SourceFromFile).setParallelism(1).map(new RichMapFunction[String, FilterState]() {
      override def map(in: String): FilterState = {
        FilterState(in)
      }
    }).broadcast(ruleStateDescriptor)

stateDescriptor is responsible for declaring the type of broadcast state, which is defined here as MapStateDescriptor, and then the corresponding FilterState can be obtained through the key of String type, thereby determining how the data in the DataStream sinks.

3. Combine DataStream and BroadCastStream

DataStream.connect(BroadCastStream), since the original DataStream is keyd-stream, use keyedBroadcastProcessFunction, which contains four parameters:

· ks - the type of keyBy field, here according to InputData.num keyBy, so it is Int

IN1 - the type of DataStream data stream, here is InputData

IN2 - the type of BroadCastStream broadcast stream, here is FilterState

· OUT - Sink output is directly output Print String, so it is String

    keyedStream.connect(ruleStream).process(new KeyedBroadcastProcessFunction[Int, InputData, FilterState, String] {

      // 与之前的 Descriptor 相同
      val ruleStateDescriptor = new MapStateDescriptor("RulesBroadcastState", classOf[String], classOf[FilterState])

      override def processElement(inputData: InputData, context: KeyedBroadcastProcessFunction[Int, InputData, FilterState, String]#ReadOnlyContext, out: Collector[String]): Unit = {
        val filterStateClass = context.getBroadcastState(ruleStateDescriptor).get("broadcastStateKey")
        val filterState = if (filterStateClass == null) {
          "odd"
        } else {
          filterStateClass.state
        }
        // 奇数模式
        if (filterState == "odd" && inputData.num % 2 != 0) {
          out.collect(inputData.num.toString)
        }
        // 偶数模式
        if (filterState == "even" && inputData.num % 2 == 0) {
          out.collect(inputData.num.toString)
        }
      }

      override def processBroadcastElement(filterState: FilterState, context: KeyedBroadcastProcessFunction[Int, InputData, FilterState, String]#Context, collector: Collector[String]): Unit = {
        // 从广播中获取规则
        val broadCastValue = context.getBroadcastState(ruleStateDescriptor)
        broadCastValue.put("broadcastStateKey", filterState)
        println(s"Rule Changed: ${filterState.state}")
      }
    }).setParallelism(1).print()

A. ProcessElement

This method is responsible for outputting data. According to whether the state of FilterState is odd-singular or even-even, the state defaults to odd-singular. Obtain the FilterState data in BroadcastStream through the context.getBroadcastState(StateDescriptor) method. Note that the StateDescriptor here should be consistent with the StateDescriptor initialized above.

B. ProcessBroadcastElement

This method is responsible for processing the Broadcast data stream and updating it to the context, so that other task nodes obtain the latest state value when executing the processElement method. The put key here needs to be consistent with the get key in the above method, otherwise the obtained state value is null.

4. Test

For the convenience of local testing, the parallelism of both Streams is set to 1.

The state file File is empty, the default state is odd, and the output is singular:

Add a line of even to the file, and save it with ctrl s. At this time, the new state even is detected at the interval of Broadcast 1s, and the processing is refined to each task, and each task outputs an even number:

Add a line of odd again, and the output state changes, re-modified to output singular:

A basic BroadcastValue controlling DataStream instance is done, and the state folder ends up containing two lines of state data:

5. Complete code

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
import org.apache.commons.lang3.StringUtils

import java.io.BufferedReader
import java.io.FileReader
import java.util.concurrent.TimeUnit

object BroadCastStateDemo {


  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    // 每5s生成一批数据 数据流
    case class InputData(num: Int)

    class SourceFromCollection extends RichSourceFunction[InputData] {
      private var isRunning = true
      var start = 0

      override def run(ctx: SourceFunction.SourceContext[InputData]): Unit = {
        while ( {
          isRunning
        }) {
          (start to (start + 100)).foreach(num => {
            ctx.collect(InputData(num))
          })
          start += 100
          TimeUnit.SECONDS.sleep(5)
        }
      }

      override def cancel(): Unit = {
        isRunning = false
      }
    }

    val keyedStream = env.addSource(new SourceFromCollection()).setParallelism(1).keyBy(_.num)

    // 每s监控一次文件，并读取最新的状态
    class SourceFromFile extends RichSourceFunction[String] {
      private var isRunning = true

      override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
        val bufferedReader = new BufferedReader(new FileReader("/Users/xudong11/flink/src/main/scala/com.weibo.ug.push.flink/DataStreamingDemo/data.txt"))
        while ( {
          isRunning
        }) {
          val line = bufferedReader.readLine
          if (!StringUtils.isBlank(line)) {
            ctx.collect(line)
          }
          TimeUnit.SECONDS.sleep(1)
        }
      }

      override def cancel(): Unit = {
        isRunning = false
      }
    }

    // MapStateDescriptor odd: 奇数 even: 偶数
    case class FilterState(state: String)

    val ruleStateDescriptor = new MapStateDescriptor("RulesBroadcastState", classOf[String], classOf[FilterState])

    // 广播流，广播规则并且创建 BroadCast
    val ruleStream = env.addSource(new SourceFromFile).setParallelism(1).map(new RichMapFunction[String, FilterState]() {
      override def map(in: String): FilterState = {
        FilterState(in)
      }
    }).broadcast(ruleStateDescriptor)

    // 连接两个流
    keyedStream.connect(ruleStream).process(new KeyedBroadcastProcessFunction[Int, InputData, FilterState, String] {

      // 与之前的 Descriptor 相同
      val ruleStateDescriptor = new MapStateDescriptor("RulesBroadcastState", classOf[String], classOf[FilterState])

      override def processElement(inputData: InputData, context: KeyedBroadcastProcessFunction[Int, InputData, FilterState, String]#ReadOnlyContext, out: Collector[String]): Unit = {
        val filterStateClass = context.getBroadcastState(ruleStateDescriptor).get("broadcastStateKey")
        val filterState = if (filterStateClass == null) {
          "odd"
        } else {
          filterStateClass.state
        }
        // 奇数模式
        if (filterState == "odd" && inputData.num % 2 != 0) {
          out.collect(inputData.num.toString)
        }
        // 偶数模式
        if (filterState == "even" && inputData.num % 2 == 0) {
          out.collect(inputData.num.toString)
        }
      }

      override def processBroadcastElement(filterState: FilterState, context: KeyedBroadcastProcessFunction[Int, InputData, FilterState, String]#Context, collector: Collector[String]): Unit = {
        // 从广播中获取规则
        val broadCastValue = context.getBroadcastState(ruleStateDescriptor)
        broadCastValue.put("broadcastStateKey", filterState)
        println(s"Rule Changed: ${filterState.state}")
      }
    }).setParallelism(1).print()

    env.execute()

  }


}

4. Summary

1. Implementation steps

Broadcast Value is implemented through DataStream connect BroadCastStream connection. During this period, pay attention to the rewriting of the two ProcessFunctions and the customization of the corresponding StateDescriptor.

2. Data consistency

Secondly, you need to pay attention to the parameters ctx of the two processFunctions. In processElement, ctr is readOnly. For consistency reasons, only the task is allowed to read the latest State but cannot be modified; on the contrary, the context in the processBroadcastElement method allows to modify the value of the value state. , note that the logic here must maintain global consistency (adding random numbers to randomly modify the state value can be regarded as an operation that does not maintain global uniqueness), otherwise the state will be different and the output of the task side will be inconsistent.

3.CheckPoint

All tasks will checkpoint the broadcast state: Although the broadcast state in all tasks is the same, when the checkpoint comes, all tasks will checkpoint the broadcast state. This is designed to prevent file hotspots or hotspots caused by reading files after the job resumes. Of course, this method will cause a certain degree of write amplification in checkpoint, and the amplification factor is p (= parallelism). Flink will ensure that data is not duplicated and not missing when restoring state / changing concurrently. When the job resumes, if it has the same or less concurrency as before, all tasks read the state that has been checkpointed before. In the case of increasing concurrency, the task will read its own state, and the extra concurrency (p_new - p_old) will use the round-robin scheduling algorithm to read the state of the previous task.

4.State Backend

The broadcast state is stored in memory at runtime, and it is necessary to ensure sufficient memory. This feature also applies to all other operator states, so the RocksDB state backend is not used.