Description and Case of ProcessingFunction of Flink Streaming Computing

0 ProcessFunction API

The conversion operator we learned before cannot access the timestamp information and water level information of the event. And this is extremely important in some application scenarios. Map transformation operators such as MapFunction do not have access to timestamps or the event time of the current event.
Based on this, the DataStream API provides a series of Low-Level conversion operators. Can access timestamps, watermarks, and register timed events. It is also possible to output some specific events, such as timeout events, etc. Process Function is used to build event-driven applications and implement custom business logic (which cannot be achieved using the previous window functions and conversion operators). For example, Flink SQL is implemented using Process Function.
Flink provides 8 Process Functions:
• ProcessFunction
• KeyedProcessFunction
• CoProcessFunction
• ProcessJoinFunction
• BroadcastProcessFunction
• KeyedBroadcastProcessFunction
• ProcessWindowFunction
• ProcessAllWindowFunction

1 KeyedProcessFunction

KeyedProcessFunction is used to operate KeyedStream. KeyedProcessFunction will process each element of the stream and output 0, 1 or more elements. All Process Functions inherit from the RichFunction interface, so they have methods such as open(), close() and getRuntimeContext(). And KeyedProcessFunction[KEY, IN, OUT] also provides two additional methods:
• processElement(v: IN, ctx: Context, out: Collector[OUT]), each element in the stream will call this method, and the result of the call will be It will be output in the Collector data type. Context can access the timestamp of the element, the key of the element, and the TimerService time service. Context can also output results to other streams (side outputs).
• onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[OUT]) is a callback function. Called when a previously registered timer fires. The parameter timestamp is the timestamp of the trigger set by the timer. Collector is a collection of output results. OnTimerContext, like the Context parameter of processElement, provides some information about the context, such as the time information triggered by the timer (event time or processing time).

1.1 TimerService and timers (Timers)

The TimerService object held by Context and OnTimerContext has the following methods:
• currentProcessingTime(): Long returns the current processing time
• currentWatermark(): Long returns the timestamp of the current watermark
• registerProcessingTimeTimer(timestamp: Long): Unit will register the processing of the current key time timer. When the processing time reaches the specified time, the timer is triggered.
• registerEventTimeTimer(timestamp: Long): Unit will register the event time timer of the current key. When the water level is greater than or equal to the time registered by the timer, the timer is triggered to execute the callback function.
• deleteProcessingTimeTimer(timestamp: Long): Register the processing time timer before Unit deletion. If there is no timer with this timestamp, it will not be executed.
• deleteEventTimeTimer(timestamp: Long): Unit Deletes the previously registered event time timer, if there is no timer with this timestamp, it will not be executed.
When the timer timer is triggered, the callback function onTimer() will be executed. Note that timer timers can only be used on keyed streams.

1.2 Case 1

Detect whether the temperature continues to rise within 10 seconds of the data, and if so, alarm. The current usage scenario is not suitable for windowing. Whether it is a rolling window or a sliding window, the following situations may not be detected. As shown in the figure: in windows 1 and 2, there are falling data, so the alarm will not occur, but in the green box in the figure are all rising data, which should trigger the alarm. And if adopt the way of the window, then will not report to the police. Some people say that this situation can be detected by reducing the sliding size of the sliding window, but in fact, no matter how small the step size of the sliding window is, the above situation cannot be completely avoided.
insert image description here
Therefore, for the above situation, the ProcessFunction method can be used. The demo is as follows:

import com.chen.flink.part01.SensorReading
import org.apache.flink.api.common.state.{
    
    ValueState, ValueStateDescriptor}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

object ProcessFunctionTest {
    
    
  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置并行度为1
    env.setParallelism(1)

    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)

    val mapStream: DataStream[SensorReading] = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        SensorReading(strings(0), strings(1).toLong, strings(2).toDouble)
      }
    )
    val warningStream = mapStream.keyBy("id").process(new TempIncreaseAlertFunction2(10000L))
    warningStream.print("warning test")
    env.execute()
  }

  class TempIncreaseAlertFunction2(interval: Long) extends KeyedProcessFunction[Tuple, SensorReading, String] {
    
    

    //获取之前状态的温度值和时间戳
    lazy val lastTempState: ValueState[Double] = getRuntimeContext.getState(new ValueStateDescriptor[Double]("lasttemp", classOf[Double]))
    lazy val timestampState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long]("timestamp", classOf[Long]))

    override def processElement(value: SensorReading, ctx: KeyedProcessFunction[Tuple, SensorReading, String]#Context, out: Collector[String]): Unit = {
    
    

      //获取状态值
      val lastTemp: Double = lastTempState.value()
      val timestamp: Long = timestampState.value()
      //更新最新的温度值
      lastTempState.update(value.temperature)
      //如果温度升高且未注册定时器,则完成定时器的注册
      if (value.temperature > lastTemp && timestamp == 0) {
    
    

        val ts = ctx.timerService().currentProcessingTime() + interval
        //注册定时器
        ctx.timerService().registerProcessingTimeTimer(ts)
        //更新定时器的值
        timestampState.update(ts)
      } else if (value.temperature < lastTemp) {
    
    
        //取消定时器注册,并将状态清空
        ctx.timerService().deleteProcessingTimeTimer(timestamp)
        timestampState.clear()
      }
    }

    override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
    
    
      //当定时时间达到时,触发定时器响应
      out.collect("温度连续" + interval / 1000 + "秒上升")
      out.collect("currentProcessingTime: "+ctx.timerService().currentProcessingTime())

      //清除状态,使得二次触发定时器
      timestampState.clear()
    }
  }

}

The results are as follows:
insert image description here
It can be seen that the two detection times are 10 seconds. If the temperature continues to rise within 10 seconds, an alarm message will be generated.

1.3 Case 2

Case 2 is roughly the same as Case 1, the difference is that Case 2 uses event time, while Case 1 uses ProgressTime

package com.chen.flink.part02
import com.chen.flink.part01.SensorReading
import org.apache.flink.api.common.state.{
    
    ValueState, ValueStateDescriptor}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

object ProcessFunctionDemo {
    
    
  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //设置读取时间为event time
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)
    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)
    val mapStream: DataStream[SensorReading] = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        SensorReading(strings(0), strings(1).toLong, strings(2).toDouble)
      }
    )
    //设置watermark延迟时间
    val waterStream: DataStream[SensorReading] = mapStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(2)) {
    
    
      override def extractTimestamp(element: SensorReading): Long = {
    
    
        element.timestamp
      }
    })
    //如果10秒钟内温度持续上升,则输出告警信息
    val warningStream = waterStream.keyBy(0).process(new TempIncreaseAlertFunction(10000L))
    warningStream.print("warning")
    env.execute()

  }
  // 自定义 KeyedProcessFunction
  class TempIncreaseAlertFunction(interval: Long) extends KeyedProcessFunction[Tuple, SensorReading, String] {
    
    

    // 需要跟之前的温度值做对比,所以将上一个温度保存成状态
    lazy val lastTempState: ValueState[Double] = getRuntimeContext.getState(new ValueStateDescriptor[Double]("lastTemp", classOf[Double]))
    // 为了方便删除定时器,还需要保存定时器的时间戳
    lazy val curTimestampState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long]("cur-time", classOf[Long]))

    override def processElement(value: SensorReading, ctx: KeyedProcessFunction[Tuple, SensorReading, String]#Context, out: Collector[String]): Unit = {
    
    

      //获取上一个温度和时间戳的状态值
      val lastTemp: Double = lastTempState.value()
      val curTimestamp: Long = curTimestampState.value()

      //将上一次的温度值更新为最新的温度值
      lastTempState.update(value.temperature)


      //温度上升且未注册定时器,则注册定时器
      if (lastTemp < value.temperature && curTimestamp == 0) {
    
    

        val curtime: Long = ctx.timerService().currentWatermark() + interval

        ctx.timerService().registerEventTimeTimer(curtime)
        //更新时间戳
        curTimestampState.update(curtime)

      } else if (lastTemp > value.temperature) {
    
    
        //如果温度下降,则删除定时器
        ctx.timerService().deleteEventTimeTimer(curTimestamp)
        //清空状态
        curTimestampState.clear()
      }
    }

    //定时器触发
    override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
    
    

      out.collect("温度连续" + interval / 1000 + "秒上升")
      out.collect("当前水位线是" + ctx.timerService().currentWatermark())
      curTimestampState.clear()
    }
  }
}

The current alarm information is the event time in the detection data, whether the temperature of the detection event time continues to rise within 10 consecutive seconds, and if so, an alarm information is generated.
According to the above code logic, when the first piece of data is input, the following warning message prompt will be generated:
insert image description here
insert image description here
From the above figure, it can be found that no matter what the first piece of data is, there will be a result. This obviously does not meet our requirement of detecting temperature rise data within 10 consecutive seconds and issuing an alarm. The reason is that when the water level is obtained for the first time, Long.MIN_VALUE is obtained by default, and this value is a negative number. Therefore, when the timer is registered with this value, the timer will be triggered immediately, that is, the output will be generated immediately. The verification method is as follows:
add the following logic to the logic before the code registers the timer:
insert image description here
start the program, and output the result:
insert image description here
it can be found that the water level before triggering the timer is Long.MIN_VALUE+interval < 0, so it will directly trigger the execution of the timer, resulting in Output is generated when the first piece of data is input.
Solution:
Add a judgment, update the event time to currentWatermark+interval only when the current water level is greater than 0, otherwise the event time is interval to
insert image description here
run the program:
at this time, no alarm output will be generated when the first data is input, and at this time The alarm time of is judged according to the event time of the input data. That is to say, when the watermark >= event time - delay time 2 seconds = 10 seconds, the calculation will be triggered.
insert image description here
As shown in the figure below: When the event time - delay time of 2 seconds >= 10 seconds, the calculation will be triggered. The
insert image description here
current water level is 10300 milliseconds, and the water level time for the next trigger calculation is >= 22000 milliseconds. If the temperature drops during this period, the alarm will not be triggered, as shown in the figure below:
insert image description here
The current input starts to rise when the data is 15000 milliseconds, and the time to register the timer at this time is watermark(13000)+interval(10000)=23000 milliseconds. If the event time keeps rising until 13000+10000 (set detection time)+2000 (delay time)=25000 milliseconds, the alarm data will continue to be output at this time.
insert image description here

2 ProcessFunction side output stream (SideOutput)

The output of most operators of the DataStream API is a single output, that is, a stream of some data type. In addition to the split operator, a stream can be divided into multiple streams, and the data types of these streams are also the same. The side outputs function of the process function can generate multiple streams, and the data types of these streams can be different. A side output can be defined as an OutputTag[X] object, where X is the data type of the output stream. The process function can emit an event to one or more sideoutputs through the Context object.

2.1 Example

Output the temperature value below 32F to the side output

package com.chen.flink.part02

import com.chen.flink.part01.SensorReading
import org.apache.flink.api.common.state.{
    
    ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

/**
 * 当前示例用于演示ProcessFunction的侧输出流
 */
object ProcessFunctionDemo03 {
    
    

  def main(args: Array[String]): Unit = {
    
    

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val dataStream: DataStream[String] = env.socketTextStream("192.168.199.101", 7777)
    val mapedStream: DataStream[SensorReading] = dataStream.map(
      data => {
    
    
        val strings = data.split(",")
        (SensorReading(strings(0), strings(1).toLong, strings(2).toDouble))
      }
    )

    val processStream: DataStream[SensorReading] = mapedStream.process(new LowTempMonitorFunction)
    //获取侧输出流,注意和之前定义的侧输出流标签一致
    val sideStream = processStream.getSideOutput(new OutputTag[String]("low-temp"))
    //分别打印侧输出流和整个流
    sideStream.print("sideStream")
    processStream.print("allStream")
    env.execute()
  }
}

class  LowTempMonitorFunction extends ProcessFunction[SensorReading,SensorReading]{
    
    

  //定义侧输出流的标签
  lazy val lowTempAlarmOutput: OutputTag[String] = new OutputTag[String]("low-temp")
  override def processElement(value: SensorReading, ctx: ProcessFunction[SensorReading, SensorReading]#Context, out: Collector[SensorReading]): Unit = {
    
    
    //如果温度小于32度,则放到侧输出流中
    if(value.temperature < 32){
    
    
      ctx.output(lowTempAlarmOutput,value.id+" 当前温度是:"+value.temperature)
    }
    //所有数据都放到输出的主流中
    out.collect(value)
  }

}

As shown in the figure below, when the current temperature is less than 32 degrees, it will enter the side output stream and output in the measured output stream
insert image description here

Guess you like

Origin blog.csdn.net/Keyuchen_01/article/details/118709311