flink state management

State role

Flink state is a state of a certain time program task / operator of the, state data at a time of data output from running . First of all to distinguish between state and checkpoint concept can be understood as checkpoint is to make persistent state data stored, by default checkpoint JoManager stored in memory. Flink job a checkpoint represents a snapshot of the global state of a particular moment in time, easy data recovery in case of failure of the task.

State state value storage ( the checkpoint stored on HDFS )

env.setStateBackend(new FsStateBackend("hdfs:///user/flink/app_statistics/checkpoint"))

checkpoint state data storage, data recovery state when restarted

    //设置checkpoint, job失败重启可以恢复数据, 默认是CheckpointingMode.EXACTLY_ONCE
    //flink-conf.yaml配置文件中配置了默认的重启策略: fixed-delay(4, 10s)
    env.enableCheckpointing(60000)
    //不希望因为checkpoint的失败而导致task失败
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
    //设置checkpoint的存储管理
    env.setStateBackend(new FsStateBackend("hdfs:///user/flink/app_statistics/checkpoint"))

State applications

  • State-> KeyedState (most common)

KeyedState is based on KeyedStream state, this state is to bind with a particular key, and each key has a corresponding flow on KeyedStream of State . Keyed State can only be used based on the KeyStream the Rich Functions .

Case I: Flink: Keyed State, seeking to achieve Monte Carlo simulation Pi

Methods rewrite map

// 定义一个MonteCarlo类
case class MonteCarloPoint(x: Double, y: Double) {

  def pi = if (x * x + y * y <= 1) 1 else 0
}

object MonteCarko extends App {

// 自定义一个Source,实现随机坐标点的生成
class MonteCarloSource extends RichSourceFunction[MonteCarloPoint] {


  val env = StreamExecutionEnvironment.getExecutionEnvironment


  // state 需要在RichFunction中实现
  val myMapFun = new RichMapFunction[(Long, MonteCarloPoint), (Long, Double)] {

    // 定义原始状态
    var countAndPi: ValueState[(Long, Long)] = _

    override def map(value: (Long, MonteCarloPoint)): (Long, Double) = {

      // 通过 ValueState.value获取状态值
      val tmpCurrentSum = countAndPi.value

      val currentSum = if (tmpCurrentSum != null) {
        tmpCurrentSum
      } else {
        (0L, 0L)
      }

      val allcount = currentSum._1 + 1
      val picount = currentSum._2 + value._2.pi

      // 计算新的状态值
      val newState: (Long, Long) = (allcount, picount)

      // 更新状态值
      countAndPi.update(newState)

      //输出总样本量和模拟极速那的Pi值
      (allcount, 4.0 * picount / allcount)

    }

    override def open(parameters: Configuration): Unit = {
      countAndPi = getRuntimeContext.getState(
        new ValueStateDescriptor[(Long, Long)]("MonteCarloPi", createTypeInformation[(Long, Long)])
      )
    }

  }
    
    // 添加数据源
   val dataStream: DataStream[MonteCarloPoint] = env.addSource(new MonteCarloSource)

  // 转换成KeyedStream
 
   val keyedStream= dataStream.map((1L, _)).keyBy(0)

  // 调用定义好的RichFunction并打印结果
   keyedStream.map(myMapFun).print()

   env.execute("Monte Carko Test")

}

Case II: Case's official website (rewrite flatmap method)

import java.lang

import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.util.Collector
import org.apache.flink.configuration.Configuration
import scala.collection.JavaConverters._


class CountWindowAverage extends RichFlatMapFunction[(Long, Long), (Long, Long)] {

  private var sum: ValueState[(Long, Long)] = _

  override def flatMap(input: (Long, Long), out: Collector[(Long, Long)]): Unit = {

    // access the state value
    val tmpCurrentSum = sum.value

    // If it hasn't been used before, it will be null
    val currentSum = if (tmpCurrentSum != null) {
      tmpCurrentSum
    } else {
      (0L, 0L)
    }

    // update the count
    val newSum = (currentSum._1 + 1, currentSum._2 + input._2)

    // update the state
    sum.update(newSum)

    // if the count reaches 2, emit the average and clear the state
    if (newSum._1 >= 2) {
      out.collect((input._1, newSum._2 / newSum._1))
      sum.clear()
    }
  }

  override def open(parameters: Configuration): Unit = {
    sum = getRuntimeContext.getState(
      new ValueStateDescriptor[(Long, Long)]("average", createTypeInformation[(Long, Long)])
    )
  }
}



object ExampleCountWindowAverage extends App {
  val env = StreamExecutionEnvironment.getExecutionEnvironment

  env.fromCollection(List(
    (1L, 3L),
    (1L, 5L),
    (1L, 7L),
    (1L, 4L),
    (1L, 2L)
  )).keyBy(_._1)
    .flatMap(new CountWindowAverage()).print()
  // the printed output will be (1,4) and (1,5)

  env.execute("ExampleManagedState")
}

Case 3: calculate the most popular commodity top3

ProcessFunction is a low-level API Flink provided for more advanced functions. It mainly provides timer timer function (support EventTime or ProcessingTime). In this case we will use the timer to determine when to receipt of all the traffic data for all commodities under a window. Because Watermark progress is global, in processElement method, every time receive a data ItemViewCount, we will register a timer windowEnd + 1 ( Flink framework will automatically ignore duplicate registration of the same time) . When the timer windowEnd + 1 is triggered, meaning received windowEnd + Watermark 1, that is the statistical value of all goods receipt of all windows in the windowEnd. We () are collected in the process onTimer all sorts of goods and traffic elect TopN, rankings and outputs information formatted into a string.

Here we also use the ListState <ItemViewCount> ItemViewCount to store each message received , to ensure that in the event of failure, without losing status and consistency of data. ListState Flink is similar to that provided Java List interface State API, which integrates checkpoint institutional framework automatically done to ensure exactly-once semantics of.

import com.sun.jmx.snmp.Timestamp
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer


case class UserBehavior(userId: Long, itemId: Long, categoryId: Int,
                        behavior: String, timestamp: Long)
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)

object UserBehaviorAnalysis {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.setParallelism(1)

    val value: DataStream[UserBehavior] = env.readTextFile("D:\\projects\\flinkStudy\\src\\userBehavior.csv").
      map(line => {
        val linearray = line.split(",")
        UserBehavior(linearray(0).toLong, linearray(1).toLong, linearray(2).toInt, linearray(3), linearray(4).toLong)
      })
    val watermarkDataStream: DataStream[UserBehavior] = value.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[UserBehavior]
    (Time.milliseconds(1000)) {
      override def extractTimestamp(element: UserBehavior): Long = {
        return element.timestamp
      }
    })
    val itemIdWindowStream: DataStream[ItemViewCount] = watermarkDataStream.filter(_.behavior == "pv").
      keyBy("itemId").
      timeWindow(Time.minutes(60),Time.minutes(5))
      //按照每个窗口进行聚合
      .aggregate(new CountAgg(), new WindowResultFunction())

    itemIdWindowStream.keyBy("windowEnd").process(new TopNHotItems(3)).print()




    env.execute("Hot Items Job")
  }
}

class CountAgg extends AggregateFunction[UserBehavior, Long, Long] {
  override def createAccumulator(): Long = 0L
  override def add(userBehavior: UserBehavior, acc: Long): Long = acc + 1
  override def getResult(acc: Long): Long = acc
  override def merge(acc: Long, acc1: Long): Long = acc1+acc
}

// 用于输出窗口的结果
class WindowResultFunction extends WindowFunction[Long, ItemViewCount, Tuple, TimeWindow] {
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
    var itemId=key.asInstanceOf[Tuple1[Long]]._1
    var count=input.iterator.next()
    out.collect(ItemViewCount(itemId, window.getEnd, count))
  }
}


class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple, ItemViewCount, String] {
  private var itemState : ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    // 命名状态变量的名字和状态变量的类型
    val itemsStateDesc = new ListStateDescriptor[ItemViewCount]("itemState-state", classOf[ItemViewCount])
    // 定义状态变量
    itemState = getRuntimeContext.getListState(itemsStateDesc)
  }

  override def processElement(input: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
    // 每条数据都保存到状态中
    itemState.add(input)
    // 注册 windowEnd+1 的 EventTime Timer, 当触发时,说明收齐了属于windowEnd窗口的所有商品数据
    // 也就是当程序看到windowend + 1的水位线watermark时,触发onTimer回调函数
    context.timerService.registerEventTimeTimer(input.windowEnd + 1)
  }

  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    // 获取收到的所有商品点击量
    val allItems: ListBuffer[ItemViewCount] = ListBuffer()
    import scala.collection.JavaConversions._
    for (item <- itemState.get) {
      allItems += item
    }
    // 提前清除状态中的数据,释放空间
    itemState.clear()
    // 按照点击量从大到小排序
    val sortedItems = allItems.sortBy(_.count)(Ordering.Long.reverse).take(topSize)
    // 将排名信息格式化成 String, 便于打印
    val result: StringBuilder = new StringBuilder
    result.append("====================================\n")
    result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")

    for(i <- sortedItems.indices){
      val currentItem: ItemViewCount = sortedItems(i)
      // e.g.  No1:  商品ID=12224  浏览量=2413
      result.append("No").append(i+1).append(":")
        .append("  商品ID=").append(currentItem.itemId)
        .append("  浏览量=").append(currentItem.count).append("\n")
    }
    result.append("====================================\n\n")
    // 控制输出频率,模拟实时滚动结果
    Thread.sleep(1000)
    out.collect(result.toString)
  }
}

 

Published 159 original articles · won praise 75 · views 190 000 +

Guess you like

Origin blog.csdn.net/xuehuagongzi000/article/details/104328893