DStream-05 updateStateByKey函数的原理和源码

Demo

updateState 可以到达将每次 word count 计算的结果进行累加。

object SocketDstream {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
    val ssc = new StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("WARN")
    val lines = ssc.socketTextStream("localhost", 9999)
    val wordCounts = lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey[Int]((seq,total)=>{
      total match {
        case Some(value) => Option(seq.sum + value)
        case None => Option(seq.sum)
      }
    })
    wordCounts.print(_)
    ssc.start()
    ssc.awaitTermination()
  }
}

源码

其实想要达到累加还是比较简单。
只要将本次计算的结果 + 上一次计算结果就可以了。
入口就是 updateStateByKey

PairDStreamFunctions

def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
    val cleanedFunc = ssc.sc.clean(updateFunc)
    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
      cleanedFunc(it)
    }
    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
  }

文章 DStream-04 window 函数时候，提到了。每次计算后，每个DStream 都会将上一次的RDD 放入内存中，以供下一次使用，这样一来也就更简单。如果获取上一次的RDD呢，也就是当前batch time 减去 slideDuration 就等于上一个批次的时间戳，可以通过getOrCompute 得到。
slideDuration 默认情况就是 batchInterval 批次间隔时间。在window 中也是批次时间。

StateDStream

class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](
    parent: DStream[(K, V)],
    updateFunc: (Time, Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
    partitioner: Partitioner,
    preservePartitioning: Boolean,
    initialRDD: Option[RDD[(K, S)]]
  ) extends DStream[(K, S)](parent.ssc) {

  // 这边注意，这个StateDStream 需要设置checkpoint 地址 来保存数据。
  super.persist(StorageLevel.MEMORY_ONLY_SER)
  override val mustCheckpoint = true

// 这个方法就是将 前一个batch RDD 的结果和当前计算的结果合并
  private [this] def computeUsingPreviousRDD(
      batchTime: Time,
      parentRDD: RDD[(K, V)],
      prevStateRDD: RDD[(K, S)]) = {
    // Define the function for the mapPartition operation on cogrouped RDD;
    // first map the cogrouped tuple to tuples of required type,
    // and then apply the update function
    val updateFuncLocal = updateFunc
    val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
      val i = iterator.map { t =>
        val itr = t._2._2.iterator
        val headOption = if (itr.hasNext) Some(itr.next()) else None
        (t._1, t._2._1.toSeq, headOption)
      }
      updateFuncLocal(batchTime, i)
    }
    // cogroup 合并
    val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)
    // 然后将合并后的结果计算
    val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
    Some(stateRDD)
  }

  override def compute(validTime: Time): Option[RDD[(K, S)]] = {

    // Try to get the previous state RDD
    // 算出上一个batch time 来获取上一个batch的RDD。
    getOrCompute(validTime - slideDuration) match {
    
      //如果有就说明之前有RDD，如果没有则当前是第一个batch
      case Some(prevStateRDD) =>    // If previous state RDD exists
        // Try to get the parent RDD
        // 获取当前这个批次来的数据 。这边理解有点绕，parent.getOrCompute(validTime) 就是前一个DStream 计算的结果，可以看下MappedDStream 的 方法就比较清楚了。
        parent.getOrCompute(validTime) match {
          case Some(parentRDD) =>    // If parent RDD exists, then compute as usual
            // 见两个RDD 的数据。
            computeUsingPreviousRDD (validTime, parentRDD, prevStateRDD)
          case None =>     // If parent RDD does not exist
            // Re-apply the update function to the old state RDD
            val updateFuncLocal = updateFunc
            val finalFunc = (iterator: Iterator[(K, S)]) => {
              val i = iterator.map(t => (t._1, Seq.empty[V], Option(t._2)))
              updateFuncLocal(validTime, i)
            }
            val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)
            Some(stateRDD)
        }

      case None =>    // If previous session RDD does not exist (first input data)
        // Try to get the parent RDD
        parent.getOrCompute(validTime) match {
          case Some(parentRDD) =>   // If parent RDD exists, then compute as usual
            initialRDD match {
              case None =>
                // Define the function for the mapPartition operation on grouped RDD;
                // first map the grouped tuple to tuples of required type,
                // and then apply the update function
                val updateFuncLocal = updateFunc
                val finalFunc = (iterator: Iterator[(K, Iterable[V])]) => {
                  updateFuncLocal (validTime,
                    iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))
                }

                val groupedRDD = parentRDD.groupByKey(partitioner)
                val sessionRDD = groupedRDD.mapPartitions(finalFunc, preservePartitioning)
                // logDebug("Generating state RDD for time " + validTime + " (first)")
                Some (sessionRDD)
              case Some (initialStateRDD) =>
                computeUsingPreviousRDD(validTime, parentRDD, initialStateRDD)
            }
          case None => // If parent RDD does not exist, then nothing to do!
            // logDebug("Not generating state RDD (no previous state, no parent)")
            None
        }
    }
  }
}

DStream-05 updateStateByKey函数的原理和源码

Demo

源码

PairDStreamFunctions

StateDStream

猜你喜欢