SparkStreaming的updateStateByKey

updateStateByKey is to update the data state according to the key. It is actually based on the option object.
Its application scenario is to gather the state of each calculation during stream processing to achieve the effect of continuous update state.
I have prepared one for everyone. Stream processing uses this method as an example of calculating the frequency of words from the beginning of the calculation to the end of the calculation according to the key, but remember that with this method, the data is best in the form of key-value. If you need, you can make the value a custom The class

package com.sparkstream

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{
    
    DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}

object StreamUpdateState {
    
    

  def main(args: Array[String]): Unit = {
    
    
    val conf = new SparkConf().setMaster("local[2]").setAppName("StreamUpdateState")
    val ssc = new StreamingContext(conf, Seconds(5))//只有一个参数则是开窗时间

    //ssc.sparkContext.setLogLevel("ERROR")
    ssc.checkpoint("mycheckpoint")//checkpoint路径

    val dataDS: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.182.147",9999)//监控一个端口

    val wordDS: DStream[String] = dataDS.flatMap(_.split(" "))

    val tupleDS: DStream[(String, Int)] = wordDS.map((_,1))

    //hello hello kitty
    //(hello,1) (hello,1) (kitty,1)
    //hello  Seq{1,1},  kitty   Seq{1}
    //Option[Int] = Some(3)  None  使用option记录持续的累加结果 0,
    //hello : seq{1,1}  option:None => Option(2)
    //hello hello hello snoopy
    //hello : seq{1,1,1}  Option(2) => Option(5)
	
	//这个方法每次都会把当先新数据  和之前完成的数据状态传达过来
    val wordCounts: DStream[(String, Int)] = tupleDS.updateStateByKey((seq: Seq[Int], option: Option[Int]) => {
    
    
      var value = 0 //每一次计算的时候准备一个变量
      value += option.getOrElse(0)//取上一次累加的结果
	  //迭代累加当前行状态
      for (elem <- seq) {
    
    
        value += elem
      }
	  //把最新状态放回去
      Option(value)//本次累加结果
    })

    wordCounts.print()

	//一定要调用这两个方法一个是启动 一个是再次迭代执行
    ssc.start()
    ssc.awaitTermination()

  }

}

In fact, there is another method for saving and updating Spark's state called mapWithState, which is a later method. In order to further save computing resources, the mapWithState method will only calculate the data that appears, not all calculations. For example, there are 1 2 3 4 5 in the stream data. , Each appears twice, but only 1 2 3 appears in the next calculation, then mapWithState will only update the state of 1 2 3

Guess you like

Origin blog.csdn.net/dudadudadd/article/details/114332339