Transformations on DStreams之updateStateByKey 的使用和状态累加

Transformations on DStreams之transform的使用 实现黑名单操作/指定过滤
https://blog.csdn.net/qq_43688472/article/details/86616864
只处理当前批次的数据,所谓的无状态的方式,来一次,处理一次
有状态:改批次的数据和以前批次的数据是需要“累加”的

例如:今天某点到某点什么数据出现的次数

1.在那个基础上加个时间戳,把他放到某处,在进行累加
2.直接的方式完成
官网:http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
updateStateByKey(func)
Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain(维持) arbitrary state data for each key.
累计旧状态进行更新

IDEA操作

package g5.learning

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object UpdateStateByKeyApp {

  def main(args: Array[String]): Unit = {

    //准备工作
    val conf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKeyApp")
    val ssc = new StreamingContext(conf, Seconds(10))

    ssc.checkpoint("hdfs://hadoop001:8020/ss/logs")//这里要加这个,为什么,因为这是个有状态的数据,你要旧数据一个地方存放才能累加
    //业务逻辑
    val lines = ssc.socketTextStream("hadoop001", 9999)
   val results = lines.flatMap(_.split(",")).map((_,1))
val state = results.updateStateByKey(updateFunction)

    state.print()
    //streaming的启动
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate

  }
  def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = newValues.sum
    val pre =runningCount.getOrElse(0) // add the new values with the previous running count to get the new count
    Some(newCount+ pre)
  }

}

问题:

这里你会发现在hdfs上你会产生很多的小文件

猜你喜欢

转载自blog.csdn.net/qq_43688472/article/details/86617536
今日推荐