Transformations on DStreams之transform的使用 实现黑名单操作/指定过滤
https://blog.csdn.net/qq_43688472/article/details/86616864
只处理当前批次的数据,所谓的无状态的方式,来一次,处理一次
有状态:改批次的数据和以前批次的数据是需要“累加”的
例如:今天某点到某点什么数据出现的次数
1.在那个基础上加个时间戳,把他放到某处,在进行累加
2.直接的方式完成
官网:http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
updateStateByKey(func)
Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain(维持) arbitrary state data for each key.
累计旧状态进行更新
IDEA操作
package g5.learning
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.ListBuffer
object UpdateStateByKeyApp {
def main(args: Array[String]): Unit = {
//准备工作
val conf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKeyApp")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.checkpoint("hdfs://hadoop001:8020/ss/logs")//这里要加这个,为什么,因为这是个有状态的数据,你要旧数据一个地方存放才能累加
//业务逻辑
val lines = ssc.socketTextStream("hadoop001", 9999)
val results = lines.flatMap(_.split(",")).map((_,1))
val state = results.updateStateByKey(updateFunction)
state.print()
//streaming的启动
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
val newCount = newValues.sum
val pre =runningCount.getOrElse(0) // add the new values with the previous running count to get the new count
Some(newCount+ pre)
}
}
问题:
这里你会发现在hdfs上你会产生很多的小文件