三种运行场景描述
- 无状态操作
- 有状态操作
- updateStateByKey
- window
场景一(无状态操作)描述:
对于无状态操作,每次操作都只是计算当前时间切片的内容,例如每次只计算一秒钟时间切片中产生的数据的RDD。
场景二(有状态操作之updateStateByKey)描述:
对于有状态操作,要不断地把当前和历史的时间切片的RDD累加计算,随着时间流逝,计算的数据规模越来越大。
场景三(window)描述:
对于window操作,是针对特定的时间段并以特定时间间隔为单位进行的滑动操作,例如在以1秒钟为时间切片的情况下,我们要统计最近10分钟内Spark Streaming产生的数据,并以每隔2分钟进行更新。
三种场景的wordcount案例
- 无状态操作
object StreamWordCount {
def main(args: Array[String]): Unit = {
//1.初始化Spark配置信息
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamWordCount")
//2.初始化SparkStreamingContext
val ssc = new StreamingContext(sparkConf, Seconds(5))
//3.通过监控端口创建DStream,读进来的数据为一行行
val lineStreams = ssc.socketTextStream("node02", 9999)
//将每一行数据做切分,形成一个个单词
val wordStreams = lineStreams.flatMap(_.split(" "))
//将单词映射成元组(word,1)
val wordAndOneStreams = wordStreams.map((_, 1))
//将相同的单词次数做统计
val wordAndCountStreams = wordAndOneStreams.reduceByKey(_+_)
//打印
wordAndCountStreams.print()
//启动SparkStreamingContext
ssc.start()
ssc.awaitTermination()
}
}
- 有状态操作之updateStateByKey
object WorldCount {
def main(args: Array[String]) {
// 定义更新状态方法,参数values为当前批次单词频度,state为以往批次单词频度
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
// val currentCount = values.foldLeft(0)(_ + _) // 可以直接用下面的values.sum方法
val currentCount = values.sum()
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("hdfs://node02:9000/streamCheck")
// Create a DStream that will connect to hostname:port, like hadoop102:9999
val lines = ssc.socketTextStream("node02", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
//import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
// 使用updateStateByKey来更新状态,统计从运行开始以来单词总的次数
val stateDstream = pairs.updateStateByKey[Int](updateFunc)
stateDstream.print()
//val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
//wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}
- 有状态操作之window 操作
object WorldCount {
def main(args: Array[String]) {
// 定义更新状态方法,参数values为当前批次单词频度,state为以往批次单词频度
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
// val currentCount = values.foldLeft(0)(_ + _) // 可以直接用下面的values.sum方法
val currentCount = values.sum()
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint(".")
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("hadoop102", 9999)
// Split each line into words
val words = lines.flatMap(_.split(" "))
//import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b),Seconds(12), Seconds(6))
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
//ssc.stop()
}
}