1.概念
在一定的时间间隔(interval)进行一个时间段(window length)内的数据处理。
【参考:http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html】
2.核心
(1)window length : 窗口的长度(下图是3)
(2)sliding interval: 窗口的间隔(下图是2)
(3)这2个参数和Streaming的batch size都是倍数关系,否则会报错!
3.实例(官方)
每10s计算前30s的数据
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
【注意:】
Seconds(30), //窗口大小,指定计算最近多久的数据量,要求是父DStream的批次产生时间的整数倍
Seconds(10) //滑动大小/新的DStream批次产生间隔时间,就是几秒钟来一次数据,要求是父DStream的批次产生时间的整数倍
4.实例代码
(1)源码
package _0809kafka
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Administrator on 2018/10/20.
*/
object WindowsReduceStream_simple_1020 {
def main(args: Array[String]): Unit = {
val sparkconf=new SparkConf().setMaster("local[2]").setAppName("WindowsReduceStream_simple_1020")
val sc=new SparkContext(sparkconf)
val ssc = new StreamingContext(sc, Seconds(2))
val checkpointPathDir = s"file:///E:\\Tools\\WorkspaceforMyeclipse\\scalaProjectMaven\\streaming_08"
ssc.checkpoint(checkpointPathDir)
val dstream = ssc.socketTextStream("bigdata.ibeifeng.com", 9999)
val batchResultDStream = dstream.flatMap(_.split(" ")).map(word => {
(word,1)
}).reduceByKey(_ + _)
val resultDStream: DStream[(String, Int)] = batchResultDStream.reduceByKeyAndWindow(
(a:Int,b:Int) => a+b,
Seconds(6), //窗口大小,指定计算最近多久的数据量,要求是父DStream的批次产生时间的整数倍
Seconds(2) //滑动大小/新的DStream批次产生间隔时间,就是几秒钟来一次数据,要求是父DStream的批次产生时间的整数倍
)
resultDStream.print()
ssc.start() // 启动
ssc.awaitTermination()
}
}
(2)测试
-》开启9999端口
nc -lt 9999
-》打开程序
-》结果:
-------------------------------------------
Time: 1540020870000 ms
-------------------------------------------
(hadoophadoop,15)
(hadoop,60)
(ccs,45)