0）摘要

　　本章博客是对Spark Streaming2.20文档（http://spark.apache.org/docs/2.2.0/streaming-programming-guide.html）的一些梳理，加上代码实现了一些算子，最后写了Spark Streaming如何整合SparkSQL。　　

1）DStream

　　SparkStreaming 中最基本的概念：抽象化离散化数据流，表示来连续不断的数据流

　　DStreams：是由不同批次的RDD构成的，交给spark core来完成处理。

2）Input DStreams and Receivers

　　Input DStreams：收到的数据流

　　注意：除了文件流，其他的Input DStreams都需要一个receive来接收数据和然后将数据存在内存里面等待着spark去处理，所以，在本地运行spark streaming程序的时候，使用“local”或者“local[1]”只会启动一个线程来处理，所以，注意有receive的输入流应该：n（线程数量）>receive的数量。如果使用sockets, Kafka, Flume等作为数据源，不要使用"local[1]"或者“local”。

　　　在spark streaming中会有一个长期运行的组件Receivers，作为一个长期运行的任务（Task）运行在Executor上，每一个Receive会负责一个DStreams输入流。Receive组件会接收数据源发来的数据，会提交给sparkcc core来处理。

3）Transformations on DStreams（算子）

map、flatMap、filter、repartition、union、count、reduce、countByValue()、reduceByKey、join、cogroup、transform
updateStateByKey、
Window Operations

4）Output Operations on DStreams（输出操作）

　　输出流，将结果写入到外面

6）DataFrame and SQL Operations（整合sparkSQL）

　　如何整合sparkSQL

　　下面代码来自于sprak 上的官方案例，地址：https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

 1 import org.apache.spark.SparkConf
 2 import org.apache.spark.rdd.RDD
 3 import org.apache.spark.sql.SparkSession
 4 import org.apache.spark.storage.StorageLevel
 5 import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
 6 
 7 /**
 8  * @Author: SmallWild
 9  * @Date: 2019/10/26 17:29
10  * @Desc: 
11  */
12 object SqlNetworkWordCount {
13   def main(args: Array[String]): Unit = {
14     val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[2]")
15     val ssc = new StreamingContext(sparkConf, Seconds(5))
16     ssc.sparkContext.setLogLevel("WARN")
17     val lines = ssc.socketTextStream(pro.ADDRESS, 1884)
18     val words = lines.flatMap(_.split(" "))
19     // Convert RDDs of the words DStream to DataFrame and run SQL query
20     words.foreachRDD { (rdd: RDD[String], time: Time) =>
21       // Get the singleton instance of SparkSession
22       val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
23       import spark.implicits._
24 
25       // Convert RDD[String] to RDD[case class] to DataFrame
26       val wordsDataFrame = rdd.map(w => Record(w)).toDF()
27 
28       // Creates a temporary view using the DataFrame
29       wordsDataFrame.createOrReplaceTempView("words")
30 
31       // Do word count on table using SQL and print it
32       val wordCountsDataFrame =
33         spark.sql("select word, count(*) as total from words group by word")
34       println(s"========= $time =========")
35       wordCountsDataFrame.show()
36     }
37 
38     ssc.start()
39     ssc.awaitTermination()
40   }
41 
42   /** Case class for converting RDD to DataFrame */
43   case class Record(word: String)
44 
45 
46   /** Lazily instantiated singleton instance of SparkSession */
47   object SparkSessionSingleton {
48 
49     @transient private var instance: SparkSession = _
50 
51     def getInstance(sparkConf: SparkConf): SparkSession = {
52       if (instance == null) {
53         instance = SparkSession
54           .builder
55           .config(sparkConf)
56           .getOrCreate()
57       }
58       instance
59     }
60   }
61 }

View Code　

8)foreachRDD的使用

　　作用：foreachRDD is a powerful primitive that allows data to be sent out to external systems

　　正确的使用方式：使用foreachPartition获得每个分区的数据，

//不使用连接池
dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}
//使用连接池

dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) // return to the pool for future reuse } }

7）总结

　　SparkStreaming无法实现毫秒级的流计算，如果需要实现毫秒级的流计算，仍然需要使用流计算框架（如Storm）

Spark Streaming文档理解