Spark Streaming文档理解

0)摘要

  本章博客是对Spark Streaming2.20文档(http://spark.apache.org/docs/2.2.0/streaming-programming-guide.html)的一些梳理,加上代码实现了一些算子,最后写了Spark Streaming如何整合SparkSQL。  

1)DStream

  SparkStreaming 中最基本的概念:抽象化离散化数据流,表示来连续不断的数据流

  DStreams:是由不同批次的RDD构成的,交给spark core来完成处理。

2)Input DStreams and Receivers

  Input DStreams:收到的数据流

  注意:除了文件流,其他的Input DStreams都需要一个receive来接收数据和然后将数据存在内存里面等待着spark去处理,所以,在本地运行spark streaming程序的时候,使用“local”或者“local[1]”只会启动一个线程来处理,所以,注意有receive的输入流应该:n(线程数量)>receive的数量。如果使用sockets, Kafka, Flume等作为数据源,不要使用"local[1]"或者“local”。

   在spark streaming中会有一个长期运行的组件Receivers,作为一个长期运行的任务(Task)运行在Executor上,每一个Receive会负责一个DStreams输入流。Receive组件会接收数据源发来的数据,会提交给sparkcc core来处理。

3)Transformations on DStreams(算子)

  • map、flatMap、filter、repartition、union、count、reduce、countByValue()、reduceByKey、join、cogroup、transform
  • updateStateByKey、
  • Window Operations

4)Output Operations on DStreams(输出操作)

  输出流,将结果写入到外面

6)DataFrame and SQL Operations(整合sparkSQL)

  如何整合sparkSQL

  下面代码来自于sprak 上的官方案例,地址:https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala

 1 import org.apache.spark.SparkConf
 2 import org.apache.spark.rdd.RDD
 3 import org.apache.spark.sql.SparkSession
 4 import org.apache.spark.storage.StorageLevel
 5 import org.apache.spark.streaming.{Seconds, StreamingContext, Time}
 6 
 7 /**
 8  * @Author: SmallWild
 9  * @Date: 2019/10/26 17:29
10  * @Desc: 
11  */
12 object SqlNetworkWordCount {
13   def main(args: Array[String]): Unit = {
14     val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[2]")
15     val ssc = new StreamingContext(sparkConf, Seconds(5))
16     ssc.sparkContext.setLogLevel("WARN")
17     val lines = ssc.socketTextStream(pro.ADDRESS, 1884)
18     val words = lines.flatMap(_.split(" "))
19     // Convert RDDs of the words DStream to DataFrame and run SQL query
20     words.foreachRDD { (rdd: RDD[String], time: Time) =>
21       // Get the singleton instance of SparkSession
22       val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
23       import spark.implicits._
24 
25       // Convert RDD[String] to RDD[case class] to DataFrame
26       val wordsDataFrame = rdd.map(w => Record(w)).toDF()
27 
28       // Creates a temporary view using the DataFrame
29       wordsDataFrame.createOrReplaceTempView("words")
30 
31       // Do word count on table using SQL and print it
32       val wordCountsDataFrame =
33         spark.sql("select word, count(*) as total from words group by word")
34       println(s"========= $time =========")
35       wordCountsDataFrame.show()
36     }
37 
38     ssc.start()
39     ssc.awaitTermination()
40   }
41 
42   /** Case class for converting RDD to DataFrame */
43   case class Record(word: String)
44 
45 
46   /** Lazily instantiated singleton instance of SparkSession */
47   object SparkSessionSingleton {
48 
49     @transient private var instance: SparkSession = _
50 
51     def getInstance(sparkConf: SparkConf): SparkSession = {
52       if (instance == null) {
53         instance = SparkSession
54           .builder
55           .config(sparkConf)
56           .getOrCreate()
57       }
58       instance
59     }
60   }
61 }
View Code 

8)foreachRDD的使用

  作用:foreachRDD is a powerful primitive that allows data to be sent out to external systems

  正确的使用方式:使用foreachPartition获得每个分区的数据,

  

//不使用连接池
dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } }
//使用连接池
dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) ConnectionPool.returnConnection(connection) // return to the pool for future reuse } }
 

7)总结

  SparkStreaming无法实现毫秒级的流计算,如果需要实现毫秒级的流计算,仍然需要使用流计算框架(如Storm)

猜你喜欢

转载自www.cnblogs.com/truekai/p/11729767.html