13.sparkStreaming

Spark Streaming

Spark Streaming时用于处理流式数据的框架。流数据的来源可以多种多样例如kafka,flume等等,其核心是Dstream,a DStream is represented by a continuous series of RDDs, Any operation applied on a DStream translates to operations on the underlying RDDs

Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Yarn调度)

特点

  • 易用:能够用RDD的高阶函数对DStream进行处理
  • 容错:发生错误能像RDD一样重新计算
  • 易整合到spark体系中,就是写入spraksql中
  • 实际上是微量批处理,一次处理的是一批数据,不是真正的流式处理。

背压机制

we have introduced a feature called backpressure that eliminate the need to set this rate limit, as Spark Streaming automatically figures out the rate limits and dynamically adjusts them if the processing conditions change. This backpressure can be enabled by setting the configuration parameter spark.streaming.backpressure.enabled to true

wordCount 案例

object WordCount {
    def main(args: Array[String]): Unit = {
        //创建一个Sc
        val conf = new SparkConf().setMaster("local[2]").setAppName("wordCount")
        val ssc = new StreamingContext(conf, Seconds(3))

        val sourceStream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.6.100", 9999)
        // 处理数据
        val wordAndCount: DStream[(String, Int)] = sourceStream.flatMap(_.split(" ")).map(x => (x, 1)).reduceByKey(_ + _)
        //展示数据
        wordAndCount.print(1000)
        //开启实时处理
        ssc.start()
        //阻止main退出
        ssc.awaitTermination()
    }
}
发布了118 篇原创文章 · 获赞 5 · 访问量 7150

猜你喜欢

转载自blog.csdn.net/resilienter/article/details/104006602