Spring City is flying everywhere, Xiao Bai takes you to SparkStreaming (Practical Application)

Written in front: The blogger is a sophomore student in the big data application development department of the software engineering department. The nickname comes from Alice and her own nickname in Alice in Wonderland. As an Internet 写博客一方面是为了记录自己的学习历程,一方面是希望能够帮助到很多和自己一样处于起步阶段的萌新white, . Due to the limited level, there will inevitably be some mistakes in the blog. If there are any mistakes, I urge you to let me know! Personal small station: http://alices.ibilibili.xyz/ , blog homepage: https://alice.blog.csdn.net/
Although the current level may not be as good as everyone, but I still hope that I can do better, Because 一天的生活就是一生的缩影. I hope 在最美的年华,做最好的自己!

        Since the previous article "Spring City Flying Flowers everywhere, Xiao Bai takes you to SparkStreaming (Principle Introduction)" , bloggers have been thinking about how to start the next article, but no, after a few days of busy, finally have the following.

        Code word is not easy, look after it first, develop a habit!
Insert picture description here


Chapter 3 Actual Practice of Spark Streaming

3.1 WordCount

3.1.1. Requirements & Preparation

  • Graphic
    Insert picture description here

  • First of all, we install the nc tool on the linux server.
    Nc is short for netcat, originally used to set up the router, we can use it to send data to a certain port
    yum install -y nc

  • Start a server and open port 9999, wait to send data to this port
    nc -lk 9999

  • send data

3.1.2 Code demo

object Streaming01 {
  def main(args: Array[String]): Unit = {

    // 1. 创建SC
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("wc")
    val sc: SparkContext = new SparkContext(conf)
    // 设置日志级别
    sc.setLogLevel("WARN")

    // 2. 创建sc 指定【每个批次的时间】
    val ssc: StreamingContext = new StreamingContext(sc,Seconds(5))

    // 3. 接收数据,并处理
    val socketDatas: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)

    val WCS: DStream[(String, Int)] = socketDatas.flatMap(a=>a.split(" ")).map(a=>(a,1)).reduceByKey(_+_)

    // 遍历每个RDD
    WCS.foreachRDD(RDD=>RDD.foreach(println))

    // 4. 开始streaming
    ssc.start()

    // 5. 等待关闭
    ssc.awaitTermination()
  }
}

The program runs, we enter a string of characters separated by spaces in the command line window. For example:, and hadoop spark hadoop hivethen you can see similar information under the IDEA console:

Insert picture description here
It means that SparjStreaming has received the information transmitted under port 9999, and made a WordCount, and displayed the result in the console.
        

3.2 updateStateByKey

3.2.1 Problem

In the above case, there is a problem:
the number of words in each batch is correctly counted, but the results cannot be accumulated!

If you need to accumulate, you need to use updateStateByKey (func) to update the state.

object WordCount2 {
  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    //spark.master should be set as local[n], n > 1
    val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //设置日志级别
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))//5表示5秒中对数据进行切分形成一个RDD
    //requirement failed: ....Please set it by StreamingContext.checkpoint().
    //注意:我们在下面使用到了updateStateByKey对当前数据和历史数据进行累加
    //那么历史数据存在哪?我们需要给他设置一个checkpoint目录
    ssc.checkpoint("./wc")   //开发中这里需要设置成HDFS
    //2.监听Socket接收数据
    //ReceiverInputDStream就是接收到的所有的数据组成的RDD,封装成了DStream,接下来对DStream进行操作就是对RDD进行操作
    val dataDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
    //3.操作数据
    val wordDStream: DStream[String] = dataDStream.flatMap(_.split(" "))
    val wordAndOneDStream: DStream[(String, Int)] = wordDStream.map((_,1))
    //val wordAndCount: DStream[(String, Int)] = wordAndOneDStream.reduceByKey(_+_)
    //====================使用updateStateByKey对当前数据和历史数据进行累加====================
    val wordAndCount: DStream[(String, Int)] =wordAndOneDStream.updateStateByKey(updateFunc)
    wordAndCount.print()
    ssc.start()//开启
    ssc.awaitTermination()//等待优雅停止

  }
  //currentValues:当前批次的value值,:1,1,1 (以测试数据中的hadoop为例)
  //historyValue:之前累计的历史值,第一次没有值是0,第二次是3
  //目标是把当前数据+历史数据返回作为新的结果(下次的历史数据)
  def updateFunc(currentValues:Seq[Int], historyValue:Option[Int] ):Option[Int] ={
// currentValues当前值
// historyValue历史值
    val result: Int = currentValues.sum + historyValue.getOrElse(0)
    Some(result)
  }
}

Demonstration effect:
Send data in batches on port 9999. It
Insert picture description here
can be found that each result can be accumulated on the original basis.
Insert picture description here

3.3 reduceByKeyAndWindow

3.3.1 Illustration

The calculation process of the sliding window conversion operation is shown in the following figure.
We can set the length of a sliding window (that is, the duration of the window) in advance, and set the interval of the sliding window (how often to perform the calculation),

For example, set the length of the sliding window (that is, the duration of the window) to 24H, and set the interval of the sliding window (how often to perform a calculation) to 1H.
Then means: calculate the most recent 24H data every 1H

Insert picture description here
Insert picture description here

3.2.2 Code demo

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object WordCount3 {
  def main(args: Array[String]): Unit = {
    //1.创建StreamingContext
    //spark.master should be set as local[n], n > 1
    val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val ssc = new StreamingContext(sc,Seconds(5))//5表示5秒中对数据进行切分形成一个RDD
    //2.监听Socket接收数据
    //ReceiverInputDStream就是接收到的所有的数据组成的RDD,封装成了DStream,接下来对DStream进行操作就是对RDD进行操作
    val dataDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
    //3.操作数据
    val wordDStream: DStream[String] = dataDStream.flatMap(_.split(" "))
    val wordAndOneDStream: DStream[(String, Int)] = wordDStream.map((_,1))

    val wordAndCount: DStream[(String, Int)] = wordAndOneDStream.reduceByKeyAndWindow((a:Int,b:Int)=>a+b,Seconds(10),Seconds(5))
   //4.使用窗口函数进行WordCount计数
    //reduceFunc: (V, V) => V,集合函数
    //windowDuration: Duration,窗口长度/宽度
    //slideDuration: Duration,窗口滑动间隔
    //注意:windowDuration和slideDuration必须是batchDuration的倍数
    //windowDuration=slideDuration:数据不会丢失也不会重复计算==开发中会使用
    //windowDuration>slideDuration:数据会重复计算==开发中会使用
    //windowDuration<slideDuration:数据会丢失
    //代码表示:
    //windowDuration=10
    //slideDuration=5
    //那么执行结果就是每隔5s计算最近10s的数据

    wordAndCount.print()
    ssc.start()//开启
    ssc.awaitTermination()//等待优雅停止
  }
}

Open the port nc -lk 9999and run the program. At this time, no data is input. After waiting for a few seconds, start to enter the string. At this time, you can observe that IDEA has started to do WordCount on the input data.

Insert picture description here
In the next few seconds, increase the frequency of input data,
you can observe that the amount of calculated data is obviously increasing, but when I stop inputting the data, the amount of data decreases suddenly until it returns to the beginning of the program.
Insert picture description here
Why is this?
In this case, I set the window length windowDuration = 10 and the sliding distance of the window slideDuration = 5.
Then the execution result is to calculate the data of the last 10s every 5s.

So no matter how you input it, the program only calculates the last 10s each time (if the set batch is 5 seconds, that is, the data of the range of two batches), so the above results will be presented.

It should be noted that: windowDuration and slideDuration must be multiples of batchDuration

windowDuration = slideDuration : data will not be lost or repeated, but
windowDuration> slideDuration can bereplaced by increasing the period of Spark Streaming batches : data will be calculated repeatedly,
windowDuration <slideDuration: data will be lost in development, generally not used in development


        Well, this article mainly explains the basic application of SparkStreaming in actual combat . Friends who benefit from or interested in big data technology , please like to follow and support the wave of support (^ U ^) ノ ~ YO
Insert picture description here

Published 261 original articles · Like 2459 · Visits 510,000+

Guess you like

Origin blog.csdn.net/weixin_44318830/article/details/105573450