(C) window is calculated and Join operations SparkStreaming

The window operation SparkStreaming

Spark Streaming Streaming computing framework also provides a window operation, allowing application of various functions on the conversion window; works as shown
Here Insert Picture Description
in this figure is described with a hop sliding window
using the window must specify two parameters:

  • Window length : actual length of the window length of the window (on the map, the window length is three time units)
  • Sliding step : sliding interval or pitch sliding step (the figure, the sliding window step size is two times content)

Note:

  • If you need to define a rolling window, 窗口长度=滑动步长you can
  • Session window does not support
  • There must be a state of computing

Here Insert Picture Description
Here we analyze the difference between the two versions of reduceByKeyAndWindow

  • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
  • reduceByKeyAndWindow ( FUNC , invFunc , windowLength , slideInterval , [ numTasks ]) more efficient reduceByKeyAndWindow operation

the reason:

Both methods are carried out according to the key value to reduce operation, and the operation results of these two methods are the same, the difference is different efficiencies, more efficient method for the back;

  • A first method: a cumulative basis (micro batch window cumulative results)
  • The second method: incremental calculation (calculation results window on a new current window element + - a window expires = current calculation result data window)

Principle analysis:

DStream:10 9 8 7 6 5 4 3 2 1
MicroBatch: b1: 1 b2: 2 b3: 3 …

Window calculated: length 5 step 3

W1 of:. 5. 4. 3. 1 2
W2 of:. 4. 5. 6. 7. 8
W3 of: 10. 11. 7. 8. 9
============================= =============
first method: reduceByKeyAndWindow cumulative basis

w1: 1 + 2 + 3 + 4 + 5 = 15 ( calculated five times)
w2 of:. 6. 4 + 5 + 30 + =. 8. 7 + (calculated 5)
W3: ...

The second method: the calculation of a window reduceByKeyAndWindow incremental computation + new data of the current window - a window on the expiration data

w1: 0 + 1 + 2 + 3 + 4 + 5 - 0 = 15 ( calculated 6)
w2 of: + 15 + 6. 8. 7 + -. 1 - 2-3 = 30 (calculated 6)
W3: 30 +9 + 10 +11 -4-5-6 = ...

Why is the efficiency of the second method is relatively high?

Reflect: the length of the sliding window is relatively large and a relatively small step size

Length: 100s steps: 1

W1:. 1 - 100
w2 of: 2 - 101
W3:. 3 - 102
=================================== ====
total amount calculated:
W1:. 3. 1 + 2 + ... + 100 100 times. 4
w2: 2+ 3 + 4 + 5 ... 101

Incremental calculated:
W1: 1+. 3 + 2 + 100
w2 of: 5050 + 101 = -1 result of the current window 2

Conclusion: The coincidence of the window more efficient way to use the second recommendation, the data window is less coinciding using a first

package window.transformation

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * 测试窗口的转换函数(6个)
 */
object TransformationOnWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("wordcount on window").setMaster("local[*]")
    // 注意:微批和窗口的大小必须是倍数关系
    // 如:微批1s   窗口 5s
    // 如:微批5s   窗口 10、15、20...
    val ssc = new StreamingContext(conf, Seconds(1))
    ssc.sparkContext.setLogLevel("ERROR")

    // 并没有设定检查点??(程序可以使用:因为有本地状态内存)
    // 建议设置检查点,对本地状态提供远程副本
    ssc.checkpoint("hdfs://xxx:9000/checkpo")

    val lines = ssc.socketTextStream("localhost", 7777)

    /*
    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 只负责划分窗口 不负责处理数据
      //.window(Seconds(10),Seconds(5))
      // 滑动步长默认为1s
      //.window(Seconds(10))

      // countByWindow方法  统计窗口中元素的个数
      //.countByWindow(Seconds(10),Seconds(5))
      .print()
     */

    /*
    lines
        .flatMap(_.split(","))
        .map(strNum => strNum.toInt)
        // reduceByWindow 基于窗口计算
        .reduceByWindow((v1:Int,v2:Int) => v1+v2,Seconds(10),Seconds(5))
        .print()
     */

    /*
    lines
      .flatMap(_.split(","))
      .map((_, 1))
      // 第一个参数:增量式的加法  第二个参数:过期数据的减法
      .reduceByKeyAndWindow(_ + _, _ - _, Seconds(5), Seconds(3))
      .print()
     */


    lines
      .flatMap(_.split(","))  // word
      // countByValueAndWindow 统计当前窗口中单词相同的个数 返回(key,count)
      .countByValueAndWindow(Seconds(5), Seconds(3))
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

Question: Why is the window calculation SparkStreaming must be stateful computing (must be set checkpoint)?

Because the data stream is split SparkStreaming will accept a batch of data (micro batch), by micro-batch process Spark engine RDD, to produce the final result stream. DStream underlayer is composed of a sequence Seq [RDD], and each comprising a plurality of such windows micro batch. So, the end result will be calculated window in series calculated the final result, that is to say, needs a data window when the next window calculations.
In summary, the window must be calculated SparkStreaming check point is provided, and must be calculated stateful

Join Operation

Based on the window DStream and DStream Join
Here Insert Picture Description
Note: For window DStream and DStream Join the generic must be [String, String]

package join

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}

object DStreamAndDStreamJoinOnWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[*]").setAppName("streaming wordcount")

    val ssc = new StreamingContext(conf, Seconds(5))
    // 将日志等价变为高级别
    ssc.sparkContext.setLogLevel("ERROR")

    //2. 构建数据源的DStream
    // 通过TCP Source构建DStream对象 获取Tcp请求数据
    val w1 = ssc.socketTextStream("localhost",8888).map((_,1)).window(Seconds(10))
    val w2 = ssc.socketTextStream("localhost",7777).map((_,1)).window(Seconds(15))
    /// 事件时间处理
    w1
      .join(w2)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

DStream and RDD of the Join (flow and batch Join)

  • Here with a small case to understand what specific batch Join stream and what scenarios (for example: some applications of sensitive words filtered out and replaced by **)
package join

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * 敏感词过滤
 */
object DStreamAndRDDJoin {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local[2]").setAppName("streaming wordcount")

    val ssc = new StreamingContext(conf, Seconds(5))
    // 将日志等价变为高级别
    ssc.sparkContext.setLogLevel("ERROR")

    // 流数据stream
    val messages = ssc.socketTextStream("localhost", 7777)

    // 批次rdd
    val words = ssc.sparkContext.makeRDD(List(("sb", 1), ("傻逼", 1)))

    messages
      .map((_, 1))
      .transform(rdd => {
        // 连接操作 最好左外连接
        rdd.leftOuterJoin(words)
      })
      .map(t2 => {
        var message = t2._1
        if(!t2._2._2.isEmpty){
          message = "**"
        }
        message
      })
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}
Published 24 original articles · won praise 1 · views 494

Guess you like

Origin blog.csdn.net/Mr_YXX/article/details/105035547