SparkStreaming_window_sparksql_reids

1.5 window

rolling window + sliding window

The window operation is the window function. Spark Streaming provides support for sliding window operations, allowing us to perform calculation operations on data within a sliding window. Each time the RDD data falls into the window, it will be aggregated to perform calculation operations, and then the generated RDD will be used as an RDD of the window DStream. For example, in the figure below, a sliding window calculation is performed on the data every three seconds. The three RDDs within these three seconds will be aggregated for processing, and then after two seconds, the data within the last three seconds will be processed. Perform sliding window calculations. Therefore, each sliding window operation must specify two parameters, the window length and the sliding interval, and the values ​​of these two parameters must be an integer multiple of the batch interval.

  1. The red rectangle is a window, and the window holds the data flow within a period of time.

  2. Each time here is a time unit. In the official example, every window size is 3 time units, and every 2 units of time, the window will slide once.

Therefore, for window-based operations, two parameters need to be specified:

window length - The duration of the window (3 in the figure)

slide interval - The interval at which the window-based operation is performed (2 in the figure).

  1. The window size, personally feels like a container for data within a period of time.

  2. The sliding interval is a cron expression that we can understand.

Case implementation

package com.qianfeng.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
  * Statistics, the number of times each key has appeared so far
  * Window window operation, how long each time is M, through the data generated in the past N long periods of time
  * M is the sliding length sliding interval
  * N is the window length window length
  */
object Demo05_WCWithWindow {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("WordCountUpdateStateByKey")
      .setMaster("local[*]")
    val batchInterval = 2
    val duration = Seconds(batchInterval)
    val ssc = new StreamingContext(conf, duration)
    val lines:DStream[String] = ssc.socketTextStream("qianfeng01", 6666)
    val pairs:DStream[(String, Int)] = lines.flatMap(_.split("\\s+")).map((_, 1))
    val ret:DStream[(String, Int)] = pairs.reduceByKeyAndWindow(_+_,
      windowDuration = Seconds(batchInterval * 3),
      slideDuration = Seconds(batchInterval * 2))
    ret.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

1.6 Integration case of SparkSQL and SparkStreaming

The most powerful thing about Spark is that it can be integrated with Spark Core and Spark SQL. We have seen before through operators such as transform and foreachRDD how to use Spark Core to perform batch operations on RDDs in DStream. Now let’s take a look at how to use RDD in DStream with Spark SQL.

Case: Top3 product sorting: Latest top3

Here is based on updatestateByKey, the statistics of the top 3 product sales in different categories so far

Code

package com.qianfeng.sparkstreaming
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream
/**
 * Case study of SparkStreaming integrating SparkSQL, top 3 ranking of popular categories
 *Input data format:
 * id brand category
 * 1 huwei watch
 * 2 huawei phone
 *
 */
object Demo06_SQLWithStreaming {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("StreamingIntegerationSQL")
      .setMaster("local[*]")
    val batchInterval = 2
    val duration = Seconds(batchInterval)
    val spark = SparkSession.builder()
      .config(conf)
      .getOrCreate()
    val ssc = new StreamingContext(spark.sparkContext, duration)
    ssc.checkpoint("/Users/liyadong/data/sparkdata/streamingdata/chk-1")
    val lines:DStream[String] = ssc.socketTextStream("qianfeng01", 6666)
    //001 my moblie
    val pairs:DStream[(String, Int)] = lines.map(line => {
      val fields = line.split("\\s+")
      if(fields == null || fields.length != 3) {
        ("", -1)
      } else {
        val brand = fields(1)
        val category = fields(2)
        (s"${category}_${brand}", 1)
      }
    }).filter(t => t._2 != -1)
    val usb:DStream[(String, Int)] = pairs.updateStateByKey(updateFunc)
    usb.foreachRDD((rdd, bTime) => {
      if(!rdd.isEmpty()) {//category_brand count
        import spark.implicits._
        val df = rdd.map{case (cb, count) => {
          val category = cb.substring(0, cb.indexOf("_"))
          val brand = cb.substring(cb.indexOf("_") + 1)
          (category, brand, count)
        }}.toDF("category", "brand", "sales")
        df.createOrReplaceTempView("tmp_category_brand_sales")
        val sql =
          """
            |select
            |  t.category,
            |  t.brand,
            |  t.sales,
            |  t.rank
            |from (
            |  select
            |    category,
            |    brand,
            |    sales,
            |    row_number() over(partition by category order by sales desc) rank
            |  from tmp_category_brand_sales
            |) t
            |where t.rank < 4
            |;
                    """.stripMargin
        spark.sql(sql).show()
      }
    })
    ssc.start()
    ssc.awaitTermination()
  }
  def updateFunc(seq: Seq[Int], option: Option[Int]): Option[Int] = {
    Option(seq.sum + option.getOrElse(0))
  }
}

1.7 SparkStreaming integrates Reids

//将实时结果写入Redis中
dStream.foreachRDD((w,c)=>{
  val jedis = new Jedis("192.168.10.101", 6379)   //抽到公共地方即可
  jedis.auth("root")
  jedis.set(w.toString(),c.toString())  //一个key对应多个值,可以考虑hset
})

Guff_hys_python data structure, big data development learning, python training project-CSDN blog

Guess you like

Origin blog.csdn.net/HYSliuliuliu/article/details/135006501