updateStateByKey算子

需求，统计到目前为止，累计出现的单词个数(需要保持之前的状态)

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

如果使用了带状态的算子，必须指定checkpoint,来连接老值和新值
在生产环境中，建议大家把checkPoint设置到HDFS某个文件夹中
传进去的参数就是定义的方法，其中包含了隐式转换

官网给出的解释如下：

The update function will be called for each word, with newValues having a sequence of 1’s (from the (word, 1) pairs) and the runningCount having the previous count.

Note that using updateStateByKey requires the checkpoint directory to be configured, which is discussed in detail in the checkpointing section.

import org.apache.spark.streaming.{Seconds, StreamingContext}

object StatefulWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("StatefulWordCount")
      .setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    val lines = ssc.socketTextStream("192.168.1.6",1111)
    //在生产环境中，建议大家把checkPoint设置到HDFS某个文件夹中
//如果使用了带状态的算子，必须指定checkpoint,来连接老值和新值
    ssc.checkpoint(".")
    val result = lines.flatMap(_.split(" ")).map((_,1))
   //这里传进去的参数就是定义的方法，其中包含了隐式转换
    val state = result.updateStateByKey[Int](updateFunction _)

    state.print()
    ssc.start()
    ssc.awaitTermination()
  }

  /**
    * 用当前的数据去更新已有的或者是老的数据
    * @param newValues
    * @param preValues
    * @return
    */
  def updateFunction(newValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    val newCount = newValues.sum
    val pre = preValues.getOrElse(0)
    Some(newCount+pre)
  }
}

将统计结果写入到MySql数据库中

首先将之前程序产生的checkPoint删掉

其中用到foreachRDD算子

其中官网的解释如下

The most generic output operator that applies a function, func, 
to each RDD generated from the stream. 
This function should push the data in each RDD to an external system,
such as saving the RDD to files, or writing it over the network to a database. 
Note that the function func is executed in the driver process running the streaming application, 
and will usually have RDD actions in it that will force the computation of the streaming RDDs.

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. 
However, it is important to understand how to use this primitive correctly and efficiently

其中操作foreachRDD算子常出现的错误有如下：

序列化异常

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

成本

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

优化版本1；

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

终极优化版本

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

将统计结果写入到MySQL中

数据库创建表

create table wordcount(
word varchar(50) default null,
wordcount int(10) default null
);

Mysql连接池创建

  def createConnection()={
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://localhost:3306/imooc_spark","root","root")
  }

将数据写输入Mysql中

 result.foreachRDD { rdd =>
        rdd.foreachPartition { partitionOfRecords =>{
            val connection = createConnection()
            partitionOfRecords.foreach(record => {
              val sql = "insert into wordcount(word,wordcount) values('"+record._1+"',"+record._2+")"
              connection.createStatement().execute(sql)
            })
            connection.close()
        }
      }

最后结果验证...

总结：

通过该sql将统计结果写入到MySQL

insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"

存在的问题：

1) 对于已有的数据做更新，而是所有的数据均为insert

改进思路：

a) 在插入数据前先判断单词是否存在，如果存在就update，不存在则insert

b) 工作中：HBase/Redis

2) 每个rdd的partition创建connection，建议大家改成连接池

窗口函数的使用

window：定时的进行一个时间段内的数据处理

window length ：窗口的长度

sliding interval：窗口的间隔

这2个参数和我们的batch size有关系：倍数

每隔多久计算某个范围内的数据：每隔10秒计算前10分钟的wc ==> 每隔sliding interval统计前window length的值

官网给出的例子

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

黑名单过滤：

将访问日志转化为DStream
黑名单列表转为为RDD
对DStream和RDD进行LeftJoin操作筛选出非黑名单的日志信息

访问日志   ==> DStream
20180808,zs
20180808,ls
20180808,ww
   ==>  (zs: 20180808,zs)(ls: 20180808,ls)(ww: 20180808,ww)

黑名单列表  ==> RDD
zs
ls
   ==>(zs: true)(ls: true)
==> 20180808,ww

leftjoin
(zs: [<20180808,zs>, <true>])  x 
(ls: [<20180808,ls>, <true>])  x
(ww: [<20180808,ww>, <false>])  ==> tuple 1

代码如下：

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 黑名单过滤
  */
object TransformApp {


  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    /**
      * 创建StreamingContext需要两个参数：SparkConf和batch interval
      */
    val ssc = new StreamingContext(sparkConf, Seconds(5))


    /**
      * 构建黑名单
      */
    val blacks = List("zs", "ls")
    val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))

    val lines = ssc.socketTextStream("localhost", 6789)
    val clicklog = lines.map(x => (x.split(",")(1), x)).transform(rdd => {
      rdd.leftOuterJoin(blacksRDD)
        .filter(x=> x._2._2.getOrElse(false) != true)
        .map(x=>x._2._1)
    })

    clicklog.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Sql和Spark Streaming整合

在POM.xml中加入Spark sql的依赖

    <!--SparkSQL-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
      <!--
      <scope>provided</scope>
      -->
    </dependency>

SparkSession是单例的

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

/**
  * Spark Streaming整合Spark SQL完成词频统计操作
  */
object SqlNetworkWordCount {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("localhost", 6789)
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Creates a temporary view using the DataFrame
      wordsDataFrame.createOrReplaceTempView("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
        spark.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    }


    ssc.start()
    ssc.awaitTermination()
  }


  /** Case class for converting RDD to DataFrame */
  case class Record(word: String)


  /** Lazily instantiated singleton instance of SparkSession */
  object SparkSessionSingleton {

    @transient  private var instance: SparkSession = _

    def getInstance(sparkConf: SparkConf): SparkSession = {
      if (instance == null) {
        instance = SparkSession
          .builder
          .config(sparkConf)
          .getOrCreate()
      }
      instance
    }
  }
}

效果验证

Spark Streaming第三部分

updateStateByKey算子

将统计结果写入到MySql数据库中

窗口函数的使用

Spark Sql和Spark Streaming整合

猜你喜欢