Transform operation of Spark-Streaming, real-time blacklist filtering case

Transform operations, when applied to DStreams, can be used to perform arbitrary RDD-to-RDD transformations. It can be used to implement operations that are not provided in the DStream API. For example, the DStream API does not provide the operation of joining each batch in a DStream with a specific RDD. But we can use the transform operation to achieve this function ourselves.
DStream.join(), can only join other DStreams. After the RDD of each batch of DStream is calculated, it will join the RDD of other DStreams.
Case:
object TransformDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    val config = new SparkConf().setAppName("TransformDemo").setMaster("local[2]")
    val ssc = new StreamingContext(config, Seconds(2))
    //Define the blacklist array
    val blackList = Array(("tom", true), ("jim", true))
    //Blacklist RDD
    val blackListRDD = ssc.sparkContext.parallelize (blackList)
    //Define a socket input stream
    ssc.socketTextStream("hadoop01", 8888).map(line => {
      val fields = line.split(" ")
      val name = fields(0)
      val clickDate = fields (1)
      (name, clickDate)
    }). transform (rdd => {
      / / Perform blacklist filtering, get the sent data and blacklist data for join connection
      //(tom,2017-03-02) leftOuterJoin (tom,true)  ===> (tom,(2017-03-02,Some(true)))
      rdd.leftOuterJoin(blackListRDD).filter(tuple => {
        //Filter out the data in the blacklist, isEmpty judges whether it is null, and returns true if it is null
        //Filter out such data (jom,(2017-09-09,None))
        if (tuple._2._2.isEmpty) true else false
      })
    }).print()
    ssc.start()
    ssc.awaitTermination ()
  }
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325665664&siteId=291194637