Transformations on DStreams之transform的使用 实现黑名单操作/指定过滤

Transformations on DStreams之transform 实现黑名单操作/指定过滤

官网:http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
transform(func)
Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary(任意的) RDD operations on the DStream.

rdd实现

package g5.learning

import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.mutable.ListBuffer

object BlackListApp {
  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
      .setMaster("local[2]").setAppName(" LogAppscala")
    val sc = new SparkContext(sparkConf)

      val input  =new ListBuffer[(String,String)]
//相当于输入数据
     input.append(("doudou","doudou info"))
    input.append(("huahua","huahua info"))
    input.append(("zhang","zhanginfo"))
    input.append(("xiaoxiao","xiaoxiao info"))
val inputRDD =sc.parallelize(input)//转化成RDD


    val blackTuple = new ListBuffer[(String,Boolean)]
blackTuple.append(("doudou",true))
val blackRDD =sc.parallelize(blackTuple)

inputRDD.leftOuterJoin(blackRDD).filter(x =>{
  x._2._2.getOrElse(false)!=
  true
}).map(_._2._1).collect().foreach(println)

    sc.stop()
  }
}

transform的实现

package g5.learning

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object TransformApp {

  def main(args: Array[String]): Unit = {

    //准备工作
    val conf = new SparkConf().setMaster("local[2]").setAppName("TransformApp")
    val ssc = new StreamingContext(conf, Seconds(10))

    val blackTuple = new ListBuffer[(String,Boolean)]
    blackTuple.append(("doudou",true))
    val blackRDD =ssc.sparkContext.parallelize(blackTuple)
    这里发现sc没有,要用ssc.sparkContext来获取
    //业务逻辑
    val lines = ssc.socketTextStream("hadoop001", 9999)
   lines.map(x=>(x.split(",")(0),x)).transform(rdd =>{
     rdd.leftOuterJoin(blackRDD).filter(x =>{
       x._2._2.getOrElse(false)!=
         true
     }).map(_._2._1)

   }).print()

    //streaming的启动
    ssc.start() // Start the computation
    ssc.awaitTermination() // Wait for the computation to terminate

  }
}

在这之前要nc -lk 9999
将他启动

主要:

是通过transform这个算子,把rdd和streaming联系到一起,单独使用这个streaming是完成不了的,

扩展:

不停作业读,把你的东西配到数据库里面就可以了,让他定时的读数据库,索要上线的配到数据库里面就可以了
假设有2000个,你今天配一个,明天配2个,,,,难道你的表里面配2000个才能上线么
肯定不是的,对于这种场景,肯定是不行的,这里就需要你个开关
你今天2000个都上完了,明天又来了50个,难道还要加到数据库里面去吗
2000上完,说明你整个功能没有问题了,后面就加的话就做一个开关
开关是什么意思呢
要读配置,是否读配置 0 1
–conf spark.filter.switch 放开关
–conf spark.filter.domains 放域名

好处,每次上线就不用再配置到这张表里面去了
生产用的非常多

猜你喜欢

转载自blog.csdn.net/qq_43688472/article/details/86616864
今日推荐