Two classic ways of using SparkStream:

everyone:

  it is good! The two classic ways of using SparkStream only involve SparkStream, but do not involve various connections. The data source is the port number of tcp.

The first type: Calculate the data according to the interval, without processing the historical data

package SparkStream

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * Created by Administrator on 2017/10/10.
  * 功能:演示sparkStream的wordcount
  *
  */
object StreamWc {
  def main(args: Array[String]): Unit = {

   //设置日志的级别
    LoggerLevels.setStreamingLogLevels()
  val conf=new SparkConf().setAppName("StreamWc").setMaster("local[2]")
  val sc=new SparkContext(conf)
  val ssc=new StreamingContext(sc,Seconds(5))

   //创建DStream  接收数据
   val DStream=ssc.socketTextStream("192.168.17.108",8888)
   val result=DStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
   result.print()

   //启动spark任务
    ssc.start()
    //等待结束任务
    ssc.awaitTermination()

  }
}

 

First, read data from the TCP port

yum install nc

nc -lk 8888

Secondly, run SparkStream, check the data input from tcp, whether SparkStream has been processed

The second type: Calculate the data according to the interval, and need to integrate the historical data

package SparkStream

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

/**
  * Created by Administrator on 2017/10/10.
  * 功能:演示SparkStream中的wordcount的累加
  *
  */
object StreamWc_Sum {
  /**
    * String : 单词 hello
    * Seq[Int] :单词在当前批次出现的次数
    * Option[Int] : 历史结果
    */
   val updateFunc=(iter:Iterator[(String,Seq[Int],Option[Int])])=>{
     //iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
  iter.flatMap{case(x,y,z)=>Some(y.sum+z.getOrElse(0)).map(m=>(x,m))}
}


  def main(args: Array[String]): Unit = {
    //设置日志的级别
    LoggerLevels.setStreamingLogLevels()
    val conf=new SparkConf().setAppName("StreamWc_Sum").setMaster("local[2]")
    val sc=new SparkContext(conf)
    val ssc=new StreamingContext(sc,Seconds(5))
    // ss做累加之前,一定要做checkpoint,而且存储的目录一定是hdfs
    //做checkpoint 写入共享存储中
    ssc.checkpoint("c://test//wc_sum")
    val lines=ssc.socketTextStream("192.168.17.108",8888)
    //没有累加的计数
//    val result=lines.flatMap(_.split(" ")).map((_,1)).reduceByKey()
   // 用默认的hash分区, true表示以后也这样用的
//updateStateByKey结果可以累加但是需要传入一个自定义的累加函数:updateFunc  三个参数
      val result=lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey(updateFunc,
  new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
    result.print()

   //启动sparkstream
    ssc.start()
    等待结束任务
    ssc.awaitTermination()

  }
}

 

First, read data from the TCP port

yum install nc

nc -lk 8888

Secondly, run SparkStream to check the data input from tcp, whether SparkStream has been processed, and focus on the processing of historical data:

Personally think the core points:

1 function updateFunc

2 scala's operator updateStateByKey, the checkpoint to be done before

 These two points are not easy to write, I suggest saving the code. Modify on this basis next time, don’t write from scratch

 

 

Guess you like

Origin blog.csdn.net/zhaoxiangchong/article/details/81668304