everyone:
it is good! The two classic ways of using SparkStream only involve SparkStream, but do not involve various connections. The data source is the port number of tcp.
The first type: Calculate the data according to the interval, without processing the historical data
package SparkStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by Administrator on 2017/10/10.
* 功能:演示sparkStream的wordcount
*
*/
object StreamWc {
def main(args: Array[String]): Unit = {
//设置日志的级别
LoggerLevels.setStreamingLogLevels()
val conf=new SparkConf().setAppName("StreamWc").setMaster("local[2]")
val sc=new SparkContext(conf)
val ssc=new StreamingContext(sc,Seconds(5))
//创建DStream 接收数据
val DStream=ssc.socketTextStream("192.168.17.108",8888)
val result=DStream.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
//启动spark任务
ssc.start()
//等待结束任务
ssc.awaitTermination()
}
}
First, read data from the TCP port
yum install nc
nc -lk 8888
Secondly, run SparkStream, check the data input from tcp, whether SparkStream has been processed
The second type: Calculate the data according to the interval, and need to integrate the historical data
package SparkStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}
/**
* Created by Administrator on 2017/10/10.
* 功能:演示SparkStream中的wordcount的累加
*
*/
object StreamWc_Sum {
/**
* String : 单词 hello
* Seq[Int] :单词在当前批次出现的次数
* Option[Int] : 历史结果
*/
val updateFunc=(iter:Iterator[(String,Seq[Int],Option[Int])])=>{
//iter.flatMap(it=>Some(it._2.sum + it._3.getOrElse(0)).map(x=>(it._1,x)))
iter.flatMap{case(x,y,z)=>Some(y.sum+z.getOrElse(0)).map(m=>(x,m))}
}
def main(args: Array[String]): Unit = {
//设置日志的级别
LoggerLevels.setStreamingLogLevels()
val conf=new SparkConf().setAppName("StreamWc_Sum").setMaster("local[2]")
val sc=new SparkContext(conf)
val ssc=new StreamingContext(sc,Seconds(5))
// ss做累加之前,一定要做checkpoint,而且存储的目录一定是hdfs
//做checkpoint 写入共享存储中
ssc.checkpoint("c://test//wc_sum")
val lines=ssc.socketTextStream("192.168.17.108",8888)
//没有累加的计数
// val result=lines.flatMap(_.split(" ")).map((_,1)).reduceByKey()
// 用默认的hash分区, true表示以后也这样用的
//updateStateByKey结果可以累加但是需要传入一个自定义的累加函数:updateFunc 三个参数
val result=lines.flatMap(_.split(" ")).map((_,1)).updateStateByKey(updateFunc,
new HashPartitioner(ssc.sparkContext.defaultParallelism), true)
result.print()
//启动sparkstream
ssc.start()
等待结束任务
ssc.awaitTermination()
}
}
First, read data from the TCP port
yum install nc
nc -lk 8888
Secondly, run SparkStream to check the data input from tcp, whether SparkStream has been processed, and focus on the processing of historical data:
Personally think the core points:
1 function updateFunc
2 scala's operator updateStateByKey, the checkpoint to be done before
These two points are not easy to write, I suggest saving the code. Modify on this basis next time, don’t write from scratch