1. 试验目标
a . 熟悉spark Streaming操作流程(编程-》打包-》程序提交运行-》job运行监控)
b. 熟悉spark Streaming 运行,和使用场景有初步了解
c .熟悉Spark Streaming基本编程,对spark函数有初步使用 ,flatMap,map,reduceByKey..
2.前提
a . 软件安装:
nc(模拟数据实时输入工具),spark-2.2.0 , sbt-1.1.0, scala-2.11.8
3.编程
程序源码:
import org.apache.spark._ import org.apache.spark.streaming._ object StreamingWordCount { def main(args: Array[String]){ //config the sparksession //spark 设置Spark集群地址 val conf=new SparkConf().setMaster("spark://master:7077").setAppName("NetworkWordCount") // create the streaming context // 创建Streaming 上下文,跟SparkContext 类似,多一个参数,设置收集数据源的时间间隔(Dstream 按照时间批次收集5s,每5s生成一个RDD) val ssc= new StreamingContext(conf,Seconds(5)) //conf the socket to reciver the words //设置Spark Streaming 监控的socket,数据流通过该socket传到spark,生成小的RDD,之后传入spark 做处理 val lines=ssc.socketTextStream("10.0.1.118",9800) //split the string to the word //flatMap 后面通过图说明这个函数跟Map区别,通过flatMap 函数后每个RDD中内容到变成单词,再经过map 映射成元组key就是单词,values为1 //例如: 输入 I Love You , 输出:[(I,1),(Love,1),(You,1)] val words=lines.flatMap(_.split(" ")).map(word=>(word,1)) //reduce the words //对上面的数组进行计数,想听的key值value相加,例如:(You,1) (You,1)=>(You,2) val wordscount=words.reduceByKey((x,y)=>x+y) //output the result //Dstrea 输出,print() 会打印RDD前10个元数。 wordscount.print() //流启动 ssc.start() //等待流终止,可以用awaitTerminationOrTimeout(3600)设置超时时间 ssc.awaitTermination(); } }
4.打包
将上面程序保存为StreamingWordCount.scala,目录结构(我的整个项目是房子一个WordsCount目录下/workscript/WordsCount):
[root@master WordsCount]# pwd /workscript/WordsCount [root@master WordsCount]# find . . ./src ./src/main ./src/main/scala ./src/main/scala/StreamingWordCount.scala ./simple.sbt [root@master WordsCount]# ll drwxrwxr-x 3 hadoop hadoop 17 Feb 8 22:57 src
新建文件simple.sbt,内容如下:
[root@master WordsCount]# cat simple.sbt # 文件版本配置跟上面软件安装截图相匹配。 name := "StreamingWordCount" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.1" #添加依赖包,这例如果添加依赖错了,在提交job会找不到类。
说明如果有多个依赖:
libraryDependencies ++= Seq( groupID % artifactID % revision, groupID % otherID % otherRevision )
sbt 官网依赖配置:https://www.scala-sbt.org/0.13/docs/zh-cn/Library-Dependencies.html
Maven包查询:http://search.maven.org/#search
执行打包:
[root@master WordsCount]# ll #进入WordsCount目录 total 4 -rwxrwxr-x 1 hadoop hadoop 144 Feb 9 01:16 simple.sbt drwxrwxr-x 3 hadoop hadoop 17 Feb 8 22:57 src [root@master WordsCount]# sbt package #执行打包,这个过程有点慢。。。。。。。。需要去下载依赖包,所以要联网,联网,当然可以本地(要先下载仓库) [info] Updated file /workscript/WordsCount/project/build.properties: set sbt.version to 1.1.0 [info] Loading project definition from /workscript/WordsCount/project [info] Updating ProjectRef(uri("file:/workscript/WordsCount/project/"), "wordscount-build")... [info] Done updating. [info] Loading project definition from /workscript/WordsCount/project [info] Loading settings from simple.sbt ... [info] Set current project to StreamingWordCount (in build file:/workscript/WordsCount/) ..... [info] Compiling 1 Scala source to /workscript/WordsCount/target/scala-2.11/classes ... [info] Done compiling. [info] Packaging /workscript/WordsCount/target/scala-2.11/streamingwordcount_2.11-1.0.jar ... [info] Done packaging. [success] Total time: 475 s, completed Feb 9, 2018 1:29:25 AM #打好的Jar包在目录/workscript/WordsCount/target/scala-2.11/streamingwordcount_2.11-1.0.jar [root@master scala-2.11]# ll total 12 drwxr-xr-x 2 root root 4096 Feb 9 01:29 classes drwxr-xr-x 4 root root 45 Feb 9 01:25 resolution-cache -rw-r--r-- 1 root root 4768 Feb 9 01:29 streamingwordcount_2.11-1.0.jar #注意权限
5.0 提交Job
移动Jar 到指定目录(自定义,方便管理):
[root@master scala-2.11]# mv streamingwordcount_2.11-1.0.jar /home/hadoop/spark-2.2.0/example_jars/ #提交Job,job提交后通过nc 相socketTextStream("10.0.1.118",9800) 输入数据源。 [hadoop@master bin]$ ./spark-submit --class StreamingWordCount ~/spark-2.2.0/example_jars/streamingwordcount_2.11-1.0.jar 18/02/09 03:10:31 INFO spark.SparkContext: Running Spark version 2.2.0 18/02/09 03:10:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/02/09 03:10:55 INFO spark.SparkContext: Submitted application: NetworkWordCount 18/02/09 03:10:56 INFO spark.SecurityManager: Changing view acls to: hadoop 18/02/09 03:10:56 INFO spark.SecurityManager: Changing modify acls to: hadoop 18/02/09 03:10:56 INFO spark.SecurityManager: Changing view acls groups to: 18/02/09 03:10:56 INFO spark.SecurityManager: Changing modify acls groups to: 18/02/09 03:10:56 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set() ...... 18/02/09 03:14:01 INFO scheduler.DAGScheduler: Job 33 finished: print at StreamingWordCount.scala:21, took 0.253337 s ------------------------------------------- Time: 1518117240000 ms ------------------------------------------- (remote,1) (desktop,1) (connect,1) (application,1) (agents,,1) (YourKit,1) (to,1) (profiler,1) (To,1) (the,2) 18/02/09 03:14:01 INFO scheduler.JobScheduler: Finished job streaming job 1518117240000 ms.0 from job set of time 1518117240000 ms
数据源输入:
没有nc可以使用yum install nc 命令安装:
[hadoop@master ~]$ nc -lk 9800 To connect the YourKit desktop application to the remote profiler agents, #回车输入完毕
驱动节点显示,该显示如果在5s 内没有输入,则为空:
6.0 前端监控
网址:http://10.0.1.118:8080/
附录
Spark Streaming 编程指导:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Spark API :
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package
Dstream :