SparkStreaming integration flume
push will lose the data in the actual development, because the push by the flume to send data, program errors, lose data. So do not use to explain, here to explain poll, got me flume data, to ensure data is not lost.
1. First you have to have flume
For example, you have: [ If not, please take this: build a flume cluster (TBD) ]
Version flume used here is apache1.6 cdh integrated company
Here to download
(1) I am here is to spark-streaming-flume-sink_2.11-2.0.2.jar put into the lib directory of the flume
cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib
(ps: my flume installation directory using ftp tool to upload up, I'm using finalShell support ssh also supports ftp ( need a small partner, the point I downloaded ))
(2) Modify dependencies in the scala flume / lib (to ensure that the same version)
I am here is to spark the jar installation path of the scala-library-2.11.8.jar replace scala-library-2.10.5.jar under the flume
Delete scala-library-2.10.5.jar
rm -rf /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib/scala-library-2.10.5.jar
Copy the scala-library-2.11.8.jar
cp /export/servers/spark-2.0.2/jars/scala-library-2.11.8.jar /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib/
(3) write flume-poll.conf file
Create a directory
mkdir /export/data/flume
Create a profile
vim /export/logs/flume-poll.conf
Write the configuration, mark glows green areas that need attention for his change of this machine (flume is based on configuration tasks)
a1.sources = r1 a1.sinks = k1 a1.channels = c1 #source a1.sources.r1.channels = c1 a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /export/data/flume a1.sources.r1.fileHeader = true #channel a1.channels.c1.type =memory a1.channels.c1.capacity = 20000 a1.channels.c1.transactionCapacity=5000 #sinks a1.sinks.k1.channel = c1 a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink a1.sinks.k1.hostname=192.168.52.110 a1.sinks.k1.port = 8888 a1.sinks.k1.batchSize= 2000
Wq to save and exit line mode
Execution flume
flume-ng agent -n a1 -c /opt/bigdata/flume/conf -f /export/logs/flume-poll.conf -Dflume.root.logger=INFO,console
File placed under surveillance / Export / Data / Flume (corresponding to the yellow profile is created before)
execution succeed
Configure your behalf flume no problem, then start writing code
1. Import its dependencies
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-flume_2.11</artifactId> <version>2.0.2</version> </dependency>
2. coding
package SparkStreaming import SparkStreaming.DefinedFunctionAdds.updateFunc import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream} import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent} object SparkStreamingFlume { def main(args: Array[String]): Unit = { //创建sparkContext val conf: SparkConf = new SparkConf().setAppName("DefinedFunctionAdds").setMaster("local[2]") val sc = new SparkContext(conf) //去除多余的log,提高可视率 sc.setLogLevel("WARN") //创建streamingContext val scc = new StreamingContext(sc,Seconds(5)) //设置备份 scc.checkpoint("./flume") //receive(task)拉取数据 val num1: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(scc,"192.168.52.110",8888) //获取flume中的body val value: DStream[String] = num1.map(x=>new String(x.event.getBody.array())) //切分处理,并附上数值1 val result: DStream[(String, Int)] = value.flatMap(_.split(" ")).map((_,1)) //结果累加 val result1: DStream[(String, Int)] = result.updateStateByKey(updateFunc) result1.print() //启动并阻塞 scc.start() scc.awaitTermination() } def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = { val newValue: Int = currentValues.sum+historyValues.getOrElse(0) Some(newValue) } }
运行
加入新的文档到监控目录 结果
成功结束!