SparkStreaming integration flume

SparkStreaming integration flume

push will lose the data in the actual development, because the push by the flume to send data, program errors, lose data. So do not use to explain, here to explain poll, got me flume data, to ensure data is not lost.

1. First you have to have flume

For example, you have: [ If not, please take this: build a flume cluster (TBD) ]

Version flume used here is apache1.6 cdh integrated company

Here to download

(1) I am here is to spark-streaming-flume-sink_2.11-2.0.2.jar put into the lib directory of the flume

 

cd /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib

  (ps: my flume installation directory using ftp tool to upload up, I'm using finalShell support ssh also supports ftp ( need a small partner, the point I downloaded ))

 

(2) Modify dependencies in the scala flume / lib (to ensure that the same version)

I am here is to spark the jar installation path of the scala-library-2.11.8.jar replace scala-library-2.10.5.jar under the flume

 

Delete scala-library-2.10.5.jar

rm -rf /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib/scala-library-2.10.5.jar 

Copy the scala-library-2.11.8.jar

cp /export/servers/spark-2.0.2/jars/scala-library-2.11.8.jar /export/servers/apache-flume-1.6.0-cdh5.14.0-bin/lib/

 

(3) write flume-poll.conf file

Create a directory

mkdir /export/data/flume

Create a profile

vim /export/logs/flume-poll.conf

 

Write the configuration, mark glows green areas that need attention for his change of this machine (flume is based on configuration tasks)

a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true
#channel
a1.channels.c1.type =memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity=5000
#sinks
a1.sinks.k1.channel = c1
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname=192.168.52.110
a1.sinks.k1.port = 8888
a1.sinks.k1.batchSize= 2000 

Wq to save and exit line mode

Execution flume

flume-ng agent -n a1 -c /opt/bigdata/flume/conf -f /export/logs/flume-poll.conf -Dflume.root.logger=INFO,console

File placed under surveillance / Export / Data / Flume                     (corresponding to the yellow profile is created before)

 

execution succeed

 

 Configure your behalf flume no problem, then start writing code

1. Import its dependencies

 

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-flume_2.11</artifactId>
    <version>2.0.2</version>
</dependency>

 

2. coding

package SparkStreaming

import SparkStreaming.DefinedFunctionAdds.updateFunc
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

object SparkStreamingFlume {
  def main(args: Array[String]): Unit = {
    //创建sparkContext
    val conf: SparkConf = new SparkConf().setAppName("DefinedFunctionAdds").setMaster("local[2]")
    val sc = new SparkContext(conf)

    //去除多余的log,提高可视率
    sc.setLogLevel("WARN")

    //创建streamingContext
    val scc = new StreamingContext(sc,Seconds(5))

    //设置备份
    scc.checkpoint("./flume")

    //receive(task)拉取数据
    val num1: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(scc,"192.168.52.110",8888)
    //获取flume中的body
    val value: DStream[String] = num1.map(x=>new String(x.event.getBody.array()))
    //切分处理,并附上数值1
    val result: DStream[(String, Int)] = value.flatMap(_.split(" ")).map((_,1))

    //结果累加
    val result1: DStream[(String, Int)] = result.updateStateByKey(updateFunc)

    result1.print()
    //启动并阻塞
    scc.start()
    scc.awaitTermination()
  }


  def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = {
    val newValue: Int = currentValues.sum+historyValues.getOrElse(0)
    Some(newValue)
  }

}

运行

加入新的文档到监控目录  结果

成功结束!

 

Guess you like

Origin www.cnblogs.com/BigDataBugKing/p/11228784.html