spark streaming消费flume数据

kafka和flume都可以承载实时数据,而spark streaming号称实时计算,也是可以消费flume数据的
在这里插入图片描述

1, push 的方式:spark充当flume的sink

flume程序的配置如下:

wang@wang-pc:~/txt/flume-ng$ cat spark-flume.conf 
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 3333

#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000


#sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 127.0.0.1
a1.sinks.k1.port = 4444

#相互关联 source--channel, sink--channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a,启动sink端(spark)

若直接启动flume则会报错如下:(所以要先启动sink端的spark程序)

启动flume:  flume-ng agent -n a1 -f spark-flume.conf 
报错: Caused by: java.io.IOException: Error connecting to /127.0.0.1:4444
	at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
	at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
	at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
	at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:165)
	... 16 more
Caused by: java.net.ConnectException: 拒绝连接: /127.0.0.1:4444

sink端:spark 充当flume的sink角色, 启动spark程序

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

//spark对接flume数据: 流计算
object FlumeStream {
  def main(args: Array[String]): Unit = {
    //spark配置
    val conf = new SparkConf().setMaster("local[*]").setAppName("test")
    //流配置
    val ssc = new StreamingContext(conf, Seconds(1))

    //接收数据:创建avro监听服务
    val inputDstream:ReceiverInputDStream[SparkFlumeEvent] 
    	 = FlumeUtils.createStream(ssc, "127.0.0.1", 4444)
    val tupDstream = inputDstream.map(event=>{
      val event1 = event.event
      val byteBuff = event1.getBody
      val body = new String(byteBuff.array())//ByteBuff.array()
      (body,1)
    }).reduceByKey(_+_)

    tupDstream.print()

    // 启动流
    ssc.start()
    ssc.awaitTermination()
  }
}

b,启动flume程序:

启动flume程序: flume-ng agent -n a1 -f spark-flume.conf

19/02/26 11:41:09 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k1 started
19/02/26 11:41:09 INFO sink.AbstractRpcSink: Rpc sink k1: Building RpcClient with hostname: 127.0.0.1, port: 4444
19/02/26 11:41:09 INFO sink.AvroSink: Attempting to create Avro Rpc client.
19/02/26 11:41:09 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:3333]
19/02/26 11:41:09 INFO api.NettyAvroRpcClient: Using default maxIOWorkers
19/02/26 11:41:09 INFO sink.AbstractRpcSink: Rpc sink k1 started.

c, 往flume发送数据, 查看spark程序的输出

wang@wang-pc:~/txt/flume-ng$ nc 127.0.0.1 3333
a
OK
b
OK
c
OK
1234
OK

查看spark程序的输出

-------------------------------------------
Time: 1551151646000 ms
-------------------------------------------
(a,1)
(b,1)
(1234,1)
  

2, pull 的方式:spark通过tcp socket拉取数据

a, 启动flume程序

启动flume: flume-ng agent -f spark-flume.conf -n a1 -f spark-flume.conf

wang@wang-pc:~/txt/flume-ng$ cat spark-flume.conf 
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 3333

#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000


#1,  avro sink  ==>spark充当socket服务
#a1.sinks.k1.type = avro
#a1.sinks.k1.hostname = 127.0.0.1
#a1.sinks.k1.port = 4444

#2,  spark sink ==>spark 通过独立的socket服务取数据
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 127.0.0.1
a1.sinks.k1.port = 4444

#相互关联 source--channel, sink--channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

报错如下:

org.apache.flume.FlumeException: Unable to load sink type: org.apache.spark.streaming.flume.sink.SparkSink, class: org.apache.spark.streaming.flume.sink.SparkSink
	at org.apache.flume.sink.DefaultSinkFactory.getClass(DefaultSinkFactory.java:70)
	at org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:43)
	at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:450)
	at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:106)
	at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.flume.sink.SparkSink

解决:去spark官网 http://spark.apache.org/docs/1.6.1/streaming-flume-integration.html寻找需要的jar包: spark-streaming-flume-sink_2.10-1.6.1.jar, 放到flume的lib目录下

b, 启动spark程序

import java.net.{InetAddress, InetSocketAddress}
import java.nio.ByteBuffer

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}

//spark对接flume数据: 流计算
object FlumeStream {
  def main(args: Array[String]): Unit = {
    //spark配置
    val conf = new SparkConf().setMaster("local[*]").setAppName("test")
    //流配置
    val ssc = new StreamingContext(conf, Seconds(2))

    //接收flume数据:Push推送的方式  ( flume push数据 ---> spark 所在的sink)
//    val inputDstream:ReceiverInputDStream[SparkFlumeEvent]
    //  = FlumeUtils.createStream(ssc, "127.0.0.1", 4444)
//    val tupDstream = inputDstream.map(event=>{
//      val event1 = event.event
//      val body:ByteBuffer = event1.getBody
//      val mesg = new String(byteBuff.array())//ByteBuff.array()
//      (mesg,1)
//    }).reduceByKey(_+_)

    //接收flume数据: pull 拉的方式 (spark pull 数据 <---flume的sink所在的tcp socket)
    val ncAddresses = Seq(new InetSocketAddress("127.0.0.1",4444))
    val inputDstream:ReceiverInputDStream[SparkFlumeEvent]= FlumeUtils.createPollingStream(
      ssc,
      ncAddresses,
      StorageLevel.MEMORY_ONLY
      )
    val tupDstream =inputDstream.map(event=>{
      val event1 = event.event
      val body:ByteBuffer = event1.getBody
      val mesg = new String(body.array())
      (mesg,1)
    }).reduceByKey(_+_)
    tupDstream.print()

    // 启动流
    ssc.start()
    ssc.awaitTermination()
  }
}

c, 往flume中发送数据, 查看spark的输出

wang@wang-pc:~/txt/flume-ng$ nc 127.0.0.1 3333
a
OK
b
OK
c
OK
1234
OK

查看spark程序的输出

-------------------------------------------
Time: 1551151646000 ms
-------------------------------------------
(a,1)
(b,1)
(1234,1)
  

猜你喜欢

转载自blog.csdn.net/eyeofeagle/article/details/87931705