kafka和flume都可以承载实时数据,而spark streaming号称实时计算,也是可以消费flume数据的
1, push 的方式:spark充当flume的sink
flume程序的配置如下:
wang@wang-pc:~/txt/flume-ng$ cat spark-flume.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 3333
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
#sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 127.0.0.1
a1.sinks.k1.port = 4444
#相互关联 source--channel, sink--channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a,启动sink端(spark)
若直接启动flume则会报错如下:(所以要先启动sink端的spark程序)
启动flume: flume-ng agent -n a1 -f spark-flume.conf
报错: Caused by: java.io.IOException: Error connecting to /127.0.0.1:4444
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:165)
... 16 more
Caused by: java.net.ConnectException: 拒绝连接: /127.0.0.1:4444
sink端:spark 充当flume的sink角色, 启动spark程序
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
//spark对接flume数据: 流计算
object FlumeStream {
def main(args: Array[String]): Unit = {
//spark配置
val conf = new SparkConf().setMaster("local[*]").setAppName("test")
//流配置
val ssc = new StreamingContext(conf, Seconds(1))
//接收数据:创建avro监听服务
val inputDstream:ReceiverInputDStream[SparkFlumeEvent]
= FlumeUtils.createStream(ssc, "127.0.0.1", 4444)
val tupDstream = inputDstream.map(event=>{
val event1 = event.event
val byteBuff = event1.getBody
val body = new String(byteBuff.array())//ByteBuff.array()
(body,1)
}).reduceByKey(_+_)
tupDstream.print()
// 启动流
ssc.start()
ssc.awaitTermination()
}
}
b,启动flume程序:
启动flume程序: flume-ng agent -n a1 -f spark-flume.conf
19/02/26 11:41:09 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: k1 started
19/02/26 11:41:09 INFO sink.AbstractRpcSink: Rpc sink k1: Building RpcClient with hostname: 127.0.0.1, port: 4444
19/02/26 11:41:09 INFO sink.AvroSink: Attempting to create Avro Rpc client.
19/02/26 11:41:09 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:3333]
19/02/26 11:41:09 INFO api.NettyAvroRpcClient: Using default maxIOWorkers
19/02/26 11:41:09 INFO sink.AbstractRpcSink: Rpc sink k1 started.
c, 往flume发送数据, 查看spark程序的输出
wang@wang-pc:~/txt/flume-ng$ nc 127.0.0.1 3333
a
OK
b
OK
c
OK
1234
OK
查看spark程序的输出
-------------------------------------------
Time: 1551151646000 ms
-------------------------------------------
(a,1)
(b,1)
(1234,1)
2, pull 的方式:spark通过tcp socket拉取数据
a, 启动flume程序
启动flume: flume-ng agent -f spark-flume.conf -n a1 -f spark-flume.conf
wang@wang-pc:~/txt/flume-ng$ cat spark-flume.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
#source
a1.sources.r1.type = netcat
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 3333
#channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
#1, avro sink ==>spark充当socket服务
#a1.sinks.k1.type = avro
#a1.sinks.k1.hostname = 127.0.0.1
#a1.sinks.k1.port = 4444
#2, spark sink ==>spark 通过独立的socket服务取数据
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 127.0.0.1
a1.sinks.k1.port = 4444
#相互关联 source--channel, sink--channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
报错如下:
org.apache.flume.FlumeException: Unable to load sink type: org.apache.spark.streaming.flume.sink.SparkSink, class: org.apache.spark.streaming.flume.sink.SparkSink
at org.apache.flume.sink.DefaultSinkFactory.getClass(DefaultSinkFactory.java:70)
at org.apache.flume.sink.DefaultSinkFactory.create(DefaultSinkFactory.java:43)
at org.apache.flume.node.AbstractConfigurationProvider.loadSinks(AbstractConfigurationProvider.java:450)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(AbstractConfigurationProvider.java:106)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.flume.sink.SparkSink
解决:去spark官网 http://spark.apache.org/docs/1.6.1/streaming-flume-integration.html寻找需要的jar包: spark-streaming-flume-sink_2.10-1.6.1.jar, 放到flume的lib目录下
b, 启动spark程序
import java.net.{InetAddress, InetSocketAddress}
import java.nio.ByteBuffer
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}
import org.apache.spark.streaming.{Seconds, StreamingContext}
//spark对接flume数据: 流计算
object FlumeStream {
def main(args: Array[String]): Unit = {
//spark配置
val conf = new SparkConf().setMaster("local[*]").setAppName("test")
//流配置
val ssc = new StreamingContext(conf, Seconds(2))
//接收flume数据:Push推送的方式 ( flume push数据 ---> spark 所在的sink)
// val inputDstream:ReceiverInputDStream[SparkFlumeEvent]
// = FlumeUtils.createStream(ssc, "127.0.0.1", 4444)
// val tupDstream = inputDstream.map(event=>{
// val event1 = event.event
// val body:ByteBuffer = event1.getBody
// val mesg = new String(byteBuff.array())//ByteBuff.array()
// (mesg,1)
// }).reduceByKey(_+_)
//接收flume数据: pull 拉的方式 (spark pull 数据 <---flume的sink所在的tcp socket)
val ncAddresses = Seq(new InetSocketAddress("127.0.0.1",4444))
val inputDstream:ReceiverInputDStream[SparkFlumeEvent]= FlumeUtils.createPollingStream(
ssc,
ncAddresses,
StorageLevel.MEMORY_ONLY
)
val tupDstream =inputDstream.map(event=>{
val event1 = event.event
val body:ByteBuffer = event1.getBody
val mesg = new String(body.array())
(mesg,1)
}).reduceByKey(_+_)
tupDstream.print()
// 启动流
ssc.start()
ssc.awaitTermination()
}
}
c, 往flume中发送数据, 查看spark的输出
wang@wang-pc:~/txt/flume-ng$ nc 127.0.0.1 3333
a
OK
b
OK
c
OK
1234
OK
查看spark程序的输出
-------------------------------------------
Time: 1551151646000 ms
-------------------------------------------
(a,1)
(b,1)
(1234,1)