只需要了解一下即可,Flume直接对接Spark Streaming是很少见的
官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html
官网有详细的介绍,可以通过官网进行操作
Approach 1: Flume-style Push-based Approach
基于Push的方式
Flume是被设计用来push data在多个Flume agents之间
Spark Streaming本质上设置了一个receiver,它充当Flume的Avro agent,而Flume可以push data
前置要求
选择一个机器在你的集群之上
当你的Flume + Spark Streaming应用被启动起来,其中1个Spark worker必须运行在这个机器上
可以将Flume配置为将数据push到该机器上的一个端口
配置Flume
agent.sinks = avroSink
agent.sinks.avroSink.type = avro
agent.sinks.avroSink.channel = memoryChannel
gent.sinks.avroSink.hostname = <chosen machine's hostname>
agent.sinks.avroSink.port = <chosen port on the machine>
既然选择了一台机器来运行,那么数据必须到这台机器上,将数据以avro类型sink到该台机器上来
与Spark Streaming对接
- 添加依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
编程
需要一个工具类:FlumeUtils
详细代码见FlumePushApp.scala实现
Flume Agent的配置:
[$FLUME_HOME/conf/nc-memory-avro.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
为什么source和sink的端口不同
日志的输入是在44444端口上监听
监听完之后,把这个数据通过avro写到本地的44443这个上面去
启动Flume Agent
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-avro.conf \
-Dflume.root.logger=INFO,console
产生报错
Caused by: org.apache.flume.FlumeException: NettyAvroRpcClient { host: localhost, port: 44443 }: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:182)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:121)
at org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:638)
at org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:89)
at org.apache.flume.sink.AvroSink.initializeRpcClient(AvroSink.java:127)
at org.apache.flume.sink.AbstractRpcSink.createConnection(AbstractRpcSink.java:211)
at org.apache.flume.sink.AbstractRpcSink.verifyConnection(AbstractRpcSink.java:272)
at org.apache.flume.sink.AbstractRpcSink.process(AbstractRpcSink.java:349)
... 3 more
原因分析
44443端口无法接收数据
先启动Flume会报错,需要先启动Spark Streaming
因为是Flume推数据到Spark Streaming上去,因此需要先将Spark Streaming启动起来
使用spark-submit
spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
/opt/lib/scala-train-1.0.jar
报错
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/flume/FlumeUtils$
at com.zhaotao.SparkStreaming.FlumePushApp$.main(FlumePushApp.scala:16)
at com.zhaotao.SparkStreaming.FlumePushApp.main(FlumePushApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.streaming.flume.FlumeUtils$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 11 more
原因分析
因为我们在pom.xml里添加了spark-streaming-flume_2.11
并且我们打成package的时候是瘦包,没有将这个包给打进去
因此我们需要手动去指定–packages xxxxx
spark-submit \
--class com.zhaotao.SparkStreaming.FlumePushApp \
--master local[2] \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0 \
/opt/lib/scala-train-1.0.jar
再次启动Flume Agent
启动telnet
$>telnet 192.168.26.131 44444
输入信息:
huhuhu
zhaotao
zhaotao
huhuhu
spark streaming启动的那端控制台,显示信息:
(zhaotao,2)
(huhuhu,2)
总结
该案例的流程如下:
nc --> flume --> sink ip+port --> streaming
步骤
- 先启动spark streaming
注意:
- 如果打的是胖包,pom.xml中必须加上
<scope>provided</scope>
: - 如果打的是瘦包,则使用–packages这个参数
但是在工作中不建议使用–packages这个参数,因为一旦使用了这个参数;也就意味着需要去外网下(如果公司有私服还行,如果没有就会很麻烦)。注意:生产上慎用–packages
还有第2种解决方案,以该案例为例:
先去maven仓库中将spark-streaming-flume-assembly下载到本地;然后使用spark-submit提交的时候,使用–jars参数指定相应的jar包去提交。
但是也会带来一个问题:如果jar包太多,一个一个写会很麻烦;我们可以写一个shell脚本,将目录里的jar包给遍历出来,然后拼成一个字符串放到–jars里去,然后再提交( –jars xxx 指定目录 –> 不行!!)
- 如果打的是胖包,pom.xml中必须加上
- 再启动flume
- telnet,输入数据,完成word count
Approach 2: Pull-based Approach using a Custom Sink
官网:http://spark.apache.org/docs/latest/streaming-flume-integration.html#approach-2-pull-based-approach-using-a-custom-sink
替代了直接push data到Spark Streaming,这种方法运行了一个自定义的Flume sink
- Flume push data进sink,这个数据保持缓冲状态
- Spark Streaming使用一个可靠的Flume receiver和事务机制 将data从sink中给拉(pull)过来
只有在接收到数据并通过Spark Streaming进行复制(即有了副本)之后,事务才会成功。
这种方式比起前一种方式确保了更强的可靠性和容错性
如果要用,建议使用这种方式
Configuring Flume
- 设置spark-streaming-flume-sink_2.11
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume-sink_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
- 设置scala-library
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
- 设置commons-lang3
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
- Configuration file
agent.sinks = spark
agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.spark.hostname =
agent.sinks.spark.port =
agent.sinks.spark.channel = memoryChannel
Configuring Spark Streaming Application
编程
见FlumePullApp.scalaFlume Agent配置
[$FLUME_HOME/conf/nc-memory-spark.conf]
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 192.168.26.131
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.26.131
a1.sinks.k1.port = 44443
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
- 启动Flume Agent
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/nc-memory-spark.conf \
-Dflume.root.logger=INFO,console
启动Spark Streaming应用程序
telnet
$>telnet 192.168.26.131 44444
输入信息:
huhuhu
zhaotao
huhuhu
zhaotao
zhao
显示结果
-------------------------------------------
Time: 1518957240000 ms
-------------------------------------------
(zhao,1)
(zhaotao,2)
(huhuhu,2)
代码
object FlumePushApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()//.setMaster("local[2]").setAppName("FlumePushApp")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = FlumeUtils.createStream(ssc, "192.168.26.131", 44443)
val words = lines.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
object FlumePullApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("FlumePullApp")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = FlumeUtils.createPollingStream(ssc, "192.168.26.131", 44443)
val words = lines.map(x => new String(x.event.getBody.array()).trim)
.flatMap(_.split(" "))
val pairs = words.map(word => (word,1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}