Spark Streaming和Flume的结合使用

首先在IDEA里面导入依赖包
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-flume_2.10</artifactId>
<version>${spark.version}</version>
</dependency>

在linux下安装flume,减压flume包,然后到conf里面复制flume-env.sh,修改里面的JavaHOME安装目录就好了
1、 Flume主动向Streaming推送数据
object FlumePushDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    //local[2]这里必须是2个或2个以上的线程,一个负责接收数据,一个负责将接收的数据下发到worker上执行
    val config = new SparkConf().setAppName("FlumePushDemo").setMaster("local[2]")
    val sc = new SparkContext(config)
    val ssc = new StreamingContext(sc, Seconds(2))
    //这个地址是spark程序启动时所在节点的地址
    val flumeStream = FlumeUtils.createStream(ssc, "192.168.10.11", 8008)
    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

配置flume文件
# 这个是启动命令,到flume的安装路径
# bin/flume-ng agent -n a1 -c conf/ -f config/flume-push.conf  -Dflume.root.logger=INFO,console
# flume 主动推送数据到spark上
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# source
a1.sources.r1.type = exec
# 监控linux目录下的文件
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1

# Describe the sink
# avro绑定一个端口
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.10.11
a1.sinks.k1.port = 8008
#在控制台打印信息
a1.sinks.k2.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

我去,弄了好几次了Flume配置文件开头总是显示<span style="font-size:14px;">,结尾显示</span>,大家在使用的时候注意,把这些去掉。
2、Streaming主动向Flume拉取数据(这个要优于上面的,可以根据处理数据的能力去拉取数据)
第一,拷贝三个jar包放到flume的lib目录下
spark-streaming-flume-sink_2.10-1.6.1.jar
scala-library-2.10.5.jar
commons-lang3-3.3.2.jar
第二,使用创建FlumeUtils.createPollingStream 的dstream
object FlumePullDemo {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)
    //local[2]这里必须是2个或2个以上的线程,一个负责接收数据,一个负责将接收的数据下发到worker上执行
    val config = new SparkConf().setAppName("FlumePullDemo").setMaster("local[2]")
    val sc = new SparkContext(config)
    val ssc = new StreamingContext(sc, Seconds(2))
    //这个地址是spark程序启动时所在节点的地址,后面可以添加多个地址
    val addresses: Seq[InetSocketAddress] = Seq(new InetSocketAddress("192.168.10.11", 8008))
    val flumeStream = FlumeUtils.createPollingStream(ssc, addresses, StorageLevel.MEMORY_ONLY)
    flumeStream.flatMap(x => new String(x.event.getBody.array()).split(" ")).map((_, 1)).reduceByKey(_ + _)
      .print()
    ssc.start()
    ssc.awaitTermination()
  }
}

配置flume文件
# 执行代码
# bin/flume-ng agent -n a1 -c conf/ -f config/flume-pull.conf  -Dflume.root.logger=INFO,console
# spark 主动到flume上拉取数据
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1

# source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1

# Describe the sink
# 告诉flume下沉到spark编写好的组件中
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
a1.sinks.k1.hostname = 192.168.10.11
a1.sinks.k1.port = 8008
# 控制台打印数据信息
a1.sinks.k2.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

猜你喜欢

转载自blog.csdn.net/zmc921/article/details/75097665