Spark Streaming Flume to integrate push mode and a pull mode by two modes

push mode: Spark Streaming end will start to receive a avro sink Flume is based on the data sent by the Receiver Avro Socket Server, this is the time Flume avro sink as a client

pull mode: This mode is a customized Spark Flume the sink as Avro Server, the data sent to the flume to collect the sink, and the sink data stored in the cache, and then start with Spark Streaming Avro Client from the Recevier since the sink Flume defined pulls data. Relative to the push mode, which is more reliable without losing data, this is because of two reasons:

1, Receiver pull model is a reliable Receiver, the Receiver is received data, and stores this data and sends a backup after ack response to the sink Flume

2, the transaction characteristics combined Flume to ensure that the data is not lost, will pull the data, if there is no pull successful (that is, Flume Sink does not receive ack Receiver sent), then the transaction fails

4 demo understand Flume

1, netcat data to display console

bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1 -Dflume.root.logger=INFO,console

## 定义 sources、channels 以及 sinks
agent1.sources = netcatSrc
agent1.channels = me moryChannel
agent1.sinks = loggerSink

## netcatSrc 的配置
agent1.sources.netcatSrc.type = netcat
agent1.sources.netcatSrc.bind = localhost 
agent1.sources.netcatSrc.port = 44445

## loggerSink 的配置
agent1.sinks.loggerSink.type = logger

## memoryChannel 的配置
agent1.channels.memoryChannel.type = memory
agent1.channels.memoryChannel.capacity = 100

## 通过 memoryChannel 连接 netcatSrc 和 loggerSink
agent1.sources.netcatSrc.channels = memoryChannel
agent1.sinks.loggerSink.channel = memoryChannel

2, the data saved to the HDFS netcat, respectively, and use the memory file channal

bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1

telnet localhost 44445

## 定义 sources、channels 以及 sinks
agent1.sources = netcatSrc
agent1.channels = memoryChannel
agent1.sinks = hdfsSink

## netcatSrc 的配置
agent1.sources.netcatSrc.type = netcat
agent1.sources.netcatSrc.bind = localhost
agent1.sources.netcatSrc.port = 44445

## hdfsSink 的配置
agent1.sinks.hdfsSink.type = hdfs
agent1.sinks.hdfsSink.hdfs.path = hdfs://master:9999/user/hadoop-twq/spark-course/steaming/flume/%y-%m-%d
agent1.sinks.hdfsSink.hdfs.batchSize = 5
agent1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true

## memoryChannel 的配置
agent1.channels.memoryChannel.type = memory
agent1.channels.memoryChannel.capacity = 100

## 通过 memoryChannel 连接 netcatSrc 和 hdfsSink
agent1.sources.netcatSrc.channels = memoryChannel
agent1.sinks.hdfsSink.channel = memoryChannel

3, save the log file data to HDFS

bin/flume-ng agent --conf conf --conf-file conf/flume-conf.properties --name agent1

echo testdata >> webserver.log

## 定义 sources、channels 以及 sinks
agent1.sources = logSrc
agent1.channels = fileChannel
agent1.sinks = hdfsSink

## logSrc 的配置
agent1.sources.logSrc.type = exec
agent1.sources.logSrc.command = tail -F /home/hadoop-twq/spark-course/steaming/flume-course/demo3/logs/webserver.log

## hdfsSink 的配置
agent1.sinks.hdfsSink.type = hdfs
agent1.sinks.hdfsSink.hdfs.path = hdfs://master:9999/user/hadoop-twq/spark-course/steaming/flume/%y-%m-%d
agent1.sinks.hdfsSink.hdfs.batchSize = 5
agent1.sinks.hdfsSink.hdfs.useLocalTimeStamp = true

## fileChannel 的配置
agent1.channels.fileChannel.type = file
agent1.channels.fileChannel.checkpointDir = /home/hadoop-twq/spark-course/steaming/flume-course/demo2-2/checkpoint
agent1.channels.fileChannel.dataDirs = /home/hadoop-twq/spark-course/steaming/flume-course/demo2-2/data

## 通过 fileChannel 连接 logSrc 和 hdfsSink
agent1.sources.logSrc.channels = fileChannel
agent1.sinks.hdfsSink.channel = fileChannel

Data collection, to a storage structure, event sent from a data source manner through channels, Sink

Spark Streaming integrated Flume (push mode)