一、核心关注点

因为flume版本不同，source、channel和sink的接口都是不一样的，所以需要使用对应版本的接口。
本文以flume1.6.0为例，参考http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.0/FlumeUserGuide.html

二、source

1.avro source

（1）功能
侦听Avro端口并从外部Avro客户端流接收事件。适用于：分层的数据收集。
（2）必须配置的参数

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be avro
bind	–	hostname or IP address to listen on
port	–	Port # to bind to

（3）实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

2.exec source

（1）功能
监控文件。适用场景：监控日志
（2）必须配置的参数

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be exec
command	–	The command to execute

（3）实例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

3.Spooling Directory Source

（1）功能
监控某一个文件目录。
（2）必须配置的参数

Property Name	Default	Description
channels	–
type	–	The component type name, needs to be spooldir.
spoolDir	–	The directory from which to read files from.

（3）实例

a1.channels = ch-1
a1.sources = src-1

a1.sources.src-1.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-1.fileHeader = true

三、channel

1.Memory Channel

（1）功能
事件存储在具有可配置最大大小的内存队列中。适用场景：需要更高吞吐量并准备在代理故障的情况下丢失上载数据的流的理想选择。
缺点：Memory Channel是一个不稳定的隧道，它在内存中存储所有事件。如果进程异常停止，内存中的数据将不能让恢复。受内存大小的限制。
（2）必须配置的参数

Property Name	Default	Description
type	–	The component type name, needs to be memory

（3）实例

a1.channels = c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000

2.flie channel

（1）功能
是一个持久化的channel，数据安全并且只要磁盘空间足够，它就可以将数据存储到磁盘上
（2）必须配置的参数

Property Name Default	Description
type	–	The component type name, needs to be file.

（3）实例

a1.channels = c1
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

参数解析：

checkpointDir：检查数据完整性，存放检查点目录，可以检测出哪些数据已被抽取，哪些还没有
dataDirs：存放数据的目录，dataDirs可以是多个目录，以逗号隔开，
用独立的多个磁盘上的多个目录可以提高file channel的性能。

四、sink

1.HDFS sink

（1）功能
此接收器将事件写入Hadoop分布式文件系统（HDFS）
（2）必须配置的参数

Name	Default	Description
channel	–
type	–	The component type name, needs to be hdfs
hdfs.path	–	HDFS directory path (eg hdfs://namenode/flume/webdata/)

（3）实例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

2.hive sink

（1）功能
此接收器将包含定界文本或JSON数据的事件直接传输到Hive表或分区。
（2）必须配置的参数

Name	Default	Description
channel	–
type	–	The component type name, needs to be hive
hive.metastore	–	Hive metastore URI (eg thrift://a.b.com:9083 )
hive.database	–	Hive database name
hive.table	–	Hive table name

（3）实例

a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg

3.hbase sink

（1）功能
把数据写入hbase。
（2）必须配置的参数

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be hbase
table	–	The name of the table in Hbase to write to.
columnFamily	–	The column family in Hbase to write to.

（3）实例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = hbase
a1.sinks.k1.table = foo_table
a1.sinks.k1.columnFamily = bar_cf
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.sinks.k1.channel = c1

4.avro sink

（1）功能
avro sink形成了Flume分层收集支持的一半。发送到此接收器的Flume事件将转换为Avro事件并发送到配置的主机名/端口对。事件从已配置的通道以批量配置的批处理大小获取
（2）必须配置的参数

Property Name	Default	Description
channel	–
type	–	The component type name, needs to be avro.
hostname	–	The hostname or IP address to bind to.
port	–	The port # to listen on.

（3）实例

a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

5.kafka sink

（1）功能
把数据写入kafka对应的topic中。
（2）必须配置的参数

Property Name	Default	Description
type	–	Must be set to org.apache.flume.sink.kafka.KafkaSink
brokerList	–	List of brokers Kafka-Sink will connect to, to get the list of topic partitions This can be a partial list of brokers, but we recommend at least two for HA. The format is comma separated list of hostname:port

（3）实例

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1

flume（三）：常见source、channel和sink总结

一、核心关注点

二、source

1.avro source

2.exec source

3.Spooling Directory Source

三、channel

1.Memory Channel

2.flie channel

四、sink

1.HDFS sink

2.hive sink

3.hbase sink

4.avro sink

5.kafka sink

猜你喜欢