Flume (三) Configuration

Configuration

正如前面部分所述，Flume代理程序配置是从类似于具有分层属性设置的Java属性文件格式的文件中读取的。

Defining the flow

要在单个代理中定义流，您需要通过channel连接sources 和sinks 。您需要列出给定agent的sources，sinks 和channels，然后将sources和sinks指向channels。 source实例可以指定多个channel，但sink实例只能指定一个channel。格式如下：

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1> <Channel2>

# set channel for source
<Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

# set channel for sink
<Agent>.sinks.<Sink>.channel = <Channel1>

例如，名为agent_foo的代理正在从外部avro客户端读取数据并通过内存通道将其发送到HDFS。配置文件weblog.config可能如下所示：

# list the sources, sinks and channels for the agent
agent_foo.sources = avro-appserver-src-1
agent_foo.sinks = hdfs-sink-1
agent_foo.channels = mem-channel-1

# set channel for source
agent_foo.sources.avro-appserver-src-1.channels = mem-channel-1

# set channel for sink
agent_foo.sinks.hdfs-sink-1.channel = mem-channel-1

这将使事件从avro-AppSrv-source流向hdfs-Cluster1-sink，通过内存通道mem-channel-1。当使用weblog.config作为其配置文件启动代理程序时，它将实例化该流程。

Configuring individual components

定义流后，您需要设置每个源，接收器和通道的属性。这是以相同的分层命名空间方式完成的，您可以在配置中设置组件类型以及特定于每个组件的属性的其他值：

# properties for sources
<Agent>.sources.<Source>.<someProperty> = <someValue>

# properties for channels
<Agent>.channel.<Channel>.<someProperty> = <someValue>

# properties for sinks
<Agent>.sources.<Sink>.<someProperty> = <someValue>

需要为Flume的每个组件设置属性“type”，以了解它需要的对象类型。每个源，接收器和通道类型都有自己的一组属性，使其能够按预期运行。所有这些都需要根据需要进行设置。在前面的示例中，我们有一个从avro-AppSrv-source到hdfs-Cluster1-sink的流程，通过内存通道mem-channel-1。这是一个示例，显示了每个组件的配置：

agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = hdfs-Cluster1-sink
agent_foo.channels = mem-channel-1

# set channel for sources, sinks

# properties of avro-AppSrv-source
agent_foo.sources.avro-AppSrv-source.type = avro
agent_foo.sources.avro-AppSrv-source.bind = localhost
agent_foo.sources.avro-AppSrv-source.port = 10000

# properties of mem-channel-1
agent_foo.channels.mem-channel-1.type = memory
agent_foo.channels.mem-channel-1.capacity = 1000
agent_foo.channels.mem-channel-1.transactionCapacity = 100

# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata

Adding multiple flows in an agent

单个Flume代理可以包含多个独立流。您可以在配置中列出多个源，接收器和通道。可以链接这些组件以形成多个流：

# list the sources, sinks and channels for the agent
<Agent>.sources = <Source1> <Source2>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

然后，您可以将源和接收器链接到其通道（用于接收器）的相应通道（用于源）以设置两个不同的流。例如，如果您需要在代理中设置两个流，一个从外部avro客户端到外部HDFS，另一个从尾部输出到avro接收器，那么这是一个配置来执行此操作：

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

Configuring a multi agent flow

要设置多层流，您需要有第一个hop 的avro/thrift接收器指向下一个hop 的avro/thrift源。这将导致第一个Flume代理将事件转发到下一个Flume代理。例如，如果您使用avro客户端定期向本地Flume代理发送文件（每个事件1个文件），则此本地代理可以将其转发到已安装存储的另一个代理。

Weblog agent config:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source
agent_foo.sinks = avro-forward-sink
agent_foo.channels = file-channel

# define the flow
agent_foo.sources.avro-AppSrv-source.channels = file-channel
agent_foo.sinks.avro-forward-sink.channel = file-channel

# avro sink properties
agent_foo.sinks.avro-forward-sink.type = avro
agent_foo.sinks.avro-forward-sink.hostname = 10.1.1.100
agent_foo.sinks.avro-forward-sink.port = 10000

# configure other pieces
#...

HDFS agent config:

# list sources, sinks and channels in the agent
agent_foo.sources = avro-collection-source
agent_foo.sinks = hdfs-sink
agent_foo.channels = mem-channel

# define the flow
agent_foo.sources.avro-collection-source.channels = mem-channel
agent_foo.sinks.hdfs-sink.channel = mem-channel

# avro source properties
agent_foo.sources.avro-collection-source.type = avro
agent_foo.sources.avro-collection-source.bind = 10.1.1.100
agent_foo.sources.avro-collection-source.port = 10000

# configure other pieces
#...

在这里，我们将weblog代理的avro-forward-sink链接到hdfs代理的avro-collection-source。这将导致来自外部应用程序服务器源的事件最终存储在HDFS中。

Fan out flow 扇出流

如前一节所述，Flume支持从一个源扇出流到多个通道。扇出有两种模式，复制和多路复用（replicating and multiplexing）。在复制流程中，事件将发送到所有已配置的通道。在多路复用的情况下，事件仅被发送到合格信道的子集。要扇出流，需要指定源的通道列表以及扇出它的策略。这是通过添加可以复制或多路复用的channel “selector”来完成的。如果它是多路复用器，则进一步指定选择规则。如果您没有指定选择器，那么默认情况下它会复制：

# List the sources, sinks and channels for the agent
<Agent>.sources = <Source1>
<Agent>.sinks = <Sink1> <Sink2>
<Agent>.channels = <Channel1> <Channel2>

# set list of channels for source (separated by space)
<Agent>.sources.<Source1>.channels = <Channel1> <Channel2>

# set channel for sinks
<Agent>.sinks.<Sink1>.channel = <Channel1>
<Agent>.sinks.<Sink2>.channel = <Channel2>

<Agent>.sources.<Source1>.selector.type = replicating

多路复用选择具有另一组属性以使流分叉。这需要指定事件属性到通道集的映射。选择器检查事件头中的每个已配置属性。如果它与指定的值匹配，则该事件将发送到映射到该值的所有通道。如果没有匹配，则将事件发送到默认配置的通道集：

# Mapping for multiplexing selector
<Agent>.sources.<Source1>.selector.type = multiplexing
<Agent>.sources.<Source1>.selector.header = <someHeader>
<Agent>.sources.<Source1>.selector.mapping.<Value1> = <Channel1>
<Agent>.sources.<Source1>.selector.mapping.<Value2> = <Channel1> <Channel2>
<Agent>.sources.<Source1>.selector.mapping.<Value3> = <Channel2>
#...

<Agent>.sources.<Source1>.selector.default = <Channel2>

映射允许为每个值重叠通道。
以下示例具有多路复用到两个路径的单个流。名为agent_foo的代理具有单个avro源和两个链接到两个接收器的通道：

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# set channels for source
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

# set channel for sinks
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器检查名为“State”的标头。如果值为“CA”，则将其发送到mem-channel-1，如果其为“AZ”，则将其发送到file-channel-2，或者如果其为“NY”则将其发送到mem-channel-1和file-channel-2。如果“State”未设置或与三者中的任何一个都不匹配，则它将转到mem-channel-1，其被指定为“default”。

选择器还支持可选通道。要为标头指定可选通道，可通过以下方式使用配置参数“optional”：

# channel selector configuration
agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

选择器将首先尝试写入所需的通道，如果其中一个通道无法消费事件，则会使事务失败。在所有channel上重试。一旦所有必需的channel消耗了事件，则选择器将尝试写入可选通道。所有在可选通道消费事件失败的，都会简单的忽略它，并且不会重试。

如果optional channels与特定报头的required channels之间存在重叠，则认为该信道是必需的，并且信道中的故障将导致重试所有必需信道集。例如，在上面的示例中，对于标题“CA”，mem-channel-1被认为是必需的通道，即使它被标记为必需和可选，并且写入此通道的失败将导致该事件在为选择器配置的所有通道上重试。

请注意，如果标头没有任何所需的通道，则该事件将被写入默认通道，并将尝试写入该标头的可选通道。如果未指定所需的通道，则指定可选通道仍会将事件写入默认通道。如果没有默认通道和必需通道，则选择器将尝试将事件写入可选通道。在这种情况下，任何失败都会被忽略。

Flume Sources

Avro Source

监听Avro端口并从外部Avro客户端流接收事件。当与另一个（previous hop）Flume代理上的内置Avro Sink配对时，它可以创建分层集合拓扑。必需属性以粗体显示。

属性名称	默认值	描述
channels	–
type	–	组件类型名称，需要是avro
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	生成的最大工作线程数
selector.type
selector.*
interceptors	–	以空格分隔的拦截器列表
interceptors.*
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl	false	将其设置为true以启用SSL加密。您还必须指定“keystore”和“keystore-password”。
keystore	–	This is the path to a Java keystore file. Required for SSL.
keystore-password	–	The password for the Java keystore. Required for SSL.
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols	SSLv3	要排除的SSL/TLS协议列表,以空格分隔。除指定的协议外，将始终排除SSLv3。
ipFilter	false	将此设置为true以启用netty的ipFiltering
ipFilterRules	–	Define N netty ipFilter pattern rules with this config.

Example for agent named a1:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

ipFilterRules的示例

ipFilterRules定义由逗号分隔的N个netty ipFilters模式规则必须采用此格式。

<’allow’ or deny>:<’ip’ or ‘name’ for computer name>:<pattern> or allow/deny:ip/name:pattern

example: ipFilterRules=allow:ip:127.*,allow:name:localhost,deny:ip:*

请注意，匹配的第一个规则将适用，如下面的示例所示，来自localhost上的客户端

这将允许localhost上的客户端拒绝来自任何其他ip的客户端“allow:name:localhost,deny:ip:”这将拒绝localhost上的客户端允许来自任何其他ip的客户端“deny:name:localhost,allow:ip:“

Thrift Source

侦听Thrift端口并从外部Thrift客户端流接收事件。当与另一个（previous hop）Flume代理上的内置ThriftSink配对时，它可以创建分层集合拓扑。可以通过启用kerberos身份验证将Thrift源配置为以安全模式启动。 agent-principal和agent-keytab是Thrift源用于向kerberos KDC进行身份验证的属性。必需属性以粗体显示。

属性名称	默认值	描述
channels	–
type	–	组件类型名称，需要是avro
bind	–	hostname or IP address to listen on
port	–	Port # to bind to
threads	–	生成的最大工作线程数
selector.type
selector.*
interceptors	–	以空格分隔的拦截器列表
interceptors.*
compression-type	none	This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl	false	将其设置为true以启用SSL加密。您还必须指定“keystore”和“keystore-password”。
keystore	–	This is the path to a Java keystore file. Required for SSL.
keystore-password	–	The password for the Java keystore. Required for SSL.
keystore-type	JKS	The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols	SSLv3	要排除的SSL/TLS协议列表,以空格分隔。除指定的协议外，将始终排除SSLv3。
kerberos	false	设置为true以启用kerberos身份验证。在kerberos模式下，成功进行身份验证需要agent-principal和agent-keytab。安全模式下的Thrift源将仅接受已启用kerberos且已成功通过kerberos KDC验证的Thrift客户端的连接。
agent-principal	–	Thrift Source使用的kerberos主体对kerberos KDC进行身份验证。
agent-keytab	–	Thrift Source与代理主体结合使用的keytab位置，用于对kerberos KDC进行身份验证。

代理名为a1的示例：

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

Exec Source

Exec源在启动时运行给定的Unix命令，并期望该进程在标准输出上连续生成数据（除非将属性logStdErr设置为true，否则将丢弃stderr）。如果进程因任何原因退出，则源也会退出并且不会产生更多数据。这意味着诸如cat [named pipe]或tail -F [file]之类的配置将产生所需的结果，而日期可能不会 - 前两个命令产生数据流，而后者产生单个事件并退出。

必需属性以粗体显示。

属性名称	默认值	描述
channels	–
type	–	组件类型名称，需要是avro
command	–	The command to execute
shell	–	用于运行命令的shell调用。例如 `/bin/sh -c`。仅适用于依赖shell功能的命令，如通配符，后退标记，管道等。
restartThrottle	10000	尝试重新启动之前等待的时间（以毫秒为单位）
restart	false	是否应该重新执行已执行的cmd
logStdErr	false	是否应记录命令的stderr
batchSize	20	一次读取和发送到通道的最大行数
batchTimeout	3000	在向下游推送数据之前，如果未达到缓冲区大小，则等待的时间（以毫秒为单位）
selector.type	replicating	replicating or multiplexing
selector.*		取决于selector.type值
interceptors	–	以空格分隔的拦截器列表
interceptors.*

警告：ExecSource和其他异步源的问题是如果在将事件放入channel中时出现错误，源无法保证客户端是否知道。在这种情况下，数据将丢失。例如，最常请求的功能之一是tail -F [file]类似的用例，其中应用程序写入磁盘上的日志文件，Flume将文件尾部发送，将每一行作为事件发送。虽然这是可能的，但是有一个明显的问题;如果频道填满并且Flume无法发送事件，会发生什么？由于某种原因，Flume无法向编写日志文件的应用程序指示它需要保留日志或事件尚未发送。如果这没有意义，您只需要知道：当使用ExecSource等单向异步接口时，您的应用程序永远无法保证数据被接收！作为此警告的延伸 - 并且完全清楚 - 使用此源时，事件传递绝对没有保证。为了获得更强的可靠性保证，请考虑Spooling Directory Source，Taildir Source或通过SDK直接与Flume集成。

Example for agent named a1:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sources.r1.channels = c1

shell配置用于通过shell命令（例如Bash或Powershell）调用command。 command作为参数传递给shell以便执行。这允许command使用shell中的功能，例如通配符，后退标记，管道，循环，条件等。如果没有shell配置，将直接调用command。 shell的常用值：bin/sh -c，/bin/ksh -c，cmd/c，powershell -Command等。

a1.sources.tailsource-1.type = exec
a1.sources.tailsource-1.shell = /bin/bash -c
a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

JMS Source

JMS Source从JMS目标（例如队列或主题）读取消息。作为JMS应用程序，它应该与任何JMS提供程序一起使用，但仅使用ActiveMQ进行测试。 JMS源提供可配置的批量大小，消息选择器，用户/传递和消息到flume事件转换器。请注意，供应商提供的JMS jar应该包含在Flume类路径中，使用plugins.d目录（首选），命令行上的-classpath或flume-env.sh中的FLUME_CLASSPATH变量。

必需属性以粗体显示。

属性名称	默认值	描述
channels	–
type	–	组件类型名称，需要是avro
initialContextFactory	–	Inital Context Factory，例如：org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory	–	连接工厂应显示为的JNDI名称
providerURL	–	JMS提供的URL
destinationName	–	Destination name
destinationType	–	Destination type (queue or topic)
messageSelector	–	Message selector to use when creating the consumer
userName	–	Username for the destination/provider
passwordFile	–	File containing the password for the destination/provider
batchSize	100	Number of messages to consume in one batch
converter.type	DEFAULT	用于将消息转换为flume事件的类。见下文。
converter.*	-	Converter properties.
converter.charset	UTF-8	仅限默认转换器。将JMS文本消息转换为字节数组时使用的字符集。
createDurableSubscription	false	是否创建持久订阅。持久订阅只能与destinationType主题一起使用。如果为true，则必须指定“clientId”和“durableSubscriptionName”。
clientId	-	JMS客户端标识符在创建后立即在Connection上设置。持久订阅必需。
durableSubscriptionName	-	用于标识持久订阅的名称。持久订阅必需。

Converter

JMS源允许可插拔转换器，尽管默认转换器可能适用于大多数用途。默认转换器能够将Bytes,Text和Object消息转换为FlumeEvents。在所有情况下，消息中的属性都作为标题添加到FlumeEvent中。

BytesMessage：消息的字节被复制到FlumeEvent的主体。每封邮件无法转换超过2GB的数据。
TextMessage的：消息文本转换为字节数组并复制到FlumeEvent的主体。默认转换器默认使用UTF-8，但这是可配置的。
ObjectMessage：将对象写入包含在ObjectOutputStream中的ByteArrayOutputStream，并将生成的数组复制到FlumeEvent的主体。

Example for agent named a1:

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE