Flume官方文档翻译——Flume 1.7.0 User Guide （unreleased version）

Flume 1.7.0 User Guide

Introduction（简介）
- Overview（综述）
- System Requirements（系统需求）
- Architecture（架构）
  - Data flow model（数据流模型）
  - Complex flows（复杂流）
  - Reliability（可靠性）
  - Recoverability（可恢复性）
Setup（配置）Configuration（配置
- Setting up an agent（设置一个agent）
  - Configuring individual components（配置单个组件）
  - Wiring the pieces together（碎片聚集）
  - Starting an agent（开始一个agent）
  - A simple example（一个简单的例子）
  - Logging raw data（记录原始数据）
  - Zookeeper based Configuration（ZooKeeper的基础配置）
  - Installing third-party plugins（安装第三方插件）
    - The plugins.d directory（插件目录）
    - Directory layout for plugins（用于插件的目录布局）
- Data ingestion（数据获取）
  - RPC（远程调用）
  - Executing commands（执行命令）
  - Network streams（网络流）
- Setting multi-agent flow（设置多个agent流）
- Consolidation（合并）
- Multiplexing the flow（多路复用流）
- Defining the flow（定义一个流）
Configuration
- Defining the flow
- Configuring individual components（配置单个组件）
- Adding multiple flows in an agent（一个agent中增加多个流）
- Configuring a multi agent flow（配置一个多agent流）
- Fan out flow（扇出流）
- Flume Sources（各种Source）
- Flume Sinks（各种Sink）
- Flume Channels（各种Channel）
- Flume Channel Selectors（Channel选择器）

- Event Serializers（序列化器）
- Flume Interceptors（拦截器）
- Flume Properties（属性）
  - Property: flume.called.from.service
Log4J Appender（日志存储器）
Load Balancing Log4J Appender（负载均衡的日志存储器）
Security（安全性）
Monitoring（监控）
Tools（工具）
- File Channel Integrity Tool
- Event Validator Tool
Topology Design Considerations（拓扑结构设计考虑）
Troubleshooting（故障排除）
Component Summary（组件总结）
Alias Conventions（别名约定）

Introduction

Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Apache Flume is a top level project at the Apache Software Foundation.

There are currently two release code lines available, versions 0.9.x and 1.x.

Documentation for the 0.9.x track is available at the Flume 0.9.x User Guide.

This documentation applies to the 1.4.x track.

New and existing users are encouraged to use the 1.x releases so as to leverage the performance improvements and configuration flexibilities available in the latest architecture.

Apache Flume是一个分布式、高可靠和高可用的收集、集合和将大量来自不同来源的日志数据移动到一个中央数据仓库。

Apache Flume不仅局限于数据的聚集。因为数据是可定制的，所以Flume可以用于运输大量时间数据包括不限于网络传输数据，社交媒体产生的数据，电子邮件信息和几乎任何数据源。

Apache Flume是Apache软件基金会的顶级项目。

目前有两个可用的发布版本，0.9.x和1.x。

我们鼓励新老用户使用1.x发布版本来提高性能和利用新结构的配置灵活性。

System Requirements

1. Java Runtime Environment - Java 1.7 or later（Java运行环境-Java1.7或者以后的版本）
2. Memory - Sufficient memory for configurations used by sources, channels or sinks（内存——足够的内存来配置souuces，channels和sinks）
3. Disk Space - Sufficient disk space for configurations used by channels or sinks（磁盘空间-足够的磁盘空间来配置channels或者sinks）
4. Directory Permissions - Read/Write permissions for directories used by agent（目录权限-代理所使用的目录读/写权限）

Architecture（架构）

Data flow model（数据流动模型）

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

一个Flume event被定义为拥有一个字节的有效负载的一个数据流单元和一个可选的字符串属性配置。Flume agent是一个JVM进程来控制组件完成事件流从一个外部来源传输到下一个目的地。

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

Flume source消费外部来源像web server传输给他的事件。外部来源发送以目标Flume source定义好的格式的event给Flume。例如，Avro Flume source用于接收Avro客户端或者流中的其他Flume中Avro sink发来的Avro events。一个相似的流可以用Thrift Flume Source 来接收来自Flume sink或者FluemThrift Rpc客户端或者一个用任何语言写的遵守Flume Thrift 协议的Thrift客户端的事件。当一个Flume Source接收一个事件时，它将事件存储在一个或者多个Cannel中。Channel是一个被动仓库用来保存事件直到它被Flume Sink消费掉。File channel就是个例子-它背靠着本地的文件系统。Sink将事件从Channel中移除并且将事件放到一个外部的仓库像HDFS（通过Flume HDFS sink）或者向前传输到流中另一个Flume Agent。Agent中Source和Sink异步地执行Channel中events。

Reliability（可靠性）

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.

事件都是（存储）在每个代理中Channel。事件会被传送到下一个Agent或者流中的最终目的地像HDFS。事件会在被储存在另一个Agent的Channel中或者终点仓库之后从原来的Agent中移除。这是一个单hop在流中信息传输定义，以此提供了端对端的流的可靠性。

Flume用一个事务性方案来保证事件传递的可靠性。source、sink和channel分别提供不同的事务机制，source和sink是封装事件的存储/恢复在一个事务机制中，channel封装事件的位置和提供在一个事务机制中。这个保证了事件集合可靠地从流中的一个点传到另一个点。在多个hop的流中，前一个hop的sink和后一个hop的source都有其事务机制来保证数据能够安全得存储在下一个hop中。

Setup（设置）

Setting up an agent（设置Agent）

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format. Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of each source, sink and channel in an agent and how they are wired together to form data flows.

Flume agent配置存储在一个本地配置文件中。这是一个跟Java 属性文件格式一样的文本文件。一个或者多个agent可以指定同一个配置文件来进行配置。配置文件包括每个source的属性，agent中的sink和channel以及它们是如何连接构成数据流。

Wiring the pieces together（碎片集合）

The agent needs to know what individual components to load and how they are connected in order to constitute the flow. This is done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the connecting channel for each sink and source. For example, an agent flows events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a file channel called file-channel. The configuration file will contain names of these components and file-channel as a shared channel for both avroWeb source and hdfs-cluster1 sink.

agent需要知道每个组件加载什么和它们是怎样连接构成流。这通过列出agent中每个source、sink和channel和指定每个sink和source连接的channel。例如，一个agent流事件从一个称为avroWeb的Avro sources通过一个称为file-channel的文件channel流向一个称为hdfs-cluster1的HDFS sink。配置文档将包含这些组件的名字和avroWeb source和hdfs-cluster1 sink中间共享的file-channel。

Starting an agent（开始一个agent）

An agent is started using a shell script called flume-ng which is located in the bin directory of the Flume distribution. You need to specify the agent name, the config directory, and the config file on the command line:

agent通过一个称为flume-ngshell位于Flume项目中bin目录下的脚本来启动。你必须在命令行中指定一个agent名字，配置目录和配置文档

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

Now the agent will start running source and sinks configured in the given properties file.

现在agent将会开始运行给定的属性文档中的cource和sink。

A simple example（一个简单的例子）

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate events and subsequently logs them to the console.

这里我们给出一个配置文件的例子，阐述一个单点Flume的部署，这个配置让一个用户产生一个事件和随后把事件打印在控制台。

# example.conf: A single-node Flume configuration

 

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then describes their types and configuration parameters. A given configuration file might define several named agents; when a given Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

这个配置信息定义了一个名字为a1的单点agent。a1拥有一个监听数据端口为44444的source，一个内存channel和一个将事件打印在控制台的sink。配置文档给多个组件命名，并且描述它们的类型和配置参数。一个给定的配置文档可以定义多个agent；当一个给定的Flume进程加载时，一个标志会传递告诉他具体运行哪个agent。

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

Note that in a full deployment we would typically include one more option: --conf=<conf-dir>. The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the console and we go without a custom environment script.

需要说明的是在一个完整的部署中我们应该通常会包含多一个选项：--conf=<conf-dir>.<conf-dir>目录包含一个shell脚本 flume-env.sh和一个潜在的log4j属性文档。在这个例子中，我们通过一个Java选项来强制Flume打印信息到控制台和没有自定义一个环境脚本。

From a separate terminal, we can then telnet port 44444 and send Flume an event:

通过一个独立的终端，我们可以telnet 端口4444和发送一个事件：

$ telnet localhost 44444

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

Hello world! <ENTER>

OK

The original Flume terminal will output the event in a log message.

原来的Flume终端将会在控制台将事件打印出来：

12/06/19 15:32:19 INFO source.NetcatSource: Source starting

12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

Congratulations - you’ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in much more detail.

恭喜-你已经成功配置和部署了一个Flume agent！接下来的部分会覆盖agent配置的更多细节。