Hadoop of big data technology (eight) - Flume log collection system

Table of contents

material

1. Overview of Flume

1. Understanding of Flume

2. The operating mechanism of Flume

(1) Source (data collector)

(2) Channel (buffer channel)

(3) Sink (receiver)

3. Flume's log collection system structure

(1) Simple structure

(2) Complex structure

Second, the basic use of Flume

1. System requirements

2. Flume installation

(1) Download Flume

(2) Unzip

(3) Double naming

(4) Configure the Flume environment

3. Getting started with Flume

(1) Configure the Flume collection scheme

(2) Specify the acquisition plan to start Flume

(3) Flume data collection test

3. Configuration description of Flume collection scheme

1、Flume Source

(1)Avro Source

 (2)Spooling Directory Source

(3)Taildir Source

(4)HTTP Source

2、Flume channel

(1)Memory Channel

(2)File channel

3、Flume Sinks

(1)HDFS Sink

(2)Logger Sink

(3)Avro Sink

4. Flume's reliability guarantee

1. Load balancing

(1) Build and configure the Flume machine

(2) Configure the Flume collection scheme

a、exec-avro.conf

b、netcat-logger.conf

(2) Start the Flume system

(3) Flume system load balancing test

2. Failover

(1) Configure the Flume collection scheme

a、avro-logger-memory.conf

 b、exec-avro-failover.conf

 (2) Start the Flume system

(3) Flume system failover test

Five, Flume interceptor

1、Timestamp interceptor

2、 Static interceptor

3、Search and Replace Interceptor

6. Case - log collection

1. Configure collection plan

(1)exec-avro_logCollection.conf 

(2)avro-hdfs_logCollection.conf

2. Start hadoop cluster

3. Start the Flume system

4. Log collection system test

reference books


material

http://Link: https://pan.baidu.com/s/19cSqan67QhB_x3vdnXsANQ?pwd=fh35 Extraction code: fh35 http://xn--gzu811i//pan.baidu.com/s/19cSqan67QhB_x3vdnXsANQ?pwd=fh35 %20%E6%8F%90%E5%8F%96%E7%A0%81:%20fh35

1. Overview of Flume

1. Understanding of Flume

        Flume was originally a highly available, highly reliable, and distributed massive log collection, aggregation, and transmission system provided by Cloudera, and was later incorporated into Apache as a top-level open source project. Apache Flume is not only limited to the collection of log data, because the data source collected by Flume is customizable, so Flume can also be used to transmit a large amount of event data, including but not limited to network traffic data, data generated by social media, email messages, and almost any possible data source.

        Currently, Flume is divided into two versions: Flume 0.9x version, collectively referred to as Flume-og (original generation) and Flume 1.x version, collectively referred to as Flumeg (next generation). Due to the problems of unreasonable design, bloated code, and difficult expansion in the early Flume-og, after Flume was incorporated into Apache, the developers refactored the code of Clouden Flume, and supplemented and strengthened the functions of Flume at the same time. And renamed to Apache Flume, so there are two completely different versions of Flume-ng and Flume-og. In actual development, most of them use the currently popular version of Flume for Flume development.

2. The operating mechanism of Flume

        The core of Flume is to collect data from the data source (Web Server) through the data collector (Source), and then collect the collected data through the buffer channel (Channel) to the designated receiver (Sink). You can refer to the official document picture below.

        There is an Agent (agent) in the basic architecture of Flume, which is the core role of Flume. meAgent is a JVM process, which carries the three core components Source, Channel, and Sink where data flows from an external source to the next target.

(1) Source (data collector)

        For source data collection, collect source data from a Web server, then write the collected data into Channel and flow to Sink;

(2) Channel (buffer channel)

        The bottom layer is a buffer queue, which caches the data in the Source and writes the data to the Sink efficiently and accurately. After all the data reaches the Sink, Flume will delete the data in the cache channel;

(3) Sink (receiver)

        Receive and collect all the data flowing to the Sink. According to the requirements, it can be directly stored in a centralized manner (using HDFS for storage), or it can continue to be used as a data source to be transmitted to other remote servers or Sources.

        During the entire data transmission process, Flume encapsulates the flowing data into an event (event), which is the basic unit of Flume's internal data transmission. A complete event includes headers and body, where the headers contain some identification information, and the body is the data information collected by Flume.

3. Flume's log collection system structure

(1) Simple structure

        When the production source that needs to collect data is relatively single and simple, an Agent can be directly used for data collection and final storage.

(2) Complex structure

        When the data sources that need to collect data are distributed on different servers, it is no longer applicable to use one Agent for data collection. At this time, multiple Agents can be deployed according to business needs for data collection and final storage. That is to say, for each Web server that needs to collect data, an Agent is built for data collection, and then the data in these multiple Agents is used as the source of the next Agent to collect and finally centrally stored in HDFS. In addition, during the development process, it may be encountered that data is collected from the same server, and then transmitted and stored to different destinations through multiplexed streams. According to specific needs, the data collected by an Agent is passed through different The Channels flow to different Sinks, and then proceed to the next stage of transmission or storage.

Second, the basic use of Flume

1. System requirements

        As a top-level project under Apache, if you want to use Flume for development, you must meet certain system requirements. The official instructions here shall prevail, and the specific requirements are as follows.

(1) Install the Java 1.8 or higher version of the Java runtime environment (for the Flume 1.8 version used this time);

(2) Provide enough memory space for the configuration of Source (data collector), Channel (buffer channel) and Sink (receiver);

(3) Provide enough disk space for the configuration of Channel (buffer channel) and Sink (receiver);

(4) Ensure that the Agent (agent) has read and write permissions to the directory to be operated.

        Among the above system requirements, the version of the Java operating environment corresponds to the version of Flume to be installed and used. If Flume 1.6 is used, the operating environment of Java 1.6 and above is required. Since the rest of this chapter will be based on Flume 1.8.0, so It is required to install Java 1.8 and above operating environment.

2. Flume installation

(1) Download Flume

        Download flume to the /export/software/ directory

https://dlcdn.apache.org/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

(2) Unzip

进入目录/export/software/,执行命令
tar -xzvf apache-flume-1.8.0-bin.tar.gz -C /export/servers/

(3) Double naming

进入目录/export/servers/,执行命令
mv apache-flume-1.8.0-bin flume

(4) Configure the Flume environment

配置 flume-env.sh
cd /export/servers/flume/conf
cp flume-env.sh.template flume-env.sh

vi flume-env.sh #编辑文件,增加如下行
export JAVA_HOME=/export/servers/jdk

配置 /etc/profile
vi /etc/profile #编辑文件,增加如下行
export FLUME_HOME=/export/servers/flume
export PATH=$PATH:$FLUME_HOME/bin

3. Getting started with Flume

(1) Configure the Flume collection scheme

        Configure netcat-logger.conf in the /export/servers/flume/conf directory, the relevant code is as follows

# 示例配置方案: 单节点 Flume 配置

#为agent各组件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = logger

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a. The name of the collection scheme can be customized, but for the convenience of management and use, it is usually named according to the data source type and the collected result type. For example, netcat-logger.conf means to collect netcat-type data sources and finally collect them as logger log information.

b. The location of the collection plan file can be customized and stored. When using it, it will be required to specify the specific location of the configuration plan. In order to facilitate unified management, the collection plan is usually stored uniformly. As in this case, all custom collection scheme files will be saved in the /export/servers/flume/conf directory.

c. The sources, channels, and sinks in the collection plan are configured according to business requirements during specific writing, and cannot be defined arbitrarily. The data types supported by Flume can be learned in detail by viewing the official website (address https://flume.apache.org/FlumeUserGuide.html ), and different configuration properties need to be written for different sources type, channels type, and sinks type.

        Note: In the configuration collection scheme, it is particularly easy to make mistakes when writing the association and binding of Source, Sink and Channel, such as al.sources.rl.channels = c1 and al.sinks.kl.channel shown in the file netcat-logger.conf = cl, the channels of sources have one more s than the channels of sinks. This is because, in an Agent, the same Source can have multiple Channels, so channels (the plural form of channels) are used in configuration; and the same Sink can only serve one Channel, so Channel must be used in configuration.

(2) Specify the acquisition plan to start Flume

进入 /export/servers/flume 目录,使用指定采集方案启动FLume
flume-ng agent --conf conf/ --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

        After executing the above instructions, Flume will be started using the previously written collection scheme netcat-logger.conf. The Flume system will listen to the neteat type source data sent by port 44444 of the current host localhost according to the configuration of the collection scheme, and collect and receive the information To the Sink of type logger.

Next, explain each part of the above command, as follows.

a. Flume-ng agent: Indicates that an agent is started using flume-ng;

b. -confconf/: The -conf option specifies the configuration file path that comes with Flume, which can be abbreviated as -c;

c. -conf-file conf/netcat-logger.conf: The -conf-file option specifies the collection scheme written by the developer, which can be abbreviated as -f, and attention should be paid to the path of the configuration file. Readers are advised to use the absolute path to specify the collection scheme. Otherwise, an error that the file does not exist will be prompted;

d. -nameal: Indicates that the name of the activated agent is al, and the name al must be consistent with the name of the agent in the collection plan;

e. -Dflume.root.logger=INFO,console: Indicates that the collected and processed information is output to the console for display through the logger log information.

(3) Flume data collection test

如果出现”-bash: telnet: command not found",请使用以下指令安装telnet工具。
yum -y install telnet

使用telnet 连接到本地主机 localhost端口44444,用来持续发送数据信息作为Flume将要采集的源数据。
telnet localhost 44444

登录后发送信息:
hello
OK(收到信息)

world
OK(收到信息)

3. Configuration description of Flume collection scheme

1、Flume Source

        When writing a Flume collection plan, you must first clarify the type and source of the collected data source; then, match this information with the supported Flume Source provided by Flume, and select the corresponding data collector type (source. type); Then, match the necessary and non-essential data collector attributes according to the selected data collector type. There are many Flume Sources provided and supported by Flume. For details, please go to the official website to view https://flume.apache.org/FlumeUserGuide.html#flume-sources https://flume.apache.org/FlumeUserGuide.html#flume- sources         Here are some commonly used Flume Sources.

(1)Avro Source

        Listens to the Avro port and receives event data from external Avro client streams. When paired with an Avro Sink on another Flume Agent, it can create a hierarchical collection topology. Using Avro Source can achieve multi-level flow, fan-out flow, and fan-out flow. Inflow and other effects.

Common attributes of Avro Source (the bold part is a required attribute)
attribute name Defaults illustrate
channels ——
type —— Component type name must be avro
bind —— The hostname or IP address to listen on
port —— The service port to listen on
threads —— Maximum number of worker threads to spawn
ssl false Set this to true to enable SSL encryption, you must specify keystore and keystore-password
keystore —— The path to the Java key store required for SSL
keystore-password —— Password for the Java keystore required for SSL
使用 Avro Source 采集器配置一个名称为 a1 的 Agent。
a1.sources=r1
a1.channels=c1
a1.sources.r1.type=avro
a1.sources.r1.channels=c1
a1.sources.r1.bind=0.0.0.0
a1.sources.r1.port=4141

 (2)Spooling Directory Source

        Spooling Directory Source allows monitoring the file directory on the specified disk to extract data, it will check the new files in the specified directory of the file, and read the data in the file.

Common attributes of Spooling Directory Source (the bold part is a required attribute)
attribute name Defaults illustrate
channels ——
type —— Component type name must be spooldir
spoolDir —— directory to read files from
fileSuffix . COMPLETED appended to the fully ingested file suffix
deletePolicy never When to delete completed files: never or immediate
fileHeader false Whether to add a header storing the absolute path filename
includePattern ^. * $ Regular expression specifying files to include
ignorePattern ^ $ Regular expression specifying files to ignore
使用 Spooling Directory Source 采集器配置一个名称为 a1 的 Agent。
a1.channels=ch-1
a1.sources=src-1
a1.sources.src-1.type=spooldir
a1.sources.src-1.channels=ch-1
a1.sources.src-1.spoolDir=/var/log/apache/flumeSpool
a1.sources.src-1.fileHeader=true

(3)Taildir Source

        Taildir Source is used to watch specified files, almost real-time monitoring for new lines added to each file. If the file is writing new lines, this collector will retry collecting them waiting for the write to complete.

Common attributes of Taildir Source (the bold part is a required attribute)
attribute name Defaults illustrate
channels ——
type —— Component type name must be TAILDIR
filegroups —— A space-separated list of filegroups. Each filegroup specifies a list of files to monitor
filegroups. <filegroupName> —— Absolute path to the filegroup. Regular expressions (not filesystem patterns) can only be used for filenames
idle Timeout 120000 Time in ms to close inactive files. This source will automatically reopen a closed file if it has newlines appended to it
writePosInterval 3000 The interval between writing the last position of each file on the position file (ms)
batchSize 100 The maximum number of rows to read and send to the channel at one time. It is usually better to use the default
backoffSleepIncrement 1000 The maximum time delay between each retries to poll for new data when the last attempt did not find any new data
fileHeader false Whether to add a header storing the absolute path filename
fileHeaderKey file header keyword to use when appending an absolute path filename to the event header
使用 Taildir Source 采集器配置一个名称为 a1 的 Agent。
a1.sources=r1
a1.channels=c1
a1.sources.r1.type=TAILDIR
a1.sources.r1.channels=c1
a1.sources.r1.positionFile=/var/log/flume/taildir_position.json
a1.sources.r1.filegroups=f1 f2
a1.sources.r1.filegroups.f1=/var/log/test1/example.log
a1.sources.r1.headers.f1.headerKet1=value1
a1.sources.r1.filegroups.f2=/var/log/test2/. * log. *
a1.sources.r1.headers.f2.headersKey1=value2
a1.sources.r1.headers.f2.headersKey2=value2-2
a1.sources.r1.fileHeader=true

(4)HTTP Source

        HTTP Source can receive event data through HTTP POST and GET requests, and GET is usually only used for testing. The HTTP request will be converted into Flume events by a handler (processor) pluggable plug-in that implements the HTTPSourceHandler interface. This handler receives the HttpServletRequest and returns a list of Flume events. All events processed by an HTTP request are committed to the channel in one transaction, allowing for increased efficiency on channels such as the file channel. If the handler throws an exception, the source will return 400; if the channel is full or the source can no longer add events to the channel, the source will return 503. All events sent in one POST request are considered a batch and will be inserted into the channel in one transaction.

Common configuration attributes of HTTP Source (the bold part is a required attribute)
attribute name Defaults illustrate
channels ——
type Component type name must be http
port —— The port to be bound to the collection source
bind 0.0.0.0 The hostname or IP address to listen for bindings
handler org.apache.flume.source.http.JSONHandler The full pathname of the handler class
handler. * —— Configure the parameters of the handler
使用 HTTP Source 采集器配置一个名称为a1 的 Agent。
a1.sources=r1
a1.channels=c1
a1.sources.r1.type=http
a1.sources.r1.port=5140
a1.sources.r1.channels=c1
a1.sources.r1.handler=org.example.rest.RestHandler
a1.sources.r1.handler.nickname=random props

2、Flume channel

        Channels 通道是event 在Agent 上暂存的存储库,Source 向 Channel 中添加event,Sink 在读取完数据后再删除它。在配置Channels时,需要明确的是将要传输的 sources 数据源类型;接着,根据这些信息并结合开发中的实际需求,选择Flume 已提供支持的 Flume Channels;然后,再根据选择的Channel类型,配置必要和非必要的Channel属性。

        这是官方文档

https://flume.apache.org/FlumeUserGuide.html#flume-channelshttps://flume.apache.org/FlumeUserGuide.html#flume-channels

(1)Memory Channel

        Memory Channel 会将 event 存储在具有可配置最大尺寸的内存队列中,它非常适用于需要更高吞吐量的流量,但是在 Agent发生故障时会丢失部分阶段数据。

Memory Channel常用配置属性(加粗部分为必须属性)
属性名称 默认值 说明
type —— 组件类型名必须是memory
capacity 100 存储在 Channel 中的最大 even数
transactionCapacity 100 Channel 将从 Source 接收或向 Sink传递的每一个事务中的最大 event数
keep-alive 3

添加或删除 event 的超时时间(s)

byteCapacityBufferPercentage 20

定义 byteCapacity与Channel中所有event的估计总大小之间的缓冲区百分比,以计算 header中的数据

byteCapacity (见说明)

允许此Channel 中所有event的最大内存字节数总和。该统计仅计算Eventbody,这也是提供byteCapacityBufferPercentage 配置参数的原因。默认计算值,等于JVM可用的最大内存的 80%(即命令行传递的-Xmx 值的 80%)

使用Memory Channel 通道配置一个名称为 a1 的 Agent。
a1.channels=c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=10000
a1.channels.c1.byteCapacityBufferPercentage=20
a1.channels.c1.byteCapacity=800000

(2)File channel

        File channel 是 Flum 的持久通道,它将所有的 event 写入磁盘,因此不会丢失进程或机器关机、崩溃时的数据。File channel 通过在一次事务中提交多个 event 来提高吞吐量,做到了只要事务被提交,那么数据就不会有丢失。

File channel常用配置属性(加粗部分为必须属性)
属性名称 默认值 说明
type —— 组件类型名必须是file
checkpointDir ~/.flume/file-channel/checkpoint 检测点文件所存储的目录
useDualCheckpoints false 备份检测点如果设置为 true,backupCheckpointDir 必须设置
backupCheckpointDir —— 备份检查点目录。此目录不能与数据目最或检查点目录相同
dataDirs ~/.flume/file-channel/data 数据存储所在的目录设置
transactionCapacity 10000 事务容量的最大值设置
checkpointInterval 30000 检测点之间的时间值设置(ms)
maxFileSize 2146435071 一个单一日志的最大值设置(以字节为单位)
capacity 100000 Channel的最大容量
使用 File channel 通道配置一个名称为 a1 的 Agent。
a1.channels=c1
a1.channels.c1.type=file
a1.channels.c1.checkpointDir=/mnt/flume/checkpoint
a1.channels.c1.dataDirs=/mnt/flume/data

3、Flume Sinks

        Flume Sources 采集到的数据通过 Channels 就会流向 Sink 中,此时的 Sink类似一个区结的递进中心,它需要根据后续需求进行配置,从而最终选择是将数据直接进行集中式存储(例如,直接存储到 HDFS中),还是继续作为其他 Agent 的 Source 进行传输。在配置Sinks时,需要明确的就是将要传输的数据目的地、结果类型;接着,根据这些实际需求信息,选择Flume已提供支持的 Flume Sinks;然后,再根据选择的 Sinks类型,配置必要和非必要的 Sinks 属性。

        具体可前往官方文档查看

https://flume.apache.org/FlumeUserGuide.html#flume-sinkshttps://flume.apache.org/FlumeUserGuide.html#flume-sinks

(1)HDFS Sink

        HDFS Sink 将 event 写入 Hadoop 分布式文件系统(HDFS),它目前支持创建文本和序列文件,以及两种类型的压缩文件。HDFS Sink 可以基于经过的时间或数据大小或 event数量来周期性地滚动文件(关闭前文件并创建新文件),同时,它还通过属性(如 event发生的时间戳或机器)来对数据进分桶/分区。HDFS 目录路径可能包含将由 HDFS 接收器替换的格式化转义序列,以生用于存储 event 的目录/文件名,使用 HDFS Sink 时需要安装 Hadoop,以便 Flume 可以有用Hadoop jar 与HDFS 集群进行通信。

HDFS Sink常用配置属性(加粗部分为必须属性)
属性名称 默认值 说明
channel ——
type —— 组件类型名必须是hdfs

hdfs.path

—— HDFS目录路径(如hdfs://namenode/ume/webdata/)
hdfs.filePrefix FlumeData 为在 hdis 目录中由 Flume 创建的文件指定前缀
hdís.round false

是否应将时间戳向下舍人(如果为 true.则影响除%;之外的

所有基于时间的转义序列)

hdfs.roundValue 1

舍人到此最高倍数(在使用 hdfs.roundUnit 配置的单位中),小于当前时间

hdfs.roundUnit second 舍人值的单位(秒、分钟或小时)
hdfs. rollInterval 30 滚动当前文件之前等待的秒数(0==根据时间间隔从不滚动)
hdfs.rollSize 1024 触发滚动的文件大小,以字节为单位(0:永不基于文件大小滚动)
hdfs.rollCount 10 在滚动之前写入文件的事件数(0==从不基于事件数滚动)
hdfs.batchSize 100 在将文件刷新到 HDFS之前写人文件的 event 数
hdfs.useLocalTimeStamp false

替换转义序列时,请使用本地时间(而不是event beader 中的时间戳)

使用 HDFS Sink 配置一个名称为 a1 的 Agent。
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=/flume/events/%y-%m-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix=events-
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute

(2)Logger Sink

        Logger Sink用于记录 INFO 级别 event,它通常用于调试。Logger Sink 接收器的不同之处是它不需要在“记录原始数据”部分中说明额外配置。

Logger Sink常用配置属性(加粗部分为必须属性)
属性名称 默认值 说明
channel ——
type —— 组件类型名必须是 logger
maxBytes ToLog 16 要记录的 event body 的最大字节数
使用 Logger Sink 配置一个名称为 a1 的 Agent。
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=logger
a1.sinks.k1.channel=c1

(3)Avro Sink

        Avro Sink 形成了 Flume 的分层收集支持的一半,发送到此接收器的 Flume event 将转换为 Avro event 并发送到配置的主机名/端口对上,event 将从配置的 Channel中批量获取配置的批处理大小。

Avro Sink常用配置属性(加粗部分为必须属性)
属性名称 默认值 说明
channel ——
type —— 组件类型名必须是 avro
hostname —— 要监听的主机名或IP地址
port —— 要监听的服务端口
batch-size 100 要一起批量发送的 event 数
connect-timeout 20000 允许第一次(握手)请求的时间量(ms)
request-timeout 20000 在第一个之后允许请求的时间量
使用 Avro Sink 配置一个名称为 a1 的 Agent。
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=avro
a1.sinks.k1.channel=c1
a1.sinks.k1.hostname=10.10.10.10
a1.sinks.k1.port=4545

四、Flume的可靠性保证

        前面讲解的Flume 人门使用中,配置的采集方案是通过唯一一个Sink作为接收器接收后续需要的数据,但有时候会出现当前 Sink故障或者数据收集请求量较大的情况,这候单一的 Sink 配置可能就无法保证 Flume 开发的可靠性。为此,Flume 提供了Flume Sink Processors(Flume Sink 处理器)来解决上述问题。

        Sink 处理器允许开发者定义一个 Sink groups(接收器组),将多个 Sink 分组到一个实体中,这样Sink处理器就可以通过组内的多个Sink为服务提供负载均衡功能,或者是在某个Sink出现短暂故障的时候实现从一个 Sink 到另一个 Sink的故障转移。

1、负载均衡

        负载均衡接收器处理器(Load balancing sink processor)提供了在多个 Sink 上进行负载均衡流量的功能,它维护了一个活跃的Sink索引列表,必须在其上分配负载。Load balancing sink processor 支持使用 round_robin (轮询)和 random (随机)选择机制进行流量分配,其默认选择机制为 round_robin,但可以通过配置进行覆盖。还支持继承 AbstractSinkSelector 的自定义类来自定义选择机制。

        在使用时选择器(selector)会根据配置的选择机制挑选下一个可用的Sink并进行调用。对于round_robin和random两种选择机制,如果所选 Sink无法收集 event,则处理器会通过其配置的选择机制选择下一个可用 Sink。这种实现方案不会将失败的Sink列人黑单,而是继续乐观地尝试每个可用的 Sink。如果所有 Sink 都调用失败,则选择器将故障传播到接收器运行器(sink runner)。

        如果启用了 backoff 属性,则 Sink处理器会将失败的Sink列人黑名单。当超时结束时,如果Sink仍然没有响应,则超时会呈指数级增加,以避免在无响应的 Sink 上长时间等待时卡住。在禁用backoff 功能的情况下,在 round_robin 机制下,所有失败的 Sink 将被传递到 Sink 队列中的下一个 Sink后,因此不再均衡。

Load balancing sink processor提供的配置属性(加粗部分为必须属性)。
属性名称 默认值 说明
sinks —— 以空格分隔的参与 sink 组的 sink 列表
processor.type default 组件类型名必须是 load_balance
processor.backoff false 设置失败的 sink 进人黑名单
processor.selector round_robin 选择机制。必须是 round_robin、random或是继承自AbstractSinkSelector 的自定义选择机制类全路径名
processor.selector.maxTimeOut 30000

失败 sink 放置在黑名单的超时时间,失败sink在指

指定时间后仍无法启用,则超时时间呈指数增加

        processor.type 属性的默认值为 default,这是因为 Sink 处理器的 processor.type 提供了3 种处理机制:default(默认值)、failover 和 load_balance。其中, default 表示配置单独一个sink,配置和使用非常简单,同时也不强制要求使用 sink group 进行封装;另外的 failover 和 load_balance 就分别代表故障转移和负载均衡情况下的配置属性。

使用 Load balancing sink processor 配置一个名称为 al 的 Agent。
a1.sinkgroups=g1
a1.sinkgroups.g1.sinks=kl k2
a1.sinkgroups.g1.processor.type=load_balance 
al.sinkgroups.g1.processor.backoff-true
al.ainkgroups.g1.processor.selector=random

(1)搭建并配置Flume机器

在hadoop01上,将Flume同步到hadoop02、hadoop03上
scp -r /export/servers/flume hadoop02.bgd01:/export/servers/
scp -r /export/servers/flume hadoop03.bgd01:/export/servers/

scp -r /etc/profile hadoop02.bgd01:/etc/profile
scp -r /etc/profile hadoop03.bgd01:/etc/profile

分别在hadoop02、hadoop03上执行如下命令,立即刷新配置
source /etc/profile

(2)配置Flume采集方案

a、exec-avro.conf

        在hadoop01.bgd01上配置第一级采集配置,在/export/servers/flume/conf/目录下编写采集方案exec-avro.conf。

# 配置load balancing sink processor一级采集方案
a1.sources = r1

#用空格分别配置2个sink
a1.sinks = k1 k2
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/123.log

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 设置sink1,流向Hadoop02,由Hadoop02上的Agent进行采集
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02.bgd01
a1.sinks.k1.port = 52020

# 设置sink2,流向Hadoop03,由Hadoop03上的Agent进行采集
a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03.bgd01
a1.sinks.k2.port = 52020

# 配置sink组及处理器策略
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = random
a1.sinkgroups.g1.processor.timeout = 10000

b、netcat-logger.conf

         分别在hadoop02.bgd01和hadoop03.bgd01上配置第二级采集配置,在/export/servers/flume/conf/目录下编写采集方案netcat-logger.conf。

# 配置load balancing sink processor二级采集方案的一个Sink分支

#为agent各组件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02.bgd01
a1.sources.r1.port = 52020

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = logger

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 配置load balancing sink processor二级采集方案的一个Sink分支

#为agent各组件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03.bgd01
a1.sources.r1.port = 52020

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = logger

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

        两个采集方案内容的唯一区别就是 source.bind 的不同,hadoop02.bgd01 机器的source.bind=hadoop02.bgd01,而hadoop03.bgd01 机器的 source.bind=hadoop03.bgd01,在 r述两个文件中,均设置了一个名为 al的Agent,在该 Agent 内部设置了 source. type= avro、source.bind=hadoop02.bgd01/hadoop03.bgd01 以及 source. port=52020,特意用来对接在 hddoop01.bgd01中前一个Agent收集后到 Sink的数据类型和配置传输的目标;最后,又设置了二采集方案的 sink.type=logger,将二次收集的数据作为日志收集打印。

(2)启动Flume系统

1、分别在hadoop02、hadoop03上,进入 /export/servers/flume 目录,使用netcat-logger采集方案启动FLume
flume-ng agent --conf conf/ --conf-file conf/netcat-logger.conf --name a1 -Dflume.root.logger=INFO,console

2、在hadoop01上,进入 /export/servers/flume 目录,使用exec-avro.conf采集方案启动FLume
flume-ng agent --conf conf/ --conf-file conf/exec-avro.conf --name a1 -Dflume.root.logger=INFO,console

hadoop02.bgd01

 hadoop03.bgd01

 hadoop01.bgd01

(3)Flume系统负载均衡测试

在hadoop01.bgd01的root目录创建logs目录
mkdir -p /root/logs

在hadoop01上,重新打开一个终端,执行如下命令:
while true; do echo "access access ..." >> /root/logs/123.log; sleep 1; done

hadoop02.bgd01

hadoop03.bgd01

2、故障转移

        故障转移接收器处理器(Failover Sink Processor)维护一个具有优先级的sink列表,保证在处理 event 只要有一个可用的 sink 即可。故障转移机制的工作原理是将故障的sink降级到故障池中,在池中为它们分配一个冷却期,在重试之前冷却时间会增加,当 sink 成功发送 event后,它将恢复到活跃池中。sink具有与之相关的优先级,数值越大,优先级越高。如果在发送 event 时 sink 发生故障,则会尝试下一个具有最高优先级的 sink来继续发送event。如果未指定优先级,则根据配置文件中指定 sink 的顺序确定优先级。

Failover Sink Processor配置属性(加粗部分为必须属性)。
属性名称 默认值 说明
sinks —— 以空格分隔的参与 sink 组的 sink 列表
processor.type default 组件类型名必须是 failover
processor.priority.<sinkName> —— 设置 sink 的优先级取值
processor.maxpenalty 30000 失败 sink 的最大退避时间
使用 Failover Sink Processor 配置一个名称为al的Agent。

al.sinkgroups=g1
a1.sinkgroups.g1.sinks=kl k2
al.sinkgroups.g1.processor.type=failover 
al.sinkgroups.g1.processor.priority.k1=5
a1.sinkgroups.gl.processor.priority.k2-10 al.sinkgroups.g1.processor.maxpenalty=10000

(1)配置Flume采集方案

a、avro-logger-memory.conf

        在hadoop01.bgd01上配置第一级采集配置,在/export/servers/flume/conf/目录下编写采集方案avro-logger-memory.conf。

# 配置load balancing sink processor一级采集方案
a1.sources = r1

#用空格分别配置2个sink
a1.sinks = k1 k2
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/456.log

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 设置sink1,流向Hadoop02,由Hadoop02上的Agent进行采集
a1.sinks.k1.channel = c1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02.bgd01
a1.sinks.k1.port = 52020

# 设置sink2,流向Hadoop03,由Hadoop03上的Agent进行采集
a1.sinks.k2.channel = c1
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop03.bgd01
a1.sinks.k2.port = 52020

# 配置sink组及处理器策略
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1=5
a1.sinkgroups.g1.processor.priority.k2=10
a1.sinkgroups.g1.processor.timeout = 10000

 b、exec-avro-failover.conf

        分别在hadoop02.bgd01和hadoop03.bgd01上配置第二级采集配置,在/export/servers/flume/conf/目录下编写采集方案exec-avro-failover.conf。

# 配置load balancing sink processor二级采集方案的一个Sink分支

#为agent各组件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop02.bgd01
a1.sources.r1.port = 52020

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = logger

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# 配置load balancing sink processor二级采集方案的一个Sink分支

#为agent各组件命名
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop03.bgd01
a1.sources.r1.port = 52020

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = logger

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

 (2)启动Flume系统

1、分别在hadoop02、hadoop03上,进入 /export/servers/flume 目录,使用avro-logger-memory.conf采集方案启动FLume
flume-ng agent --conf conf/ --conf-file conf/avro-logger-memory.conf --name a1 -Dflume.root.logger=INFO,console


2、在hadoop01上,进入 /export/servers/flume 目录,使用exec-avro-failover.conf采集方案启动FLume
flume-ng agent --conf conf/ --conf-file conf/exec-avro-failover.conf --name a1 -Dflume.root.logger=INFO,console

hadoop01.bgd01

 hadoop02.bgd01

 hadoop03.bgd01

(3)Flume系统故障转移测试

在hadoop01.bgd01上,重新打开一个终端,执行如下命令:
while true; do echo "access access ..." >> /root/logs/456.log; sleep 1; done

hadoop02.bgd01

hadoop03.bgd01

五、Flume拦截器

        FlumeInterceptors(拦截器)主要用于实现对Flume系统数据流中 event 的修改操作。在使用 Flume 拦截器时,只需要参考官方配置属性在采集方案中选择性地配置即可,当涉及配置多个拦截器时,拦截器名称中间需要用空格分隔,并且拦截器的配置顺序就是拦微顺序。这里只简述常用的几种的,具体可前往官方文档查看。

https://flume.apache.org/FlumeUserGuide.html#flume-interceptorshttps://flume.apache.org/FlumeUserGuide.html#flume-interceptors

1、Timestamp interceptor

        TimestampInterceptor(时间戳拦截器)会将流程执行的时间插入到event的header头部。此拦截器插人带有 timestamp 键(或由 header 属性指定键名)的标头,其值为对应时间戳。如果配置中已存在时间戳时,此拦截器可以保留现有的时间戳。

Timestamp Interceptor常用配置属性(加粗部分为必须属性)。
属性名称 默认值 说明
type —— 组件类型名必须是timestamp
header timestamp 用于放置生成的时间戳的标头的名称
preserveExisting false 如果时间戳已存在,是否应保留,true或false
为名称为a1的Agent中配置Timestamp interceptor。
a1.sources=r1
a1.channels=c1
a1.sources.r1.channels=c1
a1.sources.r1.type=seq
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

2、 Static interceptor

        Static Interceptor(静态拦截器)允许用户将具有舒志值的静态头附加到所有even,当前实现不支持一次指定多个 header头,但是用户可以定又多个 Static Intercrptor 来为每一个拦截器都追加一个 header。

Static Interceptor常用配置属性(加粗部分为必须属性)。
属性名称 默认值 说明
type —— 组件类型名必须是static
preserveExisting true 如果配置的header已存在,是否应保服
key key 应创建的 header 的名称
value value 应创建的header 对应的静态值
为名称是al的Agent 中配置 Static Intercepior。
a1.sources=r1
a1.channels=c1
a1.sources.r1.channels=c1
a1.sources.r1.type=seq
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=static
a1.sources.r1.interceptors.i1.key=datacenter
a1.sources.r1.interceptors.i1.value=BET_JING

3、Search and Replace Interceptor

        Search and Replace Interceptor(查询和替换拦戴器)基于Java正则表达式提供了简用的用于字符串的搜索和替换功能,同时还具有进行回溯/群组捕捉功能。此拦截器的便用与Matcher.replaceAllO方法具有相同的规则。

Sarch and Replace Intereeptor常用配置属性(加租部分为必须属性)
属性名称 默认值 说明
type —— 组件类型名必须是 search_replane
searchPattern —— 要查询或替换的模式
replaceString —— 替换的字符串
charset UTF-8 event body 的字符集,默认为 UTF-8
为名称为al的Agent 中配置 Search and Replace Interceptor的示例如下。
al.sources=r1 
a1.channels=cl
a1.sources.r1.channels=cl 
a1.sources.r1.type=seg
a1.sources.avroSrc.interceptors=i1
al.sources.avroSrc.interceptors.i1.type=search_replace 
# 影除 event body 中的前导字母数字字符
a1.sources.avroSrc.interceptors.i1.searchPattern=^[A-Za-z0-9_]+ al.sources.avroSrc.interceptors.i1.replacesString=

六、案例——日志采集

1、配置采集方案

(1)exec-avro_logCollection.conf 

        分别在hadoop02.bgd01和hadoop03.bgd01上配置同样的采集目录,在/export/servers/flume/conf/目录下编写采集方案exec-avro_logCollection.conf。

# 配置 Agent 组件

# 用3个source采集不同的日志类型数据
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

# 描述并配置第一个sources组件(数据类型、采集数据源的应用地址、包括自带的拦截器)
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/access.log
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access

# 描述并配置第二个sources组件(数据类型、采集数据源的应用地址、包括自带的拦截器)
a1.sources.r2.type = exec
a1.sources.r2.command = tail -F /root/logs/nginx.log
a1.sources.r2.interceptors = i2
a1.sources.r2.interceptors.i2.type = static
a1.sources.r2.interceptors.i2.key = type
a1.sources.r2.interceptors.i2.value = nginx

# 描述并配置第三个sources组件(数据类型、采集数据源的应用地址、包括自带的拦截器)
a1.sources.r3.type = exec
a1.sources.r3.command = tail -F /root/logs/web.log
a1.sources.r3.interceptors = i3
a1.sources.r3.interceptors.i3.type = static
a1.sources.r3.interceptors.i3.key = type
a1.sources.r3.interceptors.i3.value = web

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 200000
a1.channels.c1.transactionCapacity = 100000

# 描述并配置sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop01.bgd01
a1.sinks.k1.port = 41414

# 将Source、Sink 与Channel 进行关联绑定
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1

a1.sinks.k1.channel = c1

(2)avro-hdfs_logCollection.conf

        Configure the second-level log collection scheme on hadoop01.bgd01, and write the collection scheme exec-hdfs_logCollection.conf in the /export/servers/flume/conf/ directory.

# 配置load balancing sink processor二级采集方案的一个Sink分支

#配置 Agent 组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 描述并配置sources组件(数据类型、采集数据源的应用地址)
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop01.bgd01
a1.sources.r1.port = 41414

# 描述并配置拦截器,用于后续%Y%m%d获取时间
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

# 描述并配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 20000
a1.channels.c1.transactionCapacity = 10000

# 描述并配置sinks组件(采集后数据流出的类型)
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path =hdfs://hadoop01.bgd01:9000/source/logs/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = events
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text

#生成的文本不按条数生成
a1.sinks.k1.hdfs.rollCount = 0

#生成的文本不按时间生成
a1.sinks.k1.hdfs.rollInterval = 0

#生成的文本按大小生成
a1.sinks.k1.hdfs.rollSize = 10485760

#批量写入HDFS的个数
a1.sinks.k1.hdfs.batchSize = 20

#Flume操作HDFS的线程数(包括新建、写入)
a1.sinks.k1.hdfs.threadsPoolSize = 10

#操作HDFS超时时间
a1.sinks.k1.hdfs.callTimeout = 30000

# 将source 和 sink 通过同一个channel连接绑定
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2. Start hadoop cluster

start-dfs.sh  #启动HDFS
start-yarn.sh #启动YARN

3. Start the Flume system

在hadoop01上启动Flume系统
打开终端。进入 /export/servers/flume 目录,使用avro-hdfs_logCollection.conf采集方案启动FLume。命令如下:
flume-ng agent --conf conf/ --conf-file conf/avro-hdfs_logCollection.conf --name a1 -Dflume.root.logger=INFO,console

分别在hadoop02、hadoop03上,进入 /export/servers/flume 目录,使用exec-avro_logCollection.conf采集方案启动FLume。命令如下:
flume-ng agent --conf conf/ --conf-file conf/exec-avro_logCollection.conf --name a1 -Dflume.root.logger=INFO,console

4. Log collection system test

1、在hadoop02上,创建目录/root/logs;然后打开3个终端,分别执行执行如下命令,用来产生日志数据:
while true; do echo "access access ..." >> /root/logs/access.log; sleep 1; done
while true; do echo "nginx nginx ..." >> /root/logs/nginx.log; sleep 1; done
while true; do echo "web web ..." >> /root/logs/web.log; sleep 1; done

2、在hadoop03上,创建目录/root/logs;然后打开3个终端,分别执行执行如下命令,用来产生日志数据:
while true; do echo "access access ..." >> /root/logs/access.log; sleep 1; done
while true; do echo "nginx nginx ..." >> /root/logs/nginx.log; sleep 1; done
while true; do echo "web web ..." >> /root/logs/web.log; sleep 1; done

        Return to the terminal window where the Flume system was started on hadoop01.bgd01, and observe the log collection information.

        On hadoop01.bgd01, open the FireFox browser, enter the address http://hadoop01.bgd01:50070 (cluster IP/host name + port) in the address bar, enter the Hadoop cluster UI, you can see that a new one has been added under the cluster Source directory. Click to enter the Source directory to view the internal file storage structure.


reference books

"Hadoop big data technology principle and application"

Guess you like

Origin blog.csdn.net/weixin_63507910/article/details/128653006