A, Flume-- data acquisition and use of the basic principle

I. Overview

1. What is the flume

1) Flume provides a distributed, reliable, large amounts of log data for efficient collection, aggregation, mobile services, Flume can only run in a Linux environment.
2) Flume flow-based architecture, fault-tolerant, flexible and very simple, simple structure.
3) Flume, Kafka used for real-time data collection, Spark, Storm for real-time processing data, impala used for real-time query.

2, the basic architecture of the flume

A, Flume-- data acquisition and use of the basic principle

Figure 1.1 flume architecture

Speaking flume architecture, take a direct view of the official website is sufficient.
First, will deploy a flume agent on each data source, this agent is used to take the data.
The agent consists of three components: source, channel, sink. In the flume, the basic unit of data transmission is the event. Here to talk about these concepts

(1)source

For collecting data from a data source and the data transmission channel. source supports a variety of data sources collection methods. Such as listening port to collect data, collected from documents collected from the catalog, such as collection from the http service.

(2)channel

Located between the source and sink, it is a data staging area. Normally, the rate will be different from the rate data source and sink flows out of outflow data. So who needs a temporary storage space no way to transfer to the sink for data processing. Therefore, similar to a channel buffer, a queue.

(3)sink

Obtain data from the channel, and the data is written to the target source. Support for multiple target source, such as a local file, hdfs, kafka, next flume agent of the source can be.

(4)event

Transmission unit, the basic unit of transmission flume, comprising two headers and a body portion, header may add header information, the data body.

3, flume transmission

Based on the above concepts, the basic process is very clear, source monitoring data source, if a new data, the data is acquired, and packaged into one event, and then transmitted to the channel event, and then sink pulls data written to the target source from the channel in.

Two, flume use

1, flume deployment

The flume deployment itself is very simple,
(1) the deployment jdk1.8
program (2) extracting archive flume to the specified directory, and then to add the environment variable
(3) Modify Profile

cd /opt/modules/apache-flume-1.8.0-bin

将模板配置文件复制重命名为正式配置文件
cp conf/flume-env.sh.template conf/flume-env.sh

添加jdk家目录变量
vim conf/flume-env.sh
加上这句
export JAVA_HOME=/opt/modules/jdk1.8.0_144

This completes the configuration, and basically nothing difficult. flume use focuses on the preparation of agent profiles according to different scenarios, the configuration is different. In simple terms is actually configured to work property source, channel, sink three assembly.

2, agent process defined

agent configuration is actually configure the source, channel, sink in. There are five steps below to see how this process is.

# 1、定义的agent名称,指定使用的source sinks channels的名称
# 可以有多个source sinks channels。
<Agent>.sources = <Source>
<Agent>.sinks = <Sink>
<Agent>.channels = <Channel1>

# 2、定义source工作属性。
# 基本格式就是 agent名.sources.source名.参数名=value
# 第一个参数都是type,就是指定source类型的
<Agent>.sources.<Source>.type=xxxx
<Agent>.sources.<Source>.<parameter1>=xxxx
<Agent>.sources.<Source>.<parameter2>=xxxx
.........

# 3、设置channel工作属性.格式都是类似的
# 第一个参数都是type,就是指定channel类型的
<Agent>.channels.<Channel1>.type=xxxxx
<Agent>.channels.<Channel1>.<parameter1>=xxxxx
<Agent>.channels.<Channel1>.<parameter2>=xxxxx
.........

# 4、设置sink工作属性
# 第一个参数都是type,就是指定sink类型的
<Agent>.sinks.<sink>.type=xxxxx
<Agent>.sinks.<sink>.<parameter1>=xxxxx
<Agent>.sinks.<sink>.<parameter2>=xxxxx
...............

# 5、设置source以及sink使用的channel,通过channel将两者连接起来
<Agent>.sources.<Source>.channels = <Channel1>
<Agent>.sinks.<Sink>.channel = <Channel1>

This is the complete process agent definition, source, channel, sink each has different types, each defined parameters vary. Let's look at source, channel, sink type commonly used (to see complete all types of Internet cafes to Tell me)

3, common source type

(1) netcat-- retrieve data from tcp port

常用属性:
type:需指定为  netcat
bind:监听的主机名或者ip
port:监听的端口

例子:监听在 0.0.0.0:6666端口
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 6666

(2) exec-- Run output as a data source

常用属性:
type:需指定为 exec
command:运行的命令
shell:运行名为所需的shell,如 /bin/bash -c

例子:监控文件的新增内容
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /var/log/secure
a1.sourcesr.r1.shell = /bin/bash -c

(3) spooldir-- monitor the contents of the directory

常用的属性:
type:设置为 spooldir
spoolDir:监控的目录路径
fileSuffix:上传完成的文件加上指定的后缀,默认是 .COMPLETED
fileHeader:是否在event的header添加一个key标明该文件的绝对路径,默认为false
ignorePattern:正则匹配,忽略的文件
还有其他很多参数,具体到官网上看吧

例子:
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/module/flume1.8.0/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true
#忽略所有以.tmp结尾的文件,不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

(4) avro - flume in series between the intermediate format

This comparison particular source, usually in a flume of sink output, and as the format of the input in a flume.

常用的属性:
type:需指定为  avro
bind:监听的主机名或者ip,只能是agent所在主机的ip或者hostname
port:监听的端口

例子:
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

(5) TAILDIR-- monitor file or directory content changes (only 1.7 and later)

spoolDir there is a bug, is already completed upload files, additional content can not be, otherwise it will error, and can not read the new file content. So spooldir only be used to monitor changes in a new file in the directory, there is no way to monitor changes in the content of the file. Past, this situation can only use exec source, and then use the tail -f xxxlog way to monitor file content changes, but this approach has flaws, is easy to lose data. And after flume1.7 there is a new source, called TAILDIR, can directly monitor the contents of the file change. Look at the usage:

常用属性:
type:TAILDIR ,记住,要全部大写
filegroups:要监听的文件组的名字,可以有多个文件组
filegroups.<filegroupName>:指定文件组的包含哪些文件,可以使用扩展正则表达式,这里可以有的小技巧 /path/.*  这样就可以监听目录下的所有文件内容的变化
positionFile:这个文件json格式记录了目录下每个文件的inode,以及pos偏移量
fileHeader:是否添加header

属性过多,可以当官网看:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#spooling-directory-source

例子:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2  有两个文件组
# 文件组1内容
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
# 使用正则表达式指定文件组
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

Let me say the above mentioned positionFile this stuff and see its format:

[{"inode":408241856,"pos":27550,"file":"/opt/modules/apache-flume-1.8.0-bin/logs/flume.log.COMPLETED"},

{"inode":406278032,"pos":0,"file":"/opt/modules/apache-flume-1.8.0-bi
n/logs/words.txt.COMPLETED"},{"inode":406278035,"pos":0,"file":"/opt/modules/apache-flume-1.8.0-bin/logs/words.txt"},

{"inode":406278036,"pos":34,"file":"/opt/modules/apache
-flume-1.8.0-bin/logs/test.txt"}]

分析:
1、每个文件都是一个json串,由多个json串组成一个类似于数组的东西。
2、每个json包含内容有:
    inode:这个什么意思就自己具体看看文件系统的基本知识吧
    pos:开始监听文件内容的起始偏移量
    file:文件绝对路径名
3、小技巧:
(1)如果监听目录时,某些文件已存在,那么flume默认是从文件最后作为监听起始点进行监听。当文件内容更新时,flume会获取,然后sink。接着就会更新pos值。所以因为这个特点,就算flume agent突然崩了,下一次启动时,自动从上次崩溃的pos开始监听,而不是从最新的文件末尾开始监听。这样就不会丢失数据了,而且不会重复读取旧数据。
(2)从(1)可知,pos就是实时更新的一个文件内容监听点,如果我们想文件从头开始监听,有时候有需求,需要将监听目录下的文件全部传输一边。这时候很简单,将json文件中的pos改为0就好了。
4、如果没有指定positionFile路径,默认为/USER_HOME/.flume/taildir_position.json

4, common channel type

(1) memory-- used as scratch memory space

常用的属性:
type:需指定为  memory
capacity:存储在channel中event数量的最大值
transactionCapacity:一次传输的event的最大数量 

例子:
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

(2) file-- use the disk file as scratch space

常用的属性:
type:需指定为  file
checkpointDir:存储checkpoint文件的目录
dataDirs:存储数据的目录

例子:
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

(3) SPILLABLEMEMORY-- file + memory as a temporary storage space

这个类型是将内存+文件作为channel,当容量空间超过内存时就写到文件中

常用的属性:
type:指定为 SPILLABLEMEMORY
memoryCapacity:使用内存存储的event的最大数量
overflowCapacity:存储到文件event的最大数量
byteCapacity:使用内存存储的event的最大容量,单位是 bytes
checkpointDir:存储checkpoint文件的目录
dataDirs:存储数据的目录

例子:
a1.channels.c1.type = SPILLABLEMEMORY
a1.channels.c1.memoryCapacity = 10000
a1.channels.c1.overflowCapacity = 1000000
a1.channels.c1.byteCapacity = 800000
a1.channels.c1.checkpointDir = /mnt/flume/checkpoint
a1.channels.c1.dataDirs = /mnt/flume/data

(4) kafka-- as channel

Production environment, flume + kafka also commonly used technology stack, but the general goal is to sink as kafka

常用属性:
type:设置为 org.apache.flume.channel.kafka.KafkaChannel
bootstrap.servers:kafka集群的服务器, ip:port,ip2:port,....
topic:kafka中的topic
consumer.group.id:消费者的groupid

例子:
a1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
a1.channels.channel1.kafka.bootstrap.servers = kafka-1:9092,kafka-2:9092,kafka-3:9092
a1.channels.channel1.kafka.topic = channel1
a1.channels.channel1.kafka.consumer.group.id = flume-consumer

5, common type of sink

(1) logger-- directly as an output log information

常用属性:
type:logger

例子:
a1.sinks.k1.type = logger

This type is relatively simple, generally used for debugging

(2) avro-- series of intermediate format flume

This type is mainly used as the format for the next input flume, a mode byte stream, and the sequence of the sequence.

常用属性:
type:avro
hostname:输出目标的主机名或者ip,可以任意主机,不局限于本机
ip:输出到的端口

例子:
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 10.10.10.10
a1.sinks.k1.port = 4545

(3) hdfs-- written directly to hdfs

常用属性:
type:hdfs
hdfs.path:存储路径 , hdfs://namenode:port/PATH
hdfs.filePrefix:上传的文件的前缀(额外加上的)
hdfs.round:是否按时间滚动文件夹
hdfs.roundValue:滚动的时间值
hdfs.roundUnit:滚动的时间的单位
hdfs.userLocalTimeStamp:是否使用本地时间戳,true还是false
hdfs.batchSize:积攒多少个event才flush到hdfs 一次
hdfs.fileType:文件类型,DataStream(普通文件),SequenceFile(二进制格式,默认),CompressedStream(压缩格式)
hdfs.rollInterval:多久生成一个新的文件,单位是秒
hdfs.rollSize:文件滚动大小,单位是 bytes
hdfs.rollCount:文件滚动是否与event数量有关,true 还是false
hdfs.minBlockReplicas:最小副本数

例子:
#指定sink的类型为存储在hdfs中
a2.sinks.k2.type = hdfs
# 路径命名为按小时
a2.sinks.k2.hdfs.path = hdfs://bigdata121:9000/flume/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = king-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型,可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件,单位是秒
a2.sinks.k2.hdfs.rollInterval = 600
#设置每个文件的滚动大小,单位是bytes
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
#最小副本数
a2.sinks.k2.hdfs.minBlockReplicas = 1

(4) file_roll-- stored in a local file system

常用属性:
type:file_roll
sink.directory:存储路径

例子:
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /var/log/flum

(5) kafka-- stored in the cluster kafka

常用属性:
tpye:org.apache.flume.sink.kafka.KafkaSink
kafka.topic:kafka话题名
kafka.bootstrap.servers:集群服务器列表,以逗号分隔
kafka.flumeBatchSize:刷写到kafka的event数量
kafka.producer.acks:接收到时返回ack信息时,写入的最少的副本数
kafka.producer.compression.type:压缩类型

例子:
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = mytopic
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092
a1.sinks.k1.kafka.flumeBatchSize = 20
a1.sinks.k1.kafka.producer.acks = 1
a1.sinks.k1.kafka.producer.compression.type = snappy

6, common type of interceptor interceptors

Interceptor interceptors is not necessary, it is a work component between source and channel for the data source to the filtering, and output to the channel.
Using the format:

先指定拦截器的名字,然后对每个拦截器进行工作属性配置
<agent>.sources.<source>.interceptors = <interceptor>
<agent>.sources.<source>.interceptors.<interceptor>.<param> = xxxx

(1) timestamp timestamp interceptor

Add a field in the header event, such as a time stamp for identifying: headers: {timestamp: 111111}.

常用属性:
type:timestamp
headerName:在header中的key名字,默认是 timestamp

例子:
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

(2) host hostname interceptor

Adding header event in a field used to indicate the host stamp, such as: headers: {host: bigdata121}.

常用属性:
type:host
hostHeader:在header中的key名字,默认是 host
useIP:用ip还是主机名

例子:
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

(3) UUID interceptor

Add a field in the header event, such as for identifying uuid: headers: {id: 111111}.

常用属性:
type:org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headName:在header中的key名字,默认是 id
prefix:给每个UUID添加前缀

(4) search_replace inquiry replacement

Use regular match, then replace the specified character

常用属性:
type:search_replace
searchPattern:匹配的正则
replaceString:替换的字符串
charset:字符集,默认UTF-8

例子:删除特定字符开头的字符串
a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace
a1.sources.avroSrc.interceptors.search-replace.searchPattern = ^[A-Za-z0-9_]+
a1.sources.avroSrc.interceptors.search-replace.replaceString =

(5) regex_filter regular filter

Regular, the matching to the left or discarded

常用属性:
type:regex_filter
regex:正则
excludeEvents:true为过滤掉匹配的,false为留下匹配的

例子:
a1.sources.r1.interceptors.i1.type = regex_filter
a1.sources.r1.interceptors.i1.regex = ^A.*
#如果excludeEvents设为false,表示过滤掉不是以A开头的events。如果excludeEvents设为true,则表示过滤掉以A开头的events。
a1.sources.r1.interceptors.i1.excludeEvents = true

(6) regex_extractor regular extraction

In fact, here is the use of a packet that matches the regular matching to acquire a plurality of groups, then each group matches the value stored in the header, key can be customized.

a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.interceptors = i1
# 指定类型为 regex_extractor
a1.sources.r1.interceptors.i1.type = regex_extractor
# 分组匹配的正则
a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)
# 两个分组各自的key别名
a1.sources.r1.interceptors.i1.serializers = s1 s2
# 分别设置key的名字
a1.sources.r1.interceptors.i1.serializers.s1.name = cookieid
a1.sources.r1.interceptors.i1.serializers.s2.name = ip

(7) The custom interceptor

Interface inheritance org.apache.flume.interceptor.Interceptor, which particular method of implementation, such as:

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;

public class MyInterceptor implements Interceptor {
    @Override
    public void initialize() {
    }

    @Override
    public void close() {
    }

    /**
     * 拦截source发送到通道channel中的消息
     * 处理单个event
     * @param event 接收过滤的event
     * @return event    根据业务处理后的event
     */
    @Override
    public Event intercept(Event event) {
        // 获取事件对象中的字节数据
        byte[] arr = event.getBody();
        // 将获取的数据转换成大写
        event.setBody(new String(arr).toUpperCase().getBytes());
        // 返回到消息中
        return event;
    }

    // 处理event集合
    @Override
    public List<Event> intercept(List<Event> events) {
        List<Event> list = new ArrayList<>();
        for (Event event : events) {
            list.add(intercept(event));
        }
        return list;
    }

    //用来返回拦截器对象
    public static class Builder implements Interceptor.Builder {
        // 获取配置文件的属性
        @Override
        public Interceptor build() {
            return new MyInterceptor();
        }

        @Override
        public void configure(Context context) {

        }
    }

pom.xml dependence

<dependencies>
        <!-- flume核心依赖 -->
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>1.8.0</version>
        </dependency>
    </dependencies>

Specifies the interceptor agent profile

a1.sources.r1.interceptors = i1
#全类名$Builder
a1.sources.r1.interceptors.i1.type = ToUpCase.MyInterceptor$Builder

Run the command:

bin/flume-ng agent -c conf/ -n a1 -f jar/ToUpCase.conf -C jar/Flume_Andy-1.0-SNAPSHOT.jar -Dflume.root.logger=DEBUG,console

-C 指定额外的jar包的路径,就是我们自己写的拦截器的jar包

Also you can put the jar package lib directory flume program directory

Three, flume Case

1, to read the file hdfs

# 1.定义agent的名字a2.以及定义这个agent中的source,sink,channel的名字
a2.sources = r2
a2.sinks = k2
a2.channels = c2

#2.定义Source,定义数据来源
# 定义source类型是exec,执行命令的方式
a2.sources.r2.type = exec
# 命令
a2.sources.r2.command = tail -F /tmp/access.log
# 使用的shell
a2.sources.r2.shell = /bin/bash -c

#3.定义sink
#指定sink的类型为存储在hdfs中
a2.sinks.k2.type = hdfs
# 路径命名为按小时
a2.sinks.k2.hdfs.path = hdfs://bigdata121:9000/flume/%H
#上传文件的前缀
a2.sinks.k2.hdfs.filePrefix = king-
#是否按照时间滚动文件夹
a2.sinks.k2.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k2.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k2.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k2.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k2.hdfs.batchSize = 1000
#设置文件类型,可支持压缩
a2.sinks.k2.hdfs.fileType = DataStream
#多久生成一个新的文件,单位是秒
a2.sinks.k2.hdfs.rollInterval = 600
#设置每个文件的滚动大小,单位是bytes
a2.sinks.k2.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k2.hdfs.rollCount = 0
#最小副本数
a2.sinks.k2.hdfs.minBlockReplicas = 1

# 4.定义Channel,类型、容量限制、传输容量限制 
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# 5.链接,通过channel将source和sink连接起来
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

Start flume-agent:

/opt/module/flume1.8.0/bin/flume-ng agent \
--conf /opt/module/flume1.8.0/conf/ \   flume配置目录
--name a2 \                             agent名字
--conf-file /opt/module/flume1.8.0/jobconf/flume-hdfs.conf  agent配置
-Dflume.root.logger=INFO,console          打印日志到终端

2, multi-joint flume, many

flume1: output to flume2 and flume3
flume2: output to a local file
flume3: output to hdfs

flume1.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给多个channel。启动复制模式
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/test
a1.sources.r1.shell = /bin/bash -c

# 这是k1 sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = bigdata111
a1.sinks.k1.port = 4141

# 这是k2 sink
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = bigdata111
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# 给source接入连接两个channel.每个channel对应一个sink
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

flume2.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = bigdata111
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume2/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0
#最小副本数
a2.sinks.k1.hdfs.minBlockReplicas = 1

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume3.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = bigdata111
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
#备注:此处的文件夹需要先创建好
a3.sinks.k1.sink.directory = /opt/flume3

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

When you start, start flume2 and flume3, finally start flume1. Start command is not repeated.

3, multi-joint flume, many to one

Multiple server logs generated, requires its own control, and then aggregated up storage, many of these scenes.
flume1 (listen file) and flume2 (listening port) to collect their data, and then were to sink flume3, flume3 responsible for the summary written HDFS
flume1.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/Andy
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = bigdata111
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume2.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = bigdata111
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = bigdata111
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

flume3.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = bigdata111
a3.sources.r1.port = 4141

# Describe the sink
a3.sinks.k1.type = hdfs
a3.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume3/%H
#上传文件的前缀
a3.sinks.k1.hdfs.filePrefix = flume3-
#是否按照时间滚动文件夹
a3.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a3.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a3.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a3.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a3.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a3.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a3.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a3.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a3.sinks.k1.hdfs.rollCount = 0
#最小冗余数
a3.sinks.k1.hdfs.minBlockReplicas = 1

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

Flume3 first start at startup, and then start flume1 and flume2

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume3.conf
$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume2.conf
$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume1.conf

Testing can be transmitted by data port telnet bigdata111 44444
may be additional data in / opt / Andy document

Guess you like

Origin blog.51cto.com/kinglab/2447898
Recommended