Flume how to operate

First to a small simple example

We will file with the contents of the local HDFS flume get to the top.
Directly on the profile bars

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 我们Source定义了从本地文件secure末尾读取数据
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/secure

#这个Sink定义将数据送到hdfs
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://114.116.206.19:5009/flumehdfs/data
a1.sinks.hdfssink.filePrefix             = alert-
a1.sinks.hdfssink.hdfs.useLocalTimeStamp = true
a1.sinks.hdfssink.hdfs.rollInterval      = 60
a1.sinks.hdfssink.hdfs.rollSize          = 10485760
a1.sinks.hdfssink.hdfs.rollCount         = 0
a1.sinks.hdfssink.hdfs.codeC             = snappy
a1.sinks.hdfssink.hdfs.fileType          = DataStream
a1.sinks.hdfssink.hdfs.writeFormat       = Text

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Then you can see the data in our HDFS

Here Insert Picture Description
About Flume-HDFS

hdfs.filePrefix	默认值FlumeData,Flume在hdfs创建的文件名前缀
hdfs.fileSuffix	默认为空,文件名后缀
hdfs.inUsePrefix 默认为空,临时文件的前缀。
hdfs.inUseSuffix	默认为.tmp,临时文件的后缀
hdfs.emptyInUseSuffix	默认为false	,不重要,就用false把


hdfs.rollInterval	默认为30秒,文件滚动时间间隔。如果设置为0,则表示永远不滚动。
hdfs.rollSize	默认1024字节,文件滚动的大小间隔。如果设置为0,则表示永远不滚动。
hdfs.rollCount	默认为10,滚动前Event的个数。
这三个可以用来解决HDFS小文件问题

hdfs.idleTimeout	默认0,关闭交互文件的超时时间,为0则为永不超时。
hdfs.batchSize	默认100,100Evnet刷写一次Hdfs。	
hdfs.codeC	默认为空,压缩格式,例如: gzip, bzip2, lzo, lzop, snappy
hdfs.fileType	默认为SequenceFile。当前SequenceFile, DataStream或CompressedStream (1)DataStream不会压缩输出文件,
请不要设置codeC (2)CompressedStream需要设置hdfs。带有可用编解码器的编解码器。

hdfs.maxOpenFiles	默认5000,最大允许同时打开的文件,超过了则踢掉最先打开的文件。	
hdfs.minBlockReplicas	默认为空,指定最小的hdfs块数量。
hdfs.writeFormat	默认为Writable。	
hdfs.threadsPoolSize	默认10,对hdfs IO读写的线程池大小。
hdfs.rollTimerPoolSize	默认1,每个HDFS接收器用于调度定时文件滚动的线程数
hdfs.kerberosPrincipal 默认为空,认证系统,可以不用管。
hdfs.kerberosKeytab	默认为空。
hdfs.round	默认false,是否按照时间戳进行四舍五入。
hdfs.roundValue	默认1,四舍五入到该值的最高倍数(在使用hdfs.roundUnit配置的单元中),小于当前时间。.
hdfs.roundUnit	默认second,下舍入值的单位——秒、分或小时。
hdfs.timeZone	默认为Local Time	
hdfs.useLocalTimeStamp	默认false,
serializer	默认TEXT,序列化工具	

Very good, this tool.
I wrote another piece about the flume blog is not very good, very general, will now be described in detail the Flume it.

In fact Flume's core is the use of source data collection over, and then cached locally by channel, and by Sink got out, then came to the local cache will be deleted.

Data transmitted in the form: Event

A data transmission called us a Event, you can experiment with this small example

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
#a1.sources.r1.bind = 114.55.37.70
a1.sources.r1.command = tail -F /root/secure
#a1.sources.r1.port = 5000

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Then use this command, we can transfer data up OK

bin/flume-ng agent --name a1 --conf conf --conf-file flumeconfigure/example.conf Dflume.root.logger=INFO,console

指令传参解释

agent 表示启动一个agent

--conf 指定flume启动的配置文件夹,比如log4j,flume-env.sh

--conf-file 指定agent配置文件地址

--name 指定启动的agent名字,也就是在刚刚的example.conf文件中,可以定义两个agent,并通过命令行,指定启动其中之一。

-D 指定其他配置参数,上面命令制定了logger级别,和流向。

Then add files to secure data
Here Insert Picture Description
can see the Event form is like this, with a map to visualize
Here Insert Picture Description
the contents body is what we need, key-value can be used to bypass the entire flume is actually a water pipe as a slot configuration Sink a Channel, but a Source can flow more Channel, which is based on key Event shunt to shunt.
Then the body is in the form of a byte array, that is our real data, make no mistake, header you imagine the usual protocol we are talking about it, such as HTTP, has a head on it, other than the transport body thing.

Flume reliability, recovery

Flume的Event只有通过Sink传输到下一个Agent或者被持久化到HDFS才可以被删除。第二点非常容易疏忽,就是所有的数据都是基于事务的,这就是可靠性。
而可恢复性,肯定就是Channel缓存形式了,它有两种缓存形式,分别是file-Chinnel和In-Memory Channel。可想而知,哪个恢复性好。file-Channel的形式明显可以持久化到磁盘,当宕机或者出错了都可以从磁盘恢复过来对吧,但是如果是In-Memory Channel的话,就不行了,只要断电啥都没有了。其实还有一种叫Kafka-Channel,以后将两者连起来的时候再来看看吧。

Taildir Source应用

这个就非常重要了,其实就是Flume可以读取到上次读取的位置,观察指定的文件,一旦检测到添加到每个文件中的新行,就几乎实时跟踪它们。它定期将每个文件的最后读取位置以JSON格式写入给定位置文件。如果Flume由于某种原因停止或关闭,它可以从现有位置文件上的位置重新启动。在其他用例中,该源还可以使用给定的位置文件从每个文件的任意位置开始跟踪。当指定路径上没有位置文件时,默认情况下它将从每个文件的第一行开始跟踪。文件将按修改时间顺序使用。修改时间最长的文件将首先被使用。此源不会重命名、删除或对跟踪的文件进行任何修改。目前,该源代码不支持跟踪二进制文件。它逐行读取文本文件。

首先我们要知道Flume有三种监控文件或者目录的方式:

  • Exec Source
    Exec Source可通过tail -f命令去tail住一个文件,然后实时同步日志到sink。但存在的问题是,当agent进程挂掉重启后,会有重复消费的问题。可以通过增加UUID来解决,或通过改进ExecSource来解决。

  • Source Directory Spooling
    Spooling Directory Source can monitor a directory, the directory synchronization of files to a new sink, complete synchronized files can be deleted or marked immediately mark. Suitable for synchronizing a new file, but the file is not suitable for real-time listening, and additional log synchronization. If you need real-time monitoring of additional content files can be modified to SpoolDirectorySource.

  • Source taildir
    Taildir Source real-time monitoring a batch of files, each file and record the latest consumer position, there is no problem after repeated consumption of the agent process restarted.
    1.8.0 suggested version of the flume, 1.8.0 version Taildir Source resolved a bug might lose data when using.

All in all Taildir Source is, no matter how Flume stop the task, as long as that position in the file, restart when Flume same task can continue to read based on this position. You can not read repeatedly before ever read.

channels	channel名	 
type	 必须是TAILDIR.
filegroups 指定一个文件组名
filegroups.<filegroupName>	文件组中文件名描述正则字符串。比如 fliegroups.f3=/usr/tmp/net.+,定义了f3文件组为/usr/tmp目录下所有以net开头的文件。

可选的配置:这个官网也有,不同版本不知道想不相同
positionFile	默认~/.flume/taildir_position.json,Flume进程Json元数据信息存储地址.
headers.<filegroupName>.<headerKey>	默认为空,对于header key的赋值,可以赋值多个。
byteOffsetHeader	默认false,是否将尾行的字节偏移量添加到名为“byteoffset”的标头中。
skipToEnd	默认false,如果PositionFile文件上没有写文件,是否跳过位置到EOF。.
idleTimeout	默认120000毫秒,关闭非活动文件的时间(ms)。如果关闭的文件附加了新行,这个Source将自动重新打开它。
writePosInterval	默认3000,将每个文件的最后一个位置写入positionFile的间隔时间(ms)。
batchSize	默认100,每次读取和发送到通道的最大行数。使用默认值通常没有问题。
maxBatchCount	默认Long.MAX_VALUE,控制从同一文件中连续读取的批数。如果源跟踪多个文件,
并且其中一个文件以较快的速度写入,则可以防止处理其他文件,因为繁忙的文件将在无休止的循环中读取。在这种情况下,降低这个值

backoffSleepIncrement	默认1000,重新尝试轮询新数据(上次尝试没有发现任何新数据)之前的时间延迟增量。
maxBackoffSleep	默认5000,每次重新尝试轮询新数据之间的最大时间延迟,当上次尝试没有发现任何新数据时。	
cachePatternMatching	默认true,对于包含数千个文件的目录,列出目录并应用文件名regex模式可能会耗费时间。
缓存匹配文件列表可以提高性能。文件的使用顺序也将被缓存。要求文件系统至少以1秒的粒度跟踪修改时间
fileHeader	默认false,是否添加存储绝对路径文件名到header。
fileHeaderKey	默认file,fileHeader对应的key名。

Examples

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = taildir
a1.sources.r1.positionFile = /root/flume/flumedata/taildir_position.json
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /root/flume/flumedata/log/.*log
a1.sources.r1.fileHeader = true
# Channel类型
a1.channels.c1.type = file
# 数据存放路径
a1.channels.c1.dataDirs = /root/flume/flumedata/dataDirs
#检查点路径
a1.channels.c1.checkpointDir = /root/flume/flumedata/checkpoint
#Channel中最多缓存多少
a1.channels.c1.capacity = 1000
#channel一次给sink多少
a1.channels.c1.transactionCapacity = 100
# Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://114.116.206.19:5009/flumehdfs/data
a1.sinks.hdfssink.filePrefix             = alert-
a1.sinks.hdfssink.hdfs.useLocalTimeStamp = true
a1.sinks.hdfssink.hdfs.rollInterval      = 60
a1.sinks.hdfssink.hdfs.rollSize          = 10485760
a1.sinks.hdfssink.hdfs.rollCount         = 0
a1.sinks.hdfssink.hdfs.codeC             = snappy
a1.sinks.hdfssink.hdfs.fileType          = DataStream
a1.sinks.hdfssink.hdfs.writeFormat       = Text

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

We monitor the first to write something catalog
Here Insert Picture Description
Here Insert Picture Description
data also successfully reached.
Then flume stopped
Here Insert Picture Description
this time location information like this
to add a new message flume
Here Insert Picture Description
restart flume
Here Insert Picture Description
this time location information like this, then take a look at hdfs there are no duplicate data it
can be seen that generate a total of two files, indicating no problem
Here Insert Picture Description

Published 66 original articles · won praise 31 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_43272605/article/details/104360351