Flume common case (2)

Flume collection case

1: collection catalog to HDFS

Collection needs: a server under a specific directory, it will continue to generate new files whenever new files appear, you need to collect files to HDFS go.

Acquisition source, that source - monitor file directory: spooldir

Sinking objectives, namely sink --HDFS file system: HDFS sink

The delivery channel between the source and the sink - channel, channel may also be used File memory channel with memory.

 

Write configuration file:

#定义 source , sink ,channel 三大组件的名称
agent1.sources = source1
agent.sinks = sink1
agent.channels = channel1

#配置 source 组件
agent1.sources.source1.type = spooldir
agent1.sources.source1.spooldir = /root/date/
agent1.sources.source1.fileHeader = false

#配置拦截器
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = timestamp

#配置 sink 组件
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =  /weblog/flume-collection/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.maxOpenFiles = 5000
agent1.sinks.sink1.hdfs.batchSize = 100 
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
#滚动生成的文件按大小生成
agent1.sinks.sink1.hdfs.rollSize = 102400
#滚动生成的文件按行数生成
agent1.sinks.sink1.hdfs.rollCount = 1000000
#滚动生成的文件按时间生成
agent1.sinks.sink1.hdfs.rollInterval = 60 
#开启滚动生成目录
agent1.sinks.sink1.hdfs.round = true 
#以 10 为一梯度滚动生成
agent1.sinks.sink1.hdfs.roundValue = 10 
#单位为分钟
agent1.sinks.sink1.hdfs.roundUnit = minute

#配置 channel 组件
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600
agent1.channels.channel1.keep-alive = 120

#将 source channel sink 三个组件进行绑定
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

Flume when the source using spoodir! Directory does not allow for files with the same name, otherwise an error!

capacity: The default maximum number of event can be stored in the channel

transactionCapacity: Maximum can get from each source, or to a number of event sink

keep-alive: event is added or removed to allow passage of time

Other components: Interceptor (Interceptor)

Interceptor for a set of source, according to a preset order and where necessary decoration filtering events.

Built Interceptor allows an increase in event of headers such as: timestamp, hostname, static marks, etc.

Interceptor can be customized by introspection event payload (read the original log), to achieve their business logic (very strong)

 


2: Acquisition files to HDFS

Collection needs: such as business systems using log4j to generate log, log content continue to increase, the need to append data to the log file in real-time acquisition to hdfs

 

According to requirements, first define what three major elements:

Acquisition source, i.e., source - monitors file updates: exec 'tail -F file'

Sinking objectives, namely sink - HDFS file system: hdfs sink

The delivery channel between the Source and sink - channel, may also be used are available file channel memory channel.

 

Write configuration file:

#定义 source , sink ,channel 三大组件的名称
agent1.sources = source1
agent.sinks = sink1
agent.channels = channel1

#配置 source 组件
agent1.sources.source1.type = exec 
agent1.sources.source1.spooldir = tail -F /home/hadoop/logs/access_log

#配置拦截器
agent1.sources.source1.interceptors = i1
agent1.sources.source1.interceptors.i1.type = host
agent1.sources.source1.interceptors.i1.hostHeader = hostname

#配置 sink 组件
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = hdfs://hadoop01:9000/file/%{hostname}/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = access_log
agent1.sinks.sink1.hdfs.batchSize = 100 
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
#滚动生成的文件按大小生成
agent1.sinks.sink1.hdfs.rollSize = 102400
#滚动生成的文件按行数生成
agent1.sinks.sink1.hdfs.rollCount = 1000000
#滚动生成的文件按时间生成
agent1.sinks.sink1.hdfs.rollInterval = 60 
#开启滚动生成目录
agent1.sinks.sink1.hdfs.round = true 
#以 10 为一梯度滚动生成
agent1.sinks.sink1.hdfs.roundValue = 10 
#单位为分钟
agent1.sinks.sink1.hdfs.roundUnit = minute

#配置 channel 组件
agent1.channels.channel1.type = memory
agent1.channels.channel1.capacity = 500000
agent1.channels.channel1.transactionCapacity = 600
agent1.channels.channel1.keep-alive = 120

#将 source channel sink 三个组件进行绑定
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

3: Multiple agent series

Collection requirements: business systems such as the use of log4j log generated by the log content increasing demand to append data to the log file in real-time acquisition to hdfs, use agent series

According to requirements, first of all define the following three major elements

The first flume agent 

Acquisition source, i.e., source - monitors file updates: exec 'tail -F file'

Sinking target, i.e. sink - the sender data, implement the serialization: avro sink

The delivery channel between the Source and sink - channel, available file channel memory channel may also be used

A second flume agent

Acquisition source, i.e., source - accept data. And to achieve deserialize: avro source

Sinking objectives, namely sink - HDFS file system: hdfs sink 

The delivery channel between the Source and sink - channel, available file channel memory channel may also be used

Write configuration file:

flume-agent1

#定义source sink channel 三大组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#配置 source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/test.log

#配置 sink 
##sink端的avro是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop02
a1.sinks.k1.port = 41414
a1.sinks.k1.batch-size = 10

#配置 channel 使用内存channel 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 绑定 source sink 和 channel 关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

flume-agent2

a1.sources = r1
a1.sinks =s1
a1.channels = c1

##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://hadoop01:9000/flumedata
a1.sinks.s1.hdfs.filePrefix = access_log
a1.sinks.s1.hdfs.batchSize= 100
a1.sinks.s1.hdfs.fileType = DataStream
a1.sinks.s1.hdfs.writeFormat =Text
a1.sinks.s1.hdfs.rollSize = 10240
a1.sinks.s1.hdfs.rollCount = 1000
a1.sinks.s1.hdfs.rollInterval = 10
a1.sinks.s1.hdfs.round = true
a1.sinks.s1.hdfs.roundValue = 10
a1.sinks.s1.hdfs.roundUnit = minute

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.s1.channel = c1

 

Published 33 original articles · won praise 3 · Views 5859

Guess you like

Origin blog.csdn.net/WandaZw/article/details/83688920