Flume's high-availability distributed serial collection of data to HDFS example
1. Case introduction
It is necessary to sink the log logs in the three log server (ip: 192.168.100.9
, 192.168.100.13
, 192.168.100.100
) /home/hadoop/access
, /home/hadoop/order
and folders to another agent cluster./home/hadoop/login
The agent cluster uses 2 machines (ip are: 192.168.10.11
, 192.168.10.12
). Among them 192.169.100.11
as master (priority 10), 192.169.100.12
as slave (priority 5).
The data collected by the agent cluster sinks into the HDFS system.
The collected data needs to be classified according to host ip, access, order, and login, and written to the hdfs file system.
The distributed framework is shown in the figure:
See the configuration file below
Two, placement
- Agent configuration information of the log client. The configuration file name is
flume-collect-local-log.conf
:
a1.sources=r1 r2 r3
a1.sinks=k1 k2
a1.channels=c1
a1.sinkgroups = g1
# r1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /home/hadoop/access
a1.sources.r1.fileHeader = false
# r2
a1.sources.r2.type = spooldir
a1.sources.r2.spoolDir = /home/hadoop/order
a1.sources.r3.fileHeader = false
# r3
a1.sources.r3.type = spooldir
a1.sources.r3.spoolDir = /home/hadoop/login
a1.sources.r3.fileHeader = false
# r1拦截器
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type=static
a1.sources.r1.interceptors.i1.preserveExisting = true
a1.sources.r1.interceptors.i1.key = source
a1.sources.r1.interceptors.i1.value = access
a1.sources.r1.interceptors.i2.type=host
a1.sources.r1.interceptors.i2.hostHeader = hostname
# r2拦截器
a1.sources.r2.interceptors = i1 i2
a1.sources.r2.interceptors.i1.type=static
a1.sources.r2.interceptors.i1.preserveExisting = true
a1.sources.r2.interceptors.i1.key = source
a1.sources.r2.interceptors.i1.value = order
a1.sources.r2.interceptors.i2.type=host
a1.sources.r2.interceptors.i2.hostHeader = hostname
# r3拦截器
a1.sources.r3.interceptors = i1 i2
a1.sources.r3.interceptors.i1.type=static
a1.sources.r3.interceptors.i1.preserveExisting = true
a1.sources.r3.interceptors.i1.key = source
a1.sources.r3.interceptors.i1.value = login
a1.sources.r3.interceptors.i2.type=host
a1.sources.r3.interceptors.i2.hostHeader = hostname
# k1
a1.sinks.k1.type=avro
a1.sinks.k1.hostname = 192.168.100.11
a1.sinks.k1.port = 11111
# k2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = 192.168.100.12
a1.sinks.k2.port = 11111
# c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 设置sink group优先级
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10
a1.sinkgroups.g1.processor.priority.k2 = 5
a1.sinkgroups.g1.processor.maxpenalty = 10000
# r1 r2 r3 c1 s1关系配置8
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
A static interceptor is configured here to pass the identifier and partition HDFS
- The configuration information of the agent cluster, the configuration file name is
flume-collect-hdfs.conf
:
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# r1
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 11111
# 拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
# s1
a1.sinks.k1.type=hdfs
a1.sinks.k1.hdfs.path=/flume-log/%{source}/%{hostname}/%y%m%d
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=TEXT
a1.sinks.k1.hdfs.rollInterval=1
a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d
a1.sinks.k1.hdfs.fileSuffix=.txt
a1.sinks.k1.hdfs.rollSize = 1024
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.rollInterval = 60
# c1
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置r1 s1 c1的关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
3. Execution
- Write a test file to the log server's
access,order,login
directory. Such as:
access.log
access 192.168.100.9
order.log
order 192.168.100.9
login.log
login 192.168.100.9
Write different content to the corresponding directory settings of each machine, and test the hdfs partition situation
192.168.100.11
Start the ( and192.168.100.12
) flume services in the agent cluster respectively
bin/flume-ng agent -c conf -f conf/flume-collect-hdfs.conf -name a1 -Dflume.root.logger=INFO,console
- Start the flume service in the log log separately
bin/flume-ng agent -c conf -f conf/flume-collect-local-log.conf -name a1 -Dflume.root.logger=INFO,console
4. View the results
The result is partitioned.
Added: Flume's Load Balancing Example
Flume client collects the log conf file. By a1.sinkgroups.g1.processor.type = load_balance
implementing load balancing.
#a1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1 k2
#set gruop
a1.sinkgroups = g1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/logs/test.log
# set sink1
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = 192.168.100.11
a1.sinks.k1.port = 11111
# set sink2
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = 192.168.100.12
a1.sinks.k2.port = 11111
#set sink group
a1.sinkgroups.g1.sinks = k1 k2
#set load-balance
a1.sinkgroups.g1.processor.type = load_balance
# 默认是round_robin,还可以选择random
a1.sinkgroups.g1.processor.selector = round_robin
#如果backoff被开启,则 sink processor会屏蔽故障的sink
a1.sinkgroups.g1.processor.backoff = true
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
Flume's server-side code is the same as above.