Frame using the log collection Flume

Author of the article: foochane 

Original link: https://foochane.cn/article/2019062701.html

Flume log collection framework to install and deploy Flume operating mechanism to collect static files hdfs log file to capture the dynamic cascade hdfs two agent

Flume log collection framework

In addition to a full offline big data processing system, in addition to hdfs + mapreduce + hive up the core analysis systems, but also data acquisition, data export results, task scheduling and other essential aid system, which aids in hadoop ecosystem convenient in both open source framework, as shown:

Typical large-scale off-line data processing platform

1 Flume Introduction

FlumeIs a distributed, reliable, and highly available massive log collection, aggregation and transmission systems. FlumeFile can be collected, socketpackets, files, folders, kafkaand other forms of data sources, and data can be collected (sink sink) to the HDFS, hbase, hive, kafkaand other external storage system.

For the average acquisition needs through simple configuration of the flume can be realized.

FlumeFor specific scenes also have a good ability to custom extension, therefore, flumeit can be used for most routine data collection scenarios.

2 Flume operating mechanism

FlumeDistributed system core role agent, flumeacquisition system is the one agentformed by connecting, each agentcorresponding to a data transfer member, the interior has three components:

  • Source: Acquisition component for docking with the data source to obtain data
  • Sink: Sink assembly for a down agenttransfer or data transfer to the final data storage system
  • Channel: Transmission channel assembly, for the sourcetransmitting the data tosink

Data acquisition single agent:

Single data collection agent

Series between the multi-level agent:

Series between the multi-level agent

3 Flume的安装部署

1 下载安装包apache-flume-1.9.0-bin.tar.gz解压

2 在conf文件夹下的flume-env.sh添加JAVA_HOME

export JAVA_HOME=/usr/local/bigdata/java/jdk1.8.0_211

3 根据采集的需求,添加采集方案配置文件,文件名可以任意取

具体可以看后面的示例

4 启动flume

测试环境下:

$ bin/flume/-ng agent -c conf/ -f ./dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console

命令说明:

  • -c:指定flume自带的配置文件目录,不用自己修改
  • -f:指定自己的配置文件,这里问当前文件夹下的dir-hdfs.conf
  • -n:指定自己配置文件中使用那个agent,对应的配置文件中定义的名字。
  • -Dflume.root.logger:把日志打印在控制台,类型为INFO,这个只用于测试,后面将打印到日志文件中

生产中,启动flume,应该把flume启动在后台:

nohup bin/flume-ng  agent  -c  ./conf  -f ./dir-hdfs.conf -n  agent1 1>/dev/null 2>&1 &

4 采集静态文件到hdfs

4.1 采集需求

某服务器的某特定目录下,会不断产生新的文件,每当有新文件出现,就需要把文件采集到HDFS中去

4.2 添加配置文件

在安装目录下添加文件dir-hdfs.conf,然后添加配置信息。

先获取agent,命名为agent1,后面的配置都跟在agent1后面,也可以改为其他值,如agt1,同一个配置文件中可以有多个配置配置方案,启动agent的时候获取对应的名字就可以。

根据需求,首先定义以下3大要素

数据源组件

source ——监控文件目录 : spooldir
spooldir有如下特性:

  • 监视一个目录,只要目录中出现新文件,就会采集文件中的内容
  • 采集完成的文件,会被agent自动添加一个后缀:COMPLETED(可修改)
  • 所监视的目录中不允许重复出现相同文件名的文件
下沉组件

sink——HDFS文件系统 : hdfs sink

通道组件

channel——可用file channel 也可以用内存channel

#定义三大组件的名称
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1

# 配置source组件
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /root/log/
agent1.sources.source1.fileSuffix=.FINISHED
#文件每行的长度,注意这里如果事情文件每行超过这个长度会自动切断,会导致数据丢失
agent1.sources.source1.deserializer.maxLineLength=5120

# 配置sink组件
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/access_log/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = app_log
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text

# roll:滚动切换:控制写文件的切换规则
## 按文件体积(字节)来切 
agent1.sinks.sink1.hdfs.rollSize = 512000
## 按event条数切
agent1.sinks.sink1.hdfs.rollCount = 1000000
## 按时间间隔切换文件
agent1.sinks.sink1.hdfs.rollInterval = 60

## 控制生成目录的规则
agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 10
agent1.sinks.sink1.hdfs.roundUnit = minute

agent1.sinks.sink1.hdfs.useLocalTimeStamp = true

# channel组件配置
agent1.channels.channel1.type = memory
## event条数
agent1.channels.channel1.capacity = 500000
##flume事务控制所需要的缓存容量600条event
agent1.channels.channel1.transactionCapacity = 600

# 绑定source、channel和sink之间的连接
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1

Channel参数解释:

  • capacity: The default channel may be stored in the maximum eventnumber of
  • trasactionCapacity: The maximum from each sourceget or to sinkthe eventnumber of
  • keep-alive: eventAdded or removed to allow passage of time

4.3 start flume

$ bin/flume/-ng agent -c conf/ -f dir-hdfs.conf -n agent1 -Dflume.root.logger=INFO,console

5 Dynamic log files to collect hdfs

5.1 Acquisition requirements

Such as business systems using log4j log generated by the log content is increasing, we need to append data to the log file in real-time acquisition to hdfs

5.2 configuration files

Profile Name: tail-hdfs.conf
according to demand, first of all define the following three major elements:

  • Acquisition source, namely source- monitors file updates:exec tail -F file
  • Sinking objectives, namely sink——HDFSthe file system: hdfs sink
  • SourceAnd a sinktransfer passage between - channelavailable file channelcan also be used Memorychannel

Configuration file contents:


# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/app_weichat_login.log

# Describe the sink
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path =hdfs://Master:9000/app_weichat_login_log/%y-%m-%d/%H-%M
agent1.sinks.sink1.hdfs.filePrefix = weichat_log
agent1.sinks.sink1.hdfs.fileSuffix = .dat
agent1.sinks.sink1.hdfs.batchSize= 100
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.writeFormat =Text

agent1.sinks.sink1.hdfs.rollSize = 100
agent1.sinks.sink1.hdfs.rollCount = 1000000
agent1.sinks.sink1.hdfs.rollInterval = 60

agent1.sinks.sink1.hdfs.round = true
agent1.sinks.sink1.hdfs.roundValue = 1
agent1.sinks.sink1.hdfs.roundUnit = minute


agent1.sinks.sink1.hdfs.useLocalTimeStamp = true



# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

5.3 start flume

Start command:

bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1

6 two cascaded agent

Get command data transmitted from the tail avro port to
another node may be configured to relay a avro data source, send an external memory

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/access.log


# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hdp-05
a1.sinks.k1.port = 4141
a1.sinks.k1.batch-size = 2



# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Avro port receives data from the sink tohdfs

Collection configuration file,avro-hdfs.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
##source中的avro组件是一个接收者服务
a1.sources.r1.type = avro
a1.sources.r1.bind = hdp-05
a1.sources.r1.port = 4141


# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/taildata/%y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = tail-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 24
a1.sinks.k1.hdfs.roundUnit = hour
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 50
a1.sinks.k1.hdfs.batchSize = 10
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是Sequencefile,可用DataStream,则为普通文本
a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Guess you like

Origin www.cnblogs.com/foochane/p/11110540.html