Introduction and introduction to Flume

Introduction and introduction to Flume

1. Introduction to Flume

Flume is a distributed, reliable, and high-availability massive log collection, aggregation and transmission system. Flume can collect various forms of source data such as files and socket data packets, and can output the collected data to many external storage systems such as HDFS, Hbase, hive, and kafka.

  • Operating mechanism

The core role in the Flume distributed system is agentthat the Flume acquisition system is formed by agentconnecting one by one.

In each agent, there are three components inside :

Source : The acquisition source is used to connect with the data source to obtain data.

Sink : sinking ground, the purpose of collecting data transmission, used to transfer data to the next-level agent or to the storage system

Channel : The data transmission channel inside the agent, which is used to transfer data from the source to the sink.

image

  • agent multi-level series

In addition, the agent can also be used in multi-level series , as shown below:

image

2. Flume installation and configuration files

  • download

Go to the Flume download page and select the corresponding version to download.

download link:

http://flume.apache.org/download.html

Download the corresponding apache-flume-xxx-bin.tar.gz package

  • Unzip the installation

Unzip the installation package:

tar -zxvf apache-flume-x.x.x-bin.tar.gz

  • Modify the configuration file

Enter the conf directory under the decompressed flume directory, and change flume-env.shthe JAVA_HOMEconfiguration in the modification to an absolute path.

export JAVA_HOME=/java/jdk1.8.0_161

3. The logger output of Flume's entry-level use case

  1. Create a flume-first.conf file to configure flume agent information

In flume's conf directory, use to
vim flume-first.confcreate and edit files. The details are as follows:

# 定义这个agent中各组件的名字,其中a1表示agent的名称
a1.sources = r1
a1.sinks = k1
a1.channels = c1


# 描述和配置source组件:r1
# a1.sources.r1.type表示source源的类型
# 常用type的类型有:
#       netcat:监听一个指定端口,并将接收到的数据的每一行转换为一个事件
#       avro:接受avro序列化数据
#       exec:命令输出作为源
#       spooldir:监视该目录,并将解析新文件的出现
# 
# 每一种type,对应的配置信息不一样。

a1.sources.r1.type = netcat
# source源绑定的主机名
a1.sources.r1.bind = localhost
# source源绑定的端口
a1.sources.r1.port = 44444


# 描述和配置sink组件:k1
# a1.sinks.k1.type表示sink下沉地的类型
# type的类型有很多:
#       logger:表示logger输出
#       hdfs:表示输出到hdfs
#       hive:表示输出到hive
#       avro:表示avro序列化输出
#       file_roll:文件
#       null:表示不用sink
#       kafaka:表示输出到kafka
# 
# 每一种sink的type,会对应一些type类型的配置信息。这里就不详细解释了。

# 表示logger输出
a1.sinks.k1.type = logger


# 描述和配置channel组件
# a1.channels.c1.type表示channel的类型
# 常用的type有:
#       memony:event数据存储在内存中
#       file:event数据存储在file中
#       kafka:event数据存储在kafka中
#
# 每一种type对应的配置不一样

# channel使用内存储存
a1.channels.c1.type = memory
# channel的容量大小
a1.channels.c1.capacity = 1000
# channel每一次提交的event数量
a1.channels.c1.transactionCapacity = 100


# 描述和配置source  channel   sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

This configuration information, source-channel-sink is a one-to-one relationship, you can define multiple sources, channels, sinks. Make a description many-to-many assignment.

  1. Specifies that flume use the flume-first.conf configuration to start
# 指令参数介绍
# -c:表示指定flume自身配置文件所在目录
# -f:表示flume执行的配置文件
# -n:表示agent的名称,和配置文件中的一致
# -Dflume.root.logger=INFO,console:输出INFO级别信息到控制台

bin/flume-ng agent -c conf -f conf/flume-first.conf -n a1  -Dflume.root.logger=INFO,console
  1. test

Use the command in another window telnetto test

The port number is 44444, specified in the configuration file

telnet 127.0.0.1 44444

As shown in the figure:

image

If there is no telnet command. You can install it with yum:

yum -y install telnet


4. HDFS output of Flume's entry-level use case

  1. In the conf directory of flume, create a configuration file of flume-hdfs.conf with the following contents:
#定义三大组件的名称
a1.sources = source1
a1.sinks = sink1
a1.channels = channel1


# 配置source组件
# type指定为spooldir
a1.sources.source1.type = spooldir
# 设置监视的spooldir目录
a1.sources.source1.spoolDir = /home/hadoop/logs/
# 是否添加存储绝对路径文件名的头文件,默认false
a1.sources.source1.fileHeader = false


# 配置拦截器
# flume提供了一些拦截器,如host,timestmap,static,regex_filter等       
# 过滤器是在source的内容的header字段中添加信息,进行过滤或分区等处理
#
# 添加拦截器 
a1.sources.source1.interceptors = i1
# 设置拦截器类型
a1.sources.source1.interceptors.i1.type = host
# 设置host是hostname
a1.sources.source1.interceptors.i1.hostHeader = hostname
# 如果将useIP设置为false,则默认使用hostname,设置true,name使用IP
a1.sources.source1.interceptors.i1.useIP=false


# 配置sink组件
#
# sink 类型设置hdfs
a1.sinks.sink1.type = hdfs
# hdfs输出路径
a1.sinks.sink1.hdfs.path = /weblog/flume-log/%y-%m-%d/%H-%M
# 文件前缀
a1.sinks.sink1.hdfs.filePrefix = access_log
# 最大允许打开的HDFS文件数,当打开的文件数达到该值,最早打开的文件将会被关闭,默认5000
a1.sinks.sink1.hdfs.maxOpenFiles = 5000
# 每个批次刷新到HDFS上的events数量,默认100
a1.sinks.sink1.hdfs.batchSize= 100
# 文件格式,包括:SequenceFile, DataStream,CompressedStream,当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值
a1.sinks.sink1.hdfs.fileType = DataStream
# 输出格式
a1.sinks.sink1.hdfs.writeFormat =Text
# 当临时文件达到该大小(单位:bytes)时,滚动成目标文件,默认1024,单位byte,如果设置成0,则表示不根据临时文件大小来滚动文件
a1.sinks.sink1.hdfs.rollSize = 1024
# 当events数据达到该数量时候,将临时文件滚动成目标文件,默认10,如果设置成0,则表示不根据临时文件大小来滚动文件
a1.sinks.sink1.hdfs.rollCount = 10
# hdfs sink间隔多长将临时文件滚动成最终目标文件,单位:秒。如果设置成0,则表示不根据时间来滚动文件
a1.sinks.sink1.hdfs.rollInterval = 60
# 是否启用时间上的"舍弃",这里的"舍弃",类似于四舍五入,如果启用,则会影响除了%t的其他所有时间表达式。默认false。
a1.sinks.sink1.hdfs.round = true
# 时间上进行“舍弃”的值,默认1
a1.sinks.sink1.hdfs.roundValue = 10
# 时间上进行”舍弃”的单位,包含:second,minute,hour,默认second
a1.sinks.sink1.hdfs.roundUnit = minute
# 是否使用当地时间,默认false
a1.sinks.sink1.hdfs.useLocalTimeStamp = true


# Channel使用类型memony
a1.channels.channel1.type = memory
a1.channels.channel1.keep-alive = 120
a1.channels.channel1.capacity = 50000
a1.channels.channel1.transactionCapacity = 600

# source channel sink之间的绑定
a1.sources.source1.channels = channel1
a1.sinks.sink1.channel = channel1
  1. start flume
bin/flume-ng agent -c conf -f conf/flume-hdfs.conf -n a1  -Dflume.root.logger=INFO,console
  1. View Results:

image

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325676683&siteId=291194637
Recommended