38. Several use cases of Flume

In the previous article, we introduced the installation of Flume and its simple use. In this chapter, we will introduce several specific cases based on Flume, including: real-time extraction of local files or directories to HDFS, Flume's fan-in and fan-out, etc. Pay attention to the column "Break the Cocoon and Become a Butterfly-Big Data" to see more related content~


table of Contents

1. Extract local files to HDFS in real time

2. Extract local directories to HDFS in real time

Third, Flume's fan-in

Four, Flume's fan-out

4.1 Selector

4.2 sink group


1. Extract local files to HDFS in real time

1. Requirements description

Monitor Hive logs and upload them to HDFS in real time.

2. Concrete realization

(1) If you want to use Flume to monitor files uploaded to HDFS, you first need to copy several jar packages to Flume's lib directory, as shown below:

cp /opt/modules/hadoop-2.7.2/share/hadoop/hdfs/hadoop-hdfs-2.7.2.jar /opt/modules/flume/lib/
cp /opt/modules/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar /opt/modules/flume/lib/
cp /opt/modules/hadoop-2.7.2/share/hadoop/common/lib/hadoop-auth-2.7.2.jar /opt/modules/flume/lib/
cp /opt/modules/hadoop-2.7.2/share/hadoop/common/lib/commons-configuration-1.6.jar /opt/modules/flume/lib/
cp /opt/modules/hadoop-2.7.2/share/hadoop/common/lib/commons-io-2.4.jar /opt/modules/flume/lib/
cp /opt/modules/hadoop-2.7.2/share/hadoop/common/lib/htrace-core-3.1.0-incubating.jar /opt/modules/flume/lib/

(2) Create a configuration file

vim flume-filetoHDFS.conf
# 声明source、channel、sink
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# 配置source
# 要想读取Linux系统中的文件,就得按照Linux命令的规则执行命令。由于Hive日志在Linux系统中,所以读取文件的类型选择:exec,即execute执行的意思。
a1.sources.r1.type = exec
# 日志文件的位置
a1.sources.r1.command = tail -F /tmp/root/hive.log
a1.sources.r1.shell = /bin/bash -c

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:9000/flume/%Y%m%d/%H
# 设置上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = logs-
# 设置是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
# 设置多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
# 重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 1000
# 设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
# 多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
# 设置每个文件的滚动大小
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

# 设置channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# 连接source、channel、sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. Execute flume

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-filetoHDFS.conf

4. Start hive to check if there is data on HDFS

2. Extract local directories to HDFS in real time

1. Requirements description

Monitor a local directory and upload it to HDFS in real time.

2. Concrete realization

Create a configuration file and add the following content:

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/files
a1.sources.r1.fileSuffix = .COMPLETED
a1.sources.r1.fileHeader = true
# 忽略所有以.jar结尾的文件,不上传
a1.sources.r1.ignorePattern = ([^ ]*\.jar)

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:9000/flume/upload/%Y%m%d/%H
# 上传文件的前缀
a1.sinks.k1.hdfs.filePrefix = dir-
# 是否按照时间滚动文件夹
a1.sinks.k1.hdfs.round = true
# 多少时间单位创建一个新的文件夹
a1.sinks.k1.hdfs.roundValue = 1
# 重新定义时间单位
a1.sinks.k1.hdfs.roundUnit = hour
# 是否使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 积攒多少个Event才flush到HDFS一次
a1.sinks.k1.hdfs.batchSize = 100
# 设置文件类型,可支持压缩
a1.sinks.k1.hdfs.fileType = DataStream
# 多久生成一个新的文件
a1.sinks.k1.hdfs.rollInterval = 60
# 设置每个文件的滚动大小大概是128M
a1.sinks.k1.hdfs.rollSize = 134217700
# 文件的滚动与Event数量无关
a1.sinks.k1.hdfs.rollCount = 0

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. Execute flume

bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-dirtoHDFS.conf

4. Check if there is synchronization on HDFS

Third, Flume's fan-in

As the name implies, fan-in means that Flume has multiple data sources imported into one source, as shown in the following figure:

1. Requirements description

Flume on slave01 monitors a log file, Flume on slave02 monitors the data flow of a certain port, the Flume of the last two slave nodes sends the data to the Flume of the master node, and the master node prints the data to the console after receiving the data.

2. Concrete realization

(1) Distribute Flume to slave nodes

xsync /opt/modules/flume

(2) Write the configuration file of slave01 node

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/files/xzw.log
a1.sources.r1.shell = /bin/bash -c

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 6868

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) Write the configuration file of slave02 node

a2.sources = r1
a2.sinks = k1
a2.channels = c1

a2.sources.r1.type = netcat
a2.sources.r1.bind = slave02
a2.sources.r1.port = 44444

a2.sinks.k1.type = avro
a2.sinks.k1.hostname = master
a2.sinks.k1.port = 6868

a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Write the configuration file of the master node

a3.sources = r1
a3.sinks = k1
a3.channels = c1

a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 6868

a3.sinks.k1.type = logger

a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

(5) Start flume

bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-mtocon.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-s2tom.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-s1tom.conf -Dflume.root.logger=INFO,console

(6) Test

Append data to log:

Send data to the port:

In the console, you can find that the data has been printed:

Four, Flume's fan-out

The fan-out of Flume means that Flume sinks to different locations after receiving the data. We are divided into two parts, which are realized by selector and realized by sink group.

4.1 Selector

Using selectors to achieve fan-out function is to use single source, multiple channels and sink methods, as shown in the following figure:

1. Requirements description

Use Flume A to monitor the changes of a certain file, Flume A transmits the changed file content to Flume B in real time, and Flume B sinks the content to HDFS. At the same time, Flume A also transmits the changed content to Flume C, and Flume C stores the content locally.

2. Concrete realization

(1) Write the configuration file of Flume A

a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /tmp/root/hive.log
a1.sources.r1.shell = /bin/bash -c

# sink端的avro是一个数据发送者,Avro是由Hadoop创始人Doug Cutting创建的一种语言无关的数据序列化和RPC框架。
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master 
a1.sinks.k1.port = 8888

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = master
a1.sinks.k2.port = 6868

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

(2) Write the configuration file of Flume B

a2.sources = r1
a2.sinks = k1
a2.channels = c1

# source端的avro是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = master
a2.sources.r1.port = 8888

a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://master:9000/flume-b/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume-b-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 600
#设置每个文件的滚动大小大概是128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event数量无关
a2.sinks.k1.hdfs.rollCount = 0

a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) Write the configuration file of Flume C

a3.sources = r1
a3.sinks = k1
a3.channels = c2

a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 6868

a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /root/files/flume-c

a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) Start three Flume separately

bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-c.conf
bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-b.conf
bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-a.conf

(5) The test found that the uploaded log files were found locally and on HDFS

4.2 sink group

Choosing to use the sink group to achieve the fan-out function is to use a single source, channel, and multiple sink method. This method actually implements Flume's load balancing. As shown below:

1. Requirements description

Use Flume D to monitor port changes. Flume D transmits the monitored content to Flume E in real time, and Flume E prints the content to the console. At the same time, Flume D also sends the changed content to Flume F, and Flume F also prints the content to the console.

2. Concrete realization

(1) Write the configuration file of Flume D

a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
a1.sinkgroups.g1.processor.selector.maxTimeOut=10000

a1.sinks.k1.type = avro
a1.sinks.k1.hostname = master
a1.sinks.k1.port = 8888

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = master
a1.sinks.k2.port = 6868

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

(2) Write the configuration file of Flume E

a2.sources = r1
a2.sinks = k1
a2.channels = c1

a2.sources.r1.type = avro
a2.sources.r1.bind = master
a2.sources.r1.port = 8888

a2.sinks.k1.type = logger

a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) Write the configuration file of Flume F

a3.sources = r1
a3.sinks = k1
a3.channels = c2

a3.sources.r1.type = avro
a3.sources.r1.bind = master
a3.sources.r1.port = 6868

a3.sinks.k1.type = logger

a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) Start three Flume separately

bin/flume-ng agent --conf conf/ --name a3 --conf-file conf/flume-f.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent --conf conf/ --name a2 --conf-file conf/flume-e.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent --conf conf/ --name a1 --conf-file conf/flume-d.conf -Dflume.root.logger=INFO,console

(5) Open the port and produce data

You can see that the console has printed out the content of the transmission:

 

Well, the above are a few use cases of Flume. It is relatively simple. What problems do you encounter in this process, welcome to leave a message and let me see what problems you encountered~

Guess you like

Origin blog.csdn.net/gdkyxy2013/article/details/112240221