[Flume of Big Data] 4. Flume advanced replication and multiplexing, load balancing and failover, and aggregation cases

1 Replication and multiplexing

(1) Requirements: Use Flume-1 to monitor file changes (you can use Exec Source or Taildir Source), Flume-1 will pass the changed content to Flume-2 (use Avro Sink), (use Avro Source to connect) Flume-2 is responsible Stored in HDFS. At the same time, Flume-1 passes the changed content to Flume-3, and Flume-3 is responsible for outputting to Local FileSystem.

(2) Analysis:
insert image description here
Steps:
(1) Create a group1 folder under the /opt/module/flume-1.9.0/job directory, create a data folder under the /opt/module/flume-1.9.0/ directory, and create Create a flume folder under this folder.

(2) Create flume-file-flume.conf in group1: configure a source for receiving log files, two channels, and two sinks, and send them to flume-flume-hdfs and flume-flume-dir respectively.

vim  flume-file-flume.conf

# Name the components on this agent 
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
#将数据流复制给所有 channel 
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/logs/flume.log
a1.sources.r1.shell = /bin/bash -c

# Describethe sink
# sink 端的 avro 是一个数据发送者 
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

# sink 端的 avro 是一个数据发送者
a1.sinks.k2.type= avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe thechannel 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type= memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel =c2

(3) Create flume-flume-hdfs.conf under group1: configure the source of the upper-level Flume output, and the output is the sink to HDFS.

vim flume-flume-hdfs.conf

# Name the components on this agent 
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source 
# source 端的 avro 是一个数据接收服务 
a2.sources.r1.type= avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink 
a2.sinks.k1.type= hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹 
a2.sinks.k1.hdfs.round= true
#多少时间单位创建一个新的文件夹 
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位 
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩 
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval= 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与Event 数量无关 
a2.sinks.k1.hdfs.rollCount = 0

# Describe thechannel 
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create flume-flume-dir.conf under group1: configure the source of the upper-level Flume output, and the output is the sink to the local directory.

vim flume-flume-dir.conf

# Name the components on this agent 
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type= file_roll
a3.sinks.k1.sink.directory = /opt/module/flume-1.9.0/data/flume

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel 
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

Tip : The output local directory must be an existing directory. If the directory does not exist, a new directory will not be created locally.

(5) Start HDFS first, and then start flume-flume-hdfs, flume-flume-dir, and flume-file-flume respectively.
  The server must be started first, and then the client should be started.

myhadoop.sh start

bin/flume-ng agent -n a2 -c conf/ -f job/group1/flume-flume-hdfs.conf
bin/flume-ng agent -n a3 -c conf/ -f job/group1/flume-flume-dir.conf
bin/flume-ng agent -n a1 -c conf/ -f job/group1/flume-file-flume.conf

(6) Check the data on HDFS
insert image description here
insert image description here

2 Load balancing and failover

(1) Failover requirements: Use Flume1 to monitor a port, and the sinks in the sink group are respectively connected to Flume2 and Flume3, and FailoverSinkProcessor is used to realize the failover function.

(2) Analysis:
insert image description here
Steps:
(1) Create a group2 folder under the /opt/module/flume-1.9.0/job directory, and create flume-netcat-flume.conf and flume-flume-console1 under this folder. conf, flume-flume-console2.conf.

(2) flume-netcat-flume.conf: Configure 1 netcat source, 1 channel, 1 sink group (2 sinks), and send them to flume-flume-console1.conf and flume-flume-console2.conf respectively.

# Name the components on this agent 
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port =44444

a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel =c1

(3) Create flume-flume-console1.conf: Configure the source of the upper-level Flume output, and the output is to the local console.

# Name the components on this agent 
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source 
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel 
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create flume-flume-console2.conf: Configure the source of the upper-level Flume output, and the output is to the local console.

# Name the components on this agent 
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source 
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel 
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

(5) Execute the configuration files and open the corresponding configuration files: flume-flume-console2.conf, flume-flume-console1.conf, flume-netcat-flume.conf.
  It is also necessary to open the server first, and then open the client.

bin/flume-ng agent -c conf/ -n a3 -f job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a2 -f job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
bin/flume-ng agent -c conf/ -n a1 -f job/group2/flume-netcat-flume.conf

(6) Use the netcat tool to send content to port 44444 of this machine

nc localhost 44444

(7) View the console print logs of Flume2 and Flume3.
Flume3 has higher priority.
insert image description here
(8) Kill Flume3 and observe the console printing of Flume2.
insert image description here
insert image description here
Load balancing requirements:
  use Flume1 to monitor a port, and the sinks in the sink group are respectively connected to Flume2 and Flume3, and FailoverSinkProcessor is used to realize the load balancing function.

Steps:
  You only need to modify the configuration content of a1.sinkgroups.g1.processor in flume-netcat-flume.conf, delete the original related content, add the following, and the rest are the same.

a1.sinkgroups.g1.processor.type = load_balance;
#使用退避算法轮询sink组
a1.sinkgroups.g1.processor.backoff = true;

3 aggregation

(1) Requirements: Flume-1 on hadoop102 monitors the file /opt/module/flume-1.9.0/group.log, and Flume-2 on hadoop103 monitors the data flow of a certain port. Flume-1 and Flume-2 will The data is sent to Flume-3 on hadoop104, and Flume-3 prints the final data to the console.

(2) Analysis:
insert image description here
Steps:
(1) Create a group3 folder under the opt/module/flume-1.9.0/job directory; distribute the entire Flume to hadoop103 and hadoop104.

(2) Create the configuration file flume1-logger-flume.conf on hadoop102: configure the Source to monitor the hive.log file, configure the Sink to output data to the next-level Flume, and configure it in /opt/module/flume-1.9.0/ Create a blank file group.log under data.

# Name the components on this agent 
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/flume-1.9.0/data/group.log
a1.sources.r1.shell= /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(3) Create the configuration file flume2-netcat-flume.conf on hadoop103: configure Source to monitor port 44444 data flow, and configure Sink data to the next-level Flume.

# Name the components on this agent 
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444

# Describe the sink 
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory 
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 10

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(4) Create the configuration file flume3-flume-logger.conf on hadoop104: configure the source to receive the data streams sent by flume1 and flume2, and finally merge and sink to the console

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141

# Describe the sink# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel 
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

(5) Execute the configuration files on each host: flume3-flume-logger.conf, flume2-netcat-flume.conf, flume1-logger-flume.conf.

[lyx@hadoop104 flume-1.9.0]$ bin/flume-ng agent -c conf/ -n a3 -f job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
[lyx@hadoop103 flume-1.9.0]$ bin/flume-ng agent -c conf/ -n a2 -f job/group3/flume2-netcat-flume.conf
[lyx@hadoop102 flume-1.9.0]$ bin/flume-ng agent -c conf/ -n a1 -f job/group3/flume1-logger-flume.conf

(6) Add content to group.log in the /opt/module/flume-1.9.0/data directory on hadoop102

[lyx@hadoop102 data]$ echo 'hello' > group.log

(7) Send data to port 44444 on hadoop103

nc hadoop103 44444

(8) Check the data on hadoop104.
insert image description here

Guess you like

Origin blog.csdn.net/qq_18625571/article/details/131694778