Flume study notes (2) - Flume advanced

Flume advanced

Flume transactions

The transaction processing flow is as follows:

Put

  • doPut: Write the batch data into the temporary buffer putList first
  • doCommit: Checkchannel memory queue to see if it is merged enough.
  • doRollback: Insufficient channel memory queue space, rollback data

Take

  • doTake: Take the data into the temporary buffer takeList and send the data to HDFS
  • doCommit: If all data is sent successfully, thenclear the temporary buffer takeList
  • doRollback: If an exception occurs during data sending, rollback will return the data in the temporary buffer takeList to the channel memory queue

Flume Agent internal principles

ChannelSelector

The function of ChannelSelector isselect which Channel the Event will be sent to

There are two types of them, namely Replicating and Multiplexing

  • ReplicatingSelector willsend the same Event to all Channels
  • Multiplexing will send different Events to different Channels according to corresponding principles a>

SinkProcessor

There are three types of SinkProcessor, namely DefaultSinkProcessor 、LoadBalancingSinkProcessor、FailoverSinkProcessor

  • DefaultSinkProcessor corresponds to a single Sink
  • LoadBalancingSinkProcessor can implement load balancing function
  • FailoverSinkProcessor can implement error recovery function

Flume topology

Simple series connection

Connect multiple flume sequentially, starting from the initial source to the destination storage system of the final sink transfer

It is not recommended to bridge too many flumes. Too many flumes will not only affect the transmission rate, but also once a node flume goes down during the transmission process, it will affect the entire transmission system.

duplication and multiplexing

(Single source, multiple channels, sinks)

Flume supports streaming events to one or multiple destinations

This mode cancopy the same data to multiple channels, ordifferent data Distributed to different channels, sink can choose to transmit to different destinations

Load balancing and failover

Flume supports logically dividing multiple sinks into a sink group, and the sink group matches different SinkProcessorcan realize load balancing and error recovery functions,

Agent1 here has three sinks, which are connected to agent2, agent3, and agent4 respectively. Even if some of the sinks fail, the data can still be synchronized to HDFS.

polymerization

Commonly used in business, such as log collection function:

Daily web applications are usually distributed on hundreds of servers, or even thousands or tens of thousands of servers in large cases, and the generated logs are also very troublesome to process.

Aggregation can be used. Each server deploys a flume to collect logs, transfers them to a flume that collects logs centrally, and then uploads them to hdfs, hive, hbase, etc. for log analysis.

Flume practical case

duplication and multiplexing

Requirement:Use Flume-1 to monitor file changes

  1. Flume-1 passes the changes to Flume-2, and Flume-2 is responsible for storing them in HDFS
  2. Flume-1 passes the changes to Flume-3, and Flume-3 is responsible for outputting to Local FileSystem

Implementation process:
1. Create the folder group1 under the job and create the configuration file in itflume-file-flume.conf

The configuration file needs to have 1 source, 2 channels, and 2 sinks

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/apache-hive-3.1.2-bin/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

The function of this configuration file is to send data to two different sinks, and then the sink sends it to other agents for processing.

2. Create configuration fileflume-flume-hdfs.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

The source binds the sink1 of the previous agent and then uploads it to HDFS.

3. Create a configuration file:flume-flume-dir.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /home/why/data/flumeDemo/test1

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

Parameter Description:

sink类型为file_roll:Flume 1.11.0 User Guide — Apache Flume

Events can be saved to the local file system

  • Directory: The path to save data in the local file system (note that the path must already exist)

4. Start the corresponding flume processes respectively:

nohup bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf &

nohup bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf &

nohup bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf &
5. The corresponding content can be seen in hdfs and folders:

hdfs:

File system:

Load balancing and failover

Requirements: Use Flume1 to monitor a port. The sinks in its sink group are connected to Flume2 and Flume3 respectively. Use FailoverSinkProcessor to implement the failover function.

Implementation process:

1. Create the group2 folder in the/opt/module/flume/job directory and create the configuration fileflume-netcat-flume.conf

Configure1 netcat source, 1 channel, and 1 sink group (2 sinks), which are delivered to flume-flume- console1 and flume-flume-console2

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444


# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

Parameter description:Flume 1.11.0 User Guide — Apache Flume

Define multiple sinks in an agent through sink groups, and configure the sink processor to use:Flume 1.11.0 User Guide — Apache Flume

2.Create flume-flume-console1.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

sink output to local console

3.Createflume-flume-console2.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

sink output to local console

4. Execute instructions:

bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console

bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console

bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

5.Usenc localhost 44444Send number position

Since the priority set by console2 is higher than that of console1, the data is received by console2;

Next, kill the console2 process, and the data will be received by console1:

polymerization

need:

Flume-1 monitoring files on hadoop102/home/why/data/flumeDemo/test3/test3.log

Flume-2 on hadoop103 monitors the data flow of a certain port

Flume-1 and Flume-2 send data to Flume-3 on hadoop104, and Flume-3 prints the final data to the console

Implementation process:

1. First create the directory group3 in the job folders of the three servers.

2. On hadoop102, create a configuration fileflume1-logger-flume.conf, source is used to monitor log files, and sink is used to output data to the next level Flume

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/why/data/flumeDemo/test3/test3.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3. On hadoop103, create a configuration fileflume2-netcat-flume.conf, source is used to monitor the data flow of port 44444, and sink is used to transmit data to the next level of flume

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = localhost
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

Note that the sink destination of these two agents is the server hadoop104, so the hostname and port are the same.

4. Create a configuration file on hadoop104flume3-flume-logger.conf, source is used to receive the data stream sent by flume1 and flume2, and sink is used to output data to the console;

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

5. Execute instructions on three servers respectively

hadoop104: bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console

hadoop102:bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group3/flume1-logger-flume.conf

hadoop103:bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group3/flume2-netcat-flume.conf

6. Append content to the log file on hadoop102:

echo "hello" > /home/why/data/flumeDemo/test3/test3.log

Send data to port 44444 throughnc hadoop103 44444 in hadoop103;

Then the data can be received in hadoop104:

Guess you like

Origin blog.csdn.net/qq_51235856/article/details/134465832