table of Contents
Flume-custom interceptor, custom Source to read data from MySQL, custom Sink
Flume structure
Flume's execution process: Sources —— Channel Processor —— Interceptors —— Channel Selector —— Channels —— Sink Processor —— Sinks
Among them, there are transactions (put and take) from Soucrce to Channel and from Channel to Sink
Avro series
In order to flow data between multiple agents or hops, the sink of the previous agent and the source of the current agent must be of type avro , and the sink points to the host name (or IP address) and port of the source. This is the basis of other complex structures, but it is not recommended to connect too many flume, because too many flume will not only affect the transmission rate, but once a node flume is down during the transmission process, it will affect the entire transmission system.
Copy and multiplex
Flume supports multiplexing event streams to one or more destinations. This is achieved by defining a stream multiplexer that can replicate or selectively route events to one or more channels.
In the above example, you can see that the source of the Agent named foo can divide the data stream into three different channels. When selecting a channel ( Channel Selector ), it can be replication (Replicating) or multiplexing (Multiplexing).
For replication, every event is sent to all channels. For multiplexing, when the attribute of the event matches the pre-configured value, the event is delivered to the corresponding available channel. For example, in the following official example, if the attribute of the event is set to CZ, the c1 channel is selected; if the attribute of the event is set to US, the c2 and c3 channels are selected; otherwise, the c4 channel is selected;
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4
Channel Selector selects the replication strategy by default (Replicating)
Load balancing and failover
There are three types of Sink Processor, namely DefaultSinkProcessor , LoadBalancingSinkProcessor and FailoverSinkProcessor .
DefaultSinkProcessor corresponds to a single Sink, LoadBalancingSinkProcessor and FailoverSinkProcessor correspond to Sink Group, LoadBalancingSinkProcessor can realize the function of load balancing, and FailoverSinkProcessor can realize the function of failover.
polymerization
This mode is very common and very practical. Daily web applications are usually distributed on thousands of servers, which generate a lot of logs and are very troublesome to process. Using flume can solve this problem well. Each server deploys a flume to collect logs and transmits them to a flume that collects logs, and then uploads the flume to hdfs, hive, hbase, etc. for log analysis.
Transaction mechanism
Flume's transaction mechanism (similar to database transaction mechanism): Flume uses two independent transactions to be responsible for the event delivery from Soucrce to Channel and from Channel to Sink . For example, the spooling directory source creates an event for each line of the file. Once all events in the transaction are delivered to the Channel and the submission is successful, Soucrce will mark the file as complete. In the same way, the transaction handles the transfer process from Channel to Sink in a similar way. If the event cannot be recorded for some reason, the transaction will be rolled back. And all events will remain in the Channel, waiting to be delivered again.
According to Flume's architectural principles, it is impossible for Flume to lose data. It has a complete internal transaction mechanism. Source to Channel are transactional, and Channel to Sink is transactional. Therefore, there will be no data loss in these two links. The only possible data loss situation is that the Channel uses memoryChannel, and the agent is down and the data is lost, or the Channel storage data is full, causing the Source to no longer write, and the unwritten data is lost.
Flume will not lose data, but it may cause data duplication. For example, if data has been successfully sent by Sink but no response is received, Sink will send the data again, which may cause data duplication.
Case 1: Single data source and multiple exports
case analysis
Use Flume1 to monitor file changes, Flume1 passes the changes to Flume2 , and Flume2 is responsible for storing them in HDFS . At the same time, Flume1 transfers the changed content to Flume3 , and Flume3 is responsible for outputting to the Local FileSystem .
Case steps
-
Create an empty file: touch date.txt
-
Start HDFS and Yarn: start-dfs.sh , start-yarn.sh
-
Create three configuration files, flume1.conf , flume2.conf , flume3.conf :
The name of the first agent is flume1 , one source is r1 , two channels are c1 and c2 , and two sinks are k1 and k2. . The source type is taildir , which monitors the local file date.txt . The sink type is avro , and the ports of the two sinks are different, and they are connected to the other two agents. The type of channel is memory. There are three types of
Sink Processor , namely DefaultSinkProcessor , LoadBalancingSinkProcessor and FailoverSinkProcessor . To send data from a data source to different places, a sink is bound to a channel, and multiple channels and sinks are required.# Name the components on this agent a1.sources = r1 a1.channels = c1 c2 a1.sinks = k1 k2 将数据复制给所有channel# 将数据复制给所有channel(默认,可不写) a1.sources.r1.selector.type = replicating # Describe/configure the source a1.sources.r1.type = TAILDIR a1.sources.r1.filegroups = f1 a1.sources.r1.filegroups.f1 = /opt/flume-1.9.0/date.txt a1.sources.r1.positionFile = /opt/flume-1.9.0/file/position.json # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = master a1.sinks.k1.port = 44444 a1.sinks.k2.type = avro a1.sinks.k2.hostname = master a1.sinks.k2.port = 55555 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 a1.channels.c2.type = memory a1.channels.c2.capacity = 1000 a1.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 c2 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c2
The source type of the second agent is avro , which is connected to the first agent. The type of sink is HDFS .
# Name the components on this agent a2.sources = r1 a2.channels = c1 a2.sinks = k1 # Describe/configure the source a2.sources.r1.type = avro a2.sources.r1.bind = master a2.sources.r1.port = 44444 # Describe the sink a2.sinks.k1.type = hdfs a2.sinks.k1.hdfs.path = hdfs://master:9000/a/%Y%m%d/%H a2.sinks.k1.hdfs.filePrefix = logs a2.sinks.k1.hdfs.round = true a2.sinks.k1.hdfs.roundValue = 1 a2.sinks.k1.hdfs.roundUnit = hour a2.sinks.k1.hdfs.useLocalTimeStamp = true a2.sinks.k1.hdfs.batchSize = 100 a2.sinks.k1.hdfs.fileType = DataStream a2.sinks.k1.hdfs.rollInterval = 30 a2.sinks.k1.hdfs.rollSize = 134217700 a2.sinks.k1.hdfs.rollCount = 0 # Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
The source type of the third agent is also avro , which is connected to the first agent. The type of sink is file_roll .
# Name the components on this agent a3.sources = r1 a3.channels = c1 a3.sinks = k1 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = master a3.sources.r1.port = 55555 # Describe the sink a3.sinks.k1.type = file_roll a3.sinks.k1.sink.directory = /opt/flume-1.9.0/file # Use a channel which buffers events in memory a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
Start flume2, flume3, flume1 respectively. Note 1 at the end, because the avro source needs to be the server.
bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console
-
Enter date> date.txt to modify the file
Case 2: Failover
case analysis
Use Flume1 monitor a port which sink group sink respectively abutting Flume2 and Flume3 , using FailoverSinkProcessor , implement failover functionality.
Case steps
-
Create three configuration files, flume1.conf , flume2.conf , flume3.conf : The
first Agent adds a configuration for Sink Groups, using the failover strategy. Note that, k2 priority higher than k1 , so k2 corresponding flume is activated , and k1 corresponding to the flume is standby# Name the components on this agent a1.sources = r1 a1.channels = c1 a1.sinks = k1 k2 a1.sinkgroups = g1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = master a1.sources.r1.bind = 33333 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = master a1.sinks.k1.port = 44444 a1.sinks.k2.type = avro a1.sinks.k2.hostname = master a1.sinks.k2.port = 55555 # Sink groups a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = failover a1.sinkgroups.g1.processor.priority.k1 = 50 a1.sinkgroups.g1.processor.priority.k2 = 100 a1.sinkgroups.g1.processor.maxpenalty = 10000 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1
The sink type of the second agent is logger
# Name the components on this agent a2.sources = r1 a2.channels = c1 a2.sinks = k1 # Describe/configure the source a2.sources.r1.type = avro a2.sources.r1.bind = master a2.sources.r1.port = 44444 # Describe the sink a2.sinks.k1.type = logger # Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
The configuration of the second and the third agent is similar, but the port number is different
# Name the components on this agent a3.sources = r1 a3.channels = c1 a3.sinks = k1 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = master a3.sources.r1.port = 55555 # Describe the sink a3.sinks.k1.type = logger # Use a channel which buffers events in memory a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
Start flume2, flume3, flume1 respectively.
bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console
-
Start a new terminal, type nc master 33333 , and then type something.
The upper left corner is flume1 , the upper right corner is flume2 , the lower left corner is flume3 , and the lower right corner is the client . Since flume3 priority higher than flume2 , so flume3 is activated to, flume3 can receive the information.
At this time, flume3 hangs up, then flume2 turns positive, from standby to activated , and can receive messages.
At this time, flume3 is resurrected. Because the priority of flume3 is greater than flume2 , flume3 can receive messages again.
Case 3: Load balancing
Case steps
-
Create three configuration files, flume1.conf , flume2.conf , flume3.conf . Among them, flume2.conf and flume3.conf are the same as flume2 and flume3 in case two. flume1 is just the strategy of Sink Groups changed. The following is the configuration of flume1:
# Name the components on this agent a1.sources = r1 a1.channels = c1 a1.sinks = k1 k2 a1.sinkgroups = g1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = master a1.sources.r1.port = 33333 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = master a1.sinks.k1.port = 44444 a1.sinks.k2.type = avro a1.sinks.k2.hostname = master a1.sinks.k2.port = 55555 # Sink groups a1.sinkgroups.g1.sinks = k1 k2 a1.sinkgroups.g1.processor.type = load_balance a1.sinkgroups.g1.processor.backoff = true a1.sinkgroups.g1.processor.selector = random # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1 a1.sinks.k2.channel = c1
-
Start flume2, flume3, flume1 respectively. Start a new terminal, type nc master 33333, and then type something.
The upper left corner is flume1 , the upper right corner is flume2 , the lower left corner is flume3 , and the lower right corner is the client .
Case 4: Aggregation
case analysis
slave1 on Flume1 monitoring a data port, slave2 on Flume2 monitor local file date.txt , Flume1 and Flume2 send data to the master on Flume3 , Flume3 the print data to the console.
Case steps
-
Create the configuration file flume1.conf on slave1. The source type is netcat , listening port. : The type of sink is Avro , with flume3 docking.
# Name the components on this agent a1.sources = r1 a1.channels = c1 a1.sinks = k1 # Describe/configure the source a1.sources.r1.type = netcat a1.sources.r1.bind = localhost a1.sources.r1.port = 33333 # Describe the sink a1.sinks.k1.type = avro a1.sinks.k1.hostname = master a1.sinks.k1.port = 44444 # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Create the configuration file flume2.conf on slave2. The type of source is exec , which monitors files. : The type of sink is Avro , with flume3 docking.
# Name the components on this agent a2.sources = r1 a2.channels = c1 a2.sinks = k1 # Describe/configure the source a2.sources.r1.type = exec a2.sources.r1.command = tail -F /opt/flume-1.9.0/date.txt # Describe the sink a2.sinks.k1.type = avro a2.sinks.k1.hostname = master a2.sinks.k1.port = 44444 # Use a channel which buffers events in memory a2.channels.c1.type = memory a2.channels.c1.capacity = 1000 a2.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r1.channels = c1 a2.sinks.k1.channel = c1
Create a configuration file flume3.conf in the master. source of type Avro , and flume2 flume1 receiving data sent. : The sink type is logger , which writes out the received data to the console.
# Name the components on this agent a3.sources = r1 a3.channels = c1 a3.sinks = k1 # Describe/configure the source a3.sources.r1.type = avro a3.sources.r1.bind = master a3.sources.r1.port = 44444 # Describe the sink a3.sinks.k1.type = logger # Use a channel which buffers events in memory a3.channels.c1.type = memory a3.channels.c1.capacity = 1000 a3.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a3.sources.r1.channels = c1 a3.sinks.k1.channel = c1
-
Start flume2, flume3, flume1 respectively.
-
Enter nc localhost 33333 in slave1 , and then send data
-
Enter date> date.txt in slave2
-
The received data can be seen in the master