Flume-Flume structure, single data source and multiple outlets, failover (Failover), load balancing, aggregation, etc.

Flume-1.9.0 installation, monitoring ports, monitoring local files and uploading HDFS, monitoring new files in the directory and uploading HDFS, monitoring additional files (resumed upload)

Flume-custom interceptor, custom Source to read data from MySQL, custom Sink

Flume structure

    Flume's execution process: Sources —— Channel Processor —— Interceptors —— Channel Selector —— Channels —— Sink Processor —— Sinks
    Among them, there are transactions (put and take) from Soucrce to Channel and from Channel to Sink

Avro series

Insert picture description here
    In order to flow data between multiple agents or hops, the sink of the previous agent and the source of the current agent must be of type avro , and the sink points to the host name (or IP address) and port of the source. This is the basis of other complex structures, but it is not recommended to connect too many flume, because too many flume will not only affect the transmission rate, but once a node flume is down during the transmission process, it will affect the entire transmission system.

Copy and multiplex

Insert picture description here
    Flume supports multiplexing event streams to one or more destinations. This is achieved by defining a stream multiplexer that can replicate or selectively route events to one or more channels.
    In the above example, you can see that the source of the Agent named foo can divide the data stream into three different channels. When selecting a channel ( Channel Selector ), it can be replication (Replicating) or multiplexing (Multiplexing).
    For replication, every event is sent to all channels. For multiplexing, when the attribute of the event matches the pre-configured value, the event is delivered to the corresponding available channel. For example, in the following official example, if the attribute of the event is set to CZ, the c1 channel is selected; if the attribute of the event is set to US, the c2 and c3 channels are selected; otherwise, the c4 channel is selected;

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

    Channel Selector selects the replication strategy by default (Replicating)

Load balancing and failover

Insert picture description here
    There are three types of Sink Processor, namely DefaultSinkProcessor , LoadBalancingSinkProcessor and FailoverSinkProcessor .
    DefaultSinkProcessor corresponds to a single Sink, LoadBalancingSinkProcessor and FailoverSinkProcessor correspond to Sink Group, LoadBalancingSinkProcessor can realize the function of load balancing, and FailoverSinkProcessor can realize the function of failover.

polymerization

Insert picture description here
    This mode is very common and very practical. Daily web applications are usually distributed on thousands of servers, which generate a lot of logs and are very troublesome to process. Using flume can solve this problem well. Each server deploys a flume to collect logs and transmits them to a flume that collects logs, and then uploads the flume to hdfs, hive, hbase, etc. for log analysis.

Transaction mechanism

    Flume's transaction mechanism (similar to database transaction mechanism): Flume uses two independent transactions to be responsible for the event delivery from Soucrce to Channel and from Channel to Sink . For example, the spooling directory source creates an event for each line of the file. Once all events in the transaction are delivered to the Channel and the submission is successful, Soucrce will mark the file as complete. In the same way, the transaction handles the transfer process from Channel to Sink in a similar way. If the event cannot be recorded for some reason, the transaction will be rolled back. And all events will remain in the Channel, waiting to be delivered again.
    According to Flume's architectural principles, it is impossible for Flume to lose data. It has a complete internal transaction mechanism. Source to Channel are transactional, and Channel to Sink is transactional. Therefore, there will be no data loss in these two links. The only possible data loss situation is that the Channel uses memoryChannel, and the agent is down and the data is lost, or the Channel storage data is full, causing the Source to no longer write, and the unwritten data is lost.
    Flume will not lose data, but it may cause data duplication. For example, if data has been successfully sent by Sink but no response is received, Sink will send the data again, which may cause data duplication.

Case 1: Single data source and multiple exports

case analysis

    Use Flume1 to monitor file changes, Flume1 passes the changes to Flume2 , and Flume2 is responsible for storing them in HDFS . At the same time, Flume1 transfers the changed content to Flume3 , and Flume3 is responsible for outputting to the Local FileSystem .

Case steps

  1. Create an empty file: touch date.txt

  2. Start HDFS and Yarn: start-dfs.sh , start-yarn.sh

  3. Create three configuration files, flume1.conf , flume2.conf , flume3.conf :
        The name of the first agent is flume1 , one source is r1 , two channels are c1 and c2 , and two sinks are k1 and k2. . The source type is taildir , which monitors the local file date.txt . The sink type is avro , and the ports of the two sinks are different, and they are connected to the other two agents. The type of channel is memory. There are three types of
        Sink Processor , namely DefaultSinkProcessor , LoadBalancingSinkProcessor and FailoverSinkProcessor . To send data from a data source to different places, a sink is bound to a channel, and multiple channels and sinks are required.

     # Name the components on this agent
     a1.sources = r1
     a1.channels = c1 c2
     a1.sinks = k1 k2
     
     将数据复制给所有channel# 将数据复制给所有channel(默认,可不写)
     a1.sources.r1.selector.type = replicating
     
     # Describe/configure the source
     a1.sources.r1.type = TAILDIR
     a1.sources.r1.filegroups = f1
     a1.sources.r1.filegroups.f1 = /opt/flume-1.9.0/date.txt
     a1.sources.r1.positionFile = /opt/flume-1.9.0/file/position.json
     
     # Describe the sink
     a1.sinks.k1.type = avro
     a1.sinks.k1.hostname = master
     a1.sinks.k1.port = 44444
     
     a1.sinks.k2.type = avro
     a1.sinks.k2.hostname = master
     a1.sinks.k2.port = 55555
     
     # Use a channel which buffers events in memory
     a1.channels.c1.type = memory
     a1.channels.c1.capacity = 1000
     a1.channels.c1.transactionCapacity = 100
     
     a1.channels.c2.type = memory
     a1.channels.c2.capacity = 1000
     a1.channels.c2.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a1.sources.r1.channels = c1 c2
     a1.sinks.k1.channel = c1
     a1.sinks.k2.channel = c2
    

        The source type of the second agent is avro , which is connected to the first agent. The type of sink is HDFS .

     # Name the components on this agent
     a2.sources = r1
     a2.channels = c1
     a2.sinks = k1
     
     # Describe/configure the source
     a2.sources.r1.type = avro
     a2.sources.r1.bind = master
     a2.sources.r1.port = 44444
     
     # Describe the sink
     a2.sinks.k1.type = hdfs
     a2.sinks.k1.hdfs.path = hdfs://master:9000/a/%Y%m%d/%H
     a2.sinks.k1.hdfs.filePrefix = logs
     a2.sinks.k1.hdfs.round = true
     a2.sinks.k1.hdfs.roundValue = 1
     a2.sinks.k1.hdfs.roundUnit = hour
     a2.sinks.k1.hdfs.useLocalTimeStamp = true
     a2.sinks.k1.hdfs.batchSize = 100
     a2.sinks.k1.hdfs.fileType = DataStream
     a2.sinks.k1.hdfs.rollInterval = 30
     a2.sinks.k1.hdfs.rollSize = 134217700
     a2.sinks.k1.hdfs.rollCount = 0
     
     # Use a channel which buffers events in memory
     a2.channels.c1.type = memory
     a2.channels.c1.capacity = 1000
     a2.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a2.sources.r1.channels = c1
     a2.sinks.k1.channel = c1
    

        The source type of the third agent is also avro , which is connected to the first agent. The type of sink is file_roll .

     # Name the components on this agent
     a3.sources = r1
     a3.channels = c1
     a3.sinks = k1
     
     # Describe/configure the source
     a3.sources.r1.type = avro
     a3.sources.r1.bind = master
     a3.sources.r1.port = 55555
     
     # Describe the sink
     a3.sinks.k1.type = file_roll
     a3.sinks.k1.sink.directory = /opt/flume-1.9.0/file
     
     # Use a channel which buffers events in memory
     a3.channels.c1.type = memory
     a3.channels.c1.capacity = 1000
     a3.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a3.sources.r1.channels = c1
     a3.sinks.k1.channel = c1
    
  4. Start flume2, flume3, flume1 respectively. Note 1 at the end, because the avro source needs to be the server.

     bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console
     bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console
     bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console
    
  5. Enter date> date.txt to modify the file
    Insert picture description here
    Insert picture description here
    Insert picture description here

Case 2: Failover

case analysis

    Use Flume1 monitor a port which sink group sink respectively abutting Flume2 and Flume3 , using FailoverSinkProcessor , implement failover functionality.

Case steps

  1. Create three configuration files, flume1.conf , flume2.conf , flume3.conf : The
        first Agent adds a configuration for Sink Groups, using the failover strategy. Note that, k2 priority higher than k1 , so k2 corresponding flume is activated , and k1 corresponding to the flume is standby

     # Name the components on this agent
     a1.sources = r1
     a1.channels = c1
     a1.sinks = k1 k2
     a1.sinkgroups = g1
     
     # Describe/configure the source
     a1.sources.r1.type = netcat
     a1.sources.r1.bind = master
     a1.sources.r1.bind = 33333
     
     # Describe the sink
     a1.sinks.k1.type = avro
     a1.sinks.k1.hostname = master
     a1.sinks.k1.port = 44444
     
     a1.sinks.k2.type = avro
     a1.sinks.k2.hostname = master
     a1.sinks.k2.port = 55555
     
     # Sink groups
     a1.sinkgroups.g1.sinks = k1 k2
     a1.sinkgroups.g1.processor.type = failover
     a1.sinkgroups.g1.processor.priority.k1 = 50
     a1.sinkgroups.g1.processor.priority.k2 = 100
     a1.sinkgroups.g1.processor.maxpenalty = 10000
     
     # Use a channel which buffers events in memory
     a1.channels.c1.type = memory
     a1.channels.c1.capacity = 1000
     a1.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a1.sources.r1.channels = c1
     a1.sinks.k1.channel = c1
     a1.sinks.k2.channel = c1
    

        The sink type of the second agent is logger

     # Name the components on this agent
     a2.sources = r1
     a2.channels = c1
     a2.sinks = k1
     
     # Describe/configure the source
     a2.sources.r1.type = avro
     a2.sources.r1.bind = master
     a2.sources.r1.port = 44444
     
     # Describe the sink
     a2.sinks.k1.type = logger
     
     # Use a channel which buffers events in memory
     a2.channels.c1.type = memory
     a2.channels.c1.capacity = 1000
     a2.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a2.sources.r1.channels = c1
     a2.sinks.k1.channel = c1
    

        The configuration of the second and the third agent is similar, but the port number is different

     # Name the components on this agent
     a3.sources = r1
     a3.channels = c1
     a3.sinks = k1
     
     # Describe/configure the source
     a3.sources.r1.type = avro
     a3.sources.r1.bind = master
     a3.sources.r1.port = 55555
     
     # Describe the sink
     a3.sinks.k1.type = logger
     
     # Use a channel which buffers events in memory
     a3.channels.c1.type = memory
     a3.channels.c1.capacity = 1000
     a3.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a3.sources.r1.channels = c1
     a3.sinks.k1.channel = c1
    
  2. Start flume2, flume3, flume1 respectively.

     bin/flume-ng agent -c conf -f flume2.conf -n a2 -Dflume.root.logger=INFO,console
     bin/flume-ng agent -c conf -f flume3.conf -n a3 -Dflume.root.logger=INFO,console
     bin/flume-ng agent -c conf -f flume1.conf -n a1 -Dflume.root.logger=INFO,console
    
  3. Start a new terminal, type nc master 33333 , and then type something.
        The upper left corner is flume1 , the upper right corner is flume2 , the lower left corner is flume3 , and the lower right corner is the client . Since flume3 priority higher than flume2 , so flume3 is activated to, flume3 can receive the information.
    Insert picture description here
        At this time, flume3 hangs up, then flume2 turns positive, from standby to activated , and can receive messages.
    Insert picture description here
        At this time, flume3 is resurrected. Because the priority of flume3 is greater than flume2 , flume3 can receive messages again.
    Insert picture description here

Case 3: Load balancing

Case steps

  1. Create three configuration files, flume1.conf , flume2.conf , flume3.conf . Among them, flume2.conf and flume3.conf are the same as flume2 and flume3 in case two. flume1 is just the strategy of Sink Groups changed. The following is the configuration of flume1:

     # Name the components on this agent
     a1.sources = r1
     a1.channels = c1
     a1.sinks = k1 k2
     a1.sinkgroups = g1
     
     # Describe/configure the source
     a1.sources.r1.type = netcat
     a1.sources.r1.bind = master
     a1.sources.r1.port = 33333
     
     # Describe the sink
     a1.sinks.k1.type = avro
     a1.sinks.k1.hostname = master
     a1.sinks.k1.port = 44444
     
     a1.sinks.k2.type = avro
     a1.sinks.k2.hostname = master
     a1.sinks.k2.port = 55555
     
     # Sink groups
     a1.sinkgroups.g1.sinks = k1 k2
     a1.sinkgroups.g1.processor.type = load_balance
     a1.sinkgroups.g1.processor.backoff = true
     a1.sinkgroups.g1.processor.selector = random
     
     # Use a channel which buffers events in memory
     a1.channels.c1.type = memory
     a1.channels.c1.capacity = 1000
     a1.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a1.sources.r1.channels = c1
     a1.sinks.k1.channel = c1
     a1.sinks.k2.channel = c1
    
  2. Start flume2, flume3, flume1 respectively. Start a new terminal, type nc master 33333, and then type something.
        The upper left corner is flume1 , the upper right corner is flume2 , the lower left corner is flume3 , and the lower right corner is the client .
    Insert picture description here

Case 4: Aggregation

case analysis

    slave1 on Flume1 monitoring a data port, slave2 on Flume2 monitor local file date.txt , Flume1 and Flume2 send data to the master on Flume3 , Flume3 the print data to the console.

Case steps

  1. Create the configuration file flume1.conf on slave1. The source type is netcat , listening port. : The type of sink is Avro , with flume3 docking.

     # Name the components on this agent
     a1.sources = r1
     a1.channels = c1
     a1.sinks = k1
     
     # Describe/configure the source
     a1.sources.r1.type = netcat
     a1.sources.r1.bind = localhost
     a1.sources.r1.port = 33333
     
     # Describe the sink
     a1.sinks.k1.type = avro
     a1.sinks.k1.hostname = master
     a1.sinks.k1.port = 44444
     
     # Use a channel which buffers events in memory
     a1.channels.c1.type = memory
     a1.channels.c1.capacity = 1000
     a1.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a1.sources.r1.channels = c1
     a1.sinks.k1.channel = c1
    

        Create the configuration file flume2.conf on slave2. The type of source is exec , which monitors files. : The type of sink is Avro , with flume3 docking.

     # Name the components on this agent
     a2.sources = r1
     a2.channels = c1
     a2.sinks = k1
     
     # Describe/configure the source
     a2.sources.r1.type = exec
     a2.sources.r1.command = tail -F /opt/flume-1.9.0/date.txt
     
     # Describe the sink
     a2.sinks.k1.type = avro
     a2.sinks.k1.hostname = master
     a2.sinks.k1.port = 44444
     
     # Use a channel which buffers events in memory
     a2.channels.c1.type = memory
     a2.channels.c1.capacity = 1000
     a2.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a2.sources.r1.channels = c1
     a2.sinks.k1.channel = c1
    

        Create a configuration file flume3.conf in the master. source of type Avro , and flume2 flume1 receiving data sent. : The sink type is logger , which writes out the received data to the console.

     # Name the components on this agent
     a3.sources = r1
     a3.channels = c1
     a3.sinks = k1
     
     # Describe/configure the source
     a3.sources.r1.type = avro
     a3.sources.r1.bind = master
     a3.sources.r1.port = 44444
     
     # Describe the sink
     a3.sinks.k1.type = logger
     
     # Use a channel which buffers events in memory
     a3.channels.c1.type = memory
     a3.channels.c1.capacity = 1000
     a3.channels.c1.transactionCapacity = 100
     
     # Bind the source and sink to the channel
     a3.sources.r1.channels = c1
     a3.sinks.k1.channel = c1
    
  2. Start flume2, flume3, flume1 respectively.

  3. Enter nc localhost 33333 in slave1 , and then send data

  4. Enter date> date.txt in slave2

  5. The received data can be seen in the master

Guess you like

Origin blog.csdn.net/H_X_P_/article/details/106568209