Flume has shown signs and Practice

Photo by Janke Laskowski on Unsplash

Reference books: "Flume build highly available, scalable massive log collection system" --Hari Shreedharan the

Hereinafter referred to as "reference books", the text and pictures will be marked some of the information is quoted from the book. Official documents referred to as "official culture."

Articles for personal learning record from scratch, if wrong, please let me know.


Flume has shown signs

· Introduction

Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission system can be efficiently collected, and the polymerization a large amount of log data movement. It has a data flow architecture is based on a simple and flexible. It is robust (the Robust) and fault tolerance, reliability and multiple failover mechanism and recovery mechanism adjustable. It uses a simple, extensible data model that allows online analysis applications.

· Data flow model

  • Source: receiving external data from the Source (such as Web server) the data, and to transfer the received data format Flume the event to one or more Channel, Flume embodiment provides a variety of data received, such as Avro, Thrift, twitter1%, etc.
  • Channel: channel is a temporary storage container, it will receive data from the source to the event at the format of cached until they are consumed sinks, it plays a role of a bridge between the source and sink, channel is a complete affairs, which is to ensure the consistency of data sent and received in time and it can be any number of source and sink link types are supported:.. JDBC channel, File System channel, Memort channel and so on.
  • Sink: sink to a centralized data store, such as a memory and Hbase HDFS, and transfers it to the destination channals consumption data (Events) from which the destination could be another sink, may HDFS, HBase..

The basic operating principle:

Pictures taken from reference books

· Launch configuration

Flume Agent configuration is stored in the local configuration file. This is a text file that follows the Java Properties file format. You can specify one or a plurality of Agent in the same profile. Agent Profile including the attributes of each Source, Sink and Channel and how they are connected together to form a data stream.

Source

is responsible for receiving the data source component Flume Agent, which may receive data from other systems, the output image data Java Message Service (JMS) or other processing or other Flume Agent Sink transmitted through RPC. Receiving data from an external source, or other system - Agent (or its own production) data, and to write data to one or a plurality of Channel, these Channel-configured to advance Source. This is the most fundamental duty of Source.

Flume configuration of each Source Source authentication system configuration masking and configuration errors can be ensured:

  • Each Source has at least one properly configured Channel connect to it.
  • Each source has a type parameter definition.
  • source is active list Agent in Sources inside.
    source Once successfully configured, Flume lifecycle management system will attempt to start Source. Only Agent itself is stopped or killed, or Agent reconfigured by the user, it will stop source.

One of the most important features is the Flume Flume deployment of horizontally scaled simplicity. The reason can be easily completed extension is very easy to add new Agent for the Flume scheduling, and very easy to configure new The Agent sends data to other FlumeAgent. Similarly, once you add a new Agent, simply by updating the configuration file, you Agent can be very easily configured to run already written to the new Agent. Following is a brief summary of several Flume Source officer mentioned in the text, a detailed description may refer to other materials, the latter part of the article there are some examples of exercises.

  • The Source Avro : Flume main RPC Source, Avro Source is designed to be highly scalable RPC server side, the other from the Avro Sink - Agent Flume Flume or use the SDK client application sends data to a reception data in Flume - Agent. Avro Source using the communication Netty-Avro inter-process of the (IPC) protocol to communicate.
  • Source Thrift : Since Avro Source can not receive data from non-JVM languages, Flume joined the Apache Thrift RPC support to support cross-language communication, Thrift Source can be simply defined as a multi-threaded, high-performance Thrift server.
  • Source HTTP : Flume comes with HTTP source can receive Event by HTTP POST. GET request method used only for experiments. HTTP request is pluggable "handler" is converted to Flumeevent, the handler must implement HTTPSourceHandler interface. This handler accepts a HttpServletRequest, and returns a list of Flumeevent. From the client's point of view, HTTP Source behave like web server can receive the same Flumeevent.
  • Source Directory Spooling : monitoring the event to read the directory. Source expect files in the directory is unchanged, once the file is moved to the directory should not be written again. Once the file is completely used up and the Source of all event is successfully written in the Source Channel, Source can rename a file or delete a file-based configuration weight. When the file is renamed, Source just add a suffix to the file name, rather than completely change it.
  • The Source syslog : syslog read data and generates Flume Event. Flume provided the syslog Source: Syslog UDP Source, Syslog TCP Source , Multiport Syslog, UDP Source entire message will be treated as a single event. TCP Source create a newline ( 'n') of each string is separated by a new event.
  • The Source exec : Run user profile, and generates an output command based on the standard event. It can also read from the command error stream, the event is converted to Flumeevent, and write them people Channel. Source want the command to continue to produce data, and absorb its output and error streams. As long as the command runs, Source will keep running and handling, and constantly read the output stream processing.
  • Source JMS : Flume comes with Source, you can get the data from the Java Message Service queue or the Subject.
  • Custom Source : due to the different production environments inevitably needs to be written to use custom Flume communication format, the user needs to implement their own custom interface to complete Source Source.

Channel

Channel is located between the Source and Sink buffer. Therefore, Channel allows the source and Sink operate on different rates. Flume Channel is the key to ensure that data is not lost, of course, in the case of a properly configured. Source data is written to one or more of the Channel, and then read by one or more of the Sink. Sink can only read data from a Channel, and a plurality Sink can be read better performance from the same Channel.

Channel allows source operating on the same Channel, with its own threading model without worrying about Sink reads data from the Channel, and vice versa. Located between the Source and Sink buffer also allows them to operate at different rates, since the write operation occurs at the end of the buffer, the buffer read head occurs. It also makes Flume Agent can handle source "peak hour" of the load, even if Sink could not be read immediately Channel.

Source and Sink Channel allows a plurality of operating on them. Is the transactional nature of the Channel. Each write and the read data from the Channel, have occurred in the context of a transaction. Only when the write transaction is committed, in the event the transaction can be read by any Sink. Similarly, if a Sink has been successfully read an event, the event is not available for other Sink, unless the Sink roll back the transaction.

Flume official text of several Channel:

  • Channel Memory : the Event may be stored in a configuration having a maximum size of the memory queue. Source written from its tail, Sink read from its head. Memory Channel supports high throughput because it saves all data in memory. Channel is thread-safe, it can handle several read operations and write operations Source of several Sink simultaneously. Memory Channel at the scene do not care about data loss applicable because the class Channel data is not persisted to disk.

    Memory Channel support transactional model Flume and maintain separate queues for each program in the transaction. If the transaction fails, event will be the reverse order is re-inserted into the head Channel, so the event will be read again in the same order, just as they had been inserted. In this way, although not guarantee the order of the Flume, but Memory Channel event to ensure the order they are written to be read. However, when certain transaction rollback, event after writing the possible earlier written to the destination.

  • Channel File : File Flume Channel is persistent Channel, it will all event written to disk, it will not lose data in case the program is closed or machine downtime. File Channel Agent ensures that even if the machine or crash or restart, only when Sink removed the event and committed to the transaction, any submission to the Channel's event was removed from the Channel.

    File Channel is designed for data persistence and need not tolerate data loss at the scene. Because Channel write data to disk, it will not cause a failure or downtime due to data loss. An added bonus, because it writes data to disk, Channel may have a very large capacity, especially in comparison and Memory Channel.

  • Memory Channel spillable : see to know the name of justice, may overflow Memory Channel, in-memory queue act as primary storage, disk as an overflow. Use Embedded File Channel disk storage management. When the queue is full memory, other incoming event will be stored in the File Channel. Seems very good, of course, the official text expressly mention the Channel is currently experimental, not recommended for production environments .

  • Channel Custom : Custom Channel.

Sink

Remove data from Flume Agent and Agent or write to another person or some other data storage system components are referred to Sink. Sink is fully transactional. Before removing data from Channel batch, each Sink starts a transaction with the Channel. Upon successful event batches written to the storage system or the next Flume Agent, Sink on the use of Channel commit the transaction. Once the transaction is committed, the Channel delete event from their own internal buffer.
Sink Flume configuration system using the standard configuration. Each Agent can not Sink or several Sink. Sink Each event can only be read from one of the Channel. If Sink Channel is not configured, Sink will be removed from the Agent. Configure the system to ensure that:

  • Each Sink has at least one connecting it properly configured Channel.
  • Each Sink has a type parameter definition.
  • Sink is in the Sink Agent active list.

Flume thread can be polymerized to Sink group, each group may comprise a Sink Sink or more. If a Sink Sink group is not defined, then the Sink may be considered within a group, and the Sink is the only member within the group. Brief summary:

  • Sink HDFS : Sink the event HDFS write Hadoop Distributed File System (HDFS). Currently supports the creation of text files and sequence, according to the elapsed time, data size or number of event periodically scroll through the file (close the current file and creates a new file). It also stores the data / time stamp or the like partitions based machine attributes the origin of the event. HDFS directory path format may contain escape sequences, such escape sequences are replaced HDFSSink, to generate a directory / file name stored event. Sink need to use this Hadoop, Flume may be used to HadoopJAR communication with HDFS cluster.
  • Sink Hive : Sink Hive comprising a partition or JSON text event data flow directly Hive or partition table. event is written using Hive affairs. Once a group event is submitted to the Hive, they will be immediately visible to the query Hive. Incoming event data field is mapped to the corresponding column Hive table.
  • Sink Logger : record event at INFO level. Usually used for testing / debugging purposes.
  • Sink Avro : The Flume Event is converted into Avro event sent arranged hostname / port is right. Channel event is removed from the configuration by configuration batch size batches.
  • Sink Thrift : Flume event is converted to Thrift event and sent to the configured hostname / port pairs. Channel event is removed from the configuration by configuration batch size batches.
  • Sink IRC : receiving a message from attached channel, and relay these messages to the destination IRC configuration.
  • Roll Sink File : store event on the local file system
  • Sink null : discards the received event
  • Sink the HTTP : receiving from Channel Event, and uses these to send an HTTP POST request to the remote server Event. event content sent as POST content.
  • Sink Custom : Custom Sink

After the addition, there are not mentioned Sink, such as Kafka Sink, ElasticSearch Sink, etc. as appropriate supplement.


Other components

Interceptors

Flume Interceptors (interceptor) are plug-in assembly is disposed between the Source and Channel. Interceptors will be used before the Event Source Channel data is written to some processing and filtering. Each process Interceptors instance only received the same Source Event. In one process flow Flume may be added to any number of Interceptors chaining data is written by the last interceptor interceptor chain data to the Channel.

  • Interceptor timestamp : The event timestamp is inserted into the header, the first layer is usually used to filter the data Agent.
  • Interceptor Host : insert IP address of the server or HostName to the event header.
  • Interceptor Static : allows users to append a static header static value to all events. The current implementation of the interceptor is not allowed to specify more than one header. Users can link multiple static interceptors, each interceptor define a static head.
  • Header Interceptor the Remove : one or more headers operating Flume event by deleting header. It can remove statically defined header, a header or a list based on regular expressions in the header. If you only need to delete a header, press specify the name it has a higher performance than the other two methods.
  • Interceptor UUID : for all intercepted event set a universally unique identifier.
  • Interceptor Morphline : conversion chain defines a conduit from a record command to another command through Morphline profile, but the interceptor does not appear to be a complicated operation. If you need to deal with heavyweight, the best use Morphline Solr Sink.
  • Filtering Interceptor the Regex : transforms the event into the body using UTF-8 characters in regular expressions to filter things.
  • Interceptor and Replace Search : provides a Java-based regular expressions based on simple string search and replace functions.
  • Interceptor Custom : Custom interceptors to achieve org.apache.flume.interceptor.Interceptor interface.

Channel Selector

Channel selector determines which received the Event Source Channel write. If the event of failure to write a Channel Flume occur in other Channel it can not be rolled back will throw ChannelException cause the transaction to fail.

Flume built two Channel Selector:

  • Channel Selector Replicating (default) : If the Source is not specified, the default is Channel selector copy the selection to copy each event to all Channel Source parameter specified by the Channels in.
  • Channel Selector Multiplexing : multiplexer selector is designed for dynamic routing event Channel selector, by Channel selection event to be written, based on the value of a specific event to perform routing head, generally used in conjunction with the interceptor.
  • Channel Selector the Custom : Custom Channel selector, an interface or inherit AbstractChannelSelector ChannelSelector implement an abstract class. Start Flume agent must be custom channel selector class and dependencies contained in the proxy class path.

Sink grou和Sink Processors

Flume for each instantiation of a group Sink Sink Processors Sink to perform tasks in the group, the group may contain any number Sink Sink, which is generally used for RPC Sink, between the layers of the load balancing or failover transmission of data. Each group is declared Sink in the active list as a component, like Source, Sink and Channel, use sinkgroups keyword. Each group is a need Sink named components, because each may have a plurality of Agent Sink groups. Note that all group Sink Sink will not be activated at the same time, any time only one of them is used to transmit data. Thus, the group should not be used Sink clear more rapidly Channel, in this case, a plurality of operation Sink should just set for themselves, without Sink group, and they should be configured to read from the same Channel.

  • Sink Processor the Default : the Default Sink Processor only accept a Sink. Users do not have to create a processor (Sink group) as a single Sink.
  • Sink Processor Failover : Failover Sink Sink Processor maintains a priority list to ensure that each event will be available to reach the deal. Failover mechanism works is to downgrade fail Sink to the pool, a cooling period allocated for them in the pool, with the increase in order to increase fault before retrying. Once Sink successfully sent an Event, it will be restored to the active pool. If Sink fails when sending Event, then the next highest priority Sink will try to send Event. For example, a priority Sink 100 is activated prior to having the priority Sink 80. If the priority is not specified, it is determined according to the priority order specified in the configuration Sinks.
  • Sink Processor Balancing the Load : the Load Balancing Sink Processor for the ability to load balancing across multiple Sink. It maintains a list of active Sink index must load on which the assignment. Implementation supports the use round_robin or random selection mechanism. Select the default selection mechanism is round_robin type, but can be configured to rewrite. When invoked, the selector configured to use the selection mechanism to select the next Sink and call it. This implementation does not blacklist failure Sink, but optimistically try each available Sink.

Some introductory exercises

· Official text basis examples

 vim HelloFlume.conf  //创建Agent配置文件
 
 # example.conf: A single-node Flume configuration
 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1
 
 # Describe/configure the source
 # 这个例子监听了本机的44444端口netcat服务
 a1.sources.r1.type = netcat
 a1.sources.r1.bind = localhost  
 a1.sources.r1.port = 44444
 
 # Describe the sink
 a1.sinks.k1.type = logger
 
 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1
 
 #### 配置内容到此结束 ####

 bin/flume-ng agent --name a1 --conf conf/ --conf-file learn/part1/HelloFlume.conf -Dflume.root.logger=INFO,console
 # 注意几个参数 
 # --name 表示启动的agent name ,因为上面配置文件里写了a1,所以这里写a1,key可以简写为-n
 # --conf 表示flume的conf目录 ,key可以简写为-c
 # --conf-file 表示启动当前agent使用的配置文件,指向上面创建的配置文件,key可以简写为 -f
 # 启动成功会发现当前终端被阻塞,启动另一个终端

 nc localhost 44444
 hello flume

 回到阻塞的终端看最新的日志

 2019-9-18 09:52:55,583 (SinkRunner-PollingRunner-DefaultSinkProcessor)
  [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
  Event: { headers:{} body: 68 65 6C 6C 6F 20 66 6C  75 6D 65                hello flume }

· Monitoring changes in local log files and output to different destinations

Direct configuration content posted

 # file-flume.conf 从本地文件系统监控变化并通过avro sink将数据传输给另外两个flume #
 #name
 a1.sources = r1
 a1.channels = c1 c2
 a1.sinks = k1 k2
 
 #configure the source
 # 使用TailDir的方式监视文件变化,这种方式可以以较高效率实现断点续传
 a1.sources.r1.type = TAILDIR
 a1.sources.r1.filegroups = f1
 a1.sources.r1.filegroups.f1 = /root/public/result/t2.txt
 a1.sources.r1.positionFile = /usr/local/soft/flume-1.9.0/learn/part2/position.json
 
 #将选择器设置为复制,其实不写也可以,因为这是默认值,熟悉一下
 a1.sources.r1.selector.type = replicating
 
 #channel
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 a1.channels.c2.type = memory
 a1.channels.c2.capacity = 1000
 a1.channels.c2.transactionCapacity = 100
 

 #sink
 # 两个sink分别绑定不同端口
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 12345
 
 a1.sinks.k2.type = avro
 a1.sinks.k2.hostname = master
 a1.sinks.k2.port = 12346
 
 #bind
 a1.sources.r1.channels = c1 c2
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c2
 # flume-hdfs.conf 从avro source接收数据并上传到hdfs sink #
 #name
 a2.sources = r1
 a2.channels = c1
 a2.sinks = k1
 
 #source
 a2.sources.r1.type = avro
 a2.sources.r1.bind = master
 a2.sources.r1.port = 12345
 
 #channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 #sink
 a2.sinks.k1.type = hdfs
 #上传到hdfs的路径
 a2.sinks.k1.hdfs.path = hdfs://master:9000/flume/part2/events/%y-%m-%d/%H%M/%S
 #上传文件的前缀
 a2.sinks.k1.hdfs.filePrefix = events
 #是否按照时间滚动文件夹
 a2.sinks.k1.hdfs.round = true
 #多少时间单位创建一个新的文件夹
 a2.sinks.k1.hdfs.roundValue = 1
 #重新定义时间单位
 a2.sinks.k1.hdfs.roundUnit = hour
 #是否使用本地时间戳
 a2.sinks.k1.hdfs.useLocalTimeStamp = true
 #积攒多少个 Event 才 flush 到 HDFS 一次,这里因为是学习测试,所以设置的值比较小方便查看
 a2.sinks.k1.hdfs.batchSize = 100
 #设置文件类型,可支持压缩
 a2.sinks.k1.hdfs.fileType = DataStream
 #多久生成一个新的文件
 a2.sinks.k1.hdfs.rollInterval = 30
 #设置每个文件的滚动大小,这里最好设置成比HDFS块大小小一点
 a2.sinks.k1.hdfs.rollSize = 134217000
 #文件的滚动与 Event 数量无关
 a2.sinks.k1.hdfs.rollCount = 0
 
 #bind
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1
 # file-local.conf 从avro source接收数据并存储到本地 #

 #name
 a3.sources = r1
 a3.channels = c1
 a3.sinks = k1
 
 #source
 a3.sources.r1.type = avro
 a3.sources.r1.bind = master
 a3.sources.r1.port = 12346
 
 #channel
 a3.channels.c1.type = memory
 a3.channels.c1.capacity = 1000
 a3.channels.c1.transactionCapacity = 100
 
 #sink
 a3.sinks.k1.type = file_roll
 #注意这里写出的本地文件路径要提前创建好文件夹,否则flume不会帮你创建导致异常错误
 a3.sinks.k1.sink.directory = /usr/local/soft/flume-1.9.0/learn/part2/localResult/
 
 #bind
 a3.sources.r1.channels = c1
 a3.sinks.k1.channel = c1

Configuration to open flume, start to pay attention to the parameters and select the configuration file the same Agent Name

 bin/flume-ng agent --name a1 --conf conf/ --conf-file learn/part2/file-flume.conf
 bin/flume-ng agent --name a2 --conf conf/ --conf-file learn/part2/flume-hdfs.conf
 bin/flume-ng agent --name a3 --conf conf/ --conf-file learn/part2/flume-local.conf

Because monitoring is a local file, so any way to add information to the file, the results:

 [root@master localResult]# hadoop fs -ls -R /flume
 drwxr-xr-x   - root supergroup          0 2019-9-18 14:33 /flume/part2
 drwxr-xr-x   - root supergroup          0 2019-9-18 14:33 /flume/part2/events
 drwxr-xr-x   - root supergroup          0 2019-9-18 14:33 /flume/part2/events/19-9-18
 drwxr-xr-x   - root supergroup          0 2019-9-18 14:33 /flume/part2/events/19-9-18/1400
 drwxr-xr-x   - root supergroup          0 2019-9-18 14:35 /flume/part2/events/19-9-18/1400/00
 -rw-r--r--   1 root supergroup       3648 2019-9-18 14:34 /flume/part2/events/19-9-18/1400/00/events.1569911635854
 -rw-r--r--   1 root supergroup       2231 2019-9-18 14:35 /flume/part2/events/19-9-18/1400/00/events.1569911670803

 [root@master localResult]# ls -lh /usr/local/soft/flume-1.9.0/learn/part2/localResult/
 总用量 8.0K
 -rw-r--r--. 1 root root 2.5K 9月  18 14:34 1569911627438-1
 -rw-r--r--. 1 root root 3.4K 9月  18 14:34 1569911627438-2
 -rw-r--r--. 1 root root    0 9月  18 14:34 1569911627438-3
 -rw-r--r--. 1 root root    0 9月  18 14:35 1569911627438-4
 -rw-r--r--. 1 root root    0 9月  18 14:35 1569911627438-5

· Official text load balancing, failover, SinkGroup, Sink Processor

Failover

 #name
 a1.sources = r1
 a1.channels = c1
 a1.sinks = k1 k2
 
 #configure the source,以命令的方式监控本地文件变动
 a1.sources.r1.type = exec
 a1.sources.r1.command = tail -F /root/public/result/t2.txt
 
 
 #channel
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 #sink
 a1.sinkgroups = g1
 a1.sinkgroups.g1.sinks = k1 k2
 a1.sinkgroups.g1.processor.type = failover
 a1.sinkgroups.g1.processor.priority.k1 = 5
 a1.sinkgroups.g1.processor.priority.k2 = 10
 a1.sinkgroups.g1.processor.maxpenalty = 10000
 
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = master
 a1.sinks.k1.port = 12345
 
 a1.sinks.k2.type = avro
 a1.sinks.k2.hostname = master
 a1.sinks.k2.port = 12346
 
 #bind
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c1
 另外两个Flume启动的配置只有port和name参数不一样,所以只贴出一份

 #name
 a2.sources = r1
 a2.channels = c1
 a2.sinks = k1
 
 #source
 a2.sources.r1.type = avro
 a2.sources.r1.bind = master
 a2.sources.r1.port = 12345
 
 #channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 #sink
 a2.sinks.k1.type = logger
 
 #bind
 a2.sources.r1.channels = c1
 a2.sinks.k1.channel = c1
 bin/flume-ng agent -n a1 -c conf -f learn/part3/file-flume.conf
 bin/flume-ng agent -n a2 -c conf -f learn/part3/flume-sink1.conf -Dflume.root.logger=INFO,console
 bin/flume-ng agent -n a3 -c conf -f learn/part3/flume-sink2.conf -Dflume.root.logger=INFO,console

Thanks to a Sink k2 higher priority than k1, so the beginning of the log information will be sent to all k2, use Ctrl + c off the end of the information is transferred to the k1 k2

As for load balancing configuration, you only need to change a few parameters

 a1.sinkgroups = g1
 a1.sinkgroups.g1.sinks = k1 k2
 a1.sinkgroups.g1.processor.type = load_balance
 a1.sinkgroups.g1.processor.backoff = true
 a1.sinkgroups.g1.processor.selector = random

Because each Avro Sink sustained an open connection to the Avro Source, Sink has written more people to the same Agent will add more socket connections, and occupies the second floor Agent more resources. It must be carefully considered prior to same Agent increase the number Sink.


Multi-node Syndication

Now plans to make production data Node1 and Node2 nodes, aggregation together log information collected on the Master machine, directly on the configuration

 # flume-node1.conf #
 #name
 a2.sources = r1
 a2.channels = c1 c2
 a2.sinks = k1 k2
 
 #source
 a2.sources.r1.type = exec
 a2.sources.r1.command = tail -F /usr/local/soft/flume-1.9.0/learn/part4/input/t1.txt
 
 a2.sources.r1.selector.type = replicating
 
 #channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 a2.channels.c2.type = memory
 a2.channels.c2.capacity = 1000
 a2.channels.c2.transactionCapacity = 100
 
 #sink
 a2.sinks.k1.type = avro
 a2.sinks.k1.hostname = master
 a2.sinks.k1.port = 12345
 
 a2.sinks.k2.type = logger
 
 #bind
 a2.sources.r1.channels = c1 c2
 a2.sinks.k1.channel = c1
 a2.sinks.k2.channel = c2  
 # flume-node2.conf #
 #name
 a3.sources = r1
 a3.channels = c1 c2
 a3.sinks = k1 k2
 
 #source
 a3.sources.r1.type = TAILDIR
 a3.sources.r1.positionFile = /usr/local/soft/flume-1.9.0/learn/part4/taildir_position.json
 a3.sources.r1.filegroups = f1
 a3.sources.r1.filegroups.f1 = /usr/local/soft/flume-1.9.0/learn/part4/input/t1.txt
 
 a3.sources.r1.selector.type = replicating
 
 #channel
 a3.channels.c1.type = memory
 a3.channels.c1.capacity = 1000
 a3.channels.c1.transactionCapacity = 100
 
 a3.channels.c2.type = memory
 a3.channels.c2.capacity = 1000
 a3.channels.c2.transactionCapacity = 100
 
 #sink
 a3.sinks.k1.type = avro
 a3.sinks.k1.hostname = master
 a3.sinks.k1.port = 12345
 
 a3.sinks.k2.type = logger
 
 #bind
 a3.sources.r1.channels = c1 c2
 a3.sinks.k1.channel = c1
 a3.sinks.k2.channel = c2
 #name
 a1.sources = r1
 a1.channels = c1 c2
 a1.sinks = k1 k2
 
 #configure the source
 a1.sources.r1.type = avro
 a1.sources.r1.bind = master
 a1.sources.r1.port = 12345
 
 a1.sources.r1.selector.type = replicating
 
 #channel
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 a1.channels.c2.type = memory
 a1.channels.c2.capacity = 1000
 a1.channels.c2.transactionCapacity = 100
 
 #sink
 
 a1.sinks.k1.type = file_roll
 a1.sinks.k1.sink.directory = /usr/local/soft/flume-1.9.0/learn/part4/result/
 
 a1.sinks.k2.type = logger
 
 #bind
 a1.sources.r1.channels = c1 c2
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c2

As configured, in fact, since the cross-machine are the same so Agent name does not matter, each profile in two and sink to the channel information is printed to the console, if an error has occurred convenient observation. Some simple script to generate data slowly.

 #!/bin/bash
 hs=`hostname`
 for i in $(seq 1 20)
 do
     echo "来自${hs}的第${i}条日志" >> /usr/local/soft/flume-1.9.0/learn/part4/input/t1.txt
     sleep 1
 done
 MASTER:FLUME_HOME/bin/flume-ng agent -n a1 -c conf -f learn/part4/flume-master.conf -Dflume.root.logger=INFO,console
 NODE1:FLUME_HOME/bin/flume-ng agent -n a2 -c conf -f learn/part4/flume-node1.conf -Dflume.root.logger=INFO,console
 NODE2:FLUME_HOME/bin/flume-ng agent -n a3 -c conf -f learn/part4/flume-node2.conf -Dflume.root.logger=INFO,console
 NODE1:FLUME_HOME/learn/part4/input/generate.sh
 NODE2:FLUME_HOME/learn/part4/input/generate.sh

 ### 数据生成和传输完成后 ###
 
 MASTER:FLUME_HOME/learn/part4/result ls -l
 总用量 8
 -rw-r--r--. 1 root root 368 9月  18 20:38 1569933489286-1
 -rw-r--r--. 1 root root 774 9月  18 20:38 1569933489286-2

+ ### · interceptor custom interceptor ###
by some small examples digested appreciated different binding blocker, now we have the following structure

Then the following configuration

 #flume-master.conf

 #name
 a1.sources = r1
 a1.channels =c1 c2
 a1.sinks =k1 k2
 
 
 #configure the source
 a1.sources.r1.type = exec
 a1.sources.r1.command = tail -F /usr/local/soft/flume-1.9.0/learn/part6/input/info.txt
 
 a1.sources.r1.interceptors = i1 i2 i3 
 a1.sources.r1.interceptors.i1.type = static  
 #使用静态拦截器为每个事件添加键值对
 a1.sources.r1.interceptors.i1.key = des
 a1.sources.r1.interceptors.i1.value = UsingStaticInterceptor
 a1.sources.r1.interceptors.i2.type = host
 a1.sources.r1.interceptors.i2.useIP = false
 a1.sources.r1.interceptors.i3.type =  priv.landscape.interceptorDemo.LevelInterceptor$Builder  
 #自定义拦截器
 
 a1.sources.r1.selector.type = multiplexing
 a1.sources.r1.selector.header = level
 a1.sources.r1.selector.mapping.error = c1
 a1.sources.r1.selector.mapping.other = c2
 
 #channel
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100
 
 a1.channels.c2.type = memory
 a1.channels.c2.capacity = 1000
 a1.channels.c2.transactionCapacity = 100
 
 
 #sink
 a1.sinks.k1.type = avro
 a1.sinks.k1.hostname = node1
 a1.sinks.k1.port = 12345
 
 
 a1.sinks.k2.type = logger
 
 #bind
 a1.sources.r1.channels = c1 c2
 a1.sinks.k1.channel = c1
 a1.sinks.k2.channel = c2
 #其中自定义拦截器的关键Java代码 :

 public class LevelInterceptor implements Interceptor {
    private List<Event> eventList;

    @Override
    public void initialize() {
        eventList = new ArrayList<>();
    }

    @Override
    public Event intercept(Event event) {
        Map<String, String> headers = event.getHeaders();
        String body = new String(event.getBody());
        if (body.contains("ERROR")) {
            headers.put("level", "error");
        } else {
            headers.put("level", "other");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        eventList.clear();
        for (Event event : events) {
            eventList.add(intercept(event));
        }
        return eventList;
    }
  .......
 ## flume-node1.conf
 #name
 a2.sources = r1
 a2.channels = c1 c2
 a2.sinks = k1 k2
 
 #source
 a2.sources.r1.type = avro
 a2.sources.r1.bind = node1
 a2.sources.r1.port = 12345
 
 a2.sources.r1.selector.type = multiplexing
 
 a2.sources.r1.selector.header = host
 a2.sources.r1.selector.mapping.Master = c1
 a2.sources.r1.selector.mapping.Node2 = c2
 a2.sources.r1.selector.mapping.default = c2
 
 #channel
 a2.channels.c1.type = memory
 a2.channels.c1.capacity = 1000
 a2.channels.c1.transactionCapacity = 100
 
 a2.channels.c2.type = memory
 a2.channels.c2.capacity = 1000
 a2.channels.c2.transactionCapacity = 100
 
 #sink
 a2.sinks.k1.type = logger
 a2.sinks.k2.type = null
 
 #bind
 a2.sources.r1.channels = c1 c2
 a2.sinks.k1.channel = c1 
 a2.sinks.k2.channel = c2 
#flume-node2.conf
#name
a3.sources = r1
a3.channels = c1 c2
a3.sinks = k1 k2

#source
a3.sources.r1.type = exec
a3.sources.r1.command = tail -F /usr/local/soft/flume-1.9.0/learn/part6/input/info.txt

a3.sources.r1.interceptors = i1 i2
a3.sources.r1.interceptors.i1.type = regex_filter
a3.sources.r1.interceptors.i1.regex = \[ERROR\]
a3.sources.r1.interceptors.i2.type = host
a3.sources.r1.interceptors.i2.useIP = false

#channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

#sink
a3.sinks.k1.type = avro
a3.sinks.k1.hostname = node1
a3.sinks.k1.port = 12345

a3.sinks.k2.type = logger

#bind
a3.sources.r1.channels = c1 c2
a3.sinks.k1.channel = c1
a3.sinks.k2.channel = c2

· To build the foundation and RPC Client Event

Event data is the basic form of Flume, dependent Flume SDK added Maven in the IDE, see Event interface

 public interface Event {

  public Map<String, String> getHeaders();

  public void setHeaders(Map<String, String> headers);

  public byte[] getBody();

  public void setBody(byte[] body);
}

Event default interface implementation has SimpleEvent and JSONEvent , different internal structure, can be constructed quickly by EventBuilder an Event class static method.

Look RpcClient interfaces, which is transmitted through a flume event append method, you may also be implemented by inheriting a RpcClient AbstractRpcClient.

 public interface RpcClient {
 
   public int getBatchSize();
 
   public void append(Event event) throws EventDeliveryException;
 
   public void appendBatch(List<Event> events) throws EventDeliveryException;
 
   public boolean isActive();
 
   public void close() throws FlumeException;
 
 }

FIG its implementation structure:

So try to use the most simple code to send to the Agent a event

 public class FlumeClient {
    public static void main(String[] args) throws EventDeliveryException {

        RpcClient client = RpcClientFactory.getDefaultInstance("master", 12345);
        client.append(EventBuilder.withBody("hello , 这里是RPC Client".getBytes()));
        client.close();
    }
 }

 ——————————————————————————————————————————————————————————————————————————————
 Flume Agent:
 2019-9-20 19:37:21,576 (SinkRunner-PollingRunner-DefaultSinkProcessor) 
 [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] 
 Event: { headers:{} body: 68 65 6C 6C 6F 20 2C 20 E8 BF 99 E9 87 8C E6 98 hello , ........ }

· To be added

Planning, deployment, monitoring Flume

Read ing .....

Guess you like

Origin www.cnblogs.com/novwind/p/11620626.html