Flume Basics

Outline

Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission systems. Flume flow-based architecture, flexible and simple.

Main functions: Real-time read the server's local disk data, the data is written to HDFS;

advantage:

  1. And any store can process integration.
  2. The input data rate greater than the rate of writing the object storage (read and write rates are not synchronized), Flume are buffered, the reduced pressure hdfs.
  3. The flume transaction-based channel, uses two transaction model (sender + receiver), ensure that the message is reliably transmitted.

Flume two separate transactions from soucrce are responsible for passing to the channel, as well as events from channel to sink in. Once all transaction data all successfully submitted to the channel, it was believed that the source data read completion . Similarly, only the successful sink was written out of the data, to be removed from the channel; after the failure to re-submit;

Composition: - Agent consists source + channel + sink;

source abstract data source, sink whereabouts is data abstraction;

 

Source
Source is responsible for receiving data of the component Flume Agent. Source component can handle various types of log data in various formats
of data input terminal Type: spooling Directory (spooldir) folder data is kept inside the scroll, Exec  execution result of the command are acquired
syslog system log, Avro upper layer the Flume, netcat data transmission network


Channel
Channel is located between the Source and Sink buffer. Therefore, Channel Source and Sink allowed to operate on different rates. Channel is thread-safe, can handle several Source of write operations simultaneously read and several of Sink.
Flume comes with two kinds Channel : Memory Channel and File Channel .
Memory Channel is the memory queues. Memory Channel applicable in do not care about data loss scenarios. If you need to be concerned about data loss, then the Memory Channel should not be used, because the program died down or restart the machine will lead to data loss.
File Channel all events written to disk. So you do not lose data in case the program is closed or machine downtime.

Channel selection is determined which components of a particular event Source Channel received write, they inform Channel processor, who then writes the event to each of the Channel.

Channel Selector There are two types: Replicating Channel Selector (default all data will be distributed to all the Channel) and Multiplexing Chanell Selector (which select the channel to which to send data), and custom selectors ;

Event Source Channel transmitted through the selector to select the manner in which to write the Channel, The Flume Channel Selector three types, namely, copying, and custom multiplexing selector.

  1. Replication Selector: Source a duplicated manner to an Event Channel simultaneously written into a plurality of different Event Sink can obtain the same from different Channel, such as a log data writing while the HDFS and Kafka, while an Event writing two Channel, and different types of Sink sent to a different external storage. The selector copy each event to all of Channels channels specified by the Source parameter in. Copy Channel selector also has an optional parameter optional, the argument is a space-separated list of names channel. This parameter specifies all channel are considered optional, so if an event is written to these channel, if a failure occurs, it will be ignored. The exception is thrown when writing other channel failure.

    2. (multiplexer) multiplexes a selector: interceptors need and used in conjunction with, according to the header information to determine the Event Event different key data which should be written in the Channel.

There is also a kafka channel, it is no sink;

        3. Selector Custom

Sink

The whereabouts of the common destination of data has: HDFS, Kafka used to live , Logger (INFO level of logging), Avro (the next layer Flume), File, Hbase, solr , ipc, thrift and other custom
Sink is constantly polling the Channel event and remove them in bulk, and the bulk of these events written to the memory or indexing system, or sent to another Flume Agent.
Sink is fully transactional. Before deleting data from Channel batch, each Sink starts a transaction with the Channel. Upon successful event batches written to the storage system or the next Flume Agent, Sink on the use of Channel commit the transaction. Once the transaction is committed, the Channel delete events from their own internal buffer.

Sink groups allow the tissue to sink a plurality of entities. Sink Processors (processor) capable of providing the ability to achieve load balancing among all groups in the Sink, and in case of failure of the fault can be transferred from one to another Sink Sink. Simply put, is a source corresponds to a Sinkgroups, namely multiple sink , there was virtually multiplexing / replication of similar, but considered here is the reliability and performance that failover and load balancing settings.

DefaultSink Processor  receives a single Sink, Sink does not force the user to create Processor
FailoverSink Processor failover processor can be configured to maintain a priority list. Ensure that every event will be an effective treatment.
Works by successive failures sink assigned to a pool, it is assigned a frozen period there, in the freezing period where the sink will not do anything. Once the sink was successfully sent a event, sink will be restored to live pool.
Load Balancing Load balancing Processor The processor provides the ability to load balancing between multiple Sink. ① support achieved by  round_robin (polling), or ②  Random (random) parameters to achieve load distribution, the case where the default round_robin

Affairs

Put the transaction process:

doPut the first batch of data is written to a temporary buffer putList; doCommit: check whether the channel memory queue enough merger; doRollback: insufficient channel memory queue space, rollback data;

Try to put first data put into putList inside, and then commit the view channel to submit the transaction is successful, and if successful will be submitted to put this event out from putList in; if it fails to submit rewrite, rollTback to putList;

Take matters:

doTake first taken to a temporary data buffer takeList; doCommit if all the data transmission is successful, the temporary buffer is cleared takeList; doRollback during data transmission if an exception occurs, data will be returned ROLLBACK takeList temporary cache in memory queue to the channel;

Pull event to takeList, an attempt to commit, if the data submitted successfully put takeList removed; if the author failed to submit rewrite, rewrite submitted after the return to the channel;

This transaction: flume possible to have duplicate data;

Event

Transmission unit, the basic unit of data transmission Flume, in the form of event data from the source to the destination. Event array consists of an optional header and a byte containing data. Header is receiving a string of key-value pairs HashMap. 

Interceptors (Interceptor)
interceptors is a simple plug-in assembly, is disposed between the Source and Channel Source writing data. Each interceptor instance handle received only event with a Source.
Because the interceptor must complete the conversion before the event write channel, only when the interceptor has been successfully converted event, channel (and any other possible timeout source) customers will respond to the sending end of the event or sink.

Flume official provided some common interceptor, the interceptor can also customize the log processing. Custom interceptors just follow these steps:

  •     Flume version is: apache-flume-1.6.0

Achieve org.apache.flume.interceptor.Interceptor Interface

 

Flume topology

① series: channel more, but not too many layers flume; this mode is a plurality of sequentially connected to the flume, beginning from the initial source to the final destination storage system transmitted sink. This mode is not recommended for bridging excessive number flume, flume will not only affect too much the number of transmission rate, and once during the transmission of a node flume down, it will affect the whole transmission system.

 

② single-source, multi-channel, sink; channel corresponding to a plurality of sink; channel corresponding to a plurality of a plurality of sink;

            ---->sink1         ---->channel1 --->sink1

单source ---> channel----->sink2                 source

           ----->sink3          ------>channel2---->sink2

Flume flow to support the event one or more destinations. This mode to copy source to a plurality of channel, each channel has the same data, can select different destination sink transmission.

 

③ load balancing   Flume sink supports a group assigned to the plurality of sink logic, flume to transmit data to a different sink, mainly to solve load balancing and failover problems.

Load Balancing: side by side all three channel polling, the advantage is to increase traffic and ensure data security; (a hanging, three are not linked; buffer long, hdfs channel if there is a problem, two layers , a plurality of parallel flune can guarantee the security of data and increase the buffer)

 

④ Flume agent聚合  日常web应用通常分布在上百个服务器,大者甚至上千个、上万个服务器。产生的日志,处理起来也非常麻烦。用flume的这种组合方式能很好的解决这一问题,每台服务器部署一个flume采集日志,传送到一个集中收集日志的flume,再由此flume上传到hdfs、hive、hbase、jms等,进行日志分析。

 

 

场景模拟

1. 监控端口数据--netcat

监控端口数据:
端口(netcat)--->flume--->Sink(logger)到控制台

 

2. 实时读取本地文件到HDFS

实时读取本地文件到HDFS:
hive.log(exec)--->flume--->Sink(HDFS)

取Linux系统中的文件,就得按照Linux命令的规则执行命令。由于Hive日志在Linux系统中所以读取文件的类型选择:exec即execute执行的意思。表示执行Linux命令来读取文件。

3. 实时读取目录文件到HDFS

 

 

实时读取目录文件到HDFS:
目录dir(spooldir)--->flume--->Sink(HDFS)

4. 单数据源多出口(选择器)

单Source多Channel、Sink

单数据源多出口(选择器):单Source多Channel、Sink
hive.log(exec)---->flume1--Sink1(avro)-->flume2--->Sink(HDFS)
           ---Sink2(avro)-->flume3--->Sink(file roll本地目录文件data)

5. 单数据源多出口案例(Sink组)

单Source、Channel多Sink(负载均衡)  

Flume 的负载均衡和故障转移

目的是为了提高整个系统的容错能力和稳定性。简单配置就可以轻松实现,首先需要设置 Sink 组,同一个 Sink 组内有多个子 Sink,不同 Sink 之间可以配置成负载均衡或者故障转移。

 

单数据源多出口(Sink组): flum1-load_balance
端口(netcat)--->flume1---Sink1(avro)-->flume2---Sink(Logger控制台)
          ---Sink2(avro)-->flume3---Sink(Logger控制台)

6. 多数据源汇总

多Source汇总数据到单Flume

Guess you like

Origin www.cnblogs.com/coco2015/p/11258771.html