Flume principle

First, what is the Flume

1.1 Definitions

Flume Cloudera is to provide a highly available, highly reliable distributed the massive log collection, aggregation and transmission systems . Flume flow-based architecture.

Here Insert Picture Description

1.2 Why Flume

Because the use of hdfs traditionally put the data transmitted from the local hdfs this way bad real-time, real-time monitoring and flume can be a file, folder, or port.

Two, Flume principle

Chart 2.1

Here Insert Picture Description

Agent:

Agent is a JVM process, it sends the data from the source to the destination in the form of events.

Event:

Event is the basic unit of data transmission Flume, which composition is <K, V> form, K is the header, V is the body.

Put Affairs:

doPut: putList wrote in the batch data

doCommit: Check the adequacy of the combined channel memory queue

doRollback channel: channel is insufficient memory queue space, data rollback

Take matters:

doTake: grab the data to the buffer takeList

doCommit: If all data is written successfully, clear temporary buffer takeList

doRollback: during data transmission if an exception occurs, roolback takeList in the data buffer memory queue returned to the channel

Three Components 2.2

Source:

Source is responsible for receiving data of the component Flume Agent. Source can handle various types of log data in various formats, including Avro , Thrift, Exec , JMS, spooling Directory , tailDir , netcat like.

exec: performing at startup given Unix commands, and it is desirable to generate the process data on the standard output (stderr default is not output unless the logStdErr set to true).

tailDir: real-time monitoring data, and through documentation each time a position to achieve the position read function without data loss

spooling directory: monitor a specified folder whether to add a new file, if the file will be added to it in the back plus a suffix to identify the new file, and then if the file changes, then this will be ignored, so you can not change to folder put the same file name.

Channel:

Channel Source and Sink is located between the buffer as the data source receives velocity data may be written and the speed of a mismatch, the Channl added as a buffer.

Channel There are two types, one is the file channel, one is memory channel, file channel is slow but safe, memory channel fast but safe.

Sink:

Sink continuously polls the Channel events and bulk to remove them, and these events bulk write is written to the destination.

Sink is fully transactional . Before deleting the transaction from Channel batch, each Sink starts a transaction with the Channel. Once the batch event successfully written to the destination, Sink on the use of Channel submit a transaction, once the transaction is committed, Channel delete events from their own internal buffer.

2.3 Flume topology

NOTE: There is a sink receiving multiple channel, because it will be a mess.

1, (Flume The Flume to) a series arrangement (topology other base):

Here Insert Picture Description

2, a corresponding number Source Channel (this place can be implemented in two ways, a copy of the mechanisms and multiple selection mechanisms)

Here Insert Picture Description

3, corresponding to a plurality of Sink Channel (load balancing or failover):

Here Insert Picture Description

4, a Source corresponding to the Sink plurality (Polymerization):

Here Insert Picture Description

2.5 Agent internal principle

Here Insert Picture Description

Three, Flume Advanced

3.1 failover

Flume failover strategy is: for example, a Channel connects multiple Sink, Sink beginning of these will be a priority, such as k1: 10, k2: 5, k3: 1, the beginning Channel will write data to k1, if k1 is down, it will write to c2 years; configure failover, there was an argument is maxPenalty (the default is 30 seconds), if in the process of writing in k1 and k2 resumed, but if it is the reply within 30 seconds, then it remains to write to k2, k1 go down to write after 30 seconds.

3.2 custom interceptor

3.3 Custom Source

3.4 Custom Sink

Published 42 original articles · won praise 3 · Views 2076

Guess you like

Origin blog.csdn.net/stable_zl/article/details/104623685