Article Directory
- First, what is the Flume
- Two, Flume principle
- Chart 2.1
- Three Components 2.2
- 2.3 Flume topology
- 1, (Flume The Flume to) a series arrangement (topology other base):
- 2, a corresponding number Source Channel (this place can be implemented in two ways, a copy of the mechanisms and multiple selection mechanisms)
- 3, corresponding to a plurality of Sink Channel (load balancing or failover):
- 4, a Source corresponding to the Sink plurality (Polymerization):
- 2.5 Agent internal principle
- Three, Flume Advanced
First, what is the Flume
1.1 Definitions
Flume Cloudera is to provide a highly available, highly reliable distributed the massive log collection, aggregation and transmission systems . Flume flow-based architecture.
1.2 Why Flume
Because the use of hdfs traditionally put the data transmitted from the local hdfs this way bad real-time, real-time monitoring and flume can be a file, folder, or port.
Two, Flume principle
Chart 2.1
Agent:
Agent is a JVM process, it sends the data from the source to the destination in the form of events.
Event:
Event is the basic unit of data transmission Flume, which composition is <K, V> form, K is the header, V is the body.
Put Affairs:
doPut: putList wrote in the batch data
doCommit: Check the adequacy of the combined channel memory queue
doRollback channel: channel is insufficient memory queue space, data rollback
Take matters:
doTake: grab the data to the buffer takeList
doCommit: If all data is written successfully, clear temporary buffer takeList
doRollback: during data transmission if an exception occurs, roolback takeList in the data buffer memory queue returned to the channel
Three Components 2.2
Source:
Source is responsible for receiving data of the component Flume Agent. Source can handle various types of log data in various formats, including Avro , Thrift, Exec , JMS, spooling Directory , tailDir , netcat like.
exec: performing at startup given Unix commands, and it is desirable to generate the process data on the standard output (stderr default is not output unless the logStdErr set to true).
tailDir: real-time monitoring data, and through documentation each time a position to achieve the position read function without data loss
spooling directory: monitor a specified folder whether to add a new file, if the file will be added to it in the back plus a suffix to identify the new file, and then if the file changes, then this will be ignored, so you can not change to folder put the same file name.
Channel:
Channel Source and Sink is located between the buffer as the data source receives velocity data may be written and the speed of a mismatch, the Channl added as a buffer.
Channel There are two types, one is the file channel, one is memory channel, file channel is slow but safe, memory channel fast but safe.
Sink:
Sink continuously polls the Channel events and bulk to remove them, and these events bulk write is written to the destination.
Sink is fully transactional . Before deleting the transaction from Channel batch, each Sink starts a transaction with the Channel. Once the batch event successfully written to the destination, Sink on the use of Channel submit a transaction, once the transaction is committed, Channel delete events from their own internal buffer.
2.3 Flume topology
NOTE: There is a sink receiving multiple channel, because it will be a mess.
1, (Flume The Flume to) a series arrangement (topology other base):
2, a corresponding number Source Channel (this place can be implemented in two ways, a copy of the mechanisms and multiple selection mechanisms)
3, corresponding to a plurality of Sink Channel (load balancing or failover):
4, a Source corresponding to the Sink plurality (Polymerization):
2.5 Agent internal principle
Three, Flume Advanced
3.1 failover
Flume failover strategy is: for example, a Channel connects multiple Sink, Sink beginning of these will be a priority, such as k1: 10, k2: 5, k3: 1, the beginning Channel will write data to k1, if k1 is down, it will write to c2 years; configure failover, there was an argument is maxPenalty (the default is 30 seconds), if in the process of writing in k1 and k2 resumed, but if it is the reply within 30 seconds, then it remains to write to k2, k1 go down to write after 30 seconds.
3.2 custom interceptor
3.3 Custom Source
3.4 Custom Sink