Flume definition:
Flume Cloudera is provided to a highly available, highly reliable, distributed massive log collection, aggregation and transmission systems. Flume flow-based architecture, flexible and simple.
Why choose Flume
Main functions: Real-time data server reads the local disk, write data to HDFS
Flume organizational structure
1, the simplest organizational structure
2, Flume process stream
Description:
source: data input
Common types: spooling directory, exec, syslog, avro, netcat, etc.
channel: a buffer located between the source and sink
memory: memory-based caching, allowing data to be lost there
file: persistent channel, system downtime without losing data
sink: the data output terminal
Common destinations are: HDFS, Kafka, logger, avro, File, Custom
Put the transaction process:
doPut: the batch data is written to a temporary buffer putList
doCommit: Check the adequacy of the combined channel memory queue
doRollback: Insufficient memory queue space, data rollback
Take the transaction process:
doTake: batch extract data to a temporary buffer takeList
doCommit: If all the data sent successfully, then empty the temporary buffer takeList
doRollback: data takeList if abnormal, will appear a temporary buffer during data transmission back to the channel