Why not write directly from the source system to the Hadoop cluster it? This is because the original system there are tens of thousands of machines, real-time written to HDFS, namenode will have a lot of small files, for hadoop pressure will be very large. Therefore, the introduction of an intermediate system --Flume. Flume really doing is pushing real-time events, data flow is ongoing and the magnitude of a great situation.
Flume data interpreted as a section of the event.
Each Flume Agent contains three main components: Source, Channel, Sink, The figure is the Flume Agent Chart
A, Source Components
Receiving the data generated from other data applications. It has its own produce source data, but these are usually used for testing purposes Source. Source can monitor one or more network ports for receiving data or may read data from the local file system. Each source must be connected at least one Channel.
The figure is the Source, selector, interactive process interceptor
Two, Channel
In general, channcel passive components. reading data from the sink channel.
Three, Sink components
Fourth, the configuration Flume Agent
Using properties file format
k1 = v1
k2 = v2
There Flume Agent In some instances there may be several components, such as Source, Sink, Channel, etc., these components need to be named. Configuration files must use the following format lists the name of Source, Sink, Channel group, the list is called active list:
agent1.sources = source1 source2
agent1.sinks = sink1 sink2 sink3 sink4
agent1.sinkgroups = SG1 SG2
agent1.channels = channel1 channel2
The above configuration of the segment represented by named agent1 Flume Agent. With two Source, Sink two groups, two channel, four sink.