Flume is a distributed, reliable, and massive log collection, aggregation, and availability of the transmission system.
Data (sinking sink) Flume file can be collected, socket packet (network port), folders, Kafka and other forms of data sources, and can be acquired is output to HDFS, hbase, hive, kafka external storage systems and other in
1. agent
flume central role
flume acquisition system is connected by one agent formed by combining a simple or complex data transmission channel.
Each agent corresponds to one data (Event object is encapsulated) member of the transmission, the interior has three components:
the Source: acquisition components, for abutment with the data source to obtain data; it has a variety of built-implemented;
Channel: transmission channel assembly for transferring data from a source to a sink
sink: sink assembly for a down transfer agent or the transfer of data to the data storage system ultimately
a single data collection agent:
single-stage multistage sink source:
multistage agent between series:
2. Event: package flume internal data transmission
数据在Flum内部中数据以Event的封装形式存在。
Source组件在获取到原始数据后,需要封装成Event放入channel;
Sink组件从channel中取出Event后,需要根据配置要求,*转成其他形式的数据输出*。
Event封装对象主要有两部分组成: *Headers和 Body*
Header: 是一个集合 Map[String,String],用于携带一些KV形式的元数据(标志、描述等)
Boby: 就是一个字节数组,*装载具体的数据内容* 字节的形式存储在数组中
*Event: { headers:{} body: 61 20 61 20 61 61 61 20 61 20 0D a a aaa a . }*
3. Transaction: Transaction control mechanism
Flume的事务机制(类似数据库的事务机制):
Flume使用两个独立的事务分别负责从Soucrce到Channel,以及从Channel到Sink的event传递。比如spooling directory source 为文件的每一个event batch创建一个事件,一旦事务中所有的事件全部传递到Channel且提交成功,那么Soucrce就将该文件标记为完成。
同理,事务以类似的方式处理从Channel到Sink的传递过程,如果因为某种原因使得事件无法记录,那么事务将会回滚。且所有的事件都会保持到Channel中,等待重新传递。
事务机制涉及到如下重要参数:
a1.sources.s1.batchSize =100
a1.sinks.k1.batchSize = 100
a1.channels.c1.transactionCapacity = 100(应该大于等于source或者sink的批次大小)
< transactionCapacity 是说,channel中保存的事务的个数>
跟channel的数据缓存空间容量区别开来:
a1.channels.c1.capacity = 10000
Then the transaction is how to ensure the integrity of the data it? See below where two agent: the
data flow:
1. Event generation source 1, through the "put", "commit" Event into operation in Channel 1
2. Event removed from the sink 1 by Channel 1 "take" operation, and sends it to the Source 2 3. source 2 through the "put", "commit" operation in the Event Channel 2 into
4. source 2 successfully sent a signal to the sink 1, the sink 1 "commit" Step 2 "take" operation (in fact, is to remove the Channel 1 Event)
Note: At any time, Event or at least a complete and effective in the Channel
4. interceptors
Interceptors work after the source components, event source generates will be passed interceptor to intercept treatment as required
interceptor interceptor chain can make up
Blocker has some built-in features commonly used in the flume interceptors in
users can according to their own data needs, develop their own custom interceptor!
5. Selector
Let different projects logs to a different channel in a different sink to
Flume is built two options: replicating and multiplexing. If the source is not specified in the configuration of a selector, it will automatically copy the Channel selector.
a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.optional.US = c4
a1.sources.r1.selector.default = c4