flume core components and concepts

Flume is a distributed, reliable, and massive log collection, aggregation, and availability of the transmission system.
Data (sinking sink) Flume file can be collected, socket packet (network port), folders, Kafka and other forms of data sources, and can be acquired is output to HDFS, hbase, hive, kafka external storage systems and other in

1. agent

flume central role
flume acquisition system is connected by one agent formed by combining a simple or complex data transmission channel.

Each agent corresponds to one data (Event object is encapsulated) member of the transmission, the interior has three components:
the Source: acquisition components, for abutment with the data source to obtain data; it has a variety of built-implemented;
Channel: transmission channel assembly for transferring data from a source to a sink
sink: sink assembly for a down transfer agent or the transfer of data to the data storage system ultimately
And centered with a picture size
a single data collection agent:
Centered and pictures with size:
single-stage multistage sink source:
Centered and pictures with size:
multistage agent between series:
And centered with a picture size

2. Event: package flume internal data transmission

 数据在Flum内部中数据以Event的封装形式存在。
 Source组件在获取到原始数据后,需要封装成Event放入channel; 
 Sink组件从channel中取出Event后,需要根据配置要求,*转成其他形式的数据输出*。
 
 Event封装对象主要有两部分组成: *Headers和  Body*
 Header: 是一个集合  Map[String,String],用于携带一些KV形式的元数据(标志、描述等)
 Boby: 就是一个字节数组,*装载具体的数据内容*  字节的形式存储在数组中
 *Event: { headers:{} body: 61 20 61 20 61 61 61 20 61 20 0D                a a aaa a . }*

3. Transaction: Transaction control mechanism

Flume的事务机制(类似数据库的事务机制):
Flume使用两个独立的事务分别负责从Soucrce到Channel,以及从Channel到Sink的event传递。比如spooling directory source 为文件的每一个event batch创建一个事件,一旦事务中所有的事件全部传递到Channel且提交成功,那么Soucrce就将该文件标记为完成。
同理,事务以类似的方式处理从Channel到Sink的传递过程,如果因为某种原因使得事件无法记录,那么事务将会回滚。且所有的事件都会保持到Channel中,等待重新传递。

事务机制涉及到如下重要参数:
a1.sources.s1.batchSize =100
a1.sinks.k1.batchSize = 100
a1.channels.c1.transactionCapacity = 100(应该大于等于source或者sink的批次大小)
< transactionCapacity 是说,channel中保存的事务的个数>
跟channel的数据缓存空间容量区别开来:
a1.channels.c1.capacity = 10000

Then the transaction is how to ensure the integrity of the data it? See below where two agent: the
Here Insert Picture Description
data flow:
1. Event generation source 1, through the "put", "commit" Event into operation in Channel 1
2. Event removed from the sink 1 by Channel 1 "take" operation, and sends it to the Source 2 3. source 2 through the "put", "commit" operation in the Event Channel 2 into
4. source 2 successfully sent a signal to the sink 1, the sink 1 "commit" Step 2 "take" operation (in fact, is to remove the Channel 1 Event)

Note: At any time, Event or at least a complete and effective in the Channel

4. interceptors

Interceptors work after the source components, event source generates will be passed interceptor to intercept treatment as required
interceptor interceptor chain can make up
Blocker has some built-in features commonly used in the flume interceptors in
users can according to their own data needs, develop their own custom interceptor!
Here Insert Picture Description

5. Selector

Let different projects logs to a different channel in a different sink to
Flume is built two options: replicating and multiplexing. If the source is not specified in the configuration of a selector, it will automatically copy the Channel selector.

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3
a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.optional.US = c4
a1.sources.r1.selector.default = c4
Released five original articles · won praise 5 · Views 151

Guess you like

Origin blog.csdn.net/weixin_45687351/article/details/103825853
Recommended