Flume technical principles

Turn: https://www.e-learn.cn/content/qita/690288

Log Flume is an open source system. Is a distributed, high availability and reliability of massive log polymerization system, various types of systems supporting customization data transmission side, for collecting data; Meanwhile, Flume provides simple data processing, and various kinds of data written to the recipient (customizable) capabilities.

Flume is a flow log collection tool, FLume ability to provide a simple data processing and various kinds of data written to the recipient (customizable) is, supplied from a local file Flume (spooling directory source), real-time log (taildir, exec), REST message, the ability to collect data on Thift, Avro, Syslog, Kafka and other data sources.

  • Acquires log information provided from the directory to a fixed destination (HDFS, HBase, Kafka) capability.
  • It provides real-time collection of log information (taidir) ability to destination.
  • FLume supports cascading (multiple Flume docking), the ability to merge data.
  • Flume ability to support customized according to user data collection.

 

FIG: Flume position of FusionInsight

 

Flume is collected, the polymerization distributed event stream data frame.

 

 

 

Figure: Flume infrastructure map

 

Flume Infrastructure: Flume single node can directly collect data, mainly used in the cluster data.

 

 

 

Figure: Flume multi-agent architecture

 

Multi-agent architecture Flume: Flume may be a plurality of nodes connected together, the first data source through the collection, storage systems to the final storage. Mainly used outside the cluster data into cluster.

 

Figure: Flume Chart

 

Components are detailed below:

  • events: Flume A package among the data. Is one data unit. flume basic data transmission unit.

  • Interceptor: interceptor, the main role is to collect and filter the modified data based on user configuration.

  • Channel Selector: channel selector, the main role is based on the user configuration data into different among Channel.
  • Channel: The main role is temporary cache data.
  • Sink Runner:sink的运行器,主要是通过它来驱动Sink Processor,Sink Processor驱动Sink来从Channel当中获取数据。
  • Sink Processor:主要策略有,负载均衡,故障转移以及直通。
  • Sink:主要作用是从Channel当中取出数据,并将数据放到不同的目的地。

Source负责接收events或通过特殊机制产生events,并将events批量放到一个或多个Channels。有驱动和轮询2中类型的Source。

  • 驱动型Source:是外部主动发送数据给Flume,驱动Flume接收数据。
  • 轮询source:是FLume周期性主动去获取数据。

Source必须至少和一个channel关联。

Source的类型如下:

 

 

Channel位于Source和Sink之间,Channel的作用类似队列,用于临时缓存进来的events,当Sink成功地将events发送到下一跳的channel或最终目的,events从Channel移除。

不同的Channel提供的持久化水平也是不一样的:

  • Memory Channel:不会持久化。消息存放在内存中,提供高吞吐,但提供可靠性;可能丢失数据。
  • File Channel:对数据持久化;基于WAL(预写式日志Write-Ahaad Log)实现。但是配置较为麻烦,需要配置数据目录和checkpoint目录;不同的file channel均需要配置一个checkpoint目录。
  • JDBC Channel:基于嵌入式Database实现。内置derby数据库,对event进行了持久化,提供高可靠性;可以取代同样持久特性的file channel。

Channels支持事物,提供较弱的顺序保证,可以连接任何数量的Source和Sink。

Sink负责将events传输到下一跳或最终目的,成功完成后将events从channel移除。

必须作用于一个确切的channel。

Sink类型:

 

 

 


图:Flume采集日志文件

 

Flume支持将集群外的日志文件采集并归档到HDFS、HBase、Kafka上,供上层应用对数据分析、清洗数据使用。

 

 

 

图:Flume级联

 

Flume支持将多个Flume级联起来,同时级联节点内部支持数据复制。

这个场景主要应用于:收集FusionInsight集群外上的节点上的日志,并通过多个Flume节点,最终汇聚到集群当中。

 

 

 

图:Flume级联消息压缩、加密

 

Flume级联节点之间的数据传输支持压缩和加密,提升数据传输效率和安全性。

在同一个Flume内部进行传输时,不需要加密,为进程内部的数据交换。

 

 

 

图:Flume数据监控

 

Source接收的数据量,Channel缓存的数据量,Sink写入的数据量,这些都可以通过Manager图形化界面呈现出来。

 


图:Flume传输可靠性原理

 

Flume在传输数据过程中,采用事物管理方式,保证数据传输过程中数据不会丢失,增强了数据传输的可靠性,同时缓存在channel中的数据如果采用了file channel,进程或者节点重启数据不会丢失。

 

 

 

图:Flume传输过程中出错情况

 

Flume在传输数据过程中,如果下一跳的Flume节点故障或者数据接收异常时,可以自动切换到另外一路上继续传输。

 

 

 

图:过滤原理

 

Flume在传输数据过程中,可以见到的对数据简单过滤、清洗,可以去掉不关心的数据,同时如果需要对复杂的数据过滤,需要用户根据自己的数据特殊性,开发过滤插件,Flume支持第三方过滤插件调用。

Guess you like

Origin www.cnblogs.com/ceshi2016/p/12124386.html