Road of big data Week08_day02 (Flume three components Source, channel, sink)

Before use, first describes the characteristics of some of the components and assemblies Flume

Flume advantages:

  1. Flume may store data generated by the application to any centralized memory, such as HDFS, HBase

  2. When the speed of data collection over time the write data, that is, when the collection of information encountered peak, this time the information collection is very large, even more than the ability to write data to the system, this time, Flume data will between producer and to adjust the data container, ensure that it can provide a smooth data therebetween.

  3. route feature provides context

  4. Flume pipe is based on the transaction to ensure the consistency of data in the transmission and reception.

  5. Flume reliable, high fault tolerance, scalable, manageable and customizable.

Flume features:

  1. Flume be efficient server log information collected in multiple sites into HDFS / HBase in

  2. Using the rapid transfer of Flume, we can get data from multiple servers in a Hadoop

  3. In addition to the log information, Flume also be used to collect large-scale access to social networking node event data, such as facebook, twitter, electricity supplier sites such as Amazon, flipkart etc.

  4. supports various types of data access resource, and then the data type

  5. Support multipath traffic flow multi-channel access, then the multi-channel traffic, routing context

  6 may be extended horizontally

============================================================================================================================================================

Flume of three components:

1、Source

  Side data is collected, formatted for special responsible for data capture, the encapsulated data to the event (event), the event is then pushed into the Channel. Flume provides many built-Source, support Avro, log4j, syslog and http post (body as json format). Allows the application to deal directly with the existing Source, if built as AvroSource Source unable to meet the needs, Flume also supports custom Source.

 

 

source types:

 

 

 

Source specific role:

  AvroSource: avro a listening service port, the data collected Avro serialized data;

  Thrift Source: Thrift a listening service port, the data collected Thrift serialized data;

  Exec Source: acquiring data on the output standard Unix-based Command;

  JMS Source: Java Message Service data source, Java Message Service is an independent and platform-specific API, which is supported by jms specification data source acquisition;

  Spooling Directory Source: The new documents in the folder as a collection of data by source; [test] header

  Kafka Source: Data collected from kafka service.

  NetCat Source: port (tcp, udp), each line of text data stream port binding as input Event

  Acquisition monitor HTTP POST and GET data generated: HTTP Source

=================================================================================================================================================

2、Channel

Channel is a component connection Source and Sink, we can see it as a data buffer (data queue), it can be temporarily stored in memory events may also be persisted to local disk, the event until Sink processed.

介绍两个较为常用的Channel : MemoryChannel和FileChannel。

 

 

Channel:一个数据的存储池,中间通道。

主要作用:接受source传出的数据,向sink指定的目的地传输。Channel中的数据直到进入到下一个channel中或者进入终端才会被删除。当sink写入失败后,可以自动重写,不会造成数据丢失,因此很可靠。 channel的类型很多比如:内存中、jdbc数据源中、文件形式存储等。

常见采集的数据类型: Memory Channel、File Channel、JDBC Channel、Kafka Channel等

详细查看: http://flume.apache.org/FlumeUserGuide.html#flume-channels

Channel具体作用:

  Memory Channel:使用内存作为数据的存储。

  JDBC Channel:使用jdbc数据源来作为数据的存储。

  Kafka Channel:使用kafka服务来作为数据的存储。

  File Channel:使用文件来作为数据的存储。

  Spillable Memory Channel:使用内存和文件作为数据的存储,即:先存在内存中,如果内存中数据达到阀值则flush到文件中。

=============================================================================================================================================================

3、Sink

Sink从Channel中取出事件,然后将数据发到别处,可以向文件系统、数据库、 hadoop存数据, 也可以是其他agent的Source。在日志数据较少时,可以将数据存储在文件系统中,并且设定一定的时间间隔保存数据。

 

 

Sink:数据的最终的目的地.

主要作用:接受channel写入的数据以指定的形式表现出来(或存储或展示)。 sink的表现形式很多比如:打印到控制台、hdfs上、avro服务中、文件中等。

常见采集的数据类型: HDFS Sink、Hive Sink、Logger Sink、Avro Sink、Thrift Sink、File Roll Sink、HBaseSink、Kafka Sink等

详细查看: http://flume.apache.org/FlumeUserGuide.html#flume-sinks HDFSSink需要有hdfs的配置文件和类库。

一般采取多个sink汇聚到一台采集机器负责推送到hdfs。

Sink具体作用:

HDFS Sink:将数据传输到hdfs集群中。

Hive Sink:将数据传输到hive的表中。

Logger Sink:将数据作为日志处理(根据flume中的设置的日志的级别显示)。

Avro Sink:数据被转换成Avro Event,然后发送到配置的RPC端口上。

Thrift Sink:数据被转换成Thrift Event,然后发送到配置的RPC端口上。

IRC Sink:数据向指定的IRC服务和端口中发送。

File Roll Sink:数据传输到本地文件中。

Null Sink:取消数据的传输,即不发送到任何目的地。

HBaseSink:将数据发往hbase数据库中。

MorphlineSolrSink:数据发送到Solr搜索服务器(集群)。

ElasticSearchSink:数据发送到Elastic Search搜索服务器(集群)。

Kafka Sink:将数据发送到kafka服务中。

Flume 使用事务性的方式保证传送Event整个过程的可靠性。

Sink 必须在Event 被存入Channel 后,或者,已经被传达到下一站agent里,又或者,已经被存入外部数据目的地之后,才能把 Event 从 Channel 中 remove 掉。这样数据流里的 event 无论是在一个 agent 里还是多个 agent 之间流转,都能保证可靠,因为以上的事务保证了 event 会被成功存储起来。比如 Flume支持在本地保存一份文件 channel 作为备份,而memory channel 将event存在内存 queue 里,速度快,但丢失的话无法恢复。

Guess you like

Origin www.cnblogs.com/wyh-study/p/12093560.html