Flume Introduction
1 Overview
Flume Cloudera is to provide a distributed, highly reliable, and highly available massive log collection, aggregation and transmission systems.
Flume output file can be collected, socket packet, folders and other forms of data sources, and data can be collected to the HDFS, hbase, hive, kafka and other external storage system.
General requirements gathering, through simple configuration of the flume can be realized.
Flume for specific scenes also have a good ability to custom extension, therefore, flume can be used for most routine data collection scenarios.
2: operating mechanism
Flume distributed system is the most central role agent, flume acquisition system is the one formed by connecting the agent.
Each agent acts as a data transfer member, the interior has three components:
a) Source: acquisition source for docking with a data source to obtain data.
b) Sink: countersunk transmission destination, data collection, agent data transfer for the next stage or to the final delivery data storage system
c) Channel: inside angent data transmission channel for transmitting data from the source to the sink.
3: complex structures
Series between the multi-level agent
(1) The first: 2 serial agent
(2) The second: a plurality of data acquisition agent be aggregated
(3) Third: The collected data to a different system in the lower layer
Flume real case
1: Flume installation deployment
a) Flume installation is very simple, just unzip can, of course, that there hadoop environment installation package to upload all the data elements on the node
b) then extract tar -zxvf apache-flume-1.6.0-bin.tar.gz
c) the flume into the directory, flume-enc.sh under modified conf, arranged inside JAVA_HOME
Configuration acquisition scheme according to the needs of the data acquisition, is described in the configuration file (the file name can be arbitrarily defined)
Acquisition scheme specified profile, starting flume agent in the corresponding node
2: Simple Case
a) Now create a new file in the conf directory flume
we netcat-logger.conf
# 定义这个agent中各组件的名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 描述和配置source组件:r1
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop01
a1.sources.r1.port = 44444
# 描述和配置sink组件:k1
a1.sinks.k1.type = logger
# 描述和配置channel组件,此处使用是内存缓存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 描述和配置source channel sink之间的连接关系
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
2、启动agent去采集数据
bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
b) activating agent to collect data
bin/flume-ng agent -c conf -f conf/netcat-logger.conf -n a1 -Dflume.root.logger=INFO,console
-c conf designated flume own configuration file directory
-f conf / netcat-logger.con designated collection scheme we have described
-n a1 specify the name of our agent
3: Test
First to send data to the monitor port agent acquisition, so that there is data recoverable agent, the agent casually talk to a networked node machine
telnet anget-hostname port (telnet itcast01 44444)
Source Components
Source Type | Explanation |
Avro Source | Avro support agreement (actually Avro RPC), built-in support. |
Exec Source | Producing output data based on the command of the standard Unix |
Spooling Directory Source | Monitoring data changes within the specified directory. |
Netcat Source | A monitor port, each line of text data stream input port as the Event. |
Thrift Source | Thrift protocol support, built-in support. |
JMS Source | Reading data from the JMS system (message topic) in, the ActiveMQ been tested. |
Sequence Generator Source | A data sequence generator source, the manufacturer's serial data. |
Syslog Source | Read syslog data, generating Event, TPC and UDP support both protocols. |
HTTP Source | Based on the data source HTTP POST or GET mode, support for JSON, BLOB forms. |
Legacy Source | Compatible Source (0.9.x version) in the old Flume OG. |
Channel Components
Channel type | Explanation |
Memory Channel | Event data is stored in memory. |
File Channel | Event data is stored in a disk file. |
JDBC Channel | Event data is stored in persistent storage, the current Flume Channel built-in support Derby database. |
Spillable Memory Channel | Event data is stored in memory and on disk when memory queue is full, it will be persisted to disk file. |
Pseudo Transaction Channel | Testing purposes |
Custom Channel | Custom Channel implementation. |
Sink Components
Sink type | Explanation |
HDFS Sink | Data is written to HDFS. |
Avro Sink | Data is converted into Avro Event, and then sent to the RPC ports configured. |
Thrift Sink | Data is converted into Thrift Event, and then sent to the RPC ports configured. |
IRC Sink | Data playback on IRC. |
File Roll Sink | Stored data to a local file system. |
Null Sink | Discarding all the data. |
HBase Sink | HBase data is written to the database. |
Morphline Solr Sink | Send data to Solr search server. |
ElasticSearch Sink | Elastic Search Search sends data to the server (cluster) |
Custom Sink | Custom Sink achieve |
Flume support a large number of source, channel, sink type, detailed manuals refer to the official documentation http://flume.apache.org/FlumeUserGuide.html