One. flume Introduction
1.1 What is the Flume
As a real-time log flume collection system cloudera development, it has been widely recognized by the industry. Flume initial release version is currently referred to as Flume OG (original generation), belong to cloudera. But with the expansion of FLume function, Flume OG code works bloated, poorly designed core components, core configuration standards and other shortcomings are not exposed, especially in the final release version 0.94.0 Flume OG, the log transfer phenomenon of instability particularly serious, in order to solve these problems, in October 2011 No. 22, cloudera completed Flume-728, on the Flume be a landmark change: reconstruction core component, the core configuration and code architecture, version reconstructed collectively referred to as Flume NG (next generation); another reason is to change its Flume into apache, cloudera Flume renamed apache Flume.
1.3 Flume core concepts
Agent |
Flume agent to the smallest standalone unit, composed of a single agent Source, Sink assembly and Channel Three |
Source |
Collecting data from the Client, to pass Channel. |
Channel |
Channel located between the Source and Sink, connected Sources and Sinks, this is a bit like a queue, mainly to do cache |
Sink |
Sink is responsible for transmission from the Channel to the next source or final destination, be removed from the channel after a successful event |
Client |
Client is packed into a raw log events and sends them to one or more entities agent |
Events |
Flume Event is the basic unit of data transmission. A line of text content will be serialized into an event. |
1.2 Flume features
flume is a distributed, reliable, and highly available massive log collection, aggregation and transmission systems. Support all kinds of custom data sender in the log system to collect data; at the same time, Flume provide simple data processing, and the ability to write to parties (such as text, HDFS, Hbase, etc.) various data accepted.
flume data stream from the event (Event) throughout. Flume events are basic data units, which carries the log data (a byte array) and carries the header information, which is generated by the Agent external Event Source, Source captured when a specific event for formatting, then the event will Source push (single or multiple) of the Channel. You can think of Channel seen as a buffer that will hold the event until Sink finishes processing the event. Sink responsible for the persistent event log or towards another Source.
1.5 Flume Agent
flume basic model
Agent mainly by: source, channel, sink three components.
Source:
Receiving data from the data generator, and to transfer the received data to a format Flume the event one or more channels Channal, Flume provide a way of receiving a variety of data, such as Avro, Thrift, twitter1%, etc.
Channel:
channal is a temporary storage container, it will receive data from the source to the event at the format of cached until they are consumed sinks, it plays a role of a bridge between the total source and sink, channal is a complete transaction this is to ensure the consistency of data sent and received in time and it can be any number of source and sink link types are supported:.. JDBC channel, File System channel, Memory channel and so on.
sink:
storing the data sink, such as a memory Hbase and concentrated HDFS, and passes it channals consumption data (Events) from it to the destination. the destination could be another source, may HDFS, HBase.
Examples of its combination:
1.3Flume reliability
When a node fails, the log can be transmitted to other nodes without losing. Flume provides three levels of reliability guarantee, from strong to weak are: end-to-end (agent receives the data written to disk first event, when the data transfer is successful, then delete; if data transmission fails can be re-sent.), Store on failure (which is the scribe strategy used when the data receiving side crash, the data is written to the local, to be restored, continue to send), Besteffort (after the data is sent to the recipient, not Undergo verification).
1.4 Flume recoverability
Or by Channel. Recommended FileChannel, event persistence (poor performance) in the local file system.
two. Real-time acquisition case
2.1 Netcat+logger
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
linux:yum install nc –y
nc -l port corresponds to the server
nc ip port
windows command line, type: telnet ip port test
2.2 Spool+memory+hdfs
Create a profile f1.conf :
Name #define agent name, source, channel, sink in
f1.sources = r1
f1.channels = c1
f1.sinks = k1
# Specific definition source
# Watched Folders
f1.sources.r1.type = spooldir
# Designated to monitor folders
f1.sources.r1.spoolDir = /opt/test/
# Specific definition channel
f1.channels.c1.type = memory
The maximum number of events stored in the channel #
f1.channels.c1.capacity = 10000
# Default he is 100, that will be collected at the end of the sink to collect after 100 go to commit the transaction (that is sent to the next destination)
# Can not be less than sink in hdfs.batchSize
f1.channels.c1.transactionCapacity = 100
# Define interceptors, add a timestamp to messages
f1.sources.r1.interceptors = i1
f1.sources.r1.interceptors.i1.type = timestamp
# Specific definitions sink
f1.sinks.k1.type = hdfs
f1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/%Y%m%d
# Generated log file name prefix, default FlumeData
f1.sinks.k1.hdfs.filePrefix = events-
# Name Suffix
f1.sinks.k1.hdfs.fileSuffix=.log
# Set the text type plain text (default SequenceFile), currently SequenceFile, DataStream data or CompressedStream,
#DataStream compressed data output file will not, CompressedStream hdfs the need to set an available codec CODEC
f1.sinks.k1.hdfs.fileType = DataStream
# Compression codec. One of the following: gzip, bzip2 lzo, lzop, snappy
#f1.sinks.k1.hdfs.codeC
# Not in accordance with the number of generated files (the number of events written to the file, the default 10, 0 is not based on)
f1.sinks.k1.hdfs.rollCount = 0
Generating a file on a file #HDFS reached 1M (default 1024 bytes, not based on 0)
f1.sinks.k1.hdfs.rollSize = 1048576
#HDFS file on a file generation reaches 60 seconds, (default 30 seconds, not based on 0)
f1.sinks.k1.hdfs.rollInterval = 60
# Timeout after Inactive file is closed (0 = disabled automatically shut down idle file, default 0)
#f1.sinks.k1.hdfs.idleTimeout
# Default value of 100, the number of events to refresh each batch on the HDFS;
#f1.sinks.k1.hdfs.batchSize
# Write sequence file format. Include: Text, Writable (default)
#f1.sinks.k1.hdfs.writeFormat
# Use local time (rather than from the head timestamp events), default false
#f1.sinks.k1.hdfs.useLocalTimeStamp
# Assembly source, channel, sink
f1.sources.r1.channels = c1
f1.sinks.k1.channel = c1
Execution agent:
flume-ng agent -n f1 -c conf -f /wishedu/testdata/flume/f1.conf -Dflume.root.logger=INFO,console
-n is followed by an agent's name
-f behind with the location of the configuration file edited
2.3 Exec+memary+logger
Create a profile f2.conf:
Name #define agent name, source, channel, sink in
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Specific definition source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /wishedu/testdata/flume/logs/0.log
# Specific definition channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Specific definitions sink
a1.sinks.k1.type = logger
# Assembly source, channel, sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
III. Installation Configuration
3.1 Extracting installation
tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /wishedu/
3.2 Configuration Environment Variables
vi / etc / profile
export FLUME_HOME=/wishedu/flume-1.6.0
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$FLUME_HOME/bin:
Save and exit, refresh profile
source /etc/profile
Modify flume-env.sh in the JAVA_HOME
3.3 verification
# View flume version:
[Root @ wishedu bin] # flume-version
Flume 1.6.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 2561a23240a71ba20bf288c7c2cda88f443c2080
Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015
From source with checksum b29e416802ce9ece3269d34233baf43f
# The above information appears to indicate the installation was successful
IV. Notes
4.1 When monitoring a directory
(1) flume will monitor this directory file had to make a mark, plus a suffix (.COMPLETED) behind it, if you restart the flume, flume will no longer monitor this file.
(2) can not modify the file being monitored, or flume will complain