Hadoop middle of the flume basic introduction and configuration environment variable

One. flume Introduction

1.1 What is the Flume

As a real-time log flume collection system cloudera development, it has been widely recognized by the industry. Flume initial release version is currently referred to as Flume OG (original generation), belong to cloudera. But with the expansion of FLume function, Flume OG code works bloated, poorly designed core components, core configuration standards and other shortcomings are not exposed, especially in the final release version 0.94.0 Flume OG, the log transfer phenomenon of instability particularly serious, in order to solve these problems, in October 2011 No. 22, cloudera completed Flume-728, on the Flume be a landmark change: reconstruction core component, the core configuration and code architecture, version reconstructed collectively referred to as Flume NG (next generation); another reason is to change its Flume into apache, cloudera Flume renamed apache Flume.

1.3 Flume core concepts

 

Agent

Flume agent to the smallest standalone unit, composed of a single agent Source, Sink assembly and Channel Three

Source

Collecting data from the Client, to pass Channel.

Channel

Channel located between the Source and Sink, connected Sources and Sinks, this is a bit like a queue, mainly to do cache

Sink

Sink is responsible for transmission from the Channel to the next source or final destination, be removed from the channel after a successful event

Client

Client is packed into a raw log events and sends them to one or more entities agent

Events

Flume Event is the basic unit of data transmission. A line of text content will be serialized into an event.

1.2 Flume features

flume is a distributed, reliable, and highly available massive log collection, aggregation and transmission systems. Support all kinds of custom data sender in the log system to collect data; at the same time, Flume provide simple data processing, and the ability to write to parties (such as text, HDFS, Hbase, etc.) various data accepted.

flume data stream from the event (Event) throughout. Flume events are basic data units, which carries the log data (a byte array) and carries the header information, which is generated by the Agent external Event Source, Source captured when a specific event for formatting, then the event will Source push (single or multiple) of the Channel. You can think of Channel seen as a buffer that will hold the event until Sink finishes processing the event. Sink responsible for the persistent event log or towards another Source.

 

1.5 Flume Agent

flume basic model

 

Agent mainly by: source, channel, sink three components.

Source:

   Receiving data from the data generator, and to transfer the received data to a format Flume the event one or more channels Channal, Flume provide a way of receiving a variety of data, such as Avro, Thrift, twitter1%, etc.

Channel:

 channal is a temporary storage container, it will receive data from the source to the event at the format of cached until they are consumed sinks, it plays a role of a bridge between the total source and sink, channal is a complete transaction this is to ensure the consistency of data sent and received in time and it can be any number of source and sink link types are supported:.. JDBC channel, File System channel, Memory channel and so on.

sink:

  storing the data sink, such as a memory Hbase and concentrated HDFS, and passes it channals consumption data (Events) from it to the destination. the destination could be another source, may HDFS, HBase.

Examples of its combination:

 

 

 

 

 

 

 

1.3Flume reliability

When a node fails, the log can be transmitted to other nodes without losing. Flume provides three levels of reliability guarantee, from strong to weak are: end-to-end (agent receives the data written to disk first event, when the data transfer is successful, then delete; if data transmission fails can be re-sent.), Store on failure (which is the scribe strategy used when the data receiving side crash, the data is written to the local, to be restored, continue to send), Besteffort (after the data is sent to the recipient, not Undergo verification).

1.4 Flume recoverability

Or by Channel. Recommended FileChannel, event persistence (poor performance) in the local file system.

two. Real-time acquisition case

2.1 Netcat+logger

a1.sources = r1

a1.sinks = k1

a1.channels = c1

 

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

 

# Describe the sink

a1.sinks.k1.type = logger

 

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

 

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

 

linux:yum install nc –y

        nc -l port corresponds to the server

        nc ip port

windows command line, type: telnet ip port test

2.2 Spool+memory+hdfs

Create a profile f1.conf :

Name #define agent name, source, channel, sink in

f1.sources = r1

f1.channels = c1

f1.sinks = k1

 

# Specific definition source

# Watched Folders

f1.sources.r1.type = spooldir

# Designated to monitor folders

f1.sources.r1.spoolDir = /opt/test/

 

# Specific definition channel

f1.channels.c1.type = memory

The maximum number of events stored in the channel #

f1.channels.c1.capacity = 10000

# Default he is 100, that will be collected at the end of the sink to collect after 100 go to commit the transaction (that is sent to the next destination)

# Can not be less than sink in hdfs.batchSize

f1.channels.c1.transactionCapacity = 100

 

# Define interceptors, add a timestamp to messages

f1.sources.r1.interceptors = i1

f1.sources.r1.interceptors.i1.type = timestamp

 

# Specific definitions sink

f1.sinks.k1.type = hdfs

f1.sinks.k1.hdfs.path = hdfs://hadoop:9000/flume/%Y%m%d

# Generated log file name prefix, default FlumeData

f1.sinks.k1.hdfs.filePrefix = events-

# Name Suffix

f1.sinks.k1.hdfs.fileSuffix=.log

# Set the text type plain text (default SequenceFile), currently SequenceFile, DataStream data or CompressedStream,

#DataStream compressed data output file will not, CompressedStream hdfs the need to set an available codec CODEC

f1.sinks.k1.hdfs.fileType = DataStream

# Compression codec. One of the following: gzip, bzip2 lzo, lzop, snappy

#f1.sinks.k1.hdfs.codeC

# Not in accordance with the number of generated files (the number of events written to the file, the default 10, 0 is not based on)

f1.sinks.k1.hdfs.rollCount = 0

Generating a file on a file #HDFS reached 1M (default 1024 bytes, not based on 0)

f1.sinks.k1.hdfs.rollSize = 1048576

#HDFS file on a file generation reaches 60 seconds, (default 30 seconds, not based on 0)

f1.sinks.k1.hdfs.rollInterval = 60

# Timeout after Inactive file is closed (0 = disabled automatically shut down idle file, default 0)

#f1.sinks.k1.hdfs.idleTimeout

# Default value of 100, the number of events to refresh each batch on the HDFS;

#f1.sinks.k1.hdfs.batchSize

# Write sequence file format. Include: Text, Writable (default)

#f1.sinks.k1.hdfs.writeFormat

# Use local time (rather than from the head timestamp events), default false

#f1.sinks.k1.hdfs.useLocalTimeStamp

 

# Assembly source, channel, sink

f1.sources.r1.channels = c1

f1.sinks.k1.channel = c1

Execution agent:

flume-ng agent -n f1 -c conf -f /wishedu/testdata/flume/f1.conf -Dflume.root.logger=INFO,console

-n is followed by an agent's name

-f behind with the location of the configuration file edited

 

2.3 Exec+memary+logger

Create a profile f2.conf:

Name #define agent name, source, channel, sink in 

a1.sources = r1

a1.channels = c1

a1.sinks = k1

# Specific definition source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -f /wishedu/testdata/flume/logs/0.log

# Specific definition channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Specific definitions sink

a1.sinks.k1.type = logger

# Assembly source, channel, sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

III. Installation Configuration

3.1 Extracting installation

tar -zxvf apache-flume-1.6.0-bin.tar.gz -C /wishedu/

3.2 Configuration Environment Variables

vi / etc / profile

export FLUME_HOME=/wishedu/flume-1.6.0

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$FLUME_HOME/bin:

Save and exit, refresh profile

source /etc/profile

Modify flume-env.sh in the JAVA_HOME

3.3 verification

# View flume version:

[Root @ wishedu bin] # flume-version

Flume 1.6.0

Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git

Revision: 2561a23240a71ba20bf288c7c2cda88f443c2080

Compiled by hshreedharan on Mon May 11 11:15:44 PDT 2015

From source with checksum b29e416802ce9ece3269d34233baf43f

# The above information appears to indicate the installation was successful

IV. Notes

  4.1 When monitoring a directory

(1) flume will monitor this directory file had to make a mark, plus a suffix (.COMPLETED) behind it, if you restart the flume, flume will no longer monitor this file.

(2) can not modify the file being monitored, or flume will complain

Guess you like

Origin www.cnblogs.com/zxn0628/p/11318980.html