Flume common component configuration (II)

A, Flume development of commonly used components source, channel, sink arranged at

agent agent

1, channel Channel

file

The transaction log events stored on the local file system, persistent: as long as the event is written to the channel, even if the agent, the agent is restarted, the event will not be lost. agent1.channnels.channel1.type = file

memory

The events cached in memory, has no persistent storage capability, if the agent is restarted, an event will be lost. In some cases this situation allowed. And file channel compared, memory channel speed, high throughput

agent1.channnels.channel1.type=memory

jdbc

The events are stored in the database

2、sink

sink Joint	HDFS	Written to HDFS to text, serialized file, Avro or custom formats.
	HBASE	Use of a sequence of writing data to a tool in HBASE
	Logger	Use SLF4J record INFO level events, mainly used to test
	kafka	Write an event to kafka message buffer column
	Elasticcsearch	Use Logstash format to write events to Elasticsearch cluster
	NULL	Discard all events
	Euro	By Avro RPC to send events to a group Avro source
	File roll	Write an event to the local file system
	Hive	The event introduced to the hive by a fixed format corresponding to the table or partition.
	IRC	Send events to the IRC channel

3、source

source data source	Euro	Monitor the arrival of Avro sink or Flume SDK sent through the event Avro RPC port
	Exec	Run a Unix command (e.g. tail -F / path / to / file), and the row read from the standard output conversion event. Note, however, this source does not necessarily guarantee deliver events to the channel, the better choice may reference spooling directory source or Flume SDK.
	HTTP	Listening on a port, and the use of removable handle, such as JSON or binary data handler handler converts the HTTP request to the event
	ETC.	Read the message from the JMS Queue or Topic and converts it into the event
	Spooling directory / spool	Press OK to save the file in read buffer directory, and convert it to an event.
	Netcat	Listening on a port, and converts each line of text for an event
	Syslog	Read from the log line, and converts it into an event
	Thrift	Listen for the event Thrift sink or Flume SDK sent by the arrival of Thrift RPC window
	Squence genetartor	Generates an event based on incremental counter is mainly used to test
	kafka	Listening kafka's Topic, as events

4、interceptor

Interceptor veto-on begging	Timestamp	Events are passed over to a specific source plus a timestamp header. Which contains the proxy processing time of the event, in ms. Specific reference blog Flume interceptor essay agent1.sources.souce1.interceptors = interceptor1 agent1.sources.source1.interceptors.interceptor1.type = timestamp。
	UUID	Settings on all of an event ID header, which is a globally unique identifier, delete duplicate data useful for the future. agent1.sources.source1.interceptors.interceptor1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
	static	We will set a fixed header and their values on all events. With specific reference to the official website
	Host	Host header will be set on all events include a proxy host name or IP address

Second, transaction and reliability

flume has two separate transactions: from source to channel, channel to sink

spooling directory source为文件的每一行创建一个事件、，一旦事务中所有事件提交到channel，那么source就把这个文件标记为完成。

事务类似的方式，从channel到sink，如果无法记录事件，将会回滚，所有的事件保存在channel中

三、分区和拦截器

大型数据常常被组织为分区，好处是查询涉及到数据的某个子集，查询范围就被限制在特定的分区范围内。flume事件的数据通常按时间来分区。（例如删除重复事件）

需要更改配置文件hdfs.path设置：

agent1.sinks.sink1.hdfs.path = /temp/flume/year=%Y/mouth=%m/day=%d

source 的header并没有设置时间戳（timestamp()），需要flume拦截器完成

souece1添加一个时间拦截器，会在传入到channel之前为每个事件增加一个timestamp header

agent1.sources.source1.interceptors = interceptor1
agent1.sources.source1.interceptors.interceptor1.type = timestamp

四、扇出

从一个source向多个channel，亦即向多个sink传递事件。

案例：spooling directory source扇出到hdfs sink和logger sink的flume配置

#spooling directory source扇出到hdfs sink和logger sink的flume配置
ag1.sources = source1
ag1.sinks = sink1a sink1b
ag1.channels =channel1a channerl1b

ag1.sources.source1.channels = channel1a channerl1b
ag1.sinks.sink1.channela = channel1a
ag1.sinks.sink1.channelb = channel1b

ag1.sources.source1.type = spooldir
ag1.sources.source1.spoolDir = /root/log

ag1.sinks.sink1a.type=hdfs
ag1.sinks.sink1a.hdfs.path = /temp/flume
ag1.sinks.sink1a.hdfs.filePrefix = events
ag1.sinks.sink1a.hdfs.fileSuffix = .log
ag1.sinks.sink1a.hdfs.fileType = DataStream

ag1.sinks.sink1b.type=logger

ag1.channels.channel1a.type = file
ag1.channels.channel1b.type = memory

辛聪明

发布了77 篇原创文章 · 获赞 19 · 访问量 4079

私信关注