Flume common component configuration (II)

Flume framework of a learning

Flume, Kafka, Hdfs integration

A, Flume development of commonly used components source, channel, sink arranged at

agent agent

1, channel Channel

file The transaction log events stored on the local file system, persistent: as long as the event is written to the channel, even if the agent, the agent is restarted, the event will not be lost. agent1.channnels.channel1.type = file
memory

The events cached in memory, has no persistent storage capability, if the agent is restarted, an event will be lost. In some cases this situation allowed. And file channel compared, memory channel speed, high throughput

agent1.channnels.channel1.type=memory

jdbc The events are stored in the database

2、sink

 

 

 

sink

Joint

HDFS

Written to HDFS to text, serialized file, Avro or custom formats.

HBASE

Use of a sequence of writing data to a tool in HBASE

Logger

Use SLF4J record INFO level events, mainly used to test

kafka

Write an event to kafka message buffer column

Elasticcsearch

Use Logstash format to write events to Elasticsearch cluster

NULL

Discard all events

Euro

By Avro RPC to send events to a group Avro source

File roll

Write an event to the local file system

Hive

The event introduced to the hive by a fixed format corresponding to the table or partition.

IRC

Send events to the IRC channel

 

3、source

 

 

 

 

source

data source

Euro

Monitor the arrival of Avro sink or Flume SDK sent through the event Avro RPC port

Exec

Run a Unix command (e.g. tail -F / path / to / file), and the row read from the standard output conversion event. Note, however, this source does not necessarily guarantee deliver events to the channel, the better choice may reference spooling directory source or Flume SDK.

HTTP

Listening on a port, and the use of removable handle, such as JSON or binary data handler handler converts the HTTP request to the event

ETC.

Read the message from the JMS Queue or Topic and converts it into the event

Spooling directory

/ spool

Press OK to save the file in read buffer directory, and convert it to an event.

Netcat

Listening on a port, and converts each line of text for an event

Syslog

Read from the log line, and converts it into an event

Thrift

Listen for the event Thrift sink or Flume SDK sent by the arrival of Thrift RPC window

Squence genetartor

Generates an event based on incremental counter is mainly used to test

kafka

Listening kafka's Topic, as events

4、interceptor

 

 

 

Interceptor

veto-on begging

 

Timestamp

Events are passed over to a specific source plus a timestamp header. Which contains the proxy processing time of the event, in ms. Specific reference blog Flume interceptor essay

   agent1.sources.souce1.interceptors = interceptor1

   agent1.sources.source1.interceptors.interceptor1.type = timestamp。

UUID

Settings on all of an event ID header, which is a globally unique identifier, delete duplicate data useful for the future.

agent1.sources.source1.interceptors.interceptor1.type

=

org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

 static

We will set a fixed header and their values ​​on all events. With specific reference to the official website

 Host

Host header will be set on all events include a proxy host name or IP address

Second, transaction and reliability

flume has two separate transactions: from source to channel, channel to sink

spooling directory source为文件的每一行创建一个事件、,一旦事务中所有事件提交到channel,那么source就把这个文件标记为完成。

事务类似的方式,从channel到sink,如果无法记录事件,将会回滚,所有的事件保存在channel中

三、分区和拦截器

大型数据常常被组织为分区,好处是查询涉及到数据的某个子集,查询范围就被限制在特定的分区范围内。flume事件的数据通常按时间来分区。(例如删除重复事件)

需要更改配置文件hdfs.path设置:

agent1.sinks.sink1.hdfs.path = /temp/flume/year=%Y/mouth=%m/day=%d

source 的header并没有设置时间戳(timestamp()),需要flume拦截器完成

souece1添加一个时间拦截器,会在传入到channel之前为每个事件增加一个timestamp header

agent1.sources.source1.interceptors = interceptor1
agent1.sources.source1.interceptors.interceptor1.type = timestamp

四、扇出

从一个source向多个channel,亦即向多个sink传递事件。

案例:spooling directory source扇出到hdfs sink和logger sink的flume配置

#spooling directory source扇出到hdfs sink和logger sink的flume配置
ag1.sources = source1
ag1.sinks = sink1a sink1b
ag1.channels =channel1a channerl1b

ag1.sources.source1.channels = channel1a channerl1b
ag1.sinks.sink1.channela = channel1a
ag1.sinks.sink1.channelb = channel1b

ag1.sources.source1.type = spooldir
ag1.sources.source1.spoolDir = /root/log

ag1.sinks.sink1a.type=hdfs
ag1.sinks.sink1a.hdfs.path = /temp/flume
ag1.sinks.sink1a.hdfs.filePrefix = events
ag1.sinks.sink1a.hdfs.fileSuffix = .log
ag1.sinks.sink1a.hdfs.fileType = DataStream

ag1.sinks.sink1b.type=logger

ag1.channels.channel1a.type = file
ag1.channels.channel1b.type = memory

 

发布了77 篇原创文章 · 获赞 19 · 访问量 4079

Guess you like

Origin blog.csdn.net/qq_41861558/article/details/102817933