Flume, Kafka, Hdfs integration
A, Flume development of commonly used components source, channel, sink arranged at
agent agent
1, channel Channel
file | The transaction log events stored on the local file system, persistent: as long as the event is written to the channel, even if the agent, the agent is restarted, the event will not be lost. agent1.channnels.channel1.type = file |
memory | The events cached in memory, has no persistent storage capability, if the agent is restarted, an event will be lost. In some cases this situation allowed. And file channel compared, memory channel speed, high throughput agent1.channnels.channel1.type=memory |
jdbc | The events are stored in the database |
2、sink
sink Joint |
HDFS |
Written to HDFS to text, serialized file, Avro or custom formats. |
HBASE |
Use of a sequence of writing data to a tool in HBASE |
|
Logger |
Use SLF4J record INFO level events, mainly used to test |
|
kafka |
Write an event to kafka message buffer column |
|
Elasticcsearch |
Use Logstash format to write events to Elasticsearch cluster |
|
NULL |
Discard all events |
|
Euro |
By Avro RPC to send events to a group Avro source |
|
File roll |
Write an event to the local file system |
|
Hive |
The event introduced to the hive by a fixed format corresponding to the table or partition. |
|
IRC |
Send events to the IRC channel |
3、source
source data source |
Euro |
Monitor the arrival of Avro sink or Flume SDK sent through the event Avro RPC port |
Exec |
Run a Unix command (e.g. tail -F / path / to / file), and the row read from the standard output conversion event. Note, however, this source does not necessarily guarantee deliver events to the channel, the better choice may reference spooling directory source or Flume SDK. |
|
HTTP |
Listening on a port, and the use of removable handle, such as JSON or binary data handler handler converts the HTTP request to the event |
|
ETC. |
Read the message from the JMS Queue or Topic and converts it into the event |
|
Spooling directory / spool |
Press OK to save the file in read buffer directory, and convert it to an event. |
|
Netcat |
Listening on a port, and converts each line of text for an event |
|
Syslog |
Read from the log line, and converts it into an event |
|
Thrift |
Listen for the event Thrift sink or Flume SDK sent by the arrival of Thrift RPC window |
|
Squence genetartor |
Generates an event based on incremental counter is mainly used to test |
|
kafka |
Listening kafka's Topic, as events |
4、interceptor
Interceptor veto-on begging
|
Timestamp |
Events are passed over to a specific source plus a timestamp header. Which contains the proxy processing time of the event, in ms. Specific reference blog Flume interceptor essay agent1.sources.souce1.interceptors = interceptor1 agent1.sources.source1.interceptors.interceptor1.type = timestamp。 |
UUID |
Settings on all of an event ID header, which is a globally unique identifier, delete duplicate data useful for the future. agent1.sources.source1.interceptors.interceptor1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder |
|
static |
We will set a fixed header and their values on all events. With specific reference to the official website |
|
Host |
Host header will be set on all events include a proxy host name or IP address |
Second, transaction and reliability
flume has two separate transactions: from source to channel, channel to sink
spooling directory source为文件的每一行创建一个事件、,一旦事务中所有事件提交到channel,那么source就把这个文件标记为完成。
事务类似的方式,从channel到sink,如果无法记录事件,将会回滚,所有的事件保存在channel中
三、分区和拦截器
大型数据常常被组织为分区,好处是查询涉及到数据的某个子集,查询范围就被限制在特定的分区范围内。flume事件的数据通常按时间来分区。(例如删除重复事件)
需要更改配置文件hdfs.path设置:
agent1.sinks.sink1.hdfs.path = /temp/flume/year=%Y/mouth=%m/day=%d
source 的header并没有设置时间戳(timestamp()),需要flume拦截器完成
souece1添加一个时间拦截器,会在传入到channel之前为每个事件增加一个timestamp header
agent1.sources.source1.interceptors = interceptor1
agent1.sources.source1.interceptors.interceptor1.type = timestamp
四、扇出
从一个source向多个channel,亦即向多个sink传递事件。
案例:spooling directory source扇出到hdfs sink和logger sink的flume配置
#spooling directory source扇出到hdfs sink和logger sink的flume配置
ag1.sources = source1
ag1.sinks = sink1a sink1b
ag1.channels =channel1a channerl1b
ag1.sources.source1.channels = channel1a channerl1b
ag1.sinks.sink1.channela = channel1a
ag1.sinks.sink1.channelb = channel1b
ag1.sources.source1.type = spooldir
ag1.sources.source1.spoolDir = /root/log
ag1.sinks.sink1a.type=hdfs
ag1.sinks.sink1a.hdfs.path = /temp/flume
ag1.sinks.sink1a.hdfs.filePrefix = events
ag1.sinks.sink1a.hdfs.fileSuffix = .log
ag1.sinks.sink1a.hdfs.fileType = DataStream
ag1.sinks.sink1b.type=logger
ag1.channels.channel1a.type = file
ag1.channels.channel1b.type = memory