flume

As a real-time log collection system developed by cloudera, Flume has been recognized and widely used in the industry. The initial release version of Flume is currently collectively referred to as Flume OG (original generation), which belongs to cloudera. However, with the expansion of FLume functions, the shortcomings of Flume OG's bloated code engineering, unreasonable core component design, and non-standard core configuration have been exposed, especially in the last release version of Flume OG, 0.94.0, the phenomenon of unstable log transmission It is particularly serious. In order to solve these problems, on October 22, 2011, cloudera completed Flume-728, and made milestone changes to Flume: refactoring core components, core configuration and code architecture. The refactored versions are collectively referred to as Flume NG (next generation); another reason for the change is to bring Flume into the apache banner, and cloudera Flume was renamed Apache Flume. This IBM article: " Flume NG: The First Revolution in Flume's History " describes the revolutionary change from Flume OG to Flume NG from the perspective of basic components and user experience. This article will not repeat the details, but here is a brief mention of the main changes in Flume NG (1.xx):

  • sources and sinks are linked using channels
  • Two main channels. 1, in-memory channel non-persistent support, fast. 2, JDBC-based channel persistence support.
  • No longer distinguish between logical and physical nodes, all physical nodes are collectively referred to as "agents", each agent can run 0 or more sources and sinks
  • The master node and the dependency on zookeeper are no longer required, and the configuration file is simplified.
  • Plug-in, part facing the user, tool or system developer.
  • Using Thrift, Avro Flume sources can send events from flume0.9.4 to flume 1.x

Note: The version of Flume used in this article is flume-1.4.0-cdh4.7.0, which can be used without additional installation process. 

1. Some core concepts of Flume:

components Function
Agent Run Flume with the JVM. One agent runs per machine, but multiple sources and sinks can be included in an agent.
Client Produce data, run in a separate thread.
Source Collect data from Client and pass it to Channel.
Sink Collect data from Channel, run in a separate thread.
Channel Connect sources and sinks, this is a bit like a queue.
Events Can be logging, avro objects, etc.

1.1 Data flow model

Flume uses agent as the smallest independent running unit. An agent is a JVM. A single agent consists of three components: Source, Sink and Channel, as shown in the following figure:

Agent component diagram

Agent component diagram

Figure 1

Flume's data flow is run through events. Event is the basic data unit of Flume. It carries log data (in the form of byte array) and header information. These Events are generated by sources outside the Agent, such as the Web Server in the above figure. When the Source captures the event, it will do specific formatting, and then the Source will push the event into the (single or multiple) Channel. You can think of Channel as a buffer that will hold events until the sink has finished processing the event. Sink is responsible for persisting logs or pushing events to another Source.
Very straightforward design, it is worth noting that Flume provides a large number of built-in Source, Channel and Sink types. Different types of Source, Channel and Sink can be freely combined. The combination method is based on the configuration file set by the user, which is very flexible. For example: Channel can temporarily store events in memory or persist to local hard disk. Sink can write logs to HDFS, HBase, or even another Source, etc.
If you think that Flume has these capabilities, you are wrong. Flume supports users to establish multi-level flow, that is, multiple agents can work together, and supports Fan-in, Fan-out, Contextual Routing, Backup Routes. As shown below:

A fan-out flow using a (multiplexing) channel selector

A fan-out flow using a (multiplexing) channel selector

1.2 High reliability

As software running in a production environment, high reliability is a must.
From a single agent perspective, Flume uses transaction-based data delivery to ensure the reliability of event delivery. Source and Sink are encapsulated into a transaction. Events are stored in the Channel until the event is processed, and the events in the Channel will not be removed. This is the reliable peer-to-peer mechanism provided by Flume.
From the perspective of multi-level flow, the sink of the former agent and the source of the latter agent also have their transactions to ensure the reliability of the data.

1.3 Recoverability

Still rely on Channel. It is recommended to use FileChannel, events are persisted in the local file system (poor performance).

2. Introduction to the overall architecture of Flume

The overall Flume architecture is a three-tier architecture of source-->channel-->sink (see Figure 1 at the top), similar to the architecture of generators and consumers, which are transmitted and decoupled through queue (channel).

Source: Complete the collection of log data, divide it into transition and event and enter it into the channel. 
Channel: It mainly provides the function of a queue, and simply caches the data provided by the source. 
Sink: Take out the data in the Channel, store it in the corresponding file system, database, or submit it to the remote server. 
The minimal change to the existing program is to directly read the log file originally recorded by the program, which can basically achieve seamless access without any changes to the existing program. 
For directly reading the file Source, there are mainly two ways: 

2.1 Exec source

Data can be organized by writing Unix commands, the most commonly used is tail -F [file].
Real-time transmission can be achieved, but data will be lost when flume is not running and the script is wrong, and the resume transmission function is not supported. Because the position where the file was read last time is not recorded, there is no way to know where to start reading the next time it is read. Especially when the log file keeps growing. Flume's source hangs up. During the period when the source of flume is opened again, the added log content cannot be read by the source. However, flume has an extension of execStream, which can write a monitoring log increase by itself, and send the increased log to the node of flume through the tool written by itself. Then send it to the sink node. If it can be supported in the source of the tail class, the content of this period of time will be hung up on the node, and the transmission will continue after the next node is opened, it will be more perfect.

2.2 Spooling Directory Source

SpoolSource: It is a newly added file in the directory of the monitoring configuration, and reads the data in the file, which can realize quasi-real-time. There are two points to note: 1. The files copied to the spool directory cannot be opened for editing. 2. The spool directory cannot contain corresponding subdirectories. In the actual use process, it can be used in combination with log4j. When using log4j, set the log4j file segmentation mechanism to once every minute, and copy the files to the monitoring directory of the spool. log4j has a TimeRolling plug-in that can split log4j files to the spool directory. Real-time monitoring is basically achieved. After Flume finishes transferring the file, it will modify the suffix of the file to .COMPLETED (the suffix can also be flexibly specified in the configuration file) 
ExecSource, SpoolSource comparison: ExecSource can realize real-time collection of logs, but there are Flume not running or commands When an error occurs, the log data cannot be collected, and the integrity of the log data cannot be verified. Although SpoolSource cannot collect data in real time, it can be used to divide files in minutes, which is close to real time. If the application cannot cut log files by minutes, the two collection methods can be used in combination. 
There are many ways of Channel: there are MemoryChannel, JDBC Channel, MemoryRecoverChannel, FileChannel. MemoryChannel can achieve high-speed throughput, but cannot guarantee data integrity. MemoryRecoverChannel has been defined to be replaced by FileChannel as suggested in the official documentation. FileChannel guarantees data integrity and consistency. When configuring FileChannel, it is recommended that the directory set by FileChannel and the directory where the program log file is saved should be set to different disks to improve efficiency. 
When setting the storage data, Sink can store data in the file system, database, and hadoop. When the log data is small, it can store the data in the file system, and set a certain time interval to save the data. When there is a lot of log data, the corresponding log data can be stored in Hadoop to facilitate corresponding data analysis in the future. 

3. Common architecture and functional configuration examples

3.1 Let’s start with a simple one: single-node Flume configuration

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Save the above configuration as: example.conf

Then we can start Flume:

bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

PS: -Dflume.root.logger=INFO,console is only used for debugging, please do not copy the production environment, otherwise a lot of logs will be returned to the terminal. . .

-c/--conf followed by the configuration directory, -f/ --conf-file  followed by the specific configuration file, -n/ --name specifies the name of the agent

Then we open another shell terminal window, configure the listening port on telnet, and send a message to see the effect:

$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK

The Flume terminal window will print the following information, indicating success:

12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

So far, our first Flume Agent has been successfully deployed!

3.2 Single-node Flume directly writes to HDFS

# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
agent1.channels.ch1.capacity = 100000
agent1.channels.ch1.transactionCapacity = 100000
agent1.channels.ch1.keep-alive = 30

# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
#agent1.sources.avro-source1.channels = ch1
#agent1.sources.avro-source1.type = avro
#agent1.sources.avro-source1.bind = 0.0.0.0
#agent1.sources.avro-source1.port = 41414
#agent1.sources.avro-source1.threads = 5

#define source monitor a file
agent1.sources.avro-source1.type = exec
agent1.sources.avro-source1.shell = /bin/bash -c
agent1.sources.avro-source1.command = tail -n +0 -F /home/storm/tmp/id.txt
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.threads = 5

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = hdfs
agent1.sinks.log-sink1.hdfs.path = hdfs://192.168.1.111:8020/flumeTest
agent1.sinks.log-sink1.hdfs.writeFormat = Text
agent1.sinks.log-sink1.hdfs.fileType = DataStream
agent1.sinks.log-sink1.hdfs.rollInterval = 0
agent1.sinks.log-sink1.hdfs.rollSize = 1000000
agent1.sinks.log-sink1.hdfs.rollCount = 0
agent1.sinks.log-sink1.hdfs.batchSize = 1000
agent1.sinks.log-sink1.hdfs.txnEventMax = 1000
agent1.sinks.log-sink1.hdfs.callTimeout = 60000
agent1.sinks.log-sink1.hdfs.appendTimeout = 60000

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1

Start the following command, you can see the effect on hdfs.

../bin/flume-ng agent --conf ../conf/ -f flume_directHDFS.conf -n agent1 -Dflume.root.logger=INFO,console

PS: There is such a requirement in the actual environment. By tailing logs on multiple agents, sending them to the collector, the collector then collects the data and sends it to HDFS for storage. When the HDFS file size exceeds a certain size or exceeds a specified time The interval generates a file.
Flume implements two Triggers, which are SizeTrigger (while calling the HDFS output stream to write, count the total size of the stream that has been written. If it exceeds a certain size, a new file and output stream will be created, and the write operation will point to the new output stream, and close the previous output stream at the same time) and TimeTrigger (start the timer, when this point is reached, a new file and output stream are automatically created, new writes are redirected to this stream, and the previous output stream is closed at the same time) .

3.3 Come to a common architecture: multi-agent aggregation and writing to HDFS

A fan-in flow using Avro RPC to consolidate events in one place

A fan-in flow using Avro RPC to consolidate events in one place

 

3.3.1 Configure Flume Client on each webserv log machine

# clientMainAgent
clientMainAgent.channels = c1
clientMainAgent.sources  = s1
clientMainAgent.sinks    = k1 k2
# clientMainAgent sinks group
clientMainAgent.sinkgroups = g1
# clientMainAgent Spooling Directory Source
clientMainAgent.sources.s1.type = spooldir
clientMainAgent.sources.s1.spoolDir  =/dsap/rawdata/
clientMainAgent.sources.s1.fileHeader = true
clientMainAgent.sources.s1.deletePolicy =immediate
clientMainAgent.sources.s1.batchSize =1000
clientMainAgent.sources.s1.channels =c1
clientMainAgent.sources.s1.deserializer.maxLineLength =1048576
# clientMainAgent FileChannel
clientMainAgent.channels.c1.type = file
clientMainAgent.channels.c1.checkpointDir = /var/flume/fchannel/spool/checkpoint
clientMainAgent.channels.c1.dataDirs = /var/flume/fchannel/spool/data
clientMainAgent.channels.c1.capacity = 200000000
clientMainAgent.channels.c1.keep-alive = 30
clientMainAgent.channels.c1.write-timeout = 30
clientMainAgent.channels.c1.checkpoint-timeout=600
# clientMainAgent Sinks
# k1 sink
clientMainAgent.sinks.k1.channel = c1
clientMainAgent.sinks.k1.type = avro
# connect to CollectorMainAgent
clientMainAgent.sinks.k1.hostname = flume115
clientMainAgent.sinks.k1.port = 41415 
# k2 sink
clientMainAgent.sinks.k2.channel = c1
clientMainAgent.sinks.k2.type = avro
# connect to CollectorBackupAgent
clientMainAgent.sinks.k2.hostname = flume116
clientMainAgent.sinks.k2.port = 41415
# clientMainAgent sinks group
clientMainAgent.sinkgroups.g1.sinks = k1 k2
# load_balance type
clientMainAgent.sinkgroups.g1.processor.type = load_balance
clientMainAgent.sinkgroups.g1.processor.backoff   = true
clientMainAgent.sinkgroups.g1.processor.selector  = random

../bin/flume-ng agent --conf ../conf/ -f flume_Consolidation.conf -n clientMainAgent -Dflume.root.logger=DEBUG,console

3.3.2 Configuring Flume server at the sink node

# collectorMainAgent
collectorMainAgent.channels = c2
collectorMainAgent.sources  = s2
collectorMainAgent.sinks    =k1 k2
# collectorMainAgent AvroSource
#
collectorMainAgent.sources.s2.type = avro
collectorMainAgent.sources.s2.bind = flume115
collectorMainAgent.sources.s2.port = 41415
collectorMainAgent.sources.s2.channels = c2

# collectorMainAgent FileChannel
#
collectorMainAgent.channels.c2.type = file
collectorMainAgent.channels.c2.checkpointDir =/opt/var/flume/fchannel/spool/checkpoint
collectorMainAgent.channels.c2.dataDirs = /opt/var/flume/fchannel/spool/data,/work/flume/fchannel/spool/data
collectorMainAgent.channels.c2.capacity = 200000000
collectorMainAgent.channels.c2.transactionCapacity=6000
collectorMainAgent.channels.c2.checkpointInterval=60000
# collectorMainAgent hdfsSink
collectorMainAgent.sinks.k2.type = hdfs
collectorMainAgent.sinks.k2.channel = c2
collectorMainAgent.sinks.k2.hdfs.path = hdfs://db-cdh-cluster/flume%{dir}
collectorMainAgent.sinks.k2.hdfs.filePrefix =k2_%{file}
collectorMainAgent.sinks.k2.hdfs.inUsePrefix =_
collectorMainAgent.sinks.k2.hdfs.inUseSuffix =.tmp
collectorMainAgent.sinks.k2.hdfs.rollSize = 0
collectorMainAgent.sinks.k2.hdfs.rollCount = 0
collectorMainAgent.sinks.k2.hdfs.rollInterval = 240
collectorMainAgent.sinks.k2.hdfs.writeFormat = Text
collectorMainAgent.sinks.k2.hdfs.fileType = DataStream
collectorMainAgent.sinks.k2.hdfs.batchSize = 6000
collectorMainAgent.sinks.k2.hdfs.callTimeout = 60000
collectorMainAgent.sinks.k1.type = hdfs
collectorMainAgent.sinks.k1.channel = c2
collectorMainAgent.sinks.k1.hdfs.path = hdfs://db-cdh-cluster/flume%{dir}
collectorMainAgent.sinks.k1.hdfs.filePrefix =k1_%{file}
collectorMainAgent.sinks.k1.hdfs.inUsePrefix =_
collectorMainAgent.sinks.k1.hdfs.inUseSuffix =.tmp
collectorMainAgent.sinks.k1.hdfs.rollSize = 0
collectorMainAgent.sinks.k1.hdfs.rollCount = 0
collectorMainAgent.sinks.k1.hdfs.rollInterval = 240
collectorMainAgent.sinks.k1.hdfs.writeFormat = Text
collectorMainAgent.sinks.k1.hdfs.fileType = DataStream
collectorMainAgent.sinks.k1.hdfs.batchSize = 6000
collectorMainAgent.sinks.k1.hdfs.callTimeout = 60000

../bin/flume-ng agent --conf ../conf/ -f flume_Consolidation.conf -n collectorMainAgent -Dflume.root.logger=DEBUG,console

The above adopts a similar cs architecture. Each flume agent node first summarizes the logs of each machine to the Consolidation node, and then these nodes write to HDFS uniformly, and the load balancing method is adopted. You can also configure a high-availability mode. etc.

4. Possible problems:

4.1 OOM problem:

flume 报错:
java.lang.OutOfMemoryError: GC overhead limit exceeded
或者:
java.lang.OutOfMemoryError: Java heap space
Exception in thread "SinkRunner-PollingRunner-DefaultSinkProcessor" java.lang.OutOfMemoryError: Java heap space

The default maximum heap memory size when Flume starts is 20M, and the online environment is easy to OOM, so you need to add JVM startup parameters to flume-env.sh: 

JAVA_OPTS="-Xms8192m -Xmx8192m -Xss256k -Xmn2g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"

Then you must bring the -c conf option when starting the agent, otherwise the environment variables configured in flume-env.sh will not be loaded and take effect.

For details, see:

http://stackoverflow.com/questions/1393486/error-java-lang-outofmemoryerror-gc-overhead-limit-exceeded

http://marc.info/?l=flume-user&m=138933303305433&w=2

4.2 JDK version incompatibility issues:

2014-07-07 14:44:17,902 (agent-shutdown-hook) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.stop(HDFSEventSink.java:504)] Exception while closing hdfs://192.168.1.111:8020/flumeTest/FlumeData. Exception follows.
java.lang.UnsupportedOperationException: This is supposed to be overridden by subclasses.
        at com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
        at com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:193)

Try changing your jdk7 to jdk6.

4.3 The problem of delay in writing small files to HDFS

In fact, as explained in 3.2 above, flume's sink has implemented several main persistent triggers:

For example, by size, by interval time, by number of messages, etc., for the problem that your files are too small and cannot be written to HDFS for persistence,

That's because you haven't met the persistence conditions at this time, such as your number of rows has not reached the configured threshold or the size has not yet reached, etc.,

You can fine-tune the configuration in Section 3.2 above, for example:

agent1.sinks.log-sink1.hdfs.rollInterval = 20

When there is no new log generation, if you want to flush quickly, then make it flush every 20s and persist, and the agent will execute the triggers that meet the conditions first according to multiple conditions.

Here are some common persistent triggers:

# Number of seconds to wait before rolling current file (in 600 seconds)
agent.sinks.sink.hdfs.rollInterval=600

# File size to trigger roll, in bytes (256Mb)
agent.sinks.sink.hdfs.rollSize = 268435456

# never roll based on number of events
agent.sinks.sink.hdfs.rollCount = 0

# Timeout after which inactive files get closed (in seconds)
agent.sinks.sink.hdfs.idleTimeout = 3600

agent.sinks.HDFS.hdfs.batchSize = 1000

For more information on sink triggering mechanism and parameter configuration, please refer to: http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

http://stackoverflow.com/questions/20638498/flume-not-writing-to-hdfs-unless-killed

Note: Small file problems should be avoided with HDFS, so be careful with the persistence triggering mechanism you configure.

4.4 Repeated writing and loss of data

Flume's HDFS sink has the guarantee of Transcation when data is written to/read from Channel. When the transaction fails, it is rolled back and retried. However, because HDFS cannot modify the content of the file, suppose there are 10,000 lines of data to be written to HDFS, and when writing 5,000 lines, a network problem causes the writing to fail. Transaction rolls back, and then rewrites the 10,000 records successfully. would cause the first 5000 lines to be written to repeat. These problems are characteristic flaws in the design of the HDFS file system and cannot be solved by simple bugfixes. We can only turn off batch writes, single transaction guarantees, or enable monitoring strategies, logarithmic at both ends.

The methods of Memory and exec may cause data loss. The reliability of file is end-to-end, but the performance is worse than the former two.

End to end, store on failure mode ACK confirmation time is set too short (especially during peak hours) may also lead to repeated data writing.

4.5 The problem of tail resuming from a breakpoint:

You can record the line number when tail is uploading, and when you upload it next time, start the transmission at the last recorded position, similar to:

agent1.sources.avro-source1.command = /usr/local/bin/tail  -n +$(tail -n1 /home/storm/tmp/n) --max-unchanged-stats=600 -F  /home/storm/tmp/id.txt | awk 'ARNGIND==1{i=$0;next}{i++; if($0~/文件已截断/)i=0; print i >> "/home/storm/tmp/n";print $1"---"i}' /home/storm/tmp/n -

The following points should be noted:

(1) When the file is rotated, you need to update your breakpoint record "pointer" synchronously,

(2) You need to track files by file name,

(3) After flume hangs up, you need to accumulate breakpoints to resume the "pointer"

(4) After flume hangs up, if the file happens to be rotated, there is a risk of data loss.

You can only monitor and pull up as soon as possible or add logic to determine the file size to reset the pointer.

(5) tail pay attention to your version, please update the coreutils package to the latest.

4.6 How to modify, discard, and store data according to predefined rules in Flume?

Here you need to use the Interceptor mechanism provided by Flume to meet the above requirements. For details, please refer to the following links:

(1) Interceptor for Flume-NG source code reading (original)

http://www.cnblogs.com/lxf20061900/p/3664602.html

(2) Flume-NG custom interceptor

http://sep10.com/posts/2014/04/15/flume-interceptor/

(3) Flume-ng production environment practice (4) Implement log format interceptor

http://blog.csdn.net/rjhym/article/details/8450728

(4) How does flume-ng output to the HDFS file name according to the source file name

http://abloz.com/2013/02/19/flume-ng-output-according-to-the-source-file-name-to-the-hdfs-file-name.html

5、Refer:

(1) Comparison of scribe, chukwa, kafka, and flume log systems

http://www.ttlsa.com/log-system/scribe-chukwa-kafka-flume-log-system-contrast/

(2) Those things about Flume-ng  http://www.ttlsa.com/?s=flume

About Flume-ng (3): Common Architecture Test  http://www.ttlsa.com/log-system/about-flume-ng-3/

(3)Flume 1.4.0 User Guide

http://archive.cloudera.com/cdh4/cdh/4/flume-ng-1.4.0-cdh4.7.0/FlumeUserGuide.html

(4) Flume log collection  http://blog.csdn.net/sunmeng_007/article/details/9762507

(5) Flume-NG + HDFS + HIVE log collection analysis

http://eyelublog.wordpress.com/2013/01/13/flume-ng-hdfs-hive-%E6%97%A5%E5%BF%97%E6%94%B6%E9%9B%86%E5%88%86%E6%9E%90/

(6) [Twitter Storm series] flume-ng+Kafka+Storm+HDFS real-time system construction

http://blog.csdn.net/weijonathan/article/details/18301321

(7) Flume-NG + HDFS + PIG log collection and analysis

http://hi.baidu.com/life_to_you/item/a98e2ec3367486dbef183b5e

Flume example one collects tomcat logs  http://my.oschina.net/88sys/blog/71529

flume-ng multi-node cluster example  http://my.oschina.net/u/1401580/blog/204052

Try flume-ng 1.1  http://heipark.iteye.com/blog/1617995

(8)Flafka: Apache Flume Meets Apache Kafka for Event Processing

http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

(9) The principle and use of Flume-ng

http://segmentfault.com/blog/javachen/1190000002532284

(10) Meituan log collection system based on Flume (1) Architecture and design

http://tech.meituan.com/mt-log-system-arch.html

(11) Flume-based Meituan log collection system (2) Improvement and optimization

http://tech.meituan.com/mt-log-system-optimization.html

(12)How-to: Do Real-Time Log Analytics with Apache Kafka, Cloudera Search, and Hue

http://blog.cloudera.com/blog/2015/02/how-to-do-real-time-log-analytics-with-apache-kafka-cloudera-search-and-hue/

(13)Real-time analytics in Apache Flume - Part 1

http://jameskinley.tumblr.com/post/57704266739/real-time-analytics-in-apache-flume-part-1

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324440618&siteId=291194637