How to choose Kafka and Flume

The collection layer can mainly use two technologies, Flume and Kafka.

Flume: Flume is a pipeline flow method, which provides many default implementations, allowing users to deploy through parameters and extend the API.

Kafka: Kafka is a persistent distributed message queue.

 

  • Kafka is a very general system. You can have many producers and many consumers sharing multiple topics. In contrast, Flume is a dedicated tool designed to send data to HDFS, HBase. It has special optimizations for HDFS and integrates the security features of Hadoop. Therefore, Cloudera recommends using kafka if the data is consumed by multiple systems and Flume if the data is designed to be consumed by Hadoop.
  • As you know Flume has a lot of source and sink components built in. However, Kafka clearly has a smaller ecosystem of producers and consumers, and Kafka has poor community support. Hopefully this will improve in the future, but for now: using Kafka means you are ready to write your own producer and consumer code. Use Flume if the existing Flume Sources and Sinks meet your needs and you prefer a system that doesn't require any development.

 

  • Flume can process data in real time using interceptors. These are useful for data masking or excess. Kafka requires an external stream processing system to do this.

 

  • Both Kafka and Flume are reliable systems that can guarantee zero data loss with proper configuration. However, Flume does not support replica events. Thus, if a node of the Flume agent crashes, even with the reliable file pipeline approach, you will lose those events until you restore the disks. If you need a highly reliable pipeline, then using Kafka is a better choice.

 

  • Flume and Kafka work well together. If your design requires streaming data from Kafka to Hadoop, it is also possible to use a Flume proxy and configure Kafka's Source to read the data: you don't have to implement your own consumer. You can directly take advantage of all the benefits of Flume combined with HDFS and HBase. You can use Cloudera Manager for consumer monitoring, and you can even add interceptors for some stream processing.

 

Flume and Kafka can be used in combination. Usually the Flume + Kafka approach is used. In fact, if you want to use Flume's existing writing HDFS function, you can also use Kafka + Flume.

 

Reprinted from: https://my.oschina.net/frankwu/blog/355298

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326383347&siteId=291194637