Comparison of kafka and flume

Summary: (1) kafka and flume are both log systems. Kafka is a distributed message middleware with its own storage, providing push and pull data access functions. Flume is divided into three parts: agent (data collector), collector (simple data processing and writing), and storage (storage), each of which can be customized. For example, the agent uses RPC (Thrift-RPC), text (file), etc., and the storage is specified with hdfs.

           (2) Kafka should be more suitable for log caching, but the data collection part of Flume is very good, and many data sources can be customized to reduce the amount of development. Therefore, the flume+kafka mode is more popular. If you want to use the ability of flume to write hdfs, you can also use the kafka+flume method.




The collection layer can mainly use two technologies, Flume and Kafka.

Flume: Flume is a pipeline flow method, which provides many default implementations, allowing users to deploy through parameters and extend the API.

Kafka: Kafka is a persistent distributed message queue.

• Kafka is a very general system. You can have many producers and many consumers sharing multiple topics. In contrast, Flume is a dedicated tool designed to send data to HDFS, HBase. It has special optimizations for HDFS and integrates the security features of Hadoop. Therefore, Cloudera recommends using kafka if the data is consumed by multiple systems and Flume if the data is designed to be consumed by Hadoop.




• As you know Flume has a lot of source and sink components built in. However, Kafka clearly has a smaller ecosystem of producers and consumers, and Kafka has poor community support. Hopefully this will improve in the future, but for now: using Kafka means you are ready to write your own producer and consumer code. If the existing Flume Sources and Sinks meet your needs, and you prefer a system that doesn't require any development, use Flume.




• Flume can process data in real time using interceptors. These are useful for data masking or excess. Kafka requires an external stream processing system to do this.




• Both Kafka and Flume are reliable systems that, with proper configuration, can guarantee zero data loss. However, Flume does not support replica events. Thus, if a node of the Flume agent crashes, even with the reliable file pipeline approach, you will lose those events until you restore the disks. If you need a highly reliable pipeline, then using Kafka is a better choice.




• Flume and Kafka work well together. If your design requires streaming data from Kafka to Hadoop, it is also possible to use a Flume proxy and configure Kafka's Source to read the data: you don't have to implement your own consumer. You can directly take advantage of all the benefits of Flume combined with HDFS and HBase. You can use Cloudera Manager for consumer monitoring, and you can even add interceptors for some stream processing.

Because each message is appended to the Partition, it belongs to sequential writing to disk, so the efficiency is very high (it has been verified that sequential writing to disk is more efficient than random writing to memory, which is a very important guarantee for Kafka's high throughput rate) .


Flume and Kafka can be used in combination. Usually the Flume + Kafka approach is used. In fact, if you want to use Flume's existing writing HDFS function, you can also use Kafka + Flume.

Both Kafka and Flume can realize data transmission, but their focus is different.

Kafka pursues high throughput and high load (there can be multiple partitions under a topic).

Flume pursues data diversity: diversity of data sources and diversity of data flow.




If data source is single and you want high throughput You can use Kafka

. If there are many data sources and many data flows, you can use Flume

, or you can use Kafka and Flume together.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326690384&siteId=291194637