Kafka与Flume

  Kafka and Flume are both log systems. Kafka is a distributed message middleware with its own storage, which provides push and pull data access functions. FlumeFlume is a pipeline flow method, which is divided into three parts: agent (data collector), collector (simple data processing and writing), storage (memory), and each part can be customized. For example, the agent uses RPC (Thrift-RPC), text (file), etc., and the storage specifies to use HDFS. Kafka should be more suitable for log caching, but Flume's data collection part does a good job. It can customize many data sources and reduce the amount of development. Therefore, the Flume + Kafka mode is more popular. If you want to use Flume's ability to write HDFS, you can also use Flume + Kafka.

  • Kafka is a very versatile system, where many producers and many consumers share multiple topics. In contrast, Flume is a dedicated tool designed to send data to HDFS, HBase. It has special optimization for HDFS and integrates the security features of Hadoop. So if the data is consumed by multiple systems, use Kafka; if the data is designed for Hadoop, use Flume.
  • Flume can use interceptors to process data in real time. These are very useful for data masking or excessive. Kafka requires an external stream processing system to do this.
  • Both Kafka and Flume are reliable systems, and proper configuration can guarantee zero data loss. However, Flume does not support replica events. Therefore, if a node of the Flume agent crashes, even if a reliable file pipeline method is used, these events will be lost until these disks are restored. If you need a highly reliable pipeline, then Kafka is a better choice.

Question: Why use Flume and Kafka at the same time for log processing? Can Kafka be used without Flume?
At that time, I thought of using only Flume's interfaces, whether it be input interfaces (sockets and files) and output interfaces (Kafka / HDFS / HBase, etc.). Considering the business development of the existing system, it is more important to leave a certain degree of scalability in the system design for future flexible expansion. The Flume + kafka architecture may use 1-2 more machines to collect Flume logs than Kafka, but in order to facilitate the expansion of log data processing in the future, the Flume + kafka architecture may be used.

  Flume :管道 ----个人认为比较适合有多个生产者场景,或者有写入Hbase、HDFS和Kafka需求的场景。
  Kafka :消息队列-----由于Kafka是Pull模式,因此适合有多个消费者的场景。
Published 162 original articles · praised 58 · 90,000 views

Guess you like

Origin blog.csdn.net/ThreeAspects/article/details/105459208