Kafka Learning Road (1) Introduction to Kafka

1. Introduction

1.1 Overview

Kafka was originally developed by Linkedin Company. It is a distributed, partitioned, multi-copy, multi-subscriber, distributed log system coordinated by zookeeper (also can be used as an MQ system), which can be commonly used for web/nginx logs, access Logging, messaging services, and more, Linkedin contributed to the Apache Foundation in 2010 and became a top-level open source project.

The main application scenarios are: log collection system and message system.

The main design goals of Kafka are as follows:

  • The message persistence capability is provided with a time complexity of O(1), and constant-time access performance can be guaranteed even for TB-level data or more.
  • High throughput. Even on very cheap commercial machines, a single machine can support the transmission of 100K messages per second.
  • It supports message partitioning between Kafka Servers and distributed consumption, while ensuring the sequential transmission of messages in each partition.
  • Both offline data processing and real-time data processing are supported.
  • Scale out: support online horizontal expansion

1.2 Introduction to the message system

A messaging system is responsible for passing data from one application to another, and the application only needs to focus on the data, not how the data is passed between two or more applications. Distributed messaging is based on reliable message queues to deliver messages asynchronously between client applications and messaging systems. There are two main messaging patterns: point-to-point delivery, and publish-subscribe . Most messaging systems use a publish-subscribe model. Kafka is a publish-subscribe model .

1.3 Peer-to-peer messaging pattern

In a peer-to-peer messaging system, messages are persisted to a queue. At this point, there will be one or more consumers consuming the data in the queue. But a message can only be consumed once. When a consumer consumes a piece of data in the queue, the piece of data is deleted from the message queue. This mode can guarantee the order of data processing even if there are multiple consumers consuming data at the same time. The schematic diagram of this architecture description is as follows:

The producer sends a message to the queue, and only one consumer can receive it .

1.4 Publish-Subscribe Messaging Pattern

In a publish-subscribe messaging system, messages are persisted to a topic. Different from the point-to-point messaging system, consumers can subscribe to one or more topics, consumers can consume all the data in the topic, the same data can be consumed by multiple consumers, and the data will not be deleted immediately after consumption. In a publish-subscribe messaging system, producers of messages are called publishers and consumers are called subscribers. An example diagram of this mode is as follows:

The message sent by the publisher to the topic, only the subscribers who subscribe to the topic will receive the message .

Second, the advantages of Kafka

2.1 Decoupling

It is extremely difficult to predict what needs the project will encounter in the future at the beginning of the project. The message system inserts an implicit, data-based interface layer in the middle of the process, and both processes must implement this interface. This allows you to extend or modify both processes independently, as long as you make sure they obey the same interface constraints.

2.2 Redundancy (copy)

In some cases, the process of processing the data fails. It will be lost unless the data is persisted. Message queues avoid the risk of data loss by persisting data until they have been fully processed. In the "insert-get-delete" paradigm used by many message queues, before removing a message from the queue, your processing system needs to explicitly indicate that the message has been processed, thus ensuring that your data is kept safe until you are done using it.

2.3 Extensibility

Because message queues decouple your processing, it is easy to increase the frequency of message enqueuing and processing by adding additional processing. No need to change the code, no need to adjust the parameters. Expansion is as easy as turning up the power button.

2.4 Flexibility & Peak Processing Capability

In the case of a surge in traffic, the application still needs to continue to function, but such bursts of traffic are not common; it is undoubtedly a huge waste to invest resources to be on standby at any time based on the standard of being able to handle such peak traffic. Using message queues enables critical components to withstand sudden access pressures without completely crashing due to sudden overloaded requests.

2.5 Recoverability

When a part of the system fails, it does not affect the entire system. Message queues reduce the coupling between processes, so even if a process processing a message hangs, messages added to the queue can still be processed after the system is restored.

2.6 Order Guarantee

In most use cases, the order of data processing is important. Most message queues are inherently ordered and guarantee that data will be processed in a specific order. Kafka guarantees the ordering of messages within a Partition.

2.7 Buffering

In any critical system, there will be elements that require different processing times. For example, loading an image takes less time than applying filters. Message queues use a buffer layer to help tasks perform most efficiently - writes to the queue are processed as quickly as possible. This buffering helps control and optimize the speed at which data flows through the system.

2.8 Asynchronous Communication

Many times the user does not want or need to process the message immediately. Message queues provide asynchronous processing mechanisms that allow users to put a message on the queue, but not process it immediately. Put as many messages as you want into the queue, and then process them when needed.

3. Comparison of commonly used Message Queue

3.1 RabbitMQ

RabbitMQ is an open source message queue written in Erlang. It supports many protocols: AMQP, XMPP, SMTP, STOMP. Because of this, it is very heavyweight and more suitable for enterprise-level development. At the same time, the Broker architecture is implemented, which means that messages are queued in a central queue before being sent to the client. It has good support for routing, load balancing or data persistence.

3.2 Redis

Redis is a NoSQL database based on Key-Value pairs, with active development and maintenance. Although it is a Key-Value database storage system, it supports MQ functions, so it can be used as a lightweight queue service. For the enqueue and dequeue operations of RabbitMQ and Redis, each is executed 1 million times, and the execution time is recorded every 100,000 times. The test data is divided into four different sizes of 128Bytes, 512Bytes, 1K and 10K. Experiments show that: when entering the queue, when the data is relatively small, the performance of Redis is higher than that of RabbitMQ, and if the data size exceeds 10K, Redis is unbearably slow; when leaving the queue, regardless of the data size, Redis shows very good performance , while the dequeue performance of RabbitMQ is much lower than that of Redis.

3.3 ZeroMQ

ZeroMQ is known as the fastest message queuing system, especially for high-throughput demand scenarios. ZeroMQ can implement advanced/complex queues that RabbitMQ is not good at, but developers need to combine multiple technical frameworks by themselves. The technical complexity is a challenge to the successful application of this MQ. ZeroMQ has a unique non-middleware model, you don't need to install and run a message server or middleware because your application will play the server role. All you need is a simple reference to the ZeroMQ library, which can be installed using NuGet, and you can happily send messages between applications. But ZeroMQ only provides non-persistent queues, which means that data will be lost if it goes down. Among them, Twitter's Storm versions earlier than 0.9.0 used ZeroMQ as the data stream transmission by default (Storm has supported both ZeroMQ and Netty as transmission modules since version 0.9).

3.4 ActiveMQ

ActiveMQ is a sub-project under Apache. Similar to ZeroMQ, it can implement queues in broker and peer-to-peer technology. At the same time, similar to RabbitMQ, it can efficiently implement advanced application scenarios with a small amount of code.

3.5 Kafka/Jafka

Kafka is a sub-project under Apache. It is a high-performance cross-language distributed publish/subscribe message queue system. Jafka is incubated on top of Kafka, which is an upgraded version of Kafka. It has the following characteristics: fast persistence, message persistence can be performed under O(1) system overhead; high throughput, a throughput rate of 10W/s can be achieved on an ordinary server; a complete distributed system, Broker , Producer, and Consumer all natively and automatically support distributed, automatic load balancing; support Hadoop data parallel loading, for log data and offline analysis systems like Hadoop, but require real-time processing constraints, this is a feasible solution. . Kafka unifies online and offline message processing through Hadoop's parallel loading mechanism. Apache Kafka is a very lightweight messaging system relative to ActiveMQ, and besides being very performant, it is a distributed system that works well.

Fourth, the terminology in Kafka

4.1 Overview

Before diving into Kafka, let's introduce the terminology in Kafka. The following figure shows the related terms of Kafka and the relationship between them:

In the figure above, a topic is configured with 3 partitions. Partition1 has two offsets: 0 and 1. Partition2 has 4 offsets. Partition3 has 1 offset. The id of the replica is exactly the same as the id of the machine where the replica is located.

If a topic has 3 replicas, then Kafka will create 3 identical replicas for each partition in the cluster. Each broker in the cluster stores one or more partitions. Multiple producers and consumers can produce and consume data at the same time.

4.2 broker

A Kafka cluster consists of one or more servers, the server nodes are called brokers.

The broker stores topic data. If a topic has N partitions and the cluster has N brokers, then each broker stores a partition of the topic.

If a topic has N partitions and the cluster has (N+M) brokers, then N brokers store one partition of the topic, and the remaining M brokers do not store the topic's partition data.

If a topic has N partitions and the number of brokers in the cluster is less than N, then one broker stores one or more partitions of the topic. In the actual production environment, try to avoid this situation, which can easily lead to unbalanced data in the Kafka cluster.

4.3 Topic

Every message published to a Kafka cluster has a category called a topic. (Physically, messages of different topics are stored separately. Logically, although messages of a topic are stored on one or more brokers, users only need to specify the topic of the message to produce or consume data without caring where the data is stored.)

Database-like table names

4.3 Partition

The data in the topic is divided into one or more partitions. Each topic has at least one partition. The data in each partition is stored using multiple segment files. The data in the partition is ordered, and the data between different partitions loses the order of the data. If a topic has multiple partitions, the order of data cannot be guaranteed when consuming data. In scenarios where the consumption order of messages needs to be strictly guaranteed, the number of partitions needs to be set to 1.

4.4 Producer

The producer is the publisher of data, and this role publishes messages to Kafka topics. After the broker receives the message sent by the producer, the broker appends the message to the segment file currently used for appending data. The message sent by the producer is stored in a partition, and the producer can also specify the partition where the data is stored.

4.5 Consumer

Consumers can read data from brokers. Consumers can consume data from multiple topics.

4.6 Consumer Group

Each Consumer belongs to a specific Consumer Group (a group name can be specified for each Consumer, if no group name is specified, it belongs to the default group).

4.7 Leader

Each partition has multiple copies, and one and only one of them is the leader. The leader is the partition that is currently responsible for reading and writing data.

4.8 Follower

The Follower follows the Leader, and all write requests are routed through the Leader. Data changes are broadcast to all Followers, and Followers keep data synchronization with the Leader. If the leader fails, a new leader is elected from the followers. When a follower and leader hang, get stuck, or synchronize too slowly, the leader will delete the follower from the "in sync replicas" (ISR) list and recreate a follower.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325768071&siteId=291194637