[Big Data] Kafka Getting Started Guide

1. Introduction to Kafka

Apache Kafka is a high-throughput, distributed stream processing platform developed by LinkedIn and open sourced in 2011. It has the characteristics of high scalability , high reliability and low latency , so it is very popular in large data processing scenarios. Kafka can process various types of data, such as events, logs, indicators, etc., and is widely used in real-time data stream processing, log collection, monitoring and analysis and other fields.

It is usually used for message queue and stream processing. When used as a message queue, competing products include RabbitMQ, , ActiveMQ, RocketMQ, Apache Pulsaretc.

2.Kafka architecture

The following introduces the three most important participants in the Kafka architecture:

  • Producer( Producer ): The producer is responsible for sending messages to the Kafka cluster.
  • Consumer( Consumer ): The consumer is responsible for pulling and consuming messages from the Kafka cluster.
  • Broker( Agent node ): Broker is a service agent node in the Kafka cluster and can be regarded as a server. Kafka clusters usually consist of multiple Brokers to achieve load balancing and fault tolerance.

Insert image description here

3. Partitions and copies

Kafka introduces the concept of Topic( topic ) in order to classify messages. When the producer sends a message, it needs to specify a topic to send to, and then the message subscriber subscribes to this topic and consumes the message.

In order to improve performance, Kafka introduced the concept of Partition( partition ) on the basis of Topic. Topic is a logical concept, while Partition is a physical grouping. A Topic can contain multiple Partitions. When sending a message, the producer needs to specify a Partition to send to a Topic, and then the message subscriber subscribes to this Topic and consumes the messages in this Partition.

In order to improve the throughput and scalability of the system, Kafka places different Partitions of a Topic on multiple Broker nodes, making full use of machine resources and making it easier to expand Partitions.

ReplicaIn order to ensure the security of data and the high availability of services, Kafka introduces the concept of ( replica ) on the basis of Partition . A Partition contains multiple Replica. The relationship between Replica is one master and multiple slaves. There are two types Leader Replica( leader replica ) and Follower Replica( follower replica ). Replica is distributed on different Broker nodes.

Leader Replica is responsible for read and write requests, and Follower Replica is only responsible for synchronizing Leader Replica data and does not provide external services. When the Leader Replica fails, a new Leader Replica is elected from the Follower Replica to continue providing external services, realizing automatic failover.

The following figure shows the distribution of different Partitions of the same Topic on the Broker node:

Insert image description here
In order to improve the synchronization efficiency and data writing efficiency of Replica, Kafka classifies Replica. All Replica sets for a Partition are collectively referred to as ( ARallocated replicas ), including Leader Replica and Follower Replica. The Replica set that is synchronized with the Leader Replica is called ( , synchronized replica ), and the Replica set that is out of sync with the Leader Replica is called ( , out-of-synchronization replica ) .Assigned ReplicasISRIn-Sync ReplicasOSROut-of-Sync ReplicasAR = ISR + OSR

Before the Leader Replica writes the message to disk, it needs to wait for all replicas in the ISR to be synchronized. If the synchronization data of a Follower Replica in the ISR lags too far behind the Leader Replica, it will be transferred to the OSR. If the synchronization data of a Follower Replica in the OSR catches up with the Leader Replica, it will be transferred to the ISR. When the Leader Replica fails, only a new Leader Replica will be elected from the ISR.

4.Offset

In order to record the synchronization status of replicas and control the scope of messages consumed by consumers, Kafka introduced LEO( Log End Offset, log end offset ) and HW( High Watermark, high water mark ).

  • LEO represents the offset of the next written message in the partition and is also the maximum offset in the partition. LEO is used to record the data synchronization progress between Leader Replica and Follower Replica, with one copy in each replica.
  • HW represents the minimum offset at which all replicas (Leader and Follower) have been successfully replicated and is a data value shared by all replicas. In other words, messages before HW are considered committed and consumers can consume these messages. Used to ensure message consistency and read only once.

The following demonstrates the update process of LEO and HW:

(1) In the initial state, there are two messages, 0 and 1, in each of the three copies. LEO is both 2, and position 2 is empty, indicating that it is the position where the message will be written. HW is also 2, indicating that all messages in the Leader Replica have been synchronized to the Follower Replica, and consumers can consume two messages, 0 and 1.

Insert image description here
(2) The producer sends two messages to the Leader Replica. At this time, the LEO value of the Leader Replica increases by 2 and becomes 4. Since the message synchronization to the Follower Replica has not yet started, the HW value and the LEO value in the Follower Replica have not changed. Because consumers can only consume messages before HW, that is, two messages 0 and 1.

Insert image description here
(3) Leader Replica starts to synchronize messages to Follower Replica. The synchronization rates are different. Two messages 2 and 3 of Follower1 have been synchronized, while Follower2 has only synchronized one message 2. At this time, the LEO of Leader and Follower1 are both 4, while the LEO of Follower2 is 3. HW represents the minimum offset that has been successfully synchronized . The value is 3, which means that the consumer can only read three items: 0, 1, and 2 at this time. information.

Insert image description here
All messages are completed synchronously. The LEO of the three replicas is 4, and the HW is also 4. The consumer can read four messages: 0, 1, 2, and 3.

Insert image description here

5. Consumer Group

In order to improve the efficiency of message processing, Kafka introduces the concept of consumer groups. A consumer group ( Consumer Group) contains multiple consumers. A consumer group can subscribe to multiple topics at the same time, and a topic can also be subscribed to multiple consumer groups at the same time.

In order to ensure that messages of the same Partition are processed sequentially, for a consumer group, a Partition message will only be handed over to one consumer of this message group for processing.

Insert image description here

6. Summary

This article briefly introduces the Kafka architecture and some noun concepts involved in the architecture, including Producer(producer), Consumer(consumer), Broker(agent node), Topic(topic), ( Partitionpartition), Leader Replica(leader copy), Follower Replica(follower ) replica), LEO( Log End Offset, log end offset), HW( High Watermark, high water mark), Consumer Group(consumer group), etc. The next article will introduce how Kafka solves message loss, repeated consumption, sequential messages, persistent messages, Leader election process, etc.

Guess you like

Origin blog.csdn.net/be_racle/article/details/132818337