Simply understand Kafka's message reliability strategy

Author: hymanzhang, IEG operators Tencent Development Engineer

background

In the process of developing an activity recently, the development classmates of the department need to pay attention to a lot of application background logic and capture the triggering of various events. In the design, we intend to use Kafka message queue to decouple the business logic, so that the activity development and the work of background development students are separated. However, the students who use it are not very familiar with the principle and worry about the following issues:

  • In what business scenarios do I use message queues

  • Do I need to wait for ack when I send a message?

  • After I send a message, will consumers receive it?

  • After applying for the Kafka instance of Tencent Cloud, how to set various parameters?

  • Will my messages be lost when encountering various failures?

  • Will the consumer receive multiple messages? Will messages be lost after consumer svr restarts?

These problems are normal, and there will always be problems of one kind or another when first contacting and using them. Under normal circumstances, it is possible to work without understanding and using various default recommended values. But we have to gracefully enhance our posture (knowledge) and momentum (knowledge). Learn the principles behind it, at least when encountering general problems, be able to analyze and deal with the problems, and be aware of it.

When to use message queue?

Simply put, the 3 keywords, asynchronous/peak elimination/decoupling , can be understood as:

  • I don't care after I finish it

  • Too much work, let me take care of it slowly

  • I don't care how it happened/I don't care how to deal with it

Take the following picture as an example:

After a user submits a comment, after it is written into the database, there are multiple logical steps that need to capture the comment event. It is very cumbersome to process different steps sequentially during the interface processing. We can notify each step in batches (asynchronously), without returning to directly process the other logic of the current payment (decoupling). It seems to be much more refreshing. In addition, the message queue can also be used as a buffer to temporarily store the sent messages. It is no longer necessary to consider the abnormal scenarios of the delay logic of calling each step.

This article takes the reliability design in Kafka as an example, and the selection of other message queues is not involved.

Basic concepts of Kafka

Before answering the previous questions in the article, I need to briefly introduce various concepts. Kafka has the following topological roles:

  • Consumer: Consumers, generally exist in the form of API in each business svr

  • Producer: Producer, generally in the form of API in each business svr

  • Kafka broker: The server in the kafka cluster, the message data in the topic is stored on it

Producer sends the message to the broker by sending a push, and the broker stores it. The consumer uses the pull mode to subscribe and consume messages.

As shown in the figure, Kafka has the following roles from the storage structure:

  • Topic: A logical collection of messages processed by Kafka, which can be understood as a table. Writing to different topics means writing to different tables.

  • Partition: Physical grouping under Topic, a topic can be divided into multiple partitions, each partition is an ordered queue (large file). Each message in Partition has an orderly offset.

  • Msg: Message, the basic unit of communication. Each msg has only one copy for different partitions under topic, and there is a unique offset in the partition for positioning.

  • Replica: Replica, partition data redundancy backup, used to achieve distributed data reliability, but introduces the problem of data consistency between different replicas, which brings a certain degree of complexity.

  • Leader/follower: The role of replica, leader replica is used to provide read and write services for the partition. Follower keeps synchronously writing messages from the leader side. The message status between them is resolved by a consistency strategy.

Kakfa storage format

In order to facilitate a better understanding of the message state consistency strategy on the broker, we need to briefly introduce the message storage format. When the Producer sends a message to the broker, it will choose which partition to store in according to the partition rules. If the partition rules are set reasonably, the message will be evenly distributed to different partitions, thus achieving horizontal expansion.

Pruducer can think of partition as a large serial file, and msg is assigned a unique offset when it is stored. Offset is a logical offset used to distinguish each message.

As a file, the partition itself can have multiple replicas (leader/follower). Multiple replicas are distributed on different brokers. If you want to answer how to ensure that the stored messages and status are not lost between brokers, you must answer how to solve the consistency of the message status of each replica between the brokers, including which messages have been submitted by the producer, which messages have landed, and which messages Will not be lost after node failure.

Message reliability guarantee when sending asynchronously

Going back to the questions mentioned at the beginning of the article, how to ensure the reliability of messages when using Kafka message queue for asynchronous sending? How to answer the first few questions? There are three parts to explain the reliability guarantee.

Producer's reliability guarantee

Answer the producer’s reliability guarantee, that is, answer:

  1. Is there an ack after sending the message

  2. After sending a message and receiving ack, is the message not lost?

    Kafka is configured to specify the producer's ack strategy when sending messages:

Request.required.acks=-1 (全量同步确认,强可靠性保证)
Request.required.acks = 1(leader 确认收到, 默认)
Request.required.acks = 0 (不确认,但是吞吐量大)

If you want to configure kafka as a CP (Consistency & Partition tolerance) system, the configuration needs to be as follows:

request.required.acks=-1
min.insync.replicas = ${N/2 + 1}
unclean.leader.election.enable = false

As shown in the figure, in the case of acks=-1, the new message will only be returned to the ack after all the followers (f1 and f2, f3) in the ISR (f1 and f2, f3) are copied from the leader, regardless of the type of machine failure ( All or part), the msg4 written will not be lost, and the message status meets the requirements of consistency C.

Under normal circumstances, after all the followers are copied, the leader returns to the producer ack.

Under abnormal conditions, if some copies of data are sent to the leader (f1 and f2 are synchronized), the leader hangs? At this time, any follower may become a new leader, the producer will return an exception, and the producer will resend the data, but the data may be duplicated (but not lost). The data duplication is not considered for the time being.

The min.insync.replicas parameter is used to ensure the number of replica followers in the current cluster in a normal synchronization state. When the actual value is less than the configured value, the cluster stops serving. If the configuration is N/2+1, which is half the number, then the algorithm will ensure strong consistency under this condition. When the configuration number is not met, the service is stopped at the expense of availability.

Under abnormal circumstances, the leader hangs up, and the leader needs to be re-elected from the follower at this time. Can be f2 or f3.

If f3 is elected as the new leader, message truncation may occur because f3 has not yet synchronized the data of msg4. Kafka uses unclean.leader.election.enable to control whether f3 can be elected as the leader in this case. The default is true in the old version, and false in a certain version, to avoid message truncation in this case.

Through the cooperation of ack and min.insync.replicas and unclean.leader.election.enable, it is ensured that when kafka is configured as a CP system, it either does not work or after ack is obtained, the message will not be lost and the message status is consistent.

The default value of the min.insync.replicas parameter is 1, which means high availability is met, as long as there is one working. But the working broker status may not be correct at this time (can imagine why)

If you want to implement Kafka configuration as AP (Availability & Partition tolerance) system:

request.required.acks=1
min.insync.replicas = 1
unclean.leader.election.enable = false

When the configuration is acks=1, that is, the leader returns ack after receiving the message. At this time, there will be a message loss problem: if the leader receives the fourth message, it has not been synchronized to the follower at this time, and the leader machine hangs, one of them If the follower is selected as the leader, the 4th message is lost. Of course, this also requires unclean.leader.election.enable parameter to be configured as false to cooperate. However, in the case of the leader acknowledging, the probability of the follower not being synchronized will be greatly increased.

Through the configuration of the producer strategy and the configuration of the general parameters of the Kafka cluster, reasonable parameter configuration can be carried out according to the characteristics of your business system, and a certain balance can be found under the communication performance and message reliability.

Broker's reliability guarantee

After the message is sent to the broker through the producer, there are still many problems:

  • Partition leader is successfully written, when will the follower be synchronized?

  • Leader writes successfully, when can consumers read this message?

  • After the leader writes successfully, the leader restarts. Is the message status normal after the restart?

  • The leader restarts, how to elect a new leader?

These problems are focused on what mechanism the cluster uses to ensure the consistency of the message status created by different replicas after the message falls on the broker.

Kafka message backup and synchronization

Kafka solves the problem of message backup through a partitioned multiple copy strategy. The HW and LEO logos correspond to the concepts of ISR and OSR, and are used to compare consensus algorithms to solve the problem of data synchronization consistency.

Partition multiple copies, that is, the replicas of the Partition mentioned above are distributed on different machines from the partition, and automatic failover is guaranteed through data redundancy. The states of different copies form the concepts of ISR and OSR.

  • ISR: The follower copy that the leader replica maintains a certain amount of synchronization, including the leader replica itself, called In Sync Replica

  • AR: All replicas are collectively referred to as assigned replicas, or AR

  • OSR: Follower and the leader synchronize data with some delay nodes

ISR is a unique concept in Kafka's synchronization strategy, which is different from consensus algorithms such as raft. Raft requires N/2+1 units in the cluster to be normal. Under this condition, it uses complex algorithms to ensure that the new leader elected conforms to the consistent state. Kafka's ISR synchronization strategy, through the scalability of the ISR list and HW&LEO update, solves the balance between message consistency and throughput performance to a certain extent.

ISR expresses the synchronization status of the message through the concepts of HW and LEO:

  • HW : Highwater, commonly known as high water level, it represents a specific message offset (offset), in a partion, the consumer can only pull the message before the offset (this offset is not a concept with consumer offset);

  • LEO: LogEndOffset , the offset of the end of the log, used to represent the offset of the next write message in the current log file;

  • leader HW : the minimum LEO of all copies of the Partititon;

  • follower HW: min (follower's own LEO and leader HW);

  • Leader HW = the minimum LEO of all replicas;

  • Follower HW = min (follower's own LEO and leader HW).

Leader not only saves his own HW & LEO but also the remote copy of HW & LEO

In simple terms, each replica has HW and LEO storage, and the leader not only saves its own HW and LEO, but also saves the LEO of each remote replica. It is used to calculate the value when its own HW is updated. It can be seen that due to the characteristics of remote storage of LEO, there will be a short-term numerical difference between the real LEO of the copy and the LEO stored by the leader, which will cause some problems, which will be discussed later.

The update strategy of HW and LEO is as follows:

HW/LEO update process for a complete write request:

1. Initial state

All HW&LEOs of the leader are 0, the follower establishes a connection with the leader, and all HW&LEOs of the follower fetch leader and follower are 0

2. Follower's first fetch:

The producer sends a message to the leader. At this time, the leader's LEO=1, the follower starts fetch with its own HW&LEO (both 0), the leader's HW=min(all follower LEO)=0, and the leader records the follower's LEO=0 ;Follower pulls a message, with the message and the leader's HW(0)&LEO(1) return to itself, update its own LEO=1, update its own HW=min(follower’s own LEO(1) and leader HW(0) )=0

3. Follower second fetch:

Follower brings its own HW(0)&LEO(1) to request the leader. At this time, the HW of the leader is updated to 1, the LEO of the follower saved by the leader is updated to 1, and the HW(1)&LEO(1) of the leader is returned to itself , Update your HW&LEO

At this point, returning to the problem just mentioned, this HW and LEO update strategy has an obvious problem, that is , the follower's HW update requires the return of the leader in the follower's 2 fetches to be updated, and the leader's HW has been updated . In the meantime, if the node of the follower and the leader fails, the HW of the follower and the HW of the leader will be in an inconsistent state, which will cause more consistency problems. For example, the following scenario:

  • After the leader updates the partition HW, the follower HW has not been updated yet, at this time the follower restarts

  • After the follower restarts, LEO is set to the previous follower HW value (0), and the message is truncated (temporary state) at this time

  • Follower re-synchronizes the leader, at this time the leader is down, it will be unavailable if it is not elected

  • Follower is elected as leader, then msg 1 is permanently lost

In the case of Kafka configured as an AP system, since min.insync.replicas is 1, the probability of truncation of follower after restart will be greatly increased, and the situation may be even worse when multiple replicas exist. In order to solve this HW&LEO synchronization mechanism update defect, the new version of kafka introduces the concept of Epoch.

Leader epoch consists of two parts:

  • Epoch: version number. Whenever the copy leadership changes, the version number will be increased. The leader of the minor version number is considered to be an expired leader, and the leader power can no longer be exercised.

  • Start Offset. The displacement of the first message written by the leader copy on the Epoch value.

Leader epoch(1, 120) indicates that the version number of this leader is 1, and the starting position of the version is the 120th message. Kafka Broker will cache Leader Epoch data for each partition in memory, and it will also periodically persist this information to a checkpoint file. When the leader copy writes a message to disk, the broker will try to update this part of the cache. If the Leader writes the message for the first time, the Broker will add a Leader Epoch entry to the cache, otherwise it will not update it. In this way, every time there is a Leader change, the new Leader copy will query this part of the cache and take out the starting displacement of the corresponding Leader Epoch to avoid data loss and inconsistency.

The diagram is as follows:

Kafka uses ISR's synchronization mechanism and optimization strategy, and uses HW & LEO to ensure that data is not lost and throughput. The management of ISR will eventually be fed back to Zookeeper, and its implementation and leader election strategy will not be repeated.

Consumer's reliability strategy

Consumer's reliability strategy focuses on the consumer's delivery semantics, namely:

  • When to consume and what to consume?

  • Will it be lost by consumption?

  • Will consumption be repeated?

These semantic scenarios can be configured through some parameters of kafka consumers. In short, there are three scenarios:

1. AutoCommit (at most once, hang after commit, it will actually be lost)

enable.auto.commit = true

auto.commit.interval.ms

The consumer configured as above will return the correct message to the broker after receiving the message, but if the business logic is not completed and interrupted, the message is actually not consumed successfully. This scenario is suitable for services with low reliability requirements. Among them, auto.commit.interval.ms represents the interval of automatic submission. For example, if it is set to submit once in 1s, then the fault restart within 1s will be re-consumed from the current consumption offset, and the msg that has not been submitted within 1s but has been consumed will be re-consumed.

2. Manual Commit (at least once, hang before commit, it will be repeated, and restart will be lost)

enable.auto.commit = false

In the scenario where manual submission is configured, business developers need to submit manually after the entire process of consuming messages to message business logic processing is completed. If a restart occurs when the process is not processed, the uncommitted messages that were previously consumed will be consumed again, that is, the messages will obviously be delivered multiple times. The application and business logic here obviously realizes the use in idempotent scenarios.

Special attention should be paid to the configuration of several parameters of the sarama library in golang:

sarama.offset.initial (oldest, newest)
offsets.retention.minutes

intitial = oldest represents the oldest message in the topic that can be accessed by the consumer, which is larger than the position of commit but smaller than HW. At the same time, it is also affected by the message retention time and displacement retention time on the broker. There is no guarantee that the message at the beginning of the topic will be consumed.

If set to newest, it represents the next message to visit the commit location. If the consumer restarts and autocommit is not set to false, the previous messages will be lost and will no longer be consumed. Special attention should be paid to scenarios where the business environment is particularly unstable or non-persistent consumer instances.

In general, offsets.retention.minutes is 1440s.

3. Exactly once, it is difficult, msg persistence and commit are atomic

The semantics of message delivery only once is difficult to achieve. First, the message must be consumed and submitted to ensure that it will not be delivered repeatedly, and secondly, the overall business logic must be completed before the submission of the message processing. In the case that Kafka itself does not provide a semantic interface for this scenario, this is almost impossible to achieve effectively. The general solution is to perform atomic message storage, and business logic asynchronously and slowly take out messages from the storage for processing.

Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/110015834