Kafka's data reliability guarantee


In order to ensure that the data sent by the producer can be reliably sent to the specified topic, after each partition of the topic receives
the data sent by the producer, it needs to send an ack (acknowledgement confirmation) to the
producer . If the producer receives an ack, it will Send the next round, otherwise resend the data.

1) Replica data synchronization strategy

Program advantage Disadvantage
Send ack if more than half of the synchronization is completed Low latency When electing a new leader, tolerate the failure of n nodes, and 2n+1 copies are required
Send ack when all synchronization is completed When electing a new leader, to tolerate the failure of n nodes, n+1 copies are needed High latency

Kafka chose the second scheme for the following reasons:
1. In order to tolerate the failure of n nodes, the first scheme requires 2n+1 copies, while the second scheme only requires n+1
copies, and Kafka’s Each partition has a large amount of data, and the first scheme will cause a large amount of data redundancy.
2. Although the network delay of the second scheme will be relatively high, the network delay has less impact on Kafka.

2)ISR

After adopting the second scheme, imagine the following scenario: the leader receives data, all followers start to synchronize data,
but there is a follower, because of some kind of failure, the delay can not be synchronized with the leader, then the leader has to wait
until It can send an ack after it completes synchronization. How to solve this problem?

The leader maintains a dynamic in-sync replica set (ISR), which means a follower set that is synchronized with the leader. When the follower in the ISR completes data synchronization, the leader will send an ack to the follower. If the follower does not synchronize data with the leader for a long time, the follower will be kicked out of the ISR. The time threshold is set by the replica.lag.time.max.ms parameter. After the leader fails, a new leader will be elected from the ISR.

3) ack response mechanism

For some less important data, the reliability requirements of the data are not very high, and a small amount of data loss can be tolerated, so there is no need to wait for all the followers in the ISR to receive successfully. Therefore, Kafka provides users with three reliability levels. Users can choose the following configuration according to the requirements of reliability and delay.
Acks parameter configuration:
acks:
0: Producer does not wait for the ack of the broker. This operation provides the lowest delay. The broker returns as soon as it receives it and has not written to the disk. When the broker fails, data may be lost;
1: Producer Waiting for the ack of the broker, the leader of the partition returns ack after the successful placement, if the leader fails before the follower synchronization is successful, data will be lost; (as shown below)
Insert picture description here

-1 (all): The producer waits for the ack of the broker, the leader and the follower of the partition to return to the ack. But if the leader fails after the follower synchronization is completed and before the broker sends an ack, it will cause data duplication. (As shown below)
Insert picture description here

4) Details of troubleshooting

Insert picture description here
LEO: refers to the largest offset of each copy;
HW: refers to the largest offset that consumers can see, and the smallest LEO in the ISR queue.
(1) Follower failure When a follower fails, the
follower will be temporarily kicked out of the ISR. After the follower recovers, the follower will read the last HW recorded by the local disk and intercept the part of the log file higher than HW, starting from HW Synchronize to the leader. After the follower's LEO is greater than or equal to the partition's HW, that is, after the follower catches up with the leader, you can rejoin the ISR.
(2) Leader failure After the
leader fails, a new leader will be selected from the ISR. After that, to ensure data consistency between multiple copies, the remaining followers will first set their log files higher than the HW Cut off, and then synchronize data from the new leader.

Guess you like

Origin blog.csdn.net/qq_42706464/article/details/108843338