Kafka data loss problem

How does Kafka ensure that data is not lost

1. Producer data is not lost
Kafka's ack mechanism: When Kafka sends data, there will be a confirmation feedback mechanism every time a message is sent to ensure that the message can be received normally, and the status is 0, 1, -1.

If it is synchronous mode: the ack mechanism can ensure that data is not lost. If ack is set to 0, the risk is very high, and it is generally not recommended to set it to 0. Even if it is set to 1, data will be lost as the leader goes down.

producer.type=sync 
request.required.acks=1


If it is asynchronous mode: the status of ack will also be considered. In addition, there is a buffer in asynchronous mode. The control data is sent through the buffer. There are two values ​​for control, the time threshold and the number of messages. If the buffer is full and the data has not been sent out, there is an option to configure whether to clear the buffer immediately. It can be set to -1 to block permanently, which means that data is no longer produced.
In asynchronous mode, even if it is set to -1. It is also possible that the operation data is lost due to unscientific operations of the programmer, such as kill -9, but this is a special exception.

producer.type=async 
request.required.acks=1 
queue.buffering.max.ms=5000 
queue.buffering.max.messages=10000 
queue.enqueue.timeout.ms = -1 
batch.num.messages=200


Conclusion: The producer may lose data, but it can be configured to ensure that the message is not lost.


2. No loss of consumer data. The
offset commit is used to ensure that the data is not lost. Kafka records the offset value of each consumption. When it continues to consume next time, it will continue to consume with the last offset.

The offset information is saved in zookeeper before version 0.8 of kafka, and saved to topic after version 0.8. Even if the consumer hangs up during operation, the offset value will be found when it is restarted, and the previous consumption message will be found. Location, then consumption, because when the offset information is written, not every message is written after consumption is completed, so this situation may cause repeated consumption, but the message will not be lost.

The only exception is when we set KafkaSpoutConfig.bulider.setGroupid to the same groupid when we set KafkaSpoutConfig.bulider.setGroupid to two consumer groups that originally had different functions in the program. This situation will cause the two groups to share the same data. Group A will consume messages in partition1 and partition2, and group B will consume messages in partition3. In this way, the messages consumed by each group will be lost and are incomplete. In order to ensure that each group has an exclusive copy of the message data, the groupid must not be repeated.


2. The data of the brokers in the kafka cluster is not lost
. We generally set the number of replications in each partition of the broker. When the producer writes, first according to the distribution strategy (partition by partition, key by key , There is no polling) is written to the leader, and the follower (replica) synchronizes the data with the leader, so that with a backup, it can also ensure that the message data is not lost.

 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/102667048