How Kafka solves the problem of message loss

In the entire architecture of Kafka, it can be concluded that the message has three delivery processes:

  1. Producer side sends message to Broker side
  2. Broker processes messages and persists data
  3. The Consumer pulls and consumes messages from the Broker

Data loss may occur in each of the above three steps, so under what circumstances can Kafka ensure that messages are not lost?

Producer side lost

In order to improve sending efficiency and reduce IO operations on the Producer side, multiple requests are sent out asynchronously when sending a message, so the message loss on the Producer side is more because the message is not sent to the Broker side at all.

There are the following reasons for the failure of the Producer to send the message:

  • Network reason: Due to network jitter, the data is not sent to the Broker side
  • Reason for the data: The message body is too large to be accepted by the Broker, causing the Broker to reject the message

solution

The data on the Producer end is lost because it is sent in an asynchronous manner, so if you use the send-and-burn method at this time, that is, calling Producer.send(msg) will return immediately. Since there is no callback, the Broker may fail due to network reasons. The message is not received, it is lost at this time.

Therefore, the problem of message loss on the Producer side can be solved from the following aspects:

  • Use a method with a callback notification function to send a message
  • ACK confirmation mechanism
  • number of retries

The Producer confirms whether the message is produced successfully through the ACK configuration, and the configuration parameters are as follows:

  • 0: Since the transmission is considered successful after transmission, if network jitter occurs at this time, data loss will occur
  • 1: The message is sent to the Leader partition and received successfully, which means the sending is successful. As long as the Leader partition does not hang up, the data can be guaranteed not to be lost. However, if the Leader partition hangs up, the Follower partition has not yet synchronized data and has no ACK. will lose data
  • -1 or all: The sending of the message needs to wait for the Leader partition and all the Follower partitions in the ISR to confirm receipt of the message before the message is sent successfully. The reliability is the highest, but there is no guarantee that no data will be lost. For example: when there is only the Leader partition in the ISR, such It becomes the case of acks = 1

Broker side lost

After the Broker receives the data, it will persist the message to the disk storage. In order to improve the throughput and performance, it adopts the strategy of asynchronous batch flushing, that is to say, flushing the disk according to a certain amount of messages and interval time.

Firstly, the data will be stored in PageCache. As for when to flush the data in the Cache, it is determined by the operating system according to its own policy or by calling the fsync command to force the disk to be flushed. If the Broker crashes before synchronizing to the Follower partition and a new Leader partition is elected, the lagging message data will be lost.

Since the message storage on the Broker side is flushed in batches asynchronously, there is a possibility of data loss. Since Kafka does not provide a way to synchronize disks, a single Broker is still likely to lose data.

Kafka has been able to ensure that the data is not lost to the maximum extent through the multi-partition and multi-copy mechanism. If the data has been written into the PageCache but has not had time to be flushed to the disk, if the Broker where it is located suddenly hangs up or has a power outage, extreme cases will still occur. cause data loss.

solution

The reason for the loss of messages on the Broker side is that through the strategy of asynchronous batch flushing, the data is first stored in PageCache and then asynchronously flushed.

Therefore, Kafka uses multiple partitions and multiple copies to ensure that data is not lost to the greatest extent. It can be guaranteed by the following parameters:

  • unclean.leader.election.enable: This parameter indicates which Followers are eligible to be elected as Leader. If the data of a Follower is too far behind the Leader, once it is elected as the new Leader, the data will be lost, so we need Set it to false to prevent this from happening.
  • replication.factor: This parameter indicates the number of partition replicas. It is recommended to set replication.factor >=3, so that if the Leader copy fails, the Follower copy will be elected as the new Leader copy to continue to provide services.
  • min.insync.replicas: This parameter indicates how many copies of the message must be successfully written to the ISR to be considered "committed". It is recommended to set min.insync.replicas > 1, so as to improve message persistence and ensure that data is not lost .

In addition, you need to ensure that replication.factor > min.insync.replicas, if they are equal, as long as one replica hangs up, the entire partition will not work properly, so it is recommended to set it as: replication.factor = min.insync.replicas +1, Maximize system availability.

Consumer side lost

The message consumption process is mainly divided into two stages:

  • Pull data from Broker
  • Process the message and submit the Offset record

Consumer needs to submit Offset after pulling the message, so data may be lost here. The reasons for the loss are as follows:

  • Possible ways to automatically submit Offset
  • Submit the Offset first after pulling the message, and then process the message. If there is an abnormal downtime when processing the message at this time, since the Offset has already been submitted, after the Consumer restarts, it will resume consumption from the next position of the previously submitted Offset. The processed message will not be processed again, and the message will be lost for the Consumer.
  • After the message is pulled, the message is processed first, and the Offset is submitted. If an abnormal downtime occurs before the submission at this time, because the Offset has not been submitted successfully, the message will be re-pulled from the last Offset after the next Consumer restarts. In the case of message loss, but there will be repeated consumption, here only the business itself can guarantee idempotency.

solution

The loss of messages on the consumer side is caused by submitting the offset after pulling the messages. Therefore, in order not to lose data, the correct way is: pull data, process business logic, and submit consumption offset displacement information.

At the same time, it is also necessary to set the parameter enable.auto.commit = false, and use the method of manually submitting the displacement. In addition, in the case of repeated consumption messages, the business itself guarantees idempotency, ensuring that only one successful consumption is sufficient.

Guess you like

Origin blog.csdn.net/xhaimail/article/details/132324586