[Kafka] Kafka idempotence principle and realization analysis

Insert picture description here

1 Overview

Reprinted: https://www.cnblogs.com/smartloli/p/11922639.html I
recently communicated with some students and reported that when I interviewed Kafka, I was asked about the components of Kafka, API usage, Consumer and Producer principles and Questions such as role can be answered in detail. However, I asked a question that I usually don't pay attention to, that is, the idempotence of Kafka, and I was stuck. So, today I will analyze the principle and realization of Kafka's idempotence for everyone.

2. Content

2.1 Why does Kafka need idempotence?

Producer will inevitably send messages repeatedly when producing and sending messages. When the Producer retry, a retry mechanism is generated, and messages are repeatedly sent. With the introduction of idempotence, repeated sending will only generate a valid message. Kafka is a distributed messaging system, and its usage scenarios are common in distributed systems, such as message push systems, business platform systems (such as logistics platforms, bank settlement platforms, etc.). In the case of a bank settlement platform, the business party reports data to the bank settlement platform as the upstream. If a piece of data is calculated and processed multiple times, the impact will be very serious.

2.2 What are the factors affecting Kafka's idempotence?

When using Kafka, you need to ensure Exactly-Once semantics. In a distributed system, there are many uncontrollable factors, such as network, OOM, FullGC, etc. When Kafka Broker confirms the Ack, the Ack will time out due to network abnormalities, FullGC, OOM, etc., and the Producer will send it repeatedly. The possible situations are as follows:

Insert picture description here

2.3 How is Kafka's idempotence achieved?

In order to achieve idempotence, Kafka introduces ProducerID and SequenceNumber in the underlying design architecture. What is the purpose of these two concepts?

  1. ProducerID: When each new Producer is initialized, a unique ProducerID will be assigned, which is invisible to client users.
  2. SequenceNumber: For each ProducerID, each Topic and Partition sent by the Producer corresponds to a SequenceNumber value that starts from 0 and monotonically increases.

2.3.1 Problems before the introduction of idempotence?

Before Kafka introduces idempotence, the Producer sends a message to the Broker, and then the Broker appends the message to the message stream and returns the Ack signal value to the Producer. The implementation process is as follows:

Insert picture description here
The implementation process in the above figure is an ideal message sending situation, but in actual situations, there will be various uncertain factors, such as network abnormalities when the Producer sends to the Broker. For example, the following abnormal situation occurs:

Insert picture description here
In the above picture, when the Producer sends a message to the Broker for the first time, the Broker appends the message (x2, y2) to the message flow, but it fails when returning an Ack signal to the Producer (such as a network exception). At this point, the Producer side triggers the retry mechanism and resends the message (x2, y2) to the Broker. After receiving the message, the Broker appends the message to the message flow again, and then successfully returns the Ack signal to the Producer. In this way, two identical (x2, y2) messages are repeatedly added to the message flow.

2.3.2 What problem has been solved after the introduction of idempotence?

Faced with such problems, Kafka introduced idempotence. So how does idempotence solve the problem of repeated message sending? Below we can first take a look at the flowchart:

Insert picture description here
Again, this is an ideal sending process. In reality, there will be many uncertain factors. For example, when the Broker sends an Ack signal to the Producer, a network abnormality occurs, causing the sending to fail. The abnormal situation is shown in the following figure:

Insert picture description here
When the Producer sends a message (x2, y2) to the Broker, the Broker receives the message and appends it to the message flow. At this time, when the Broker returns the Ack signal to the Producer, an exception occurs and the Producer fails to receive the Ack signal. For the Producer, a retry mechanism will be triggered to send the message (x2, y2) again, but due to the introduction of idempotence, PID (ProducerID) and SequenceNumber are attached to each message. The same PID and SequenceNumber are sent to the Broker, and the Broker has buffered the same message sent before, then there is only one message in the message flow (x2, y2), and there will be no repeated sending.

2.3.3 How is ProducerID generated?

When the client generates the Producer, it will instantiate the following code:

// 实例化一个Producer对象
Producer<String, String> producer = new KafkaProducer<>(props);

In the org.apache.kafka.clients.producer.internals.Sender class, there is a maybeWaitForPid() method in run() to generate a ProducerID. The implementation code is as follows:

private void maybeWaitForPid() {
    
    
        if (transactionState == null)
            return;

        while (!transactionState.hasPid()) {
    
    
            try {
    
    
                Node node = awaitLeastLoadedNodeReady(requestTimeout);
                if (node != null) {
    
    
                    ClientResponse response = sendAndAwaitInitPidRequest(node);
                    if (response.hasResponse() && (response.responseBody() instanceof InitPidResponse)) {
    
    
                        InitPidResponse initPidResponse = (InitPidResponse) response.responseBody();
                        transactionState.setPidAndEpoch(initPidResponse.producerId(), initPidResponse.epoch());
                    } else {
    
    
                        log.error("Received an unexpected response type for an InitPidRequest from {}. " +
                                "We will back off and try again.", node);
                    }
                } else {
    
    
                    log.debug("Could not find an available broker to send InitPidRequest to. " +
                            "We will back off and try again.");
                }
            } catch (Exception e) {
    
    
                log.warn("Received an exception while trying to get a pid. Will back off and retry.", e);
            }
            log.trace("Retry InitPidRequest in {}ms.", retryBackoffMs);
            time.sleep(retryBackoffMs);
            metadata.requestUpdate();
        }
    }

3. Affairs

Another feature related to idempotence is transactions. The transaction in Kafka is similar to the transaction in the database. The transaction attribute in Kafka refers to the operation of a series of Producer producing messages and consuming messages to submit offsets in one transaction, which is an atomic operation. The corresponding result is simultaneous success or simultaneous failure.

This needs to be distinguished from the transaction in the database. The transaction in the operation database refers to a series of additions, deletions, and changes. For Kafka, the operation transaction refers to a series of atomic operations such as production and consumption.

3.1 What is the purpose of Kafka's introduction of transactions?

Before the transaction attribute is introduced, the idempotence of Producer is introduced first, and its function is:

  1. Producer sending messages multiple times can be encapsulated into an atomic operation, that is, simultaneous success or simultaneous failure;
  2. In the consumer & producer mode, because the Consumer has a problem with Commit Offsets, which causes repeated consumption of messages, the Producer repeatedly produces messages. It is necessary to encapsulate the Commit Offsets operation of Consumer and the series of message production operations of Producer into an atomic operation in this mode.
    The resulting scenes are:

For example, when Commit Offsets in the Consumer, when the Consumer's Commit Offsets is 100 when the consumption is completed (assuming the Offsets of the last Commit is 50), then when the execution triggers Balance, other Consumers will repeatedly consume the message (the consumption Offsets are between Messages between 50~100).

3.2 What APIs does the transaction provide?

Producer provides five transaction methods, they are:, the initTransactions()、beginTransaction()、sendOffsetsToTransaction()、commitTransaction()、abortTransaction()code is defined in the org.apache.kafka.clients.producer.Producer<K,V> interface, the specific definition interface is as follows:

// 初始化事务,需要注意确保transation.id属性被分配
void initTransactions();

// 开启事务
void beginTransaction() throws ProducerFencedException;

// 为Consumer提供的在事务内Commit Offsets的操作
void sendOffsetsToTransaction(Map<TopicPartition, OffsetAndMetadata> offsets,
                              String consumerGroupId) throws ProducerFencedException;

// 提交事务
void commitTransaction() throws ProducerFencedException;

// 放弃事务,类似于回滚事务的操作
void abortTransaction() throws ProducerFencedException;

3.3 What are the actual application scenarios of transactions?

In a Kafka transaction, an atomic operation can be divided into three situations according to the type of operation. details as following:

Only the Producer produces messages. This scenario requires the intervention of transactions;

  1. Consumption messages and production messages coexist, such as the Consumer&Producer model. This scenario is a relatively common model in general Kafka projects and requires transaction intervention;
  2. Only Consumer consumes messages. This operation has little meaning in actual projects. It is the same as the result of manual Commit Offsets, and this scenario is not the purpose of introducing transactions.

4. Summary

Kafka's idempotence and transactions are more important features, especially when it comes to data loss and data duplication. Kafka introduces idempotence, and the principles of design are easier to understand. The transaction characteristics of the transaction and the database are similar, and the experience of using the database is also easier to understand the transaction of Kafka.

Guess you like

Origin blog.csdn.net/qq_21383435/article/details/108818553