The Kafka message duplication problem that is a headache at the first touch can be solved immediately

I. Introduction

The problem of data duplication is actually quite normal, and the entire link may cause data duplication.

Usually, a certain number of retries is set during message consumption to avoid the impact of network fluctuations, and at the same time, the side effect is that message duplication may occur.

Sort out several scenarios where messages are repeated:

1. "Production end:"  encounters an exception, and the basic solution is  "retry" .

  • Scenario 1: leaderThe partition is unavailable,  LeaderNotAvailableException an exception is thrown, and a new  leader partition is waiting to be selected.

  • Scenario 2: The location Controller hangs  Broker up, throws  NotControllerException an exception, and waits for  Controller re-election.

  • Scenario 3: Network exception, network disconnection, network partition, packet loss, etc., throw  NetworkException an exception and wait for the network to recover.

2. "Consumer side:" poll  A batch of data has not been submitted after processing  offset . The machine is down and restarted, and another batch of data poll will be uploaded . Consuming again will cause duplication of messages.

How to deal with it?

"Let's first understand the three delivery semantics of messages:"

  • At most once ( at most once): The message is sent only once, and the message may be lost, but it will never be sent repeatedly. For example: mqtt medium  QoS = 0.

  • At least once ( at least once): The message is sent at least once, and the message will not be lost, but it may be sent repeatedly. Example: mqtt Medium QoS = 1

  • Exactly once ( exactly once): The message is sent exactly once, and the message will not be lost or sent repeatedly. For example: mqtt medium  QoS = 2.

After understanding these three semantics, let’s look at how to solve message duplication, that is, how to achieve precision once, which can be divided into three methods:

  1. Kafka Idempotency  Producer: Ensure that the producer sends messages idempotent. The limitation is that it can only guarantee a single partition and a single session (even if it is a new session after restarting)

  2. Kafka Transaction: Ensure that the producer sends messages idempotent. Solve the limitations of idempotence  Producer .

  3. Consumer terminal idempotence: Ensure that the consumer terminal receives messages idempotent. Bottom plan.

1) Kafka Idempotence Producer

"Idempotence means" : No matter how many times the same operation is performed, the result is the same. That is to say, for a command, any number of executions will have the same impact as one execution.

"Example of idempotent use: just add the corresponding configuration on the production side"

Properties props = new Properties();  
props.put("enable.idempotence", ture); // 1. 设置幂等  
props.put("acks", "all"); // 2. 当 enable.idempotence 为 true,这里默认为 all  
props.put("max.in.flight.requests.per.connection", 5); // 3. 注意

1. Set idempotence and start idempotence.

2. Configuration  acks, note: must be set  acks=all, otherwise an exception will be thrown.

3. Configuration max.in.flight.requests.per.connection needs  <= 5, otherwise an exception will be thrown  OutOfOrderSequenceException.

  • 0.11 >= Kafka < 1.1max.in.flight.request.per.connection = 1

  • Kafka >= 1.1max.in.flight.request.per.connection <= 5

In order to better understand, you need to understand Kafka's idempotent mechanism:

1. Producer After each startup, it will  Broker apply for a globally unique one  pid. pid It will change after restarting, which is also one of the disadvantages)

2. Sequence Numbe: For each  <Topic, Partition> corresponding to a monotonically increasing from 0  Sequence, at the same time  Brokerthe end will cache this seq num

3. Determine whether it is repeated: <pid, seq num> Take  Broker the corresponding queue  ProducerStateEntry.Queue(the default queue length is 5) and check whether it exists

  • If  nextSeq == lastSeq + 1, ie  服务端seq + 1 == 生产传入seq, then receive.

  • If  nextSeq == 0 && lastSeq == Int.MaxValue, that is, just initialized, also receive.

  • On the contrary, either repeat or lose the message, all rejected.

This design addresses two issues:

  1. "Repeated message:"  The scene   crashes  Broker after saving the message before sending it  . At this time  , it will retry, which causes the message to be repeated.ackProducer

  2. "Message out of order:"  Avoid scenarios where the previous message fails to be sent but the next one is sent successfully, and the previous message succeeds after retrying, resulting in message out of order.

When should I use idempotence:

  1. If already used  acks=all, using idempotent is fine too.

  2. If you have used  acks=0 or  acks=1, it means that your system pursues high performance and does not require high data consistency. Don't use idempotent.

2) Kafka Affairs

Use  Kafka transactions to solve the disadvantages of idempotence: single-session and single-partition idempotence.

" Tips:"  This section is quite long, so let's mention it a little bit first, and then start another article.

"Example of transaction use: divided into production end and consumer end"

Properties props = new Properties();  
props.put("enable.idempotence", ture); // 1. 设置幂等  
props.put("acks", "all"); // 2. 当 enable.idempotence 为 true,这里默认为 all  
props.put("max.in.flight.requests.per.connection", 5); // 3. 最大等待数  
props.put("transactional.id", "my-transactional-id"); // 4. 设定事务 id  
  
Producer<String, String> producer = new KafkaProducer<String, String>(props);  
  
// 初始化事务  
producer.initTransactions();  
  
try{  
    // 开始事务  
    producer.beginTransaction();  
  
    // 发送数据  
    producer.send(new ProducerRecord<String, String>("Topic", "Key", "Value"));  
   
    // 数据发送及 Offset 发送均成功的情况下,提交事务  
    producer.commitTransaction();  
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {  
    // 数据发送或者 Offset 发送出现异常时,终止事务  
    producer.abortTransaction();  
} finally {  
    // 关闭 Producer 和 Consumer  
    producer.close();  
    consumer.close();  
}

Here the consumer Consumer needs to set the following configuration: isolation.level parameters

  • " read_uncommitted:"  This is the default value, indicating that   any message written  Consumer can be read  , regardless of whether the transaction type  commits the transaction or terminates the transaction, the written message can be read. If you use the transaction type  , then  do not use this value correspondingly  .KafkaProducerProducerConsumer

  • " read_committed:"  indicates  that only  the messages written by the successfully committed transactions Consumer of the transaction type will be read  .  Of course, it also sees all messages written Producernon-transactionally  .Producer

3) The consumer is idempotent

"How to solve message duplication?" This question, in fact, is another way of saying: how to solve the problem of idempotence on the consumer side.

As long as the consumer is idempotent, the problem of repeated consumption of messages will be solved.

"A typical solution is to use: message table, come and go:"

  • In the above example, after the consumer pulls a message, it starts a transaction, adds the message Id to the local message table, and updates the order information at the same time.

  • If the message is repeated, the new operation  insert will be abnormal, and the transaction rollback will be triggered at the same time.

2. Case: Kafka idempotent Producer use

Environment construction can refer to: https://developer.confluent.io/tutorials/message-ordering/kafka.html#view-all-records-in-the-topic

"The preparations are as follows:"

1. Zookeeper: local use  Docker startup

$ docker run -d --name zookeeper -p 2181:2181 zookeeper  
a86dff3689b68f6af7eb3da5a21c2dba06e9623f3c961154a8bbbe3e9991dea4

2. Kafka: version  2.7.1, source code compilation start (see above source code build start)

3. Start the producer:  in Kafka the source code exmaple

4. Start the messenger: you can use  Kafka the provided script

# 举个栗子:topic 需要自己去修改  
$ cd ./kafka-2.7.1-src/bin  
$ ./kafka-console-producer.sh --broker-list localhost:9092 --topic test_topic

Creation topic : 1 replica, 2 partitions

$ ./kafka-topics.sh --bootstrap-server localhost:9092 --topic myTopic --create --replication-factor 1 --partitions 2  
  
# 查看  
$ ./kafka-topics.sh --bootstrap-server broker:9092 --topic myTopic --describe

"Manufacturer Code:"

public class KafkaProducerApplication {  
  
    private final Producer<String, String> producer;  
    final String outTopic;  
  
    public KafkaProducerApplication(final Producer<String, String> producer,  
                                    final String topic) {  
        this.producer = producer;  
        outTopic = topic;  
    }  
  
    public void produce(final String message) {  
        final String[] parts = message.split("-");  
        final String key, value;  
        if (parts.length > 1) {  
            key = parts[0];  
            value = parts[1];  
        } else {  
            key = null;  
            value = parts[0];  
        }  
        final ProducerRecord<String, String> producerRecord  
            = new ProducerRecord<>(outTopic, key, value);  
        producer.send(producerRecord,  
                (recordMetadata, e) -> {  
                    if(e != null) {  
                        e.printStackTrace();  
                    } else {  
                        System.out.println("key/value " + key + "/" + value + "\twritten to topic[partition] " + recordMetadata.topic() + "[" + recordMetadata.partition() + "] at offset " + recordMetadata.offset());  
                    }  
                }  
        );  
    }  
  
    public void shutdown() {  
        producer.close();  
    }  
  
    public static void main(String[] args) {  
  
        final Properties props = new Properties();  
  
        props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");  
        props.put(ProducerConfig.ACKS_CONFIG, "all");  
  
        props.put(ProducerConfig.CLIENT_ID_CONFIG, "myApp");  
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);  
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);  
  
        final String topic = "myTopic";  
        final Producer<String, String> producer = new KafkaProducer<>(props);  
        final KafkaProducerApplication producerApp = new KafkaProducerApplication(producer, topic);  
  
        String filePath = "/home/donald/Documents/Code/Source/kafka-2.7.1-src/examples/src/main/java/kafka/examples/input.txt";  
        try {  
            List<String> linesToProduce = Files.readAllLines(Paths.get(filePath));  
            linesToProduce.stream().filter(l -> !l.trim().isEmpty())  
                    .forEach(producerApp::produce);  
            System.out.println("Offsets and timestamps committed in batch from " + filePath);  
        } catch (IOException e) {  
            System.err.printf("Error reading file %s due to %s %n", filePath, e);  
        } finally {  
            producerApp.shutdown();  
        }  
    }  
}

"After starting the producer, the console output is as follows:"

"Start consumer:"

$ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTopic

Modify configuration acks

When idempotence is enabled, adjust acks the configuration, and what is the result after the producer starts:

  • Change setting acks = 1

  • Change setting acks = 0

Will directly report an error:

Exception in thread "main" org.apache.kafka.common.config.ConfigException: Must set acks to all in order to use the idempotent producer.  
Otherwise we cannot guarantee idempotence.

Modify the configuration max.in.flight.requests.per.connection

"When idempotence is enabled, adjust this configuration, what is the result:"

max.in.flight.requests.per.connection > 5 what will happen  

"Of course it will report an error:"

Caused by: org.apache.kafka.common.config.ConfigException: Must set max.in.flight.requests.per.connection to at most 5 to use the idempotent producer.

 

Guess you like

Origin blog.csdn.net/z_ssyy/article/details/131519119