The problem of data duplication is actually quite normal, and the entire link may cause data duplication.
Usually, a certain number of retries is set during message consumption to avoid the impact of network fluctuations, and at the same time, the side effect is that message duplication may occur.
Sort out several scenarios where messages are repeated:
-
Production side: When encountering an exception, the basic solution is to retry .
-
Scenario 1:
leader
The partition is unavailable,LeaderNotAvailableException
an exception is thrown, and a newleader
partition is waiting to be selected. -
Scenario 2: The location
Controller
hangsBroker
up, throwsNotControllerException
an exception, and waits forController
re-election. -
Scenario 3: Network exception, network disconnection, network partition, packet loss, etc., throw
NetworkException
an exception and wait for the network to recover.
-
-
Consumer side:
poll
A batch of data has not been submitted after processingoffset
. The machine restarts after restarting, and another batch of datapoll
will be . Consuming again will cause duplicate messages.
How to deal with it?
Let's first understand the three delivery semantics of the message:
-
At most once (
at most once
): The message is sent only once, and the message may be lost, but it will never be sent repeatedly. For example:mqtt
mediumQoS = 0
. -
At least once (
at least once
): The message is sent at least once, and the message will not be lost, but it may be sent repeatedly. Example:mqtt
MediumQoS = 1
-
Exactly once (
exactly once
): The message is sent exactly once, and the message will not be lost or sent repeatedly. For example:mqtt
mediumQoS = 2
.
After understanding these three semantics, let’s look at how to solve message duplication, that is, how to achieve precision once, which can be divided into three methods:
-
Kafka
IdempotenceProducer
: Ensure that the producer sends messages idempotent. The limitation is that it can only guarantee a single partition and a single session (even if it is a new session after restarting) -
Kafka
Transaction: Ensure that the producer sends messages idempotent. Solve the limitations of idempotenceProducer
. -
Consumer terminal idempotence: Ensure that the consumer terminal receives messages idempotent. Bottom plan.
1) Kafka
Idempotence Producer
Idempotency means that no matter how many times the same operation is performed, the result is the same. That is to say, for a command, any number of executions will have the same impact as one execution.
Example of using idempotence: just add the corresponding configuration on the production side
Properties props = new Properties();
props.put("enable.idempotence", ture); // 1. 设置幂等
props.put("acks", "all"); // 2. 当 enable.idempotence 为 true,这里默认为 all
props.put("max.in.flight.requests.per.connection", 5); // 3. 注意
-
Set idempotent, start idempotent.
-
Configuration
acks
, note: must be setacks=all
, otherwise an exception will be thrown. -
Configuration
max.in.flight.requests.per.connection
required<= 5
, otherwise an exception will be thrownOutOfOrderSequenceException
.-
0.11 >= Kafka < 1.1
,max.in.flight.request.per.connection = 1
-
Kafka >= 1.1
,max.in.flight.request.per.connection <= 5
-
In order to better understand, you need to understand Kafka's idempotent mechanism:
-
Producer
After each startup,Broker
a globally unique one will be applied topid
. (pid
It will change after restarting, which is also one of the disadvantages) -
Sequence Numbe
: For each<Topic, Partition>
corresponding to a monotonically increasing from 0Sequence
, at the same timeBroker
the end will cache thisseq num
-
Determine whether it is repeated:
<pid, seq num>
takeBroker
the corresponding queueProducerStateEntry.Queue
(the default queue length is 5) and check whether it exists-
If
nextSeq == lastSeq + 1
, ie服务端seq + 1 == 生产传入seq
, then receive. -
If
nextSeq == 0 && lastSeq == Int.MaxValue
, that is, just initialized, also receive. -
On the contrary, either repeat or lose the message, all rejected.
-
This design addresses two issues:
-
Message duplication: After saving the message, the scene crashes
Broker
before sending it . At this time , it will retry, which causes the message to be duplicated.ack
Producer
-
Message out-of-sequence: Avoid scenarios where the previous message fails to be sent but the next one is sent successfully, and the previous message succeeds after retrying, resulting in message out-of-order.
When should I use idempotence:
-
If already used
acks=all
, using idempotent is fine too. -
If you have used
acks=0
oracks=1
, it means that your system pursues high performance and does not require high data consistency. Don't use idempotent.
2) Kafka
Affairs
Use
Kafka
transactions to solve the disadvantages of idempotence: single-session and single-partition idempotence.
Tips
: This section is long, so let’s mention it a little bit first, and then start another article.
Example of transaction use: divided into production end and consumer end
Properties props = new Properties();
props.put("enable.idempotence", ture); // 1. 设置幂等
props.put("acks", "all"); // 2. 当 enable.idempotence 为 true,这里默认为 all
props.put("max.in.flight.requests.per.connection", 5); // 3. 最大等待数
props.put("transactional.id", "my-transactional-id"); // 4. 设定事务 id
Producer<String, String> producer = new KafkaProducer<String, String>(props);
// 初始化事务
producer.initTransactions();
try{
// 开始事务
producer.beginTransaction();
// 发送数据
producer.send(new ProducerRecord<String, String>("Topic", "Key", "Value"));
// 数据发送及 Offset 发送均成功的情况下,提交事务
producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
// 数据发送或者 Offset 发送出现异常时,终止事务
producer.abortTransaction();
} finally {
// 关闭 Producer 和 Consumer
producer.close();
consumer.close();
}
Here the consumer Consumer
needs to set the following configuration: isolation.level
parameters
-
read_uncommitted
: This is the default value, indicating that any message writtenConsumer
can be read , regardless of whether the transaction commits or terminates the transaction, the written message can be read. If you use the transaction type , then do not use this value correspondingly .Kafka
Producer
Producer
Consumer
-
read_committed
: Indicates that only messages writtenConsumer
by transactional successfully committed transactions will be read . Of course, it also sees all messages writtenProducer
non-transactionally .Producer
3) The consumer is idempotent
"How to solve message duplication?" This question, in fact, is another way of saying: how to solve the problem of idempotence on the consumer side.
As long as the consumer is idempotent, the problem of repeated consumption of messages will be solved.
A typical solution is to use: message table, come and go:
-
In the above example, after the consumer pulls a message, it starts a transaction, adds the message
Id
to the local message table, and updates the order information at the same time. -
If the message is repeated, the new operation
insert
will be abnormal, and the transaction rollback will be triggered at the same time.
2. Case:
Kafka idempotent Producer uses
Environment construction can refer to: https://developer.confluent.io/tutorials/message-ordering/kafka.html#view-all-records-in-the-topic
The preparations are as follows:
1、Zookeeper
: local use Docker
start
$ docker run -d --name zookeeper -p 2181:2181 zookeeper
a86dff3689b68f6af7eb3da5a21c2dba06e9623f3c961154a8bbbe3e9991dea4
2、Kafka
: version 2.7.1
, source code compilation start (see above source code build start)
3. Start the producer: in Kafka
the source code exmaple
4. Start the messenger: you can use Kafka
the provided script
# 举个栗子:topic 需要自己去修改
$ cd ./kafka-2.7.1-src/bin
$ ./kafka-console-producer.sh --broker-list localhost:9092 --topic test_topic
Creation topic
: 1 replica, 2 partitions
$ ./kafka-topics.sh --bootstrap-server localhost:9092 --topic myTopic --create --replication-factor 1 --partitions 2
# 查看
$ ./kafka-topics.sh --bootstrap-server broker:9092 --topic myTopic --describe
Producer code:
public class KafkaProducerApplication {
private final Producer<String, String> producer;
final String outTopic;
public KafkaProducerApplication(final Producer<String, String> producer,
final String topic) {
this.producer = producer;
outTopic = topic;
}
public void produce(final String message) {
final String[] parts = message.split("-");
final String key, value;
if (parts.length > 1) {
key = parts[0];
value = parts[1];
} else {
key = null;
value = parts[0];
}
final ProducerRecord<String, String> producerRecord
= new ProducerRecord<>(outTopic, key, value);
producer.send(producerRecord,
(recordMetadata, e) -> {
if(e != null) {
e.printStackTrace();
} else {
System.out.println("key/value " + key + "/" + value + "\twritten to topic[partition] " + recordMetadata.topic() + "[" + recordMetadata.partition() + "] at offset " + recordMetadata.offset());
}
}
);
}
public void shutdown() {
producer.close();
}
public static void main(String[] args) {
final Properties props = new Properties();
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true");
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.CLIENT_ID_CONFIG, "myApp");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
final String topic = "myTopic";
final Producer<String, String> producer = new KafkaProducer<>(props);
final KafkaProducerApplication producerApp = new KafkaProducerApplication(producer, topic);
String filePath = "/home/donald/Documents/Code/Source/kafka-2.7.1-src/examples/src/main/java/kafka/examples/input.txt";
try {
List<String> linesToProduce = Files.readAllLines(Paths.get(filePath));
linesToProduce.stream().filter(l -> !l.trim().isEmpty())
.forEach(producerApp::produce);
System.out.println("Offsets and timestamps committed in batch from " + filePath);
} catch (IOException e) {
System.err.printf("Error reading file %s due to %s %n", filePath, e);
} finally {
producerApp.shutdown();
}
}
}
After starting the producer, the console output is as follows:
Start the consumer:
$ ./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic myTopic
Modify configuration acks
When idempotence is enabled, adjust acks
the configuration, and what is the result after the producer starts:
-
Change setting
acks = 1
-
Change setting
acks = 0
Will directly report an error:
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Must set acks to all in order to use the idempotent producer.
Otherwise we cannot guarantee idempotence.
Modify the configuration max.in.flight.requests.per.connection
When idempotence is enabled, adjust this configuration, what is the result:
max.in.flight.requests.per.connection > 5
what will happen
Of course it will report an error:
Caused by: org.apache.kafka.common.config.ConfigException: Must set max.in.flight.requests.per.connection to at most 5 to use the idempotent producer.