Detailed explanation of manual submission of Kafka's consumer consumers

foreword

In the previous article, Kafka uses Java to implement data production and consumption demo , which introduced how to simply use kafka for data transmission. This article focuses on the explanation of consumer consumers in kafka.

Application scenarios

In the last article of Kafka's consumer consumer, we used automatic submission of offset subscripts.
However, the automatic submission of offset subscripts is actually not applicable in many scenarios, because automatic submission is directly submitted after Kafka pulls the data, so it is easy to lose data, especially when transaction control is required.
In many cases, we need to successfully pull data from kafka, process the data accordingly, and then submit it. For example, after pulling the data and writing to mysql, we need to manually submit the offset subscript of kafka.

By the way, here is what the offset is exactly.
offset : Refers to the subscript consumed by each consumer group in Kafka's topic.
Simply put, a message corresponds to an offset subscript. If the offset is submitted each time data is consumed, the next consumption will start from the submitted offset plus one.
For example, there are 100 pieces of data in a topic, and I have consumed 50 pieces of data and submitted them, then the offset submitted by the kafka server record at this time is 49 (offset starts from 0), then the offset will be consumed from 50 the next time it is consumed. .

test

Having said that, let's start with manual commit testing.
First, use the Kafka producer program to send 100 pieces of test data to the Kafka cluster.

write picture description here
The program printing has been successfully sent. Here we use the command on the kafka server to check whether it has been sent successfully. The
command is as follows:

 kafka-console-consumer.sh  --zookeeper master:2181  --topic KAFKA_TEST2 --from-beginning

write picture description here

Note:
1.master is the relationship that I made IP mapping in linux, which can actually be replaced by IP.
2. Because kafka is a cluster, it can also be consumed on other machines in the cluster.

You can see that 100 items have been successfully sent.

After the message is successfully sent, we use Kafka's consumer for data consumption.

Because it is used to test manual submission
, change enable.auto.commit to false for manual submission
and set a maximum of 10 entries per pull

props.put("enable.auto.commit", "false");
props.put("max.poll.records", 10);

After changing the submission method to false, you
need to submit manually just add this code

consumer.commitSync();

Then first try to consume without submitting, and test whether it can be consumed repeatedly.
Right-click to run the main method for consumption without submitting the offset subscript.
write picture description here

After successful consumption, end the program, run the main method again for consumption, and do not submit the offset subscript.
write picture description here

It has not been submitted manually, and the name of the consumption group has not been changed, but it can be seen that the consumption has been repeated!

Next, start testing manual commits.

  1. Test purpose :
    1. Test whether the offset after manual submission can be consumed again.
    2. Test whether the uncommitted offset can be consumed again.
  2. Test method: When 50 items are consumed, manually submit, and then the remaining 50 items are not submitted.
  3. Hope to achieve: Manually submitted offsets cannot be consumed again, and uncommitted ones can be consumed again.

In order to achieve the above purpose, our test only needs to add the following code:

if(list.size()==50){
    consumer.commitSync();
}

After changing the code, start running the program. The
test example is as follows:
write picture description here

At a glance, it seems that there is no problem with the previously unsubmitted.
But normally, an uncommitted subscript shouldn't be repeatedly consumed until it is committed?
Because of repeated consumption, messageNo will keep accumulating, and only the first 50 offsets will be submitted manually,
and the next 50 offsets will not be consumed all the time, so the number of prints should not be 100, but should always be printed.

So why is the test result inconsistent with what was expected?
Hasn't it been tested before that uncommitted offsets can be repeatedly consumed?
In fact, this can be concluded based on the difference in the two startup methods.
When I started to test unsubmitted repeated consumption, I actually started-pause-started, so the local consumer was actually initialized twice.
The actual consumer just tested is only initialized once.
As for why it can't be initialized once?
Because there are actually two records of the offset subscript of kafka, the server will record one by itself, and the local consumer client will also record one. The submitted offset will tell the server that this has been consumed, but the local does not This will not change the offset for re-consumption.

To put it simply, if there are 10 pieces of data, and the offset subscript is submitted in the fifth item, then the server will know that the subscript consumed by the group is up to the fifth item. If other consumers in the same group consume Consumption will start from Article 6. But the local consumer client will not change because of this, it will continue to consume, and will not start to consume from Article 6 again, so the above picture will appear.

But after running the project, it will not restart because of this, so we can change our thinking at this time.
That is, if a certain condition is triggered, so the offset is not submitted, we can close the previous consumer, and then create a new one, so that we can consume again! Of course the configuration should be the same as before.

Then change the previous submission code as follows:

if(list.size()==50){
    consumer.commitSync();
}else if(list.size()>50){
    consumer.close();
    init();
    list.clear();
    list2.clear();
}

Note: Because this is a test, for simplicity and clarity, the conditions I wrote are very simple. Please refer to the actual situation according to the individual.

The example picture is as follows:
write picture description here
Description:
1. Because 10 pieces of data are pulled each time, the configuration of kafka is initialized at the time of 60 pieces, and then 50-60 pieces of data are pulled again, but they are not submitted, so they will not affect actual results.
2. In order to facilitate the screenshot display, the printing conditions have been changed, but it does not affect the program!

From the test results, we have achieved the purpose we wanted to test before, and uncommitted offsets can be consumed repeatedly.
This approach generally suffices for most needs.
For example, get data from kafka and put it into the database. If a batch of data is successfully stored in the database, submit the offset, otherwise, do not submit it, and then pull it again.
However, this approach cannot guarantee the integrity of the data to the greatest extent. For example, when running, the program hangs and so on.
Therefore, another method is to manually specify the offset subscript to obtain the data. After the data processing of kafka is successful, the offset is recorded, for example, written in the database. So this approach, wait until the next article to try it!

I put the project on github, you can take a look if you are interested!
Address: https://github.com/xuwujing/kafka

This is the end of this article, thank you for reading!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325477960&siteId=291194637