Kafka Source Code Analysis - Sequence 6 - Consumer - Analysis of Consumption Strategy

From this article, we will enter the analysis of Consumer. Like Producer, Consumer is also divided into old scala version and new Java version, here we only analyze the new Java version.

Before analyzing, let's take a look at the basic usage of Consumer:
     Properties props = new Properties();
     props.put("bootstrap.servers", "localhost:9092");
     props.put("group.id", "test");
     props.put("enable.auto.commit", "true");
     props.put("auto.commit.interval.ms", "1000");
     props.put("session.timeout.ms", "30000");
     props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
     KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

     consumer.subscribe(Arrays.asList("foo", "bar")); //Core function 1: subscribe topic
     while (true) {
         ConsumerRecords<String, String> records = consumer.poll(100); //Core function 2: long poll, pull back multiple messages at a time
         for (ConsumerRecord<String, String> record : records)
             System.out.printf("offset = %d, key = %s, value = %s", record.offset(), record.key(), record.value());
     }

Consumer's non-thread safety

As we mentioned earlier, Kafka Producer is thread-safe, and multiple threads can share a producer instance. But Consumer is not.

In almost all functions of KafkaConsumer, we will see this:
    public ConsumerRecords<K, V> poll(long timeout) {
        acquire(); //Acquire/release here is not for multi-threaded locking, on the contrary: it is to prevent multi-threaded calls. If a multi-threaded call is found, an exception will be thrown directly internally.
        ...
        release();
   }

Consumer Group – Load Balancing Mode vs. Pub/Sub Mode

Each consumer instance needs to pass a group.id when initializing. This group.id determines that when multiple consumers consume the same topic, it is shared. Still broadcasting.

Assuming that multiple consumers subscribe to the same topic, this topic has multiple partitions.

Load balancing mode: If multiple consumers belong to the same group, the messages of the partition corresponding to the topic will be distributed to these consumers.

Pub/Sub mode: If multiple Consumers belong to different groups, all messages of this topic will be broadcast to each group.

Partition is automatically allocated vs. manually specified

. In the above load balancing mode, we call the subscrible function and only specify the topic and do not specify the partition. At this time, the partition will be automatically distributed among all the corresponding consumers in this group.

Another way is to forcely specify which partion of which topic the consumer consumes, using the assign function.
    public void subscribe(List<String> topics) {
        subscribe(topics, new NoOpConsumerRebalanceListener());
    }

    public void assign(List<TopicPartition> partitions) {
        。。。
    }

A key point is that these two modes are mutually exclusive. When subscribe is used, assign cannot be used. vice versa.

In the code, these two modes are stored in two different variables:
public class SubscriptionState {
    。。。
    private final Set<String> subscription; //corresponding to the subscript mode
    private final Set<TopicPartition> userAssignment; //corresponding to assign mode
}

Similarly, when calling subscrible or assign in the code, there are corresponding checks. If mutual exclusion is found, an exception will be thrown.

Consumption confirmation - consume offset vs. committed offset

As we mentioned earlier, "consumption confirmation" is a problem that all message middleware must solve: after getting the message, after processing it, send ack, or confirm, to the message middleware.

Then there will be two consumption positions, or two offset values: one is the consume offset where the message is currently fetched, and the other is the committed offset determined after the ack is sent after processing.

Obviously, in asynchronous mode, committed offset lags behind consume offset.

A key point here: if the consumer hangs and restarts, it will start re-consumption from the committed offset position, not the consume offset position. This means that it is possible to repeat consumption.

In the 0.9 client, there are 3 ack strategies:
Strategy 1: Automatic, periodic ack. That is the way shown in the demo above:
     props.put("enable.auto.commit", "true");
     props.put("auto.commit.interval.ms", "1000");

Strategy 2: consumer.commitSync() //Call commitSync to manually synchronize ack. Every time a message is processed, commitSync 1 time

Strategy 3: consumer.commitASync() //Manual asynchronous ack

Exactly Once – save the offset

by yourself As we said earlier, Kafka only guarantees that messages are not leaked, that is, at lease once, not guaranteed The news is not heavy.

Repeated sending: This client can't solve it, and the server needs to be judged seriously, and the cost is too high.

Repeated consumption: With the above commitSync(), we can send a commitSync every time we process a message. So is it possible to solve the "repeated consumption"? Like the code below:
     while (true) {
         ConsumerRecords<String, String> records = consumer.poll(100);
         for (ConsumerRecord<String, String> record : records) {
             buffer.add(record);
         }
         if (buffer.size() >= minBatchSize) {
             insertIntoDb(buffer); //Eliminate processing and save to db
             consumer.commitSync(); //Send ack synchronously
             buffer.clear();
         }
     }

the answer is negative! Because the above insertIntoDb and commitSync cannot perform atomic operations: if the data processing is completed and commitSync hangs, the server restarts again, and the message will still be consumed repeatedly.

So what is the solution to this problem?

The answer is to save the committed offset by yourself, instead of relying on the Kafka cluster to save the committed offset, and make the message processing and saving offset an atomic operation.

In kafka's official documentation, the following two usage scenarios for saving offsets by yourself are listed:
//Relational database, accessed through transactions. The consumer hangs up, restarts, and the message will not be repeatedly consumed
If the results of the consumption are being stored in a relational database, storing the offset in the database as well can allow committing both the results and offset in a single transaction. Thus either the transaction will succeed and the offset will be updated based on what was consumed or the result will not be stored and the offset won't be updated.

//Search engine: put the offset together with the data and build it in the index
If the results are being stored in a local store it may be possible to store the offset there as well. For example a search index could be built by subscribing to a particular partition and storing both the offset and the indexed data together. If this is done in a way that is atomic, it is often possible to have it be the case that even if a crash occurs that causes unsync'd data to be lost, whatever is left has the corresponding offset stored as well. This means that in this case the indexing process that comes back having lost recent updates just resumes indexing from what it has ensuring that no updates are lost.

At the same time, the official also said that if you want to save the offset yourself, you need to do the following operations
Configure enable.auto.commit=false //Disable automatic ack
Use the offset provided with each ConsumerRecord to save your position. //Every time you get a message, save the corresponding offset
On restart restore the position of the consumer using seek(TopicPartition, long).//The next restart, through the consumer.seek function, locate the offset saved by yourself, and start consuming from there

Through the above methods, we have also achieved "Exactly Once" on the consumer side. On the consumer side, messages are not lost or heavy.

Taking the producer + consumer together further, if there is Exactly Once on the consumer side, coupled with the judgment of the DB, even if the sender has "repeated sending", there is no problem.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327068823&siteId=291194637