After adding multiple Flume instances, Kafka data repeated consumption problem handling

We use Flume to load data from Kafka into Hive.

Because when starting a Flume instance, the data loading speed can only reach 10MB/sec (100B per Kafka record). So we plan to start multiple instances of Flume (specify the same consumer group name).

We know that Kafka data consumption is based on Partition, that is, a Partition can only be consumed by one Flume instance. When the second Flume instance is started, Kafka will revoke the Partition that has been allocated to the first Flume instance, and then redistribute it (the complete process is here https://cwiki.apache.org/confluence/display/ KAFKA/Kafka+Client-side+Assignment+Proposal ).

On the other hand, in order to satisfy the transaction semantics, Flume needs to wait for each Kafka Record to be actually placed in the Channel before it can confirm to Kafka (call consumer.commitSync). In the Kafka recycling and distribution process (rebalance), Records that have received the Flume end but have not been placed on the Channel by Flume (naturally they have not been confirmed to Kafka) are considered by Kafka to have not been consumed; especially those that will be allocated in the future Records on the Partition of other Flume instances will be consumed by the new Flume instance again, causing repeated loading.

Kafka provides the ConsumerRebalanceListener interface to allow consumers to listen to the onPartitionsRevoked and onPartitionsAssigned process. The code when Kafka Source detects rebalance in Flume is as follows:
 

// this flag is set to true in a callback when some partitions are revoked.
// If there are any records we commit them.

   if (rebalanceFlag.get()) {
         rebalanceFlag.set(false);
         break;
     }

This measure is very simple, immediately jump out to deal with this situation (put it in the Channel and confirm with Kafka). From the code (https://github.com/apache/flume/blob/flume-1.8/flume-ng-sources/flume-kafka-source/src/main/java/org/apache/flume/source/kafka/KafkaSource .java) context can see that there are several flaws:

1. Since it takes time to put the event in the channel (execute interceptor and channel selector), it may not be able to complete and submit to Kafka before the partition is allocated to other Flume instances (research consumer.commitSync)

2. Those records that have been read to records (local) but not placed in eventList are not processed

So we modified the Flume source code to remedy these two defects:
 

// If there are any records we commit them.
     if (rebalanceFlag.get()) {
           rebalanceFlag.set(false);
             // invalidate remaining records
           it = null;
             // and drop processed events
           eventList.clear();
           tpAndOffsetMetadata.clear();
             // and seek to committed offsets
     for (Map.Entry<String, List<PartitionInfo>> es : consumer.listTopics().entrySet()) {
         for (PartitionInfo pi : es.getValue()) {
               TopicPartition tp = new TopicPartition(pi.topic(), pi.partition());
               try {
                  OffsetAndMetadata oam = consumer.committed(tp);
                  if (oam != null) {
                    consumer.seek(tp, oam.offset());
                  }
               } catch (Exception e) {
                  // log.warn("ignore seeking exception, {}", e);
               }
             }
        }
           log.info("read-ahead records have been dropped.");
           break;

     }

For details, please refer to https://github.com/hejiang2000/flume/blob/hejiang-kafka-source/flume-ng-sources/flume-kafka-source/src/main/java/org/apache/flume/source/kafka /KafkaSource.java.

After testing, the above processing completely corrects the Kafka Source data repeated consumption problem.
 

Guess you like

Origin blog.csdn.net/qq_32445015/article/details/107523657