Kafka consumer - the consumer client multithreading

Consumers client multithreading

KafkaProducer is thread-safe, but KafkaConsumer but not thread-safe. KafkaConsumer a defined acquire () method is used to detect whether the current operation in only one thread, if other threads are operating an exception will be thrown ConcurrentModifcationException:

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access.

Each public method KafkaConsumer of this call will acquire () before the implementation of the action to be performed by the method, only wakeup () method is an exception.

Specific definitions acquire () method is as follows:

private final AtomicLong currentThread
    = new AtomicLong(NO_CURRENT_THREAD); //KafkaConsumer中的成员变量

private void acquire() {
    long threadId = Thread.currentThread().getId();
    if (threadId != currentThread.get() &&
            !currentThread.compareAndSet(NO_CURRENT_THREAD, threadId))
        throw new ConcurrentModificationException
                ("KafkaConsumer is not safe for multi-threaded access");
    refcount.incrementAndGet();
}

Whether acquire () method of the thread operation by counting labeled way to detect the occurrence of a thread concurrent operations, in order to ensure that only one thread operation. acquire () method and a release () method paired, indicates that the corresponding locking and unlocking operation.

acquire () method and the release () method is a private method, so we do not need to explicitly call in practical applications, but then understand the mechanism inside it can prompt us to properly, effectively writing the appropriate program logic.

Multithreading way to achieve consumer news, multi-threaded purpose is to improve the overall spending power. There are a variety of multi-threaded implementation, the first and most common ways: thread closed, that is, each thread instantiate an object is KafkaConsumer

A thread corresponds to a KafkaConsumer example, we can call it the consumer thread. A consumer thread can consume one or more partitions in the news, all consumer threads belong to the same consumer group. Concurrent degree this implementation is limited to the actual number of partition, when the number of threads is greater than the number of partitions of consumption, there is some consumer thread has been in idle state.

public class FirstMultiConsumerThreadDemo {
    public static final String brokerList = "localhost:9092";
    public static final String topic = "topic-demo";
    public static final String groupId = "group.demo";

    public static Properties initConfig(){
        Properties props = new Properties();
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG,
                StringDeserializer.class.getName());
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
                StringDeserializer.class.getName());
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
        props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true);
        return props;
    }

    public static void main(String[] args) {
        Properties props = initConfig();
        int consumerThreadNum = 4;
        for(int i=0;i<consumerThreadNum;i++) {
            new KafkaConsumerThread(props,topic).start();
        }
    }

    public static class KafkaConsumerThread extends Thread{
        private KafkaConsumer<String, String> kafkaConsumer;

        public KafkaConsumerThread(Properties props, String topic) {
            this.kafkaConsumer = new KafkaConsumer<>(props);
            this.kafkaConsumer.subscribe(Arrays.asList(topic));
        }

        @Override
        public void run(){
            try {
                while (true) {
                    ConsumerRecords<String, String> records =
                            kafkaConsumer.poll(Duration.ofMillis(100));
                    for (ConsumerRecord<String, String> record : records) {
                        //处理消息模块    ①
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                kafkaConsumer.close();
            }
        }
    }
}

Internal class KafkaConsumerThread on behalf of the consumer thread, wrapped inside a separate KafkaConsumer instance. A plurality of threads by the consumer to start the external main class () method, designated by the number of threads consumerThreadNum consumption variables. The number of partitions is generally a theme in advance can be known, consumerThreadNum may be set to not greater than the value of the number of partitions, without knowing the number of partitions subject matter, it may be a method to indirectly obtained by partitionsFor KafkaConsumer class (), which can then set a reasonable consumerThreadNum value.

There is no essential difference above this multi-threaded implementations and open multiple consumer processes the way, it has the advantage that each thread can consume each partition in order message. Disadvantages are also obvious that each consumer thread must maintain a separate TCP connection, if the value of the number of partitions and consumerThreadNum are large, it will cause no small system overhead.

If very rapid processing of the message, then the poll () pulling the frequency will be higher, and thus also enhance the performance of the overall consumption; the contrary, if a slow processing of the message here is, for example, a transactional operation, or wait RPC is a synchronous response, then poll () to pull the frequency will drop, thereby resulting in a decline in overall consumption performance. In general, poll () speed pull message is quite fast, but the bottleneck is also in the overall consumption of processing the message this one, if we are to improve this part a certain way, then we can lead the overall performance of the consumer improved.

The second way this corresponds to multiple consumer threads Consumer same partition, this is achieved by assign (), seek () methods, which can break the limit of the number of original consumer thread can not exceed the number of partitions, further improve the ability to consume. However, such an implementation process for the sequential control of submission and displacement becomes very complicated, too little practical use.

A third implementation, the message processing module into a multi-threaded implementation

public class ThirdMultiConsumerThreadDemo {
    public static final String brokerList = "localhost:9092";
    public static final String topic = "topic-demo";
    public static final String groupId = "group.demo";

    //省略initConfig()方法,具体请参考代码清单14-1
    public static void main(String[] args) {
        Properties props = initConfig();
        KafkaConsumerThread consumerThread = 
                new KafkaConsumerThread(props, topic,
                Runtime.getRuntime().availableProcessors());
        consumerThread.start();
    }

    public static class KafkaConsumerThread extends Thread {
        private KafkaConsumer<String, String> kafkaConsumer;
        private ExecutorService executorService;
        private int threadNumber;

        public KafkaConsumerThread(Properties props, 
                String topic, int threadNumber) {
            kafkaConsumer = new KafkaConsumer<>(props);
            kafkaConsumer.subscribe(Collections.singletonList(topic));
            this.threadNumber = threadNumber;
            executorService = new ThreadPoolExecutor(threadNumber, threadNumber,
                    0L, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(1000),
                    new ThreadPoolExecutor.CallerRunsPolicy());
        }

        @Override
        public void run() {
            try {
                while (true) {
                    ConsumerRecords<String, String> records =
                            kafkaConsumer.poll(Duration.ofMillis(100));
                    if (!records.isEmpty()) {
                        executorService.submit(new RecordsHandler(records));
                    }    ①
                }
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                kafkaConsumer.close();
            }
        }

    }

    public static class RecordsHandler extends Thread{
        public final ConsumerRecords<String, String> records;

        public RecordsHandler(ConsumerRecords<String, String> records) {
            this.records = records;
        }

        @Override
        public void run(){
            //处理records.
        }
    }
}

RecordHandler class is used to process the message, and KafkaConsumerThread class corresponds to a consumer thread, which is invoked RecordHandler batch processing of messages by way of the thread pool. ThreadPoolExecutor in the last parameter settings Note KafkaConsumerThread class is CallerRunsPolicy (), so the overall ability to keep up with the spending power of the thread pool poll () pulls can be prevented, resulting in abnormal phenomenon. The third embodiment can also achieve scale, to further enhance the overall consumption power by turning on a plurality KafkaConsumerThread instances.

A third way to achieve compared to a first implementation, in addition to the ability to scale, a TCP connection can also reduce the consumption of system resources, but the drawback is the order of the messages for processing more difficult.
For the first implementation, if to do a specific displacement of the submission, direct run () method can be achieved in the KafkaConsumerThread in. For the third implementation, where the introduction of a shared variable offsets to participate submitted

Each stored message handling RecordHandler After processing the message will be mapped to the shared variable displacement consumption in offsets, KafkaConsumerThread offsets are read in the content after each poll () method and submit its displacement. Note that in the process of realization of the need for read and write offsets lock handle, to prevent concurrency issues. And the need to pay attention at the time of displacement of the cover of the write offsets for this problem, run RecordHandler class () method to be implemented the following

for (TopicPartition tp : records.partitions()) {
    List<ConsumerRecord<String, String>> tpRecords = records.records(tp);
    //处理tpRecords.
    long lastConsumedOffset = tpRecords.get(tpRecords.size() - 1).offset();
    synchronized (offsets) {
        if (!offsets.containsKey(tp)) {
            offsets.put(tp, new OffsetAndMetadata(lastConsumedOffset + 1));
        }else {
            long position = offsets.get(tp).offset();
            if (position < lastConsumedOffset + 1) {
                offsets.put(tp, new OffsetAndMetadata(lastConsumedOffset + 1));
            }
        }
    }
}

Commitment to the displacement of the corresponding category added KafkaConsumerThread line ①

synchronized (offsets) {
    if (!offsets.isEmpty()) {
        kafkaConsumer.commitSync(offsets);
        offsets.clear();
    }
}

这种位移提交的方式会有数据丢失的风险。对于同一个分区中的消息,假设一个处理线程 RecordHandler1 正在处理 offset 为0~99的消息,而另一个处理线程 RecordHandler2 已经处理完了 offset 为100~199的消息并进行了位移提交,此时如果 RecordHandler1 发生异常,则之后的消费只能从200开始而无法再次消费0~99的消息,从而造成了消息丢失的现象。这里虽然针对位移覆盖做了一定的处理,但还没有解决异常情况下的位移覆盖问题。

对此就要引入更加复杂的处理机制,这里再提供一种解决思路,参考下图,总体结构上是基于滑动窗口实现的。对于第三种实现方式而言,它所呈现的结构是通过消费者拉取分批次的消息,然后提交给多线程进行处理,而这里的滑动窗口式的实现方式是将拉取到的消息暂存起来,多个消费线程可以拉取暂存的消息,这个用于暂存消息的缓存大小即为滑动窗口的大小,总体上而言没有太多的变化,不同的是对于消费位移的把控。

image

如上图所示,每一个方格代表一个批次的消息,一个滑动窗口包含若干方格,startOffset 标注的是当前滑动窗口的起始位置,endOffset 标注的是末尾位置。每当 startOffset 指向的方格中的消息被消费完成,就可以提交这部分的位移,与此同时,窗口向前滑动一格,删除原来 startOffset 所指方格中对应的消息,并且拉取新的消息进入窗口。滑动窗口的大小固定,所对应的用来暂存消息的缓存大小也就固定了,这部分内存开销可控。

方格大小和滑动窗口的大小同时决定了消费线程的并发数:一个方格对应一个消费线程,对于窗口大小固定的情况,方格越小并行度越高;对于方格大小固定的情况,窗口越大并行度越高。不过,若窗口设置得过大,不仅会增大内存的开销,而且在发生异常(比如 Crash)的情况下也会引起大量的重复消费,同时还考虑线程切换的开销,建议根据实际情况设置一个合理的值,不管是对于方格还是窗口而言,过大或过小都不合适。

如果一个方格内的消息无法被标记为消费完成,那么就会造成 startOffset 的悬停。为了使窗口能够继续向前滑动,那么就需要设定一个阈值,当 startOffset 悬停一定的时间后就对这部分消息进行本地重试消费,如果重试失败就转入重试队列,如果还不奏效就转入死信队列。真实应用中无法消费的情况极少,一般是由业务代码的处理逻辑引起的,比如消息中的内容格式与业务处理的内容格式不符,无法对这条消息进行决断,这种情况可以通过优化代码逻辑或采取丢弃策略来避免。如果需要消息高度可靠,也可以将无法进行业务逻辑的消息(这类消息可以称为死信)存入磁盘、数据库或 Kafka,然后继续消费下一条消息以保证整体消费进度合理推进,之后可以通过一个额外的处理任务来分析死信进而找出异常的原因。

Guess you like

Origin www.cnblogs.com/luckyhui28/p/12003594.html