Kafka basics study notes arrangement


This article is a compilation of Kafka basic study notes:

This article is organized as Kafka basic study notes, mainly including advanced knowledge points of Kafka and SpringBoot integration of Kafka.

Most of the pictures in this article are personal hand-painted supplements, and the drawing software uses: draw.io

Complete code Demo project warehouse link: https://gitee.com/DaHuYuXiXi/kafak-demo

The pre-knowledge of this article refers to the introduction to kafka: the study notes of the introduction to Kafka


producer

Data sending process

The data sending process of Kafka producer client is divided into three stages:

insert image description here

  1. The main thread calls KafkaProducer to send data, the data is not sent to the kafka broker server, but buffered first
  2. The asynchronous thread sender is responsible for sending the buffered data to the kafka broker server
  • Using buffering can avoid server-side pressure caused by high concurrent requests, and buffering can also be used to send data in batches.
  • The asynchronous sender thread is responsible for sending data, which avoids the main thread sending data blocking, causing delays in core business responses.

Let's take a look at this process from the perspective of source code:

  • Let's take a look at the constructor of KafkaProducer first, keeping only the parts related to data transmission
    KafkaProducer(ProducerConfig config,
                  Serializer<K> keySerializer,
                  Serializer<V> valueSerializer,
                  ProducerMetadata metadata,
                  KafkaClient kafkaClient,
                  ProducerInterceptors<K, V> interceptors,
                  Time time) {
    
    
            ...
            //记录累加器
            this.accumulator = new RecordAccumulator(logContext,
                    config.getInt(ProducerConfig.BATCH_SIZE_CONFIG),
                    this.compressionType,
                    lingerMs(config),
                    retryBackoffMs,
                    deliveryTimeoutMs,
                    metrics,
                    PRODUCER_METRIC_GROUP_NAME,
                    time,
                    apiVersions,
                    transactionManager,
                    new BufferPool(this.totalMemorySize, config.getInt(ProducerConfig.BATCH_SIZE_CONFIG), metrics, time, PRODUCER_METRIC_GROUP_NAME));
            ....
            //数据发送线程        
            this.sender = newSender(logContext, kafkaClient, this.metadata);
            String ioThreadName = NETWORK_THREAD_PREFIX + " | " + clientId;
            this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
            this.ioThread.start();
            ....
    }
  • The record accumulator RecordAccumulator, that is, the data produced by the producer KafkaProducer is not directly sent to the kafka broker, but is accumulated in batches and put into the RecordAccumulator first, and then sent to the kafka broker in batches.
  • Data sending is done by a separate sender thread, and one KafkaProducer corresponds to one Sender thread.
  • The record accumulator is used as the data buffer of KafkaProducer, and the specific structure is as follows
public class RecordAccumulator {
    
    
    ...
    private final ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batches;
  • Batches are data buffers, and maintain a Deque double-ended queue for TopicPartition (a partition of a topic).
  • The data type stored in this double-ended queue is ProducerBatch, which represents a batch of data produced by the producer.
  • The append method of the RecordAccumulator class is used to add data to the data buffer
    public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Header[] headers,
                                     Callback callback,
                                     long maxTimeToBlock,
                                     boolean abortOnNewBatch,
                                     long nowMs) throws InterruptedException {
    
                            
        ...
        try {
    
    
            //获取已有的缓冲区,或者创建新的缓冲区(Deque)
            Deque<ProducerBatch> dq = getOrCreateDeque(tp);
            //锁住该缓冲区,避免用户异步编程操作导致数据发送数据顺序错乱的问题
            synchronized (dq) {
    
    
                if (closed)
                    throw new KafkaException("Producer closed while send in progress");
                  //tryAppend方法将一条消息数据的时间戳、key、value、
                 //header等信息追加到缓冲区中(Deque<ProducerBatch> )
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
                if (appendResult != null)
                    return appendResult;
            }
            ...
            synchronized (dq) {
    
    
                ...
                //再次尝试,查看是否能将消息成功追加到Deque的某个批次中
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
                if (appendResult != null) {
    
    
                    return appendResult;
                }
                //如果追加失败,那么创建一个新的批次,加入Deque尾部
                ...
                ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
                ...
                dq.addLast(batch);
                ...
                return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
            }
    }

Detailed analysis of tryAppend method:

  • When this method is called, it will check whether the given message can be added to the last ProducerBatch of the current Deque, if so, add the message to the batch, and return a FutureRecordMetadata object representing the metadata of the message;
  • 添加失败有两种情况: 当前Deque为空,或者当前批次已满
  • Otherwise, it will close the previous ProducerBatch and return null, and then create a new batch and stuff it at the end of the Deque.
  • Normally, this method returns a RecordAppendResult object that contains information about whether the record was written to disk, the partition assignment, and whether repartitioning is required.
  • 在Kafka Producer中,每个ProducerBatch都对应一个Broker分区,该方法的作用是向ProducerBatch批次中尝试添加一条消息,如果该批次已满或无法再分配分区,则会创建一个新的ProducerBatch,并将消息添加到其中。通过使用一个生产者批次来批量发送多条消息,可以提高消息发送的效率和吞吐量,并减少网络IO的消耗。
  • What information is contained in a message data produced by the producer
public ProducerRecord(String topic, Integer partition, Long timestamp, K key, V value, Iterable<Header> headers){
    
    
         ...
        //消息属于哪个主题 
        this.topic = topic;
        //消息属于哪个分区
        this.partition = partition;
        //消息的key
        this.key = key;
        //消息的value
        this.value = value;
        //消息的时间戳
        this.timestamp = timestamp;
        //消息的消息头
        this.headers = new RecordHeaders(headers);
    }
  • Make a summary of the above content, as shown in the following figure:
    insert image description here
  • RecordAccumulator as a data buffer, including several Deque double-ended queues
  • For a partition of a topic, maintain a Deque double-ended queue on the kafka producer client
  • Put several batches of ProducerBatch data into each queue Deque
  • Each batch ProducerBatch contains several data records ProducerRecord
  • Data with the same key will be sent to the same partition of the topic.

Notice:

  • Before the data is sent to the server, KafkaProducer will classify the data, buffer them in batches, and then send the data to the Kafka server asynchronously by a separate thread, thereby improving the efficiency of data sending.
  • The value of the message is classified according to topic, partition, key, and delivered in the order of timestamp

Bulk and scheduled sending

KafkaProducer will put the message into the buffer first, and then send it to the broker server asynchronously by a separate sender thread. Since the message is sent in batches, what is the condition to trigger batch sending?

  • batch.size: When the amount of buffered data to be sent to a partition exceeds the threshold set by batch.size, a batch send will be triggered, and all the data in the Deque queue will be sent to the Broker server at one time. The default value of batch.size The value is 16KB.

insert image description here

  • linger.ms: If the buffer has not reached the sending standard, when the time exceeds the value set by linger.ms, the data will also be sent. This is mainly because if the batch.size is set relatively large, in some non- The amount of data generated during the active time is relatively small, and it has not reached the threshold of batch.size, so the message will always stay in the buffer.

Notice:

  • As long as one of the above two conditions is met, the data will be sent to the server
  • The default value of linger.ms is 0, that is, data is sent when there is data, but since the sender is single-threaded, the producer has many buffer queues Deque, so the data in the buffer needs to wait for the sender thread to be idle before being sent.
  • The default value of linger.ms is 0. When the amount of data is large, the sender thread is very busy, and the buffer mechanism can guarantee the throughput; when the amount of data is small, the sender thread processes quickly, which can ensure low message delay.

buffer size

buffer.memory : Used to constrain the size of the memory buffer that Kafka Producer can use, the default is 32MB.

insert image description here

Notice:

  • The buffer.memory parameter is very important, especially when your kafka cluster has many topics and partitions, and the corresponding producer partition buffer queues are also very large.
  • If the buffer.memory setting is too small, the message may be quickly written into the memory buffer, but the Sender thread has no time to send the message to the Kafka server. It will cause the memory buffer to be filled quickly, and once it is filled, the user thread will be blocked and the message will not be written to Kafka.
  • The buffer.memory must be greater than the batch.size, otherwise an error of insufficient memory will be reported, do not exceed the physical memory, and adjust according to the actual situation.

sendSend a message

The producer data sending process is as follows:
insert image description here

  • The send method is used as the entry for sending messages
    @Override
    public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
    
    
        //调用拦截器对record进行预处理,该方法不会抛出异常
        ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);
        //发送消息
        return doSend(interceptedRecord, callback);
    }
  • doSend starts with do, it can be seen that it is the real entry point for message sending
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
    
    
        TopicPartition tp = null;
        try {
    
    
            //1.检测生产者是否已经关闭
            throwIfProducerClosed();
            //2、检查正要将数据发往的主题在kafka集群中的包含哪些分区
            //获取集群中一些元数据信息
            long nowMs = time.milliseconds();
            ClusterAndWaitTime clusterAndWaitTime;
            try {
    
    
                clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), nowMs, maxBlockTimeMs);
            } catch (KafkaException e) {
    
    
                ...
            }
            nowMs += clusterAndWaitTime.waitedOnMetadataMs;
            long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
            Cluster cluster = clusterAndWaitTime.cluster;
            byte[] serializedKey;
            try {
    
    
                //3.对消息的key进行序列化
                serializedKey = keySerializer.serialize(record.topic(), record.headers(), record.key());
            } catch (ClassCastException cce) {
    
    
                ...
            }
            byte[] serializedValue;
            try {
    
    
                //4.对消息的value进行序列化
                serializedValue = valueSerializer.serialize(record.topic(), record.headers(), record.value());
            } catch (ClassCastException cce) {
    
    
                ...
            }
            //5.分区器进行计算,决定此条消息发送到哪个分区
            int partition = partition(record, serializedKey, serializedValue, cluster);
            tp = new TopicPartition(record.topic(), partition);
            ...
            //6.预估消息发送消息的大小,内容包括key,value以及消息头
            int serializedSize = AbstractRecords.estimateSizeInBytesUpperBound(apiVersions.maxUsableProduceMagic(),
                    compressionType, serializedKey, serializedValue, headers);
            //7.检查发送消息的大小是否超过阈值
            ensureValidRecordSize(serializedSize);
            ...
            // 拦截器回调函数
            Callback interceptCallback = new InterceptorCallback<>(callback, this.interceptors, tp);
            //8.将消息添加到消息累加器(缓冲区)   
            RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,
                    serializedValue, headers, interceptCallback, remainingWaitMs, true, nowMs);
            ...
            //9.如果添加进缓冲队列已经满了,或者是首次创建的,那么幻想sender线程进行数据发送 
            if (result.batchIsFull || result.newBatchCreated) {
    
    
                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
                this.sender.wakeup();
            }
            //10.返回future对象
            return result.future;
        } 
        ...
    }

Summary of the dosend method (the above source code omits a lot of content, you can check it in conjunction with the specific source code):

  • Inside the dosend method, first check whether the Producer is closed, and then call the waitOnMetadata method to wait for the metadata to be available.
  • Next, partitions are calculated based on the recorded key-value pairs and cluster information, and messages are added to the buffer using the RecordAccumulator class.
  • If the buffer is full or a new batch of partitions needs to be created, the Sender thread is woken up and the backlog of message batches are sent to the Kafka Broker.
  • If an exception occurs during the sending process, the corresponding interceptor will be called and a FutureFailure object will be returned to indicate the failure result.
  • Otherwise, return a Future<RecordMetadata> object representing the successful result.
  • Currently, this method also contains logic for handling API exceptions and logging errors.

In general, this method implements the core logic of Kafka Producer sending messages, including obtaining metadata, calculating partitions, adding messages to buffers, handling exceptions and recording errors, etc. At the same time, it also supports the interceptor mechanism, allowing developers to customize the processing behavior of messages.


message reliability

To ensure the reliability of the production and consumption process of messages, Kafka needs to take care of the Broker server, producer client, and consumer client. Only when these three aspects ensure reliability can messages not be repeated and not lost.

This section stands on the producer client side to talk about how to ensure the reliability of the message. Kafka provides some producer configuration parameters to ensure:

  • message is not lost
  • Messages are not repeated

Release Confirmation Mechanism

  • The relevant parameters are as follows:
#新版本中
acks=all
#在一些比较旧的apache kafka老版本中,参数名称如下
request.required.acks=all

The ack parameter determines how the message is confirmed after the producer sends the message:

  • acks=0: After the producer writes the message into the buffer, the message is considered to be sent successfully
  • acks=1: As long as the message is successfully received by the leader replica of the corresponding partition, the message is considered to be sent successfully.

Since the kafka producer only communicates with the leader partition copy, the follower copy is responsible for replica synchronization from the leader partition. Therefore, if for some reason the partition copy of the current topic is re-elected as a leader, if the previous leader goes down after the election is completed, resulting in messages not being copied to the current leader, data loss will result.

  • acks=all or acks= -1 : All copies in the ISR are successfully written, and the leader needs to wait for all copies in the ISR set to complete message synchronization before confirming the message to the producer. This method is the most reliable and efficient It is also the lowest.

retry mechanism

  • The relevant parameters are as follows:
retries=Integer.MAX_VALUE
retry.backoff.ms=100
delivery.timeout.ms=120000

Notice:

  • When the producer sends a message, RetriableExceptionit will retry, that is, resend the message. retriesThe maximum number of retries allowed is configured; retry.backoff.msthe time interval between 2 retries is configured, in milliseconds; delivery.timeout.msthe timeout period for the message to be sent is configured, and no retry will be made after this time, and retriesthe parameters will be invalid.

Notice:

  • delivery.timeout.ms is a parameter in the Kafka producer configuration, which is used to specify the maximum waiting time for sending messages. Specifically, it defines an upper limit on how long the client waits for a response from the server when sending a message. If no confirmation from the server is received within the specified time, a DeliveryException will be thrown.

  • By default, delivery.timeout.ms has a value of 2 minutes (120000 milliseconds), which is usually enough to cover the vast majority of cases. However, in some cases, such as high network latency or busy servers, it may be necessary to increase this value in order to more fully utilize the fault tolerance and availability of the Kafka cluster.

  • It should be noted that delivery.timeout.ms is only applicable to asynchronous sending mode (that is, using the send method instead of the sendSync method). In synchronous sending mode, since each request will block, there is no timeout problem.

  • In general, delivery.timeout.ms is an important Kafka producer configuration parameter that can help control the maximum waiting time for sending messages, thereby improving the reliability and stability of message delivery. However, reasonable adjustments need to be made according to the actual situation to avoid problems of excessive waiting or message loss.

Notice:

  • retry.backoff.ms is a parameter in the Kafka producer configuration that controls how long to wait when retrying to send a message. Specifically, it defines how long the client waits before the next attempt after sending a message fails the first time.

  • By default, retry.backoff.ms has a value of 100 milliseconds, which is usually sufficient for most network failures and other unusual situations. However, in some cases, such as high network latency or a busy server, it may be necessary to increase this value for more robust handling of message delivery failures.

  • It should be noted that retry.backoff.ms is only applicable to asynchronous sending mode (that is, using the send method instead of the sendSync method). In synchronous sending mode, since each request will block, there is no retry problem.

  • In general, retry.backoff.ms is an important Kafka producer configuration parameter that can help control the time to wait when retrying to send messages, and improve the reliability and stability of message delivery. However, reasonable adjustments need to be made according to the actual situation to avoid problems of excessive waiting or message loss.

  • When will RetriableException be thrown?

    • The exception caused by the re-election of the partition leader belongs to RetriableException, and the producer of this type of exception will retry. Because after the Leader election is completed, resending the message will succeed.
    • For exceptions caused by incorrect configuration information, the producer will not retry, because the program cannot automatically modify the configuration after trying many times, and human intervention is still required. It is meaningless to retry message sending for such exceptions.
  • How to deal with the failure after retrying multiple times?

    • This situation is possible. After reaching retriesthe upper limit or delivery.timeout.msthe upper limit, the message sending has been retried several times, but the sending is still not successful. For this case, we still have to treat differently
      • If it is the user's order data, user's payment data, etc., such data must not be lost. When an exception occurs, the developer needs to catch the exception and handle the exception. Either put the data that failed to be sent into the database, or write the file and save it first. Wait for the exception to be resolved through human intervention, and then send it to Kafka again.
      • If the message data is data such as user web page clicks and product readings, the amount of data is large, and there are not many requirements for data processing delays. Even data loss under abnormal circumstances is not intolerable. For such cases, there is actually no need to do too much exception handling. Do a good job in alarm and log records, find and solve problems, and optimize from the perspectives of programs, kafka servers, and network performance.
  • Retrying may cause the problem of repeated consumption of messages. How to solve this problem?

    • The producer sends data to the broker for the first time. Due to network reasons, the producer has not been able to get the confirmation of the successful message written by the server .即:实际上消息数据已经在服务端写入成功,但是生产者没有接收到服务端的ack响应。
    • Since the producer did not receive confirmation of a successful write, it considers the message delivery to have failed. So the message is resent, and as a result, the message may be written multiple times.
    • Before kafka0.11.0.0, it was impossible to realize exactly once, that is, it was impossible to realize that the message was sent once and only once . Introduced from version 0.11.0.0 EOS(exactly once semantics,精确一次处理语义), by implementing idempotence and transaction processing of message data, message data can be sent exactly once.

message sequence problem

In this section, we discuss how the kafka production end ensures the orderly sending of messages? How many common methods are there?

insert image description here

  • The kafka producer buffer contains several buffer queues, and each buffer queue corresponds to a partition of a topic on the kafka server.
  • The data structure of the buffer queue is Deque, which is a double-ended queue. Data is put into one end and data is taken out from the other end.

Combined with the above figure, it can be seen that:

  • In the double-ended buffer queue in the producer, the order of messages can be guaranteed, and one end goes in and one end goes out.
  • Each double-ended queue corresponds to a topic partition of the Kafka server, so Kafka can guarantee the order of message data in a partition .

Therefore, to achieve the orderliness of messages, there are the following ideas:

  • Only one partition is created under the corresponding topic, then all data sending and consumption under this topic will be orderly —> topics with relatively small data volume can do this
  • Send messages that need to be ordered to the same partition through a custom partitioner
  • When sending a message, specify the key value, and messages with the same key will be sent to the same partition

How to avoid retries causing messages to be out of order

The kafka producer provides a retry mechanism for message sending, that is to say, after the message fails to be sent, the kafka producer will resend the message, then the following situation will occur:

  • After the first batch of messages is sent, the data sending fails due to some special reason (such as the topic partition is re-electing the Leader)
  • The second batch of messages is sent, and the server data is saved successfully.
  • Because the first batch of messages failed to send, Kafka retried to send the first batch of data, and this time it succeeded

This will cause the data sent to the kafka partition to be out of sequence. To avoid this problem, we need to set the relevant parameters of the production side, as follows:

max.in.flight.requests.per.connection=1

The function of this parameter is: For a kafka client request connection (which can be considered as a producer), once a batch of messages fails to be sent, before the batch of data is retried (resent) successfully, the next Batch message data sending is blocked. If the previous batch is unsuccessful, the next batch will never be sent out.

Notice:

  • max.in.flight.requests.per.connection is a parameter in the Kafka producer configuration that controls the number of unacknowledged requests that can be sent to the server per connection.
  • Specifically, it defines the maximum number of requests that can be sent to a TCP connection before no response from the server is received on that connection.
  • For example, if max.in.flight.requests.per.connection is set to 1, the client must receive a response to a previously sent request before sending the next request.
  • The default value of this parameter is 5, which means that there can be at most 5 unacknowledged requests on a TCP connection.
  • By increasing the value of this parameter, the performance of the Kafka client can be improved because it allows more requests to be sent and processed simultaneously.
  • However, if the setting is too high, this may cause Kafka Broker overload or network congestion, thereby affecting the availability and performance of the entire system.
  • 需要注意的是,max.in.flight.requests.per.connection只适用于异步发送模式(即使用send方法而不是sendSync方法)。在同步发送模式下,由于每个请求都会阻塞,所以不存在未确认的请求问题。
  • 总的来说,max.in.flight.requests.per.connection是一个重要的Kafka生产者配置参数,可以帮助优化生产者的性能和吞吐量,但需要根据实际情况进行合理的调整。

custom interceptor

  • By implementing the interceptor interface class ProducerInterceptor provided by kafka, the effect of the message interceptor can be realized, as follows:
public interface ProducerInterceptor<K, V> extends Configurable {
    
    
    /**
    * 该方法封装于KafkaProducer.send()方法中,运行在用户主线程
    * Producer确保在消息序列化前调用该方法,可以对消息进行任意操作,但慎重修改消息的topic、key和partition,会影响分区以及日志压缩
    */
    public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record);

    /**
    * 该方法在消息发送结果应答或者发送失败时调用,并且通常都是在callback()触发之前执行,运行在IO线程中
。实现该方法的代码逻辑尽量简单,否则影响消息发送效率
    */
    public void onAcknowledgement(RecordMetadata metadata, Exception exception);

    /**
    * 生产者的producer.close触发
    */
    public void close();
}
  • For example: statistics on the number of times the message was sent successfully or failed
public class RequestStatCalInterceptor implements ProducerInterceptor<String,String> {
    
    
    private static final String MSG_PREFIX="dhy:";
    private final AtomicInteger successCnt = new AtomicInteger(0);
    private final AtomicInteger errorCnt = new AtomicInteger(0);


    @Override
    public ProducerRecord<String,String> onSend(ProducerRecord<String,String> msg) {
    
    
        return new ProducerRecord<>(msg.topic(),msg.partition(),msg.timestamp(),msg.key(),MSG_PREFIX+msg.value());
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
    
    
        if (metadata == null) {
    
    
            errorCnt.getAndIncrement();
        }else {
    
    
            successCnt.getAndIncrement();
        }
    }

    @Override
    public void close() {
    
    
        double successRate = (double) successCnt.get() / (successCnt.get() + errorCnt.get());
        System.out.println("消息发送成功率:" + successRate*100 +"%");
    }

    @Override
    public void configure(Map<String, ?> configs) {
    
    

    }
}
  • application blocker

Notice:

  • The producer interceptor can be customized before the message is sent and before the producer callback, allowing the user to specify multiple Interceptors to act on a message in the order of configuration to form an interception chain
//拦截器的配置
props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, Collections.singletonList("interceptor.RequestStatCalInterceptor"));

You can also specify and configure an interceptor separately:

props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG, RequestStatCalInterceptor.class.getName());
  • test
    insert image description here

custom serializer

The kafka client producer serialization interface is as follows. If we need to implement serialization of custom data formats, we need to define a class to implement this interface.

What is serialization and deserialization:

  • Converting an object into a transportable and storable format (json, xml, binary, or even a custom format) is called serialization.
  • Deserialization is to convert a transportable and storable format into an object.
  • The serializer interface provided by kafak
/**
 * 将对象转成二进制数组的接口序列化实现类
 */
public interface Serializer<T> extends Closeable {
    
    

    /**
     * 参数configs会传入生产者配置参数,
     * 序列化器实现类可以根据生产者参数配置影响序列化逻辑
     * isKey布尔型,表示当前序列化的对象是不是消息的key,如果不是key就是value
     */
    default void configure(Map<String, ?> configs, boolean isKey) {
    
    
        // intentionally left blank
    }

    /**
     * 重要方法将对象data转换为二进制数组
     */
    byte[] serialize(String topic, T data);


    default byte[] serialize(String topic, Headers headers, T data) {
    
    
        return serialize(topic, data);
    }

    /**
     * 关闭序列化器
     * 此方法的实现必须是幂等的,因为可能被调多次
     */
    @Override
    default void close() {
    
    
        // intentionally left blank
    }
}
  • An example of serialization using Jackson, which is SpringBoot's default JSON processing framework
<dependency> 
    <groupId>com.fasterxml.jackson.core</groupId> 
    <artifactId>jackson-databind</artifactId> 
    <version>2.9.7</version> 
</dependency>
  • Define a class as the target class for our object serialization.
public class Peo {
    
    
   private String name;
   private Integer age;
   ...
}
  • custom serializer
public class JacksonSerializer implements Serializer<Peo> {
    
    
    private static final ObjectMapper objectMapper = new ObjectMapper();

    @Override
    public byte[] serialize(String topic, Peo data) {
    
    
        byte[] result=null;
        try {
    
    
            result=objectMapper.writeValueAsBytes(data);
        } catch (JsonProcessingException e) {
    
    
            e.printStackTrace();
        }
        return result;
    }
}

Note: ObjectMapper is thread-safe and can be shared and reused across multiple threads. Its thread safety mainly comes from the following two aspects:

  • ObjectMapper itself is immutable
  • ObjectMapper objects are not modified after creation and thus can be considered immutable objects. This means that multiple threads can access the same ObjectMapper instance at the same time without worrying about concurrent modification and race conditions.
  • ObjectMapper uses thread-safe data structures
  • ObjectMapper uses some thread-safe data structures, such as ThreadLocal, ConcurrentHashMap, and CopyOnWriteArrayList. These data structures are specifically designed for use in a multithreaded environment and provide efficient concurrent access and modification.
  • It should be noted that although ObjectMapper itself is thread-safe, the classes it uses (such as serialized and deserialized POJO classes) may not be thread-safe. Therefore, when dealing with complex data types, you need to consider the thread safety of these classes, and perform additional operations such as synchronization or replication when necessary to avoid race conditions and thread safety issues.
  • Apply serializer

Notice:

  • The kafka producer message can only choose one format, and the previous data cannot be JSON, and the next data can not be XML. So only one serializer can be configured.
        //序列化器的配置
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, JacksonSerializer.class.getName());
  • Write test cases and check whether the serializer works with breakpoints
    @Test
    void testJacksonSerializer() throws ExecutionException, InterruptedException {
    
    
        Peo peo = new Peo();
        peo.setName("dhy");
        peo.setAge(18);
        KafkaProducer<String, Peo> producer = new KafkaProducer<>(props);
        RecordMetadata metadata = producer.send(new ProducerRecord<>(TEST_TOPIC, peo)).get();
        System.out.println("消息偏移量为:"+metadata.offset());
    }

custom partitioner

The default partition strategy of KafkaProducer is:

  • Default strategy 1: If the producer specifies a partition, send it directly to the partition
  • Default strategy 2: If no partition is specified but a key is specified, the partition will be selected according to the hash value of the key, and messages with the same key value will be sent to the same partition.
  • Default strategy 3: If neither the partition nor the key is specified, the round-robin strategy is used, which can ensure that messages are relatively evenly distributed to multiple partitions under the same topic.

insert image description here

insert image description here


  • In order to ensure the order in which producer messages are sent and the order in which consumers consume data, these messages must be sent to the same partition
  • If you want to send messages to the same partition, there are three ways:
    • The producer manually specifies the partition
    • For messages that need to be sent to the same partition, specify the same key
    • Custom partitioner implements partition logic

  • partitioner interface

/**
 * 分区器接口
 */
public interface Partitioner extends Configurable, Closeable {
    
    

    /**
     * 根据消息record信息对其进行重新分区
     *
     * @param topic 主题名称
     * @param key 用于分区的key对象
     * @param keyBytes 用于分区的key的二进制数组
     * @param value 生产者消息对象
     * @param valueBytes 生产者消息对象的二进制数组
     * @param cluster 当前kafka集群的metadata信息
     */
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster);

    /**
     * 当分区器执行完成时被调用
     */
    public void close();


    default public void onNewBatch(String topic, Cluster cluster, int prevPartition) {
    
    }
}
  • custom partitioner
/**
 * 通过对消息value进行hash,然后取余于分区数计算出消息要被路由到的分区
 */
public class ValuePartitioner implements Partitioner {
    
    
    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    
    
        return partition(valueBytes, cluster.partitionsForTopic(topic).size());
    }

    private int partition(byte[] valueBytes, int numPartitions) {
    
    
        return Utils.toPositive(Utils.murmur2(valueBytes)) % numPartitions;
    }

    @Override
    public void close() {
    
    
    }

    @Override
    public void configure(Map<String, ?> configs) {
    
    
    }
}
  • Apply a custom partitioner: Specify a custom partitioner for the producer, so that after the configuration is complete, when the producer sends a message again, it will follow the partition rules defined in the partition method in the partitioner and send the data to the specified partition.
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, ValuePartitioner.class.getName());

Idempotence and transactions

In the section on Retry Mechanism of Producer Message Reliability, we talked about the retry mechanism after the Kafka producer fails to send data, and also introduced a possible abnormal situation:

  • The producer sends data to the broker. Due to network reasons, the producer may not be able to get the confirmation from the server (confirmation that the message is sent successfully). In fact, the message data has been successfully sent, and the Kafka server broker has successfully written it.
  • Since the producer did not receive confirmation of a successful write, it considers the message delivery to have failed. So the message is resent, and as a result, the message may be written a second time on the kafka broker server.

Normally, we cannot accept this situation, and the desired effect is exactly once (a batch of data is sent successfully once, and only once). This was not possible before version 0.11.0.0. After version 0.11.0.0, Kafka introduced idempotent and transaction mechanisms to support exactly once semantics.

Concept introduction:

  • Idempotence: Simply put, the result of multiple calls to the interface is consistent with the result of one call. For kafka, the result of sending a message once is the same as sending the message multiple times, and the message will not be repeatedly processed by consumers.
  • Transaction: The term transaction may be familiar to developers, and it usually refers to a series of operations that either succeed or fail (rollback). For kafka, transactions are used to ensure that multiple messages are either sent successfully (all written to the kafka broker data log), or are not written to the kafka data log.

Kafka realizes idempotence

  • Kafka is very easy to achieve idempotence, just set the producer client parameter enable.idempotence to true (the default value of this parameter is false)

Question: How does Kafka send repeated messages (retry) and still guarantee idempotence?

  • For this reason, the Kafka producer introduces the two concepts of producer id (hereinafter referred to as PID) and sequence number (sequence number).
    • Each kafka producer client is assigned a PID when it is initialized
    • PID + serial number can represent a unique message data, and each message of the producer corresponds to a unique serial number. Even if the same message is sent multiple times, the sequence number corresponding to the message will not change.
    • At the same time, the kafka broker server will save the start sequence number (start_seq) and end sequence number (end_seq) of the data batches that have been successfully sent for each producer (PID).
    • Therefore, when a new batch of messages is sent to the server, the serial number interval comparison will be performed first. Once an overlap occurs, it means that the messages with overlapping serial numbers have been successfully written on the server, and the duplicate message data can be Discard it to avoid repeated consumption on the consumer side.

Notice;

  • The idempotency we mentioned above is based on a certain partition, that is to say, Kafka's idempotence can only guarantee the idempotency of a single partition of a certain topic.
  • Therefore, if the retried message and the same message sent for the first time are sent to different partitions, idempotency will not take effect.
  • However, this situation usually does not occur, because even if the message fails to be sent and retried, the key, value, and topic information of the message itself does not change, the message partition algorithm does not change, and the number of partitions does not change. Under these premises, even if the same message is sent repeatedly, it will be sent to the same partition.
  • Kafka's idempotence mechanism can only guarantee the idempotency of a single partition of a topic, because idempotence is implemented based on the partition ID. Each partition has its own unique identifier against which messages are checked for idempotency. Therefore, if messages have the same key in multiple partitions, they will be treated as different messages in each partition, and global idempotence cannot be achieved.

Kafka implements transactions

Kafka's idempotence solves the problem that the same message is sent multiple times to the same partition. So if multiple different messages are sent to different partitions, how can we ensure that multiple messages are either sent successfully (all are written to the kafka broker data log), or are not written to the kafka data log?

This requires relying on kafka transactions to achieve:

  • The kafka producer needs to set the transactional.id parameter, which can be considered as the id of the transaction manager
  • The kafka transaction producer turns on idempotence, that is, enable.idempotence is set to true (if not explicitly set, KafkaProducer will set its value to true by default). If the user explicitly sets enable.idempotence to false, a ConfigException will be reported.

KafkaProducer provides 5 transaction-related methods, detailed as follows:

//初始化事务
void initTransactions();
//开启事务
void beginTransaction() throws ProducerFencedException;
//为消费者提供在事务内的位移提交的操作
void sendOffsetsToTransaction(Map<TopicPartition, OffsetAndMetadata> offsets, String consumerGroupId)throws ProducerFencedException;
//提交事务
void commitTransaction() throws ProducerFencedException;
//中止事务,类比事务的回滚
void abortTransaction() throws ProducerFencedException;

transaction isolation level

There is a parameter isolation.level in the Kafka consumer client. The default value of this parameter is "read_uncommitted", which means that the consumer application can see (consume) uncommitted transactions, and of course it is also visible to committed transactions.

This parameter can also be set to "read_committed", indicating that the consumer application cannot see the messages in the uncommitted transaction.

for example:

  • If the producer starts a transaction and sends three messages msg1, msg2, and msg3 to a partition value, the consumer application set to "read_committed" cannot consume these messages before executing the commitTransaction() or abortTransaction() method. However, these messages will be cached inside KafkaConsumer, and it will not be able to push these messages to the consumer application until the producer executes the commitTransaction() method.
  • Conversely, if the producer executes the abortTransaction() method, KafkaConsumer will discard these cached messages.

demo

/**
 * 生产者使用demo
 */
public class KafkaProducerTest {
    
    
    private static final String TEST_TOPIC = "test1";
    private Properties props;

    @BeforeEach
    public void prepareTest() {
    
    
        props = new Properties();
        //kafka broker列表
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 5000);
        //可靠性确认应答参数
        props.put(ProducerConfig.ACKS_CONFIG, "1");
        //发送失败,重新尝试的次数
        props.put(ProducerConfig.RETRIES_CONFIG, "3");
        //生产者数据key序列化方式
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        //序列化器的配置
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        //生产者端开启幂等
        props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, Boolean.TRUE);
        //生产者端开启事务
        props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "test-transaction");
    }


    @Test
    void testTransaction() throws ExecutionException, InterruptedException {
    
    
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        //0.初始化事务管理器
        producer.initTransactions();
        //1.开启事务
        producer.beginTransaction();
        try {
    
    
            //2.发送消息
            producer.send(new ProducerRecord<>(TEST_TOPIC, Integer.toString(1), "test1"));
            producer.send(new ProducerRecord<>(TEST_TOPIC, Integer.toString(2), "test2"));
            producer.send(new ProducerRecord<>(TEST_TOPIC, Integer.toString(3), "test3"));
            //3.提交事务
            producer.commitTransaction();
        } catch (ProducerFencedException e) {
    
    
            e.printStackTrace();
            //4.1 事务回滚
            producer.abortTransaction();
        } catch (KafkaException e) {
    
    
            e.printStackTrace();
            //4.2 事务回滚
            producer.abortTransaction();
        } finally {
    
    
            producer.close();
        }
    }
}

consumer

rebalance

  • What is partition rebalancing?
    • We know that a topic partition is consumed by a consumer in a consumer group.
    • The so-called partition rebalancing (rebalancing) refers to reestablishing the relationship between partitions and consumers compared to the first balance state.
      • When the service where the consumer group is located is started, the consumer will be assigned a topic partition that it can access data. This is the first time a relationship between a consumer and a partition is established, and it is the first partition balance
      • The so-called partition rebalancing refers to the action of re-establishing the relationship between consumers and partitions due to changes in certain external conditions during data consumption.

insert image description here

  • When does partition rebalancing happen?

    • The number of partitions consumed by the consumer group changes (adding partitions). Kafka currently only supports adding partitions for a certain topic
    • The number of consumers increases. When the consumer application in the original consumer group is running normally, a new service is started. The service contains consumers with the same groupId as the original consumer, resulting in an increase in consumers in the consumer group. .
    • When a consumer group subscribes to a topic in the form of a regular expression, when a new topic is created, and the topic name matches the subscription regular expression of the consumer group. Also triggers partition rebalancing.
    • A decrease in the number of consumers consuming a topic. Common situation: When a consumer fails to complete the data processing for a long time after pulling the data (the next data pull action is not performed), the Kafka server thinks that the consumer has hung up (that is, the Kafka server thinks that the consumption in the consumer group decreased in number).
  • What are the effects of rebalancing?

    • Rebalance will affect the processing performance of consumers. When rebalance occurs, consumers lose connection with partitions, cannot poll to pull data, and cannot submit consumption offsets.
    • The speed of rebalance is very slow. When you have many topic partitions and a large number of consumers in the consumer group, this process may last for dozens of minutes.
  • How to avoid rebalancing?

    • The first three points that lead to the rebalancing behavior are our proactive behaviors, which can avoid the operation of adding or subtracting consumers and increasing partitions during busy hours
    • For the fourth point, the number of consumers in the consumer group changes, such as: the number of consumers decreases. When a consumer pulls a batch of data and cannot complete the processing of this batch of data for a long time (without submitting the offset), the kafak server thinks that the consumer has hung up (that is, the number of consumers in the consumer group has decreased) , which will trigger Rebalance. In this case, we can avoid it by setting the following parameters:
session.timout.ms=10000
heartbeat.interval.ms=2000
max.poll.interval.ms=配置该值大于消费者批处理消息最长耗时(默认5分钟)
max.poll.records=500(默认值是500
  • The consumer maintains a heartbeat with the Kafka server. Once the server session.timout.msdoes not receive the consumer's heartbeat within the set time, the consumer is considered dead. So this value can be relatively large, such as 10s.
  • heartbeat.interval.msIt is the time interval for consumers to send heartbeats to the Kafka server. The smaller the value, the higher the frequency, and the lower the probability of misjudgment of heartbeat loss.
  • When a consumer pulls a batch of data and max.poll.interval.msstill does not execute the next data pull poll after the time expires (because the data processing timed out), the Kafka server considers the consumer to hang up. Therefore, in order to avoid rebalance, we should make the data processing time of a single batch (the default is 500 for pulling a batch) less than the max.poll.interval.msconfigured value.
  • The above is to avoid rebalance by increasing the timeout time of single batch processing. We can also max.poll.interval.msreduce the configuration value while keeping it unchanged max.poll.records. The less data pulled in a batch, the shorter the data processing time, thus avoiding rebalance problems caused by timeouts.

  • Why does rebalancing cause repeated consumption of messages?

    • During the rebalance time of the consumers in the consumer group, all consumers in the group will stop pulling data, and will be temporarily disconnected from the server.
    • Possible problems, for example: 500 pieces of data were fetched in the last batch, and a rebalance occurred before the data was processed, and the consumption offset of this batch could not be submitted. After the rebalance is completed, when the consumer consumes this partition again, according to the consumption offset recorded by the server, the pulled data is still the original 500 pieces, which leads to the problem of repeated consumption.
  • How to solve the problem of repeated consumption of messages caused by rebalancing?

    • We expect there to be a way to commit offsets before rebalance. This requirement can be met by implementing the ConsumerRebalanceListener interface. When a rebalance event occurs in a partition of a certain topic, the offset of the consumer is submitted. A specific example:
public class ConsumerBalance {
    
    
    private static KafkaConsumer<String, String> consumer;

    /**
     * 存储一个主题多个分区的当前消费偏移量
     */
    private static Map<TopicPartition, OffsetAndMetadata> currentOffsets = new HashMap<>();

    /**
     *  初始化消费者
     */
    static {
    
    
        Properties configs = initConfig();
        consumer = new KafkaConsumer<>(configs);
        //主题订阅
        consumer.subscribe(Collections.singletonList("test1"), new RebalanceListener(consumer));
    }

    /**
     * 初始化配置
     */
    private static Properties initConfig() {
    
    
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "dhy-group");
        props.put("enable.auto.commit", true);
        props.put("auto.commit.interval.ms", 1000);
        props.put("session.timeout.ms", 30000);
        props.put("max.poll.records", 1000);
        props.put("key.deserializer", StringDeserializer.class.getName());
        props.put("value.deserializer", StringDeserializer.class.getName());
        return props;
    }

    public static void main(String[] args) {
    
    

        while (true) {
    
    
            // 这里的参数指的是轮询的时间间隔,也就是多长时间去拉一次数据
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(3000));
            records.forEach((ConsumerRecord<String, String> record) -> {
    
    
                System.out.println("topic:" + record.topic()
                        + ",partition:" + record.partition()
                        + ",offset:" + record.offset()
                        + ",key:" + record.key()
                        + ",value" + record.value());
                // 每次消费记录消费偏移量,用于一旦发生rebalance时提交
                currentOffsets.put(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset() + 1, "no matadata"));
            });
            consumer.commitAsync();
        }
    }

    static class RebalanceListener implements ConsumerRebalanceListener {
    
    
        KafkaConsumer<String, String> consumer;

        public RebalanceListener(KafkaConsumer<String,String> consumer) {
    
    
            this.consumer = consumer;
        }

        /**
         * 在rebalance发生之前和消费者停止读取消息之后被调用
         */
        @Override
        public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
    
    
            consumer.commitSync(currentOffsets);
        }

        /**
         * 在rebalance完成之后(重新分配了消费者对应的分区),消费者开始读取消息之前被调用。
         */
        @Override
        public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
    
    
            consumer.commitSync(currentOffsets);
        }
    }
}

The correct practice of consumer groups and thread pools

  • Error example 1: Multiple threads use a consumer
    • Create multiple threads to consume kafka data
    • Multiple threads use the same KafkaConsumer object
    • Use this KafkaConsumer object in a single thread to complete data fetching, processing, and submitting offsets.
      insert image description here

wrong reason:

  • KafkaConsumer is thread-unsafe because its design includes non-linear state and may lead to unexpected results in multi-threaded situations. Specifically, the following are some factors that can create thread safety issues:

    • When the poll() method is called, KafkaConsumer will have many internal state changes, such as the current partition, consumption displacement, etc. When these internal states are shared by multiple threads, there will be race conditions, resulting in consumption displacement errors or repeated consumption of messages.

    • KafkaConsumer will automatically submit the consumption displacement. If multiple threads call the commitSync() or commitAsync() method at the same time, a race condition will occur, resulting in an error in the consumption displacement submission.

    • When KafkaConsumer processes messages, it needs to use cache (such as offsetsForTimes cache) to improve efficiency. If multiple threads modify the cache at the same time, it will cause data inconsistency, or even NullPointerException and other exceptions.

  • Error example 2: Pull messages and hand them over to the thread pool for batch processing

insert image description here

Not recommended for use reasons:

  • This processing method is not an error, but it is just a consumer consuming data in the kafka message queue, not a consumer group. Kafka partitions cannot be fully utilized to improve the throughput of message processing.

  • Correct approach: use thread pool to implement consumer group

insert image description here

  • Because KafkaConsumer is not thread-safe, KafkaConsumer cannot be used across threads
  • Each thread holds a KafkaConsumer object
  • The implementation of multiple threads can use the thread pool, and the number of threads in the thread pool is equal to the number of consumers in the consumer group
class ConsumerGroupThreadPoolTest {
    
    
    @Test
    void test(){
    
    
        ExecutorService executorService = Executors.newFixedThreadPool(3);
        for (int i = 0; i < 3; i++) {
    
    
            executorService.execute(new MyConsumer());
        }
    }
}
  • MyConsumer calls the KafkaConsumer related method of pulling and canceling consumption data in the run method
class MyConsumer implements Runnable {
    
    
    private static final String TEST_TOPIC = "test1";
    private final KafkaConsumer<String, String> consumer;

    public MyConsumer() {
    
    
        Properties props = new Properties();
        //kafka集群信息
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        //消费者组名称
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "dhy_group");
        //key的反序列化器
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //value的反序列化器
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //初始化消费者
        consumer = new KafkaConsumer<>(props);
    }

    @Override
    public void run() {
    
    
        consumeTemplate(MyConsumer::printRecord, null);
    }

    /**
     * recordConsumer针对单条数据进行处理,此方法中应该做好异常处理,避免外围的while循环因为异常中断。
     */
    public void consumeTemplate(Consumer<ConsumerRecord<String, String>> recordConsumer, Consumer<KafkaConsumer<String, String>> afterCurrentBatchHandle) {
    
    
        consumer.subscribe(Collections.singletonList(TEST_TOPIC));
        try {
    
    
            while (true) {
    
    
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofSeconds(100));
                for (ConsumerRecord<String, String> record : records) {
    
    
                    recordConsumer.accept(record);
                }
                if (afterCurrentBatchHandle != null) {
    
    
                    afterCurrentBatchHandle.accept(consumer);
                }
            }
        } finally {
    
    
            consumer.close();
        }
    }

    private static void printRecord(ConsumerRecord<String, String> record) {
    
    
        System.out.println("topic:" + record.topic()
                + ",partition:" + record.partition()
                + ",offset:" + record.offset()
                + ",key:" + record.key()
                + ",value" + record.value());
        record.headers().forEach(System.out::println);
    }
}

interceptor

  • Consumer consumer-side interceptor interface
public interface ConsumerInterceptor<K, V> extends Configurable, AutoCloseable {
    
    
    /**
     * 发生的时机:在返回给客户端之前,也就是poll() 方法返回之前
     * 这个方法允许你修改records(记录集合),然后信息的记录集合被返回
     * 没有返回记录条数上的限制,你可以在这里可以可以过滤或者是生成新的记录
     */
    public ConsumerRecords<K, V> onConsume(ConsumerRecords<K, V> records);

    //当offset 被提交之后调用
    public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets);
    //当拦截器关闭的时候被调用
    public void close();
}

  • Application scenarios of consumer-side interceptors: client monitoring, end-to-end system performance detection, message auditing and other scenarios
  • Use case: Realize the calculation of the average delay from sending to receiving a batch of data.
    • The sending time of a batch of data will be stored in the timestamp of ConsumerRecords, which is added when the producer constructs the message.
    • The time when a batch of data is received can be considered as the current time System.currentTimeMillis()
    • We save the received batch delay to totalLatency, and the number of message batches that have ended to msgCountLong, and divide the two to get the average delay of the "batch message" from the producer to the consumer.
package interceptor;

import org.apache.kafka.clients.consumer.ConsumerInterceptor;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;

import java.util.Map;
import java.util.concurrent.atomic.AtomicLong;

public class LatencyCalConsumerInterceptor implements ConsumerInterceptor<String, String> {
    
    

    /**
     * 数据处理总耗时
     */
    private static final AtomicLong totalLatency = new AtomicLong();
    /**
     * 消息的总数量
     */
    private static final AtomicLong msgCount = new AtomicLong();


    /**
     * 在消费者进行数据处理之前被调用
     */
    @Override
    public ConsumerRecords<String, String> onConsume(ConsumerRecords<String, String> records) {
    
    
        long lantency = 0L;
        //累加每条消息处理耗时
        for (ConsumerRecord<String, String> msg : records) {
    
    
            lantency += (System.currentTimeMillis() - msg.timestamp());
        }
        //获取当前消息发送处理的总耗时
        long totalLatencyLong = totalLatency.addAndGet(lantency); 
        //获取消息总数
        long msgCountLong = msgCount.incrementAndGet(); 
        System.out.println("该批次消息发出到消费处理的平均延时:" + (totalLatencyLong / msgCountLong));
        return records;
    }

    @Override
    public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets) {
    
    }
    @Override
    public void close() {
    
    }
    @Override
    public void configure(Map<String, ?> configs) {
    
    }
}
  • Apply a custom interceptor
props.put(ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG, LatencyCalConsumerInterceptor.class.getName());

deserializer

  • Serialization process: Kafka producers serialize Peo objects into JSON format, and then convert JSON format into byte[] byte stream for network transmission
  • Deserialization process: kafka consumer gets byte[] byte stream array, deserializes it into JSON, and then gets Peo object through JSON

Consumer deserialization interface:

public interface Deserializer<T> extends Closeable {
    
    

    /**
     * 参数configs会传入消费者配置参数,
     * 反序列化器实现类可以根据消费者参数配置影响序列化逻辑
     * isKey布尔型,表示当前反序列化的对象是不是消息的key,如果不是key就是value
     */
    default void configure(Map<String, ?> configs, boolean isKey) {
    
    
    }

    //核心反序列化函数,将二进制数组转成T类对象
    T deserialize(String topic, byte[] var2);

    default T deserialize(String topic, Headers headers, byte[] data) {
    
    
        return this.deserialize(topic, data);
    }

    default void close() {
    
    
    }
}
  • Example: Deserialize an object using Jackson
/**
 * 反序列化器
 */
public class JacksonDeserializer implements Deserializer<Peo> {
    
    
    private static final ObjectMapper objectMapper = new ObjectMapper();
    @Override
    public Peo deserialize(String topic, byte[] data) {
    
    
        try {
    
    
            return objectMapper.readValue(data,Peo.class);
        } catch (IOException e) {
    
    
            e.printStackTrace();
            return null;
        }
    }
}

  • Apply deserializer
        //value的反序列化器
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JacksonDeserializer.class.getName());
  • test
    @Test
    void testDeserializer(){
    
    
        //1.创建消费者
        KafkaConsumer<String, Peo> consumer = new KafkaConsumer<>(props);
        //2.订阅Topic
        consumer.subscribe(Collections.singletonList(TEST_TOPIC));

        try {
    
    
            while (true) {
    
    
                //循环拉取数据,
                //Duration超时时间,如果有数据可消费,立即返回数据
                // 如果没有数据可消费,超过Duration超时时间也会返回,但是返回结果数据量为0
                ConsumerRecords<String, Peo> records = consumer.poll(Duration.ofSeconds(100));
                for (ConsumerRecord<String, Peo> record : records) {
    
    
                    System.out.println(record.value());
                }
            }
        } finally {
    
    
            //退出应用程序前使用close方法关闭消费者,
            // 网络连接和socket也会随之关闭,并立即触发一次再均衡(再均衡概念后续章节介绍)
            consumer.close();
        }
    }

Integrate Spring Boot

quick start

<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
</dependency>
  • parameter configuration
spring:
  kafka:
    bootstrap-servers: localhost:9092
    producer: # 生产者
      retries: 3  #发送失败重试次数
      acks: all  #所有分区副本确认后,才算消息发送成功
      # 指定消息key和消息体的序列化编码方式
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
    consumer: #消费者
      # 指定消息key和消息体的反序列化解码方式,与生产者序列化方式一一对应
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
      # 该参数作用见下文注释
      properties:
        spring:
          json:
            trusted:
              packages: '*'

Notice:

  • The serializer of the producer and the deserializer of the consumer appear in pairs, that is to say, the producer serializes the value using JSON, and the consumer should also use JSON when deserializing

  • spring.kafka.consumer.properties.spring.json.trusted.packagesis a Kafka consumer property that specifies which Java packages Spring Kafka should trust to deserialize JSON messages.

  • In Kafka, messages are usually serialized, and Spring Kafka uses the JSON serializer/deserializer by default to process messages in JSON format. However, in order to prevent deserialization vulnerabilities, Spring Kafka only trusts some basic Java types by default, such as classes under java.langthe package . If your JSON message contains other types of objects, such as custom POJO classes, then Spring Kafka will refuse to deserialize these messages.

  • To get around this, you can use spring.kafka.consumer.properties.spring.json.trusted.packagesthe attribute to specify
    which Java packages Spring Kafka should trust.

  • You can add the package of your custom class to this property so that Spring Kafka can properly process your custom class when deserializing JSON messages. For example:

spring.kafka.consumer.properties.spring.json.trusted.packages=com.example.myapp.pojo
  • This will tell Spring Kafka to trust the classes under com.example.myapp.pojothe package .
  • Note that this property only takes effect when using a JSON serializer/deserializer. If you use other types of serializers/deserializers then this property will have no effect.
  • If you want to customize the log level, use the following configuration.
logging:
  level:
    org:
      springframework:
        kafka: ERROR # spring-kafka
      apache:
        kafka: ERROR # kafka

Build the producer environment:

  • target
@Data
public class User {
    
    
    private String name;
    private Integer age;
}
  • Producer Test Cases
@SpringBootTest(classes = KafkaSpringBootDemo.class)
class SpringKafkaTest {
    
    

    @Resource
    KafkaTemplate<String, User> kafkaTemplate;

    @Test
    void testProducer() {
    
    
        User user = new User();
        user.setAge(21);
        user.setName("大忽悠");
        kafkaTemplate.send(TEST_KAFKA_TOPIC, user);
        //阻塞等待观察结果
        System.in.read();
    }
}

Notice:

  • <String,User>KafkaTemplate is a template operation class encapsulated by Spring for Kafka producers. Generics can be used. The data type of the key representing the sent data message above is String, and the data type of the value of the data body is User.
  • Because of the configuration value-serializer: org.springframework.kafka.support.serializer.JsonSerializer, the User object will be serialized into a JSON object and sent to the Kafka server.
  • It should be noted that: I did not create a new topic "test3" on the server before sending the data , but the data was sent successfully. This is because, by default, when the topic that the producer sends data to does not exist, a new topic will be created (the topic has only one partition).

Consumer environment construction:

@Component
@Slf4j
public class KafkaConsumer {
    
    
    @KafkaListener(topics = TEST_KAFKA_TOPIC , groupId = TEST_CONSUMER_GROUP)
    public void dealUser(User user) {
    
    
      log.info("kafka consumer msg: {}",user);
    }
}

Notice:

  • The core annotation is KafkaListener, topics specifies which topic data is consumed, and gourpId specifies the name of the consumer group
  • Here, User is used as the method parameter because the kafka consumer will call the deserializer to value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializerdeserialize the User object sent by the producer.
  • Note that the consumer group here has only one consumer. If you want to start multiple consumer threads, you can set it @KafkaListener(concurrency=n). (Usage: number of consumer threads = number of topic partitions)

Run the producer test case and view the output:
insert image description here
If the following error occurs during the test:

Caused by: java.lang.IllegalArgumentException: The class 'springboot.pojo.User' is not in the trusted packages
  • Because the package path of springboot.pojo is not trusted on the consumer side.
  • If you need to be trusted, you need to configure spring.kafka.consumer.properties.spring.json.trusted.packages: springboot.pojo;
  • If configured as '*', all paths will be trusted.

producer

The list of parameters supported by the send method of KafkaTemplate is as follows:

  • topic: the name of the Topic topic
  • partition: The partition number of the topic, starting from 0. Indicates that the message data is specified to be sent to the partition
  • timestamp: Timestamp, generally defaults to the current timestamp
  • key: The key of the message, which can be of different data types, but is usually String. Messages with the same key are sent to the same partition, which means that messages with the same key can guarantee data order.
  • data: the data of the message, which can be of different data types
  • ProducerRecord: The encapsulation class corresponding to the message, which contains the above fields and is rarely used
  • Message: The Message encapsulation class that comes with Spring, including messages and message headers, is rarely used

send asynchronously

The send method is asynchronous by default, that is, it will not wait for the confirmation of the message from the server after sending, and the producer client will not have any perception if there is an exception.

In order to enable the producer to perceive whether the message has been sent successfully, there are two ways:

  • send synchronously
  • Asynchronous send + callback function

Adding a callback function is written as follows:

   @Test
    void testAsyncWithCallBack() throws IOException {
    
    
        User user = new User();
        user.setAge(21);
        user.setName("大忽悠");
        kafkaTemplate.send(TEST_KAFKA_TOPIC, user).addCallback(new ListenableFutureCallback<SendResult<String, User>>() {
    
    
            @Override
            public void onFailure(Throwable ex) {
    
    
                log.error("msg send err: ",ex);
            }

            @Override
            public void onSuccess(SendResult<String, User> result) {
    
    
                // 消息发送到的topic
                String topic = result.getRecordMetadata().topic();
                // 消息发送到的分区
                int partition = result.getRecordMetadata().partition();
                // 消息在分区内的offset
                long offset = result.getRecordMetadata().offset();
                log.info("msg send success,topic: {},partition: {},offset: {}",topic,partition,offset);
            }
        });
        System.in.read();
    }

insert image description here


send synchronously

By default, the send() method is an asynchronous call method. If you want to implement a synchronous blocking method, you need to call the get() method on the basis of the send method.

The get() method without parameters has an overloaded method get(long timeout, TimeUnit unit). When the server still has no message written to confirm successfully after a certain period of time, a TimeoutException will be thrown.

    @Test
    void testSync() throws IOException {
    
    
        User user = new User();
        user.setAge(21);
        user.setName("大忽悠");
        try {
    
    
            SendResult<String, User> result = kafkaTemplate.send(TEST_KAFKA_TOPIC, user).get();
            // 消息发送到的topic
            String topic = result.getRecordMetadata().topic();
            // 消息发送到的分区
            int partition = result.getRecordMetadata().partition();
            // 消息在分区内的offset
            long offset = result.getRecordMetadata().offset();
            log.info("msg send success,topic: {},partition: {},offset: {}",topic,partition,offset);
        } catch (InterruptedException | ExecutionException e) {
    
    
            log.error("send msg sync occurs err: ",e);
        }
        System.in.read();
    }

insert image description here


Interceptor and partitioner configuration

insert image description here
Note that interceptors and partitioners are uncommon configuration properties in Spring's view. For uncommon native configuration properties, spring is all configured under properties. That is to say, in the native API, all properties passed to producers through Properties are supported here.

spring:
	kafka:
		producer:
			properties:
                interceptor.classes: springboot.producer.interceptor.RequestStatCalInterceptor
                partitioner.class: springboot.producer.partitioner.ValuePartitioner

Notice:

  • Note that one is classes and the other is class, only one partitioner can be configured, and multiple interceptors can be configured.

affairs

The setting of idempotence is still very simple, just set the producer client parameter enable.idempotence to true.

spring:
	kafka:
		producer:
			properties:
			    enable.idempotence: true

After kakfa's transaction processing is combined with spring, there are two ways to use it, namely manual transmission (template method) and automatic transmission (annotation). Here, the order payment scenario is taken as an example:

  • Pay for user orders, send data to Kafka, and increase points for users
  • Then store the user's order payment result in the database
  • If the order payment fails, an exception is thrown, but the kafka message has been sent. At this time, we hope that the order payment is successful and the user points increase successfully, either succeeds or fails

After introducing the application scenarios of kafak transactions, let's demonstrate the manual use of transactions:

   @Test
    void testTransaction() {
    
    
        User user = new User();
        user.setAge(21);
        user.setName("大忽悠");
        //调用事务模板方法
        kafkaTemplate.executeInTransaction(operations -> {
    
    
            operations.send(TEST_KAFKA_TOPIC, user);
            //业务处理发生异常,事务回滚
            throw new RuntimeException("fail");
        });
    }

The way to automatically block is to use the @Transactional annotation, and additional configuration management for kafka is required, but this method is not recommended because it is easy to confuse with database transactions.


consumer

Use the @KafkaListener annotation to mark a certain consumer. There are several attributes in the annotation, and the functions are as follows:

public @interface KafkaListener {
    
    

   /**
    * 消费者的id,如果没有配置或默认生成一个。如果配置了会覆盖groupId,笔者的经验这个配置不需要配
    */
   String id() default "";

   /**
    * 配置一个bean,类型为:org.springframework.kafka.config.KafkaListenerContainerFactory
    */
   String containerFactory() default "";

   /**
    * 三选一:该消费者组监听的Topic名称
    */
   String[] topics() default {
    
    };

   /**
    * 三选一:通过为消费者组指定表达式匹配监听多个Topic(笔者从来没用过,也不建议使用)
    */
   String topicPattern() default "";

   /**
    * 三选一:消费组指定监听Topic的若干分区。
    */
   TopicPartition[] topicPartitions() default {
    
    };

   /**
    * 没用过,不知道作用
    */
   String containerGroup() default "";

   /**
    * Listener的异常处理器,后续会介绍
    * @since 1.3
    */
   String errorHandler() default "";

   /**
    * 消费者组的分组id
    * @since 1.3
    */
   String groupId() default "";

   /**
    * 设否设置id属性为消费组组id
    * @since 1.3
    */
   boolean idIsGroup() default true;

   /**
    * 消费者组所在客户端的客户端id的前缀,用于kafka客户端分类
    * @since 2.1.1
    */
   String clientIdPrefix() default "";

   /**
    * 用于SpEL表达式,获取当前Listener的配置信息
    * 如获取监听Topic列表的SpEL表达式为 : "#{__listener.topicList}"
    * @return the pseudo bean name.
    * @since 2.1.2
    */
   String beanRef() default "__listener";

   /**
    * 当前消费者组启动多少了消费者线程,并行执行消费动作
    * @since 2.2
    */
   String concurrency() default "";

   /**
    * 是否自动启动,true or false
    * @since 2.2
    */
   String autoStartup() default "";

   /**
    * Kafka consumer 属性配置,支持所有的apache kafka 消费者属性配置
    * 但不包括group.id 和 client.id 配置属性
    * @since 2.2.4
    */
   String[] properties() default {
    
    };

   /**
    * 笔者从来没用过,自己理解下面的这段英文吧
    * When false and the return type is an {@link Iterable} return the result as the
    * value of a single reply record instead of individual records for each element.
    * Default true. Ignored if the reply is of type {@code Iterable<Message<?>>}.
    * @return false to create a single reply record.
    * @since 2.3.5
    */
   boolean splitIterables() default true;

   /**
    *  笔者从来没用过,自己理解下面的这段英文吧
    * Set the bean name of a
    * {@link org.springframework.messaging.converter.SmartMessageConverter} (such as the
    * {@link org.springframework.messaging.converter.CompositeMessageConverter}) to use
    * in conjunction with the
    * {@link org.springframework.messaging.MessageHeaders#CONTENT_TYPE} header to perform
    * the conversion to the required type. If a SpEL expression is provided
    * ({@code #{...}}), the expression can either evaluate to a
    * {@link org.springframework.messaging.converter.SmartMessageConverter} instance or a
    * bean name.
    * @return the bean name.
    * @since 2.7.1
    */
   String contentTypeConverter() default "";

}

Best Practices

Make custom configurations of common information such as the topic monitored by the consumer, the name of the consumer group, and the number of consumers in the consumer group (rather than hard-coded in the code), as follows:

dhyconsumer:
    topic: topic1,topic2
    group-id: dhy-group
    concurrency: 5

Annotation properties support the use of SPEL expressions, so we can read configuration as property values:

@KafkaListener(topics = "#{'${dhyconsumer.topic}'.split(',')}",
        groupId = "${dhyconsumer.group-id}",
        concurrency="${dhyconsumer.concurrency}")
public void readMsg(ConsumerRecord consumerRecord) {
    
    
     //监听到数据之后,进行处理操作
}

Designated consumption location

In some special scenarios, you want to consume certain partitions (not all partitions) in the Topic topic. Or start consumption from a specified offset for a certain partition.

@KafkaListener(topicPartitions =
        {
    
     @TopicPartition(topic = "topic1", partitions = {
    
     "0", "1" }),
          @TopicPartition(topic = "topic2", partitions = {
    
    "0","4"},partitionOffsets = @PartitionOffset(partition = "0", initialOffset = "300"))
        })
public void readMsg(ConsumerRecord<?, ?> record) {
    
    
    
}

In the above example, the consumer listens to partitions 0 and 1 of topic1 (may contain more than 2 partitions); listens to partitions 0 and 4 of topic2, and consumes from partition 0 with an offset of 300;


listener factory

@Configuration
public class KafkaInitialConfiguration {
    
    

    /**
     * 监听器工厂
     */
    @Autowired
    private ConsumerFactory<String,String> consumerFactory;

    /**
     * @return 配置一个消息过滤策略
     */
    @Bean
    public ConcurrentKafkaListenerContainerFactory<String,String> myFilterContainerFactory() {
    
    
        ConcurrentKafkaListenerContainerFactory<String,String> factory = new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory);
        // 被过滤的消息将被丢弃
        factory.setAckDiscarded(true);
        //设置消息过滤策略
        factory.setRecordFilterStrategy(new RecordFilterStrategy() {
    
    
            @Override
            public boolean filter(ConsumerRecord consumerRecord) {
    
    
                //这里做逻辑判断
                //返回true的消息将会被丢弃
                return true;
            }
        });
        return factory;
    }
}

Using the method, myFilterContainerFactory is the name of the bean method above:

@KafkaListener(containerFactory ="myFilterContainerFactory")

Notice:

  • ConsumerFactory is used to create consumer instances. Its role is to simplify the consumer creation process, especially when using custom configurations, it can provide consumers with more flexibility.
  • ConcurrentKafkaListenerContainerFactory is a factory class provided by Spring Kafka, which is used to create and configure Kafka message listener containers. It can create multiple concurrent listener containers, so as to realize the ability of multi-threaded processing of Kafka messages.

Notice:

  • ConcurrentMessageListenerContainer is a component in the Spring framework. Its function is to listen to and process messages concurrently in the message queue. It can process messages in multiple threads at the same time, thereby improving the efficiency of message processing.

  • Specifically, ConcurrentMessageListenerContainer can configure multiple MessageListener instances and assign them to different threads to process messages. In this way, the bottleneck of message processing can be avoided and the throughput of the system can be improved. At the same time, ConcurrentMessageListenerContainer also supports batch processing of messages, which can process multiple messages in one call, further improving processing efficiency.

  • In addition, ConcurrentMessageListenerContainer also provides some other functions, such as:

    • Supports dynamic adjustment of the number of concurrent consumers, and can automatically adjust the number of concurrent consumers according to the load of the message queue.
    • Support message transaction processing, which can guarantee the atomicity and consistency of messages.
    • Supports message retry and dead letter processing, and can handle message processing failures due to various reasons.
  • In short, ConcurrentMessageListenerContainer is a very practical component that can help us process messages in the message queue more efficiently.

Notice:

  • KafkaMessageListenerContainer is a component in the Spring Kafka library. It acts as a container for Kafka message listeners, which can automatically manage the life cycle of Kafka consumers, and provides some convenient configuration options and processing logic.

  • Specifically, KafkaMessageListenerContainer can listen to Kafka messages by subscribing to one or more Kafka topics, and automatically call the registered message listener for processing when the message arrives. It also supports some advanced features such as:

    • Commit offsets manually to ensure messages are fully processed before committing offsets.
    • Supports batch processing of messages to improve processing efficiency.
    • Provides some error handling mechanisms such as retries and error logging.
  • In short, KafkaMessageListenerContainer can greatly simplify the development of Kafka message processing, and provides some advanced features to improve the reliability and efficiency of message processing.


Other property configuration

In addition to some configuration properties mentioned above, in fact, the native configuration properties supported by apache kafka consumer are much more than those provided by Spring. All Apache Kafka native configuration properties can be passed through the properties configuration:

@KafkaListener(properties = {
    
    "enable.auto.commit:false","max.poll.interval.ms:6000" })

Header acquisition

We can get the message header by annotation:

  • @Payload: What is obtained is the message body of the message, that is, the content of the sent data
  • @Header(KafkaHeaders.RECEIVED_MESSAGE_KEY): Get the key to send the message
  • @Header(KafkaHeaders.RECEIVED_PARTITION_ID): Get the partition from which the current message is monitored
  • @Header(KafkaHeaders.RECEIVED_TOPIC): Get the TopicName of the listener
  • @Header(KafkaHeaders.RECEIVED_TIMESTAMP): Get the timestamp of the data time
@KafkaListener(topics = "topic1")
public void  readMsg(@Payload String data,
                         @Header(KafkaHeaders.RECEIVED_MESSAGE_KEY) Integer key,
                         @Header(KafkaHeaders.RECEIVED_PARTITION_ID) int partition,
                         @Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
                         @Header(KafkaHeaders.RECEIVED_TIMESTAMP) long ts) {
    
    

}

message forwarding

Spring-Kafka only needs to pass a @SendTo annotation to realize message forwarding, and the return value of the annotated method is the content of the forwarded message:

@Component
public class KafkaConsumer {
    
    
    @KafkaListener(topics = {
    
    "topic2"})
    @SendTo("topic1")
    public String listen1(String data) {
    
    
        System.out.println("业务A收到消息:" + data);
        return data + "(已处理)";
    }
 
    @KafkaListener(topics = {
    
    "topic1"})
    public void listen2(String data) {
    
    
        System.out.println("业务B收到消息:" + data);
    }
}

Manually committing and autocommitting offsets

There are two types of Spring Kafka listener modes (spring.kafka.listener.type configuration property):

  • single: the listener message parameter is an object
  • batch: The listener message parameter is a collection

The listener message parameter is a single object

    @KafkaListener(topics = TEST_KAFKA_TOPIC , groupId = TEST_CONSUMER_GROUP)
    public void dealUser(User user) {
    
    
      log.info("kafka consumer msg: {}",user);
    }
  • Automatically submit the consumption offset at regular intervals
# 开启自动提交消费offset(这个配置实际上专指按照周期自动提交)
spring.kafka.consumer.enable-auto-commit: true
# 进行自动提交操作的时间间隔
spring.kafka.consumer.auto-commit-interval: 10s
  • Every time a message data is processed, the consumer offset is automatically submitted. This method is the most reliable way to avoid repeated consumption of data, but it is also the method with the lowest execution efficiency.
# 禁用按周期自动提交消费者offset
spring.kafka.consumer.enable-auto-commit: false
# offset提交模式为record
spring.kafka.listener.ack-mode: record

Note: ack-mode has the following configuration modes

ack-mode mode illustrate
RECORD Submit an offset per record
BATCH (default) After each batch of data processing from poll() is completed, the offset is submitted
TIME For a batch of poll() data, spring.kafka.listener.ack-timesubmit an offset once the processing time exceeds
COUNT When the data from a batch of poll() is greater than or equal to spring.kafka.listener.ack-countthe setting, the offset is submitted once
COUNT_TIME Timeout or excess quantity TIME or COUNT, submit offset when a condition is met
MANUAL Submit manually Manually call Acknowledgment.acknowledge() to submit the consumption offset, but it can only be submitted after a batch of data processing is completed.
MANUAL_IMMEDIATE Submit the offset immediately after manually calling Acknowledgment.acknowledge(). Even if only one or a few pieces of data in a batch of data may be processed at this time, the offset must be submitted.
  • Submit the consumption offset manually
# 禁用自动提交消费offset
spring.kafka.consumer.enable-auto-commit: false
# offset提交模式为manual_immediate
spring.kafka.listener.ack-mode: manual_immediate  或  manual
    @KafkaListener(topics = TEST_KAFKA_TOPIC, groupId = TEST_CONSUMER_GROUP)
    public void dealUser(User user, Acknowledgment ack) {
    
    
        log.info("kafka consumer msg: {}", user);
        ack.acknowledge();
    }

Notice:

  • If at this time ack-mode=manual, we see that dealUserthe method is a receiving parameter of a piece of data, a piece of data is processed at this time, and then called ack.acknowledge(); it does not mean that the consumer offset will be submitted immediately, but "I want to submit", the real submission The timing is that a batch of messages will not be submitted until they are processed.
  • If at this time ack-mode=manual_immediate, it means processing a message and submitting a consumer offset immediately.

The listener message parameter is a collection

The listener function parameter is a List collection type, which needs to be set spring.kafka.listener.type: batch, not the default:

    @KafkaListener(topics = TEST_KAFKA_TOPIC, groupId = TEST_CONSUMER_GROUP)
    public void dealUser(List<User> user) {
    
    

    }

Notice:

  • This mode can be used when batch processing of message data is required, such as directly storing List collection data into the database.
  • Automatically submit consumption offsets by batch
# listener类型为批量batch类型(默认为single单条消费模式)
spring.kafka.listener.type: batch
# offset提交模式为batch(不可使用record - 启动报错)
spring.kafka.listener.ack-mode: batch
# 禁用自动按周期提交消费者offset
spring.kafka.consumer.enable-auto-commit: false

Notice:

  • A batch processing process corresponds to a call of the monitoring function, that is to say, after the monitoring function finishes processing the current batch of data, the consumption offset of this batch is automatically submitted
  • This method has the highest execution efficiency, but once an exception occurs during data processing, the offset is not submitted. The next time a consumer consumes this partition, the 500 pieces of data will still be fetched, which is easy to consume repeatedly.
  • Manually submit offsets by batch
# listener类型为批量batch类型(默认为single单条消费模式)
spring.kafka.listener.type: batch
# offset提交模式为manual(不可使用record - 启动报错)
spring.kafka.listener.ack-mode: manual_immediate  或  manual
# 禁用自动按周期提交消费者offset
spring.kafka.consumer.enable-auto-commit: false
    @KafkaListener(topics = TEST_KAFKA_TOPIC, groupId = TEST_CONSUMER_GROUP)
    public void dealUser(List<User> user,Acknowledgment ack) {
    
    
        user.forEach(System.out::println); 
        ack.acknowledge();
    }

Consumption exception handling

Serialization exception handling

Poison pill message (one of the application scenarios)

  • A poison pill message is a special kind of message that is usually used to tell a consumer to stop consuming and dequeue. Such messages usually have special identifiers so that consumers can easily identify them and act accordingly. Once a consumer receives a poison pill message, it should immediately stop consuming and dequeue.

  • The reason for using poison pill messages is usually because in some cases, the consumer may not be able to process the messages in the queue normally, for example due to errors or exceptions. In this case, a poison pill message can be used to tell the consumer to stop consuming and dequeue to avoid further errors or problems.

  • If you are using message queues, then I suggest you consider the use of poison pill messages in your design. Make sure your consumers recognize and properly handle poison pill messages, and are able to stop consuming and dequeue if necessary. In addition, you should also consider how to handle messages after the poison pill message so that your application can continue to work normally.

  • Under what circumstances may cause the poison pill (Poison Pill) problem?

    • The data structure corresponding to topic A has always been User object (JSON serialization). One day, due to a program modification error, several string messages were accidentally sent to this topic
    • These string messages cannot be deserialized, and the Poison Pill phenomenon occurs. Consumers will be stuck in the infinite loop of "deserialization failure-retry-deserialization failure" and cannot process subsequent messages.
  • How to deal with the poison pill problem?

    • Use ErrorHandlingDeserializer to handle deserialization failure, and configure ErrorHandlingDeserializer deserializer in application.yaml.
spring:
  kafka:
    consumer:
      auto-offset-reset: earliest
      key-deserializer: org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
      value-deserializer: org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
      properties:
        spring.json.trusted.packages: '*'
        spring.deserializer.key.delegate.class: org.apache.kafka.common.serialization.StringDeserializer
        spring.deserializer.value.delegate.class: org.springframework.kafka.support.serializer.JsonDeserializer

Configure the Consumer key-deserializer 和 value-deserializeras org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
and appoint specific Key and Value deserializers:

  • spring.deserializer.key.delegate.class: org.apache.kafka.common.serialization.StringDeserializer
  • spring.deserializer.value.delegate.class: org.springframework.kafka.support.serializer.JsonDeserializer

Notice:

  • auto-offset-reset is a Kafka consumer configuration property, which is used to specify when the consumer resets the consumption offset (offset). When a consumer subscribes to a Kafka topic, it needs to know from which offset to start consuming messages. If the consumer has already consumed some messages, it needs to know from which offset it should start consuming next time.

  • The auto-offset-reset attribute is used to specify what should happen when the consumer does not store any offset or the stored offset is invalid. It has three optional values:

    1. earliest: Start consuming from the earliest available offset. This means that the consumer will start consuming from the topic's earliest message, regardless of whether the consumer has consumed some messages before.

    2. latest: start consumption from the latest available offset. This means that the consumer will start consuming from the topic's latest message, regardless of whether the consumer has consumed some messages before.

    3. none: throws an exception if the consumer does not store any offsets. This means that the consumer must specify an offset at startup, otherwise it will not be able to consume messages.

  • The function of the auto-offset-reset attribute is to ensure that the consumer can always consume the message in the topic, even if it has not consumed before or the stored offset is invalid

  • By default, the value of auto-offset-reset is latest, which means that the consumer will start consuming from the latest available message. If you want to start consuming from the earliest available message, you can set auto-offset-reset to earliest. This option is very important because it ensures that consumers do not miss any messages, thereby guaranteeing the integrity and accuracy of the data.

When the Key or Value fails to be deserialized, the deserializer configured by the delegate agent is used for deserialization.

If the deserialization fails, ErrorHandlingDeserializer can ensure that the Poison Pill message is processed and logged, and the Consumer offset can move forward so that the Consumer can continue to process subsequent messages.

The deserialization source code of ErrorHandlingDeserializer is as follows:

	@Override
	public T deserialize(String topic, byte[] data) {
    
    
		try {
    
    
			return this.delegate.deserialize(topic, data);
		}
		catch (Exception e) {
    
    
			return recoverFromSupplier(topic, null, data, e);
		}
	}

	private T recoverFromSupplier(String topic, Headers headers, byte[] data, Exception exception) {
    
    
	    //如果我们指定了反序列失败的处理函数,这里会回调,否则返回null
		if (this.failedDeserializationFunction != null) {
    
    
			FailedDeserializationInfo failedDeserializationInfo =
					new FailedDeserializationInfo(topic, headers, data, this.isForKey, exception);
			return this.failedDeserializationFunction.apply(failedDeserializationInfo);
		}
		else {
    
    
			return null;
		}
	}	
  • Configure the configuration parameters for key and value deserialization failures, and the value is the full class name of the corresponding callback processor
	/**
	 * Supplier for a T when deserialization fails.
	 */
	public static final String KEY_FUNCTION = "spring.deserializer.key.function";

	/**
	 * Supplier for a T when deserialization fails.
	 */
	public static final String VALUE_FUNCTION = "spring.deserializer.value.function";

Consumption exception handling

In addition to exceptions in the deserialization process, there may also be exceptions in the process of processing data in our consumer program, and there is also a global exception handling mechanism that can be used. Implement the KafkaListenerErrorHandler interface to handle exceptions that occur in the listener.

@Component
public class MyErrorHandler implements KafkaListenerErrorHandler {
    
    
    @Override
    public Object handleError(Message<?> message, ListenerExecutionFailedException exception) {
    
    
        return null;
    }

    @Override
    public Object handleError(Message<?> message, ListenerExecutionFailedException exception, Consumer<?, ?> consumer) {
    
    
        return null;
    }
}

The configuration method is as follows

@KafkaListener(errorHandler="myErrorHandler"public void userdeal(@Payload ConsumerRecord  consumerRecord) {
    
    
    //所有的异常全部对外抛出,不要处理,由myErrorHandler统一处理
}

Replenish

ObjectMapper date serialization problem

ObjectMapper serializes the date type into a Long type timestamp by default, and the ObjectMapper injected in Spring has been configured to serialize the date type into a string by default.

Notice:

  • The reason why ObjectMapper serializes the date type to Long timestamp by default is to ensure the consistency and reliability of data when it is transmitted between different systems. The long integer timestamp is a general time representation, which can be interpreted and converted between different programming languages ​​and operating systems, thereby avoiding the problem of inconsistent date formats.

  • Additionally, long timestamps are more precise and readable because they can be converted directly to dates and times without further parsing and processing. This is useful for data analysis and processing, as it makes it easier for developers to perform operations and calculations on dates and times.

  • If you would like to serialize date types into other formats, such as ISO 8601 date formats or custom formats, you can use ObjectMapper's date formatters to do so. This will allow you to customize the date format as needed and ensure consistent data transfer and parsing between different systems.

@SpringBootTest(classes = KafkaSpringBootDemo.class)
class JsonSerializerTest {
    
    
    @Resource
    private ObjectMapper objectMapper;

    private ObjectMapper objectMapperNew=new ObjectMapper();

    @Test
    void dateSerializerTest() throws JsonProcessingException {
    
    
        System.out.println("spring注入的ObjectMapper序列化结果: "+objectMapper.writeValueAsString(new Date()));
        System.out.println("手动new的ObjectMapper序列化结果: "+objectMapperNew.writeValueAsString(new Date()));
    }
}

insert image description here

When deserializing, this long type number will not be automatically recognized as a Date data type, but will be recognized as a Long type, resulting in deserialization failure, and the date serialization ObjectMapper used by Spring Kafka by default is also manually new from. So the Date type will be serialized into a Long type timestamp. If we don't want such a problem, we can define it as follows:

@Configuration
public class ConsumerKafkaConfig {
    
    

    @Resource
    private ObjectMapper objectMapper;

    //反序列化器
    @Bean
    public DefaultKafkaConsumerFactory<?, ?> cf(KafkaProperties properties) {
    
    
        Map<String, Object> props = properties.buildConsumerProperties();
        return new DefaultKafkaConsumerFactory<>(props,
                new StringDeserializer(),  //指定key的反序列化方式是String
                new JsonDeserializer<>(objectMapper));  //指定value的反序列化方式是JSON
    }

    //序列化器
    @Bean
    public DefaultKafkaProducerFactory<?, ?> pf(KafkaProperties properties) {
    
    
        Map<String, Object> props = properties.buildProducerProperties();
        return new DefaultKafkaProducerFactory<>(props,
                new StringSerializer(),   //指定key的序列化方式是String
                new JsonSerializer<>(objectMapper)); //指定value的序列化方式是JSON
    }

}

After this configuration, do not write the following parameters in the application.yml configuration file, and the configuration will not take effect:

      value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer

We can also use the configure() method in ObjectMapper to modify its configuration so that the date type is serialized as a string. The specific code is as follows:

    @Test
    void configureTest() throws JsonProcessingException {
    
    
        objectMapperNew.configure(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS, false);
        objectMapperNew.setDateFormat(new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ"));
        System.out.println("手动new的ObjectMapper序列化结果: "+objectMapperNew.writeValueAsString(new Date()));
    }

insert image description here

This disables serialization of dates as timestamps, and formats dates as ISO 8601-formatted strings. You can change the format string for the date format as needed.


Guess you like

Origin blog.csdn.net/m0_53157173/article/details/130146689
Recommended