Kafka usage details, best practices and troubleshooting

Kafka is a commonly used distributed message middleware. Compared with RabbitMQ, it is characterized by unlimited horizontal expansion, and maintains high reliability, high throughput and low latency. Therefore, it has a higher market share than RabbitMQ (search online , Kafka is about 41%, RabbitMQ is about 29%).

1. Common concepts of kafka

Generally, for normal development, it is enough to understand the first 6 concepts, and the rest of the concepts are more used for kafka operation and maintenance configuration, or troubleshooting.

1. producer producer

Refers to the external application that produces messages and delivers them to kafka, which is not part of kafka

2. consumer

Refers to external applications that connect to kafka, receive/subscribe messages, and perform subsequent logical processing, and it is not a part of kafka.
A consumer can consume multiple queues (topics) of kafka at the same time

3. Consumer Group

Consumers connected to Kafka must specify a consumer group. Multiple consumers can specify the same consumer group, which can prevent the same message from being consumed repeatedly.
If two consumers are bound to the same queue (topic) and specify different consumer groups, each message will be delivered to the two consumers at the same time.

4. topic theme

In kafka, the logical collection of sending and receiving messages, each topic can be considered as a queue;
producers and consumers process messages by connecting topics.

5, partition partition

The physical collection of messages stored in Kafka, a topic can be divided into one or more partitions, which can be understood as sub-queues;
each partition belongs to only one topic, and can only be consumed by one consumer (the same group) .
All messages received by the topic will be delivered according to the corresponding partition selected according to the key of the message;
if the message does not specify a key and no partition rules are defined, Kafka will randomly and evenly deliver it to multiple partitions of the topic.
Note: The messages in each partition must follow the rules of the queue to ensure first-in first-out; but the messages in different partitions cannot be guaranteed.
Therefore, if you want to ensure that consumers can consume in the order in which messages are delivered:

  • Create only one partition per topic (this is not recommended)
  • For the same batch of messages that need to be ordered, specify the same key, such as using the user ID as the key of the message, and the messages with the same key will be delivered to the same partition

6. offset offset

Refers to the unique number of the message in each partition, and it is incremented from 0. Topic + partition + offset can uniquely locate a message.
Note: For each message in each partition, the offset must be different; the offset of different partitions will be repeated.
Consumers will also record the offset value of each consumption to identify which message they are currently processing, so that they can continue to consume when disconnected and reconnected. The consumer's offset is also stored in Kafka.

7. Broker cluster nodes

A node in the kafka cluster, usually a kafka instance on a server.

8, replica copy

Each partition of a topic can specify multiple copies, and each copy stores exactly the same message data.
It is generally recommended that different copies of the same partition should be stored on different brokers to avoid data loss of the partition caused by broker failure.

9、leader/follower

For each partition of the topic, if there are multiple copies, one of the copies will serve as the leader to provide read and write services, and the rest will only synchronize data as followers.
If the leader fails, a copy will be elected from the follower as the leader to provide services again.

10、ISR(in-sync replicas)

A collection of synchronous replicas of partitions, each partition maintains an ISR list, which is the list of followers that are synchronized with the leader.
If a follower cannot keep up with the synchronization progress, or cannot maintain synchronization, it will be removed from the ISR list.
Only the follower in the ISR list has the opportunity to be promoted to leader.
Note: the copy that has become the leader is also in the ISR

11. LEO log end offset

LEO refers to the Log End Offset, which is the offset of the next message to be written in the current partition. This message does not point to a specific message.
Each replica of a partition has its own LEO.

12. HW high water mark

HW refers to High Watermark Offset, that is, the highest message offset (offset) in the current partition that has been submitted and copied to all replicas. The
leader will not update the HW value when it receives the message but has not yet synchronized.
The leader will compare the LEOs of itself and all followers, and use the smaller value to update the HW value.

13. Number of LAG lagging messages

A consumer group, when consuming each partition of the topic, each partition will calculate a LAG value, which refers to the difference between the total number of messages in the partition and the number of messages consumed by the consumer.
Usually the partition's HW minus the offset of the consumer group.
In practice, operation and maintenance personnel should monitor the LAG, for example, when the number exceeds 10,000, an alarm will be issued and processing will be performed.

After understanding these concepts, I found a diagram of the working principle of kafka on the Internet:
insert image description here

2. Comparison between kafka and RabbitMQ

Compared with RabbitMQ, Kafka has the following characteristics:

  • Kafka's messages will not be deleted immediately after consumption, but will be deleted after a certain period of time. The default is 7 days. And RabbitMQ is deleted after consumption.
    I like the fact that kafka does not delete messages, especially when data recovery is required for troubleshooting.
  • Kafka consumers only support pull mode, not push mode, that is, consumers can only actively poll Kafka to obtain messages. The default is to pull once every 500ms, and each time can pull up to 500 pieces of data. The advantage of polling is flexibility, but the disadvantage is that there is no Message time and space consumption performance.
    By default, RabbitMQ only supports the push mode, actively pushes messages to consumers, and has better real-time performance.
  • Kafka connects producers and consumers through topic topics. Multiple consumers connected to the same topic can consume all the messages of the topic. If there are unnecessary messages, they can only be judged and discarded by consumers themselves.
    RabbitMQ receives messages through Exchange, and then forwards them to a specific Queue through specified rules for consumption by consumers. You can refer to my previous article: https://youbl.blog.csdn.net/article/details/80401945
    RabbitMQ You can configure many routes to avoid delivering messages to unnecessary consumers,
    but RabbitMQ also supports receiving and delivering messages directly through Queue.
  • In terms of performance, RabbitMQ is a single-threaded model, and there will be bottlenecks in big data; while Kafka can be expanded almost infinitely.
  • Orderliness. For each partition of the topic, Kafka can guarantee the orderliness of messages because there is one and only one consumer. Different partitions cannot guarantee the orderliness.
    RabbitMQ will distribute messages evenly when there are multiple consumers. The order cannot be guaranteed, and the message order will also be destroyed when the message consumption fails to be re-delivered.

3. Kafka Best Practices

1. Producer configuration

  • When the producer sends, there is an acks configuration, which is described as follows:

    • If it is 0, after the producer sends a message, it will return success without waiting for the borker to respond, with the highest performance and the highest probability of data loss;
    • If it is 1, after the producer sends a message, the leader node returns success, even if it succeeds; but the leader hangs up before synchronizing to other replicas, and data will be lost;
    • If it is all or -1, it must wait for all replicas to be synchronized successfully before returning success to ensure that data will not be lost, but the performance is the lowest.
      In the production environment, it is recommended to configure -1, and the other two configurations may lose data.
  • min.insync.replicas The minimum number of replicas required, the default value is 1, and it is recommended to be 2 (of course, the number of replicas for each topic is required to be more than 3)
    because if the configuration is 1, when the leader receives the data, it will fail before it is synchronized, and it will Lost data.

  • retries: The number of retries, set to a larger value, the default value is Integer.MAX_VALUEto ensure that the sending is successful.
    Note: Although the number of retries is large by default, the retries are also affected by another time configuration: delivery.timeout.ms(2 minutes by default), if the retries are not used up, the timeout will expire and the sending will be interrupted.
    In addition, if you set a large retries, please use asynchronous message sending to avoid thread blockage caused by synchronous operations, affecting user experience, or other business problems.

  • Configuration reference:

spring:
  kafka:
    producer:
      bootstrap-servers: 10.1.1.1:9092
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
      retries: 10000
    properties:
      delivery.timeout.ms: 2000 # 发送消息上报成功或失败的最大时间,默认120000,两分钟
      linger.ms: 0              # 生产者把数据组合到一个批处理进行请求的最大延迟时间,默认0
      # 参考 https://cwiki.apache.org/confluence/display/KAFKA/KIP-19+-+Add+a+request+timeout+to+NetworkClient
      request.timeout.ms: 1000  # 批处理就绪后到响应的等待时长,含网络+服务器复制时间
      batch.size: 1000

2. Consumer configuration

  • In order to avoid message loss, consumers need to enable manual ack, and submit the offset after the message business logic processing is completed
  • Referring to the infinite loop problem later, it is recommended to use the String deserializer
  • According to the business situation, configure the appropriate batch pull quantity max-poll-records, the default value is 500
  • According to the business situation, configure the appropriate auto-offset-resetvalue, the default value is latest
    • latest: When a consumer consumes a partition of a topic, if there is no previous consumption record (previously submitted offset), only the latest news is pulled and historical messages are ignored.
    • earliest: Contrary to latest, when there is no previous consumption record, start processing from the earliest message.
    • none: An exception is thrown when there is no previous consumption record.
  • Configuration reference:
spring:
  kafka:
    consumer:
      bootstrap-servers: 10.1.1.1:9092
      max-poll-records: 100
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      auto-offset-reset: latest
    listener:
      type: batch
      ack-mode: manual_immediate

3. Others

  • Configure multiple bootstrap server urls to avoid single-node failures that cause connection failures
  • If the message is sent asynchronously, do not include too much business logic in the success callback and failure callback methods of kafkaTemplate. The callback method is single-threaded, and the business logic inside will take up the timeout configuration, which may cause subsequent message sending to timeout delivery.timeout.ms.
  • Similarly, consumers are also single-threaded. If the consumption logic is too heavy, it may cause session.timeout.msa timeout, and the consumer is considered offline, causing problems.

4. Introduction of kafka tools

1. Graphical tools

Recommended OffsetExplorer, download address: https://www.kafkatool.com/download.html
insert image description here

2. Command line tools

The kafka installation package has many built-in script tools, which can easily query the status of kafka. These tools only need to be downloaded and can be used without installation.

  • Download address: https://kafka.apache.org/downloads
    After downloading, decompress, there are many sh files in the bin directory, which are used on linux;
    if used under Windows, use those under bin\windows\ bat file.
    The following uses the bat file command of windows as an example (linux can be executed with the corresponding sh file)
  • For usage instructions, please refer to the official documentation: https://kafka.apache.org/documentation/

Query which consumers are under a certain consumer group, and the consumption status of these consumers on topics:

d:\kafka_2.13-3.4.0\bin\windows\kafka-consumer-groups.bat --describe --group=cb_consumers --bootstrap-server=10.0.0.1:9092
insert image description here
Field description:

  • GROUP consumer group
  • TOPIC The topic of consumption
  • PARTITION consumption partition
  • CURRENT-OFFSET The message offset currently consumed
  • LOG-END-OFFSET Maximum message offset for the current partition
  • LAG lag message count
  • CONSUMER-ID consumer ID
  • HOST The host where the consumer resides
  • CLIENT-ID Client ID
    Note: LAG can be simply understood as LOG-END-OFFSET 减 CURRENT-OFFSET, but in factLAG=HW 减 CURRENT-OFFSET

Fourth, the use of springboot project

1. Producer Demo code:

1.1. Add pom dependencies:

<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
</dependency>

1.2. Add application.yml configuration:

spring:
  kafka:
    producer:
      bootstrap-servers: 10.1.1.1:9092
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
      retries: 2  # 失败重发次数

1.3. Java sending code:

private final KafkaTemplate kafkaTemplate; // 注入的Bean

// 同步发送消息
String topic = "beinetTest111";
Object result = kafkaTemplate.send(topic, "我是key", objData).get();

2. Consumer Demo code:

2.1. Add pom dependencies:

<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
</dependency>

2.2. Add application.yml configuration:

spring:
  kafka:
    consumer:
      bootstrap-servers: 10.1.1.1:9092
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer 
    listener:
      type: batch
      ack-mode: manual_immediate

2.3. Java sending code:

@KafkaListener(topics = "${kafka-topic.reports}")
public void consumerCreateTask(List<ConsumerRecord<String, Object>> consumerRecordList, Acknowledgment ack) {
    
    
    if (consumerRecordList == null || consumerRecordList.size() <= 0)
        return;

    long start = System.nanoTime();
    ConsumerRecord lastRecord = consumerRecordList.get(0);
    try {
    
    
        // 转换dto,并进行业务逻辑处理

        long elapsedTime = System.nanoTime() - start;
        log.debug("Topic:{} 分区:{} 偏移:{} 条数:{} 耗时:{}ns",
                lastRecord.topic(),
                lastRecord.partition(),
                lastRecord.offset(),
                dtos.size(),
                elapsedTime);
    } catch (Exception exp) {
    
    
        long elapsedTime = System.nanoTime() - start;
        log.error("Topic:{} 分区:{} 偏移:{} 耗时:{}ns 出错:",
                lastRecord.topic(),
                lastRecord.partition(),
                lastRecord.offset(),
                elapsedTime,
                exp);
    } finally {
    
    
        // 不论成败,都提交,避免出错导致死循环,避免丢消息的逻辑,可以在catch里备份
        ack.acknowledge();
    }
}

Five, kafka common problems

1. There are multiple consumers, but there will always be one consumer who cannot get the message data

For a consumer group, if a topic has several partitions, it can accept up to several consumers;
for example, if a topic has 2 partitions, then each partition can only be assigned to one consumer in the group, and there are only 2 consumers at most If there are 3 consumers in the group, there must be one consumer who is idle and has no work to do.
If the topic has 2 partitions, but there is only one consumer in the group, then the messages of the 2 partitions will be delivered to this consumer.

2. What is the partition allocation strategy of the topic?

When the topic has multiple partitions and multiple consumers, Kafka's source code implementation has the following partition allocation strategies:

  • Range strategy (default strategy):
    All partitions of each topic consumed by the current consumer group are allocated to consumers one by one. Note that each topic is processed separately, so there will be imbalance.
    For example: topic a has 3 partitions a0/a1/a2, topic b has 3 partitions b0/b1/b2, and has 2 consumers C0/C1. The distribution process is roughly: step 1 assign topic a: a0->
    C0 , a1->C1, a2->C0
    Step 2 assign topic b: b0->C0, b1->C1, b2->C0
    It can be seen that [ consumer C0 needs to maintain data in 4 partitions, and C1 As long as the data of two partitions is maintained ], there is an obvious imbalance problem.
  • Round-Robin strategy:
    After all the partitions are sorted, they are assigned to all consumers one by one in a round-robin manner.
    For example: topic a has 3 partitions a0/a1/a2, topic b has 3 partitions b0/b1/b2, there are 2 consumers C0/C1, the allocation process is roughly:
    step 1 assign topic a: a0->C0, a1->C1, a2->C0
    step 2 assign topic b: b0->C1, b1->C0 , b2->C1
    Note: The second step is not to start from the beginning, but to continue the allocation after the first step, so the imbalance problem of the Range scheme is excluded, and the
    final allocation result is [ 2 consumers, each responsible for 3 partitions】.
    However, if the themes consumed by two consumers only partially intersect and are not exactly the same, there will still be an imbalance.
    If you want to use this strategy instead, there is currently no configuration method, you need to modify the properties in the code partition.assignment.strategy, refer to the code:
@Configuration
@RequiredArgsConstructor
public class KafkaConfiguration {
    
    
    private final KafkaProperties kafkaProperties;
    private final ConcurrentKafkaListenerContainerFactory<String, Object> kafkaFactory;

    @Bean("myKafkaFactory")
    public KafkaListenerContainerFactory<ConcurrentMessageListenerContainer<String, Object>> batchFactory() {
    
    
        Map<String, Object> props = kafkaProperties.buildConsumerProperties();
        props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, "org.apache.kafka.clients.consumer.RoundRobinAssignor");
        kafkaFactory.setConsumerFactory(new DefaultKafkaConsumerFactory<>(props));
        return kafkaFactory;
    }
}

Then specify the use of this factory bean on the consumer code:

 @KafkaListener(id = "beinetHandler1", groupId = "beinetGroup", topicPattern = "beinetTest.*",
            containerFactory = "myKafkaFactory")
    public void msgHandler(List<ConsumerRecord> message, Acknowledgment ack) {
    
    
  • There are two other implementations in the source code of kafka, which will not be introduced in depth in this article:
    org.apache.kafka.clients.consumer.CooperativeStickyAssignor
    org.apache.kafka.clients.consumer.StickyAssignor

3. When consumers join or quit, can messages still be consumed normally?

Conclusion: As long as there are surviving consumers, all messages can be consumed normally.
When a new consumer joins the group, or a consumer in the group goes offline/exits, the consumer rebalancing action will be triggered, which is to reassign partitions to all consumers.
When rebalancing occurs, by default all consumer jobs are stopped until the allocation is complete.

4. A brother said that the message has been written to Kafka, but there is no data stored on the consumer side

  • First, confirm that Kafka has news, use the above graphical tool offset explorer to search for data in the corresponding topic topic, and find that there is indeed data
  • In the tool, check the corresponding Group under Consumers and find that the Lag is 0, indicating that the message has been consumed normally
  • Check the consumer's application log, no consumption log is generated
  • Continue to check the application log of the consumer, and find the following log: cb_consumers: partitions assigned: []
    This means that the consumer is not assigned to a partition and is not working.
    Preliminary judgment is whether someone started the consumer elsewhere and consumed the data.
  • The offset explorer tool cannot display consumer IP information. You can only use the above command kafka-consumer-groups.batto view the consumer IP, and
    then find the operation and maintenance to see who the IP is.
  • Finally, it was determined that the configuration of the test environment was wrong, and the data in the development environment was consumed.

5. Deserialization failed, resulting in an infinite loop of consumers

After publishing to the test environment one day, it was found that after the program was started, the following exceptions were thrown all the time, and it lasted for tens of minutes without interruption:

<#6d8d6458> j.l.IllegalStateException: No type information in headers and no default type provided
    at o.s.util.Assert.state(Assert.java:76)
    at o.s.k.s.s.JsonDeserializer.deserialize(JsonDeserializer.java:535)
    at o.a.k.c.c.i.Fetcher.parseRecord(Fetcher.java:1387)
    at o.a.k.c.c.i.Fetcher.access$3400(Fetcher.java:133)
    at o.a.k.c.c.i.Fetcher$CompletedFetch.fetchRecords(Fetcher.java:1618)
    at o.a.k.c.c.i.Fetcher$CompletedFetch.access$1700(Fetcher.java:1454)
    at o.a.k.c.c.i.Fetcher.fetchRecords(Fetcher.java:687)
    at o.a.k.c.c.i.Fetcher.fetchedRecords(Fetcher.java:638)
    at o.a.k.c.c.KafkaConsumer.pollForFetches(KafkaConsumer.java:1272)
    at o.a.k.c.c.KafkaConsumer.poll(KafkaConsumer.java:1233)
    at o.a.k.c.c.KafkaConsumer.poll(KafkaConsumer.java:1206)
    at j.i.r.GeneratedMethodAccessor109.invoke(Unknown Source)
    at j.i.r.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at j.l.reflect.Method.invoke(Unknown Source)
    at o.s.a.s.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
    at o.s.a.f.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:208)
    at c.s.proxy.$Proxy186.poll(Unknown Source)
    at o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer.doPoll(KafkaMessageListenerContainer.java:1413)
    at o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1250)
    at o.s.k.l.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1162)
    at j.u.c.Executors$RunnableAdapter.call(Unknown Source)
    at j.u.c.FutureTask.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)

I searched on google and said that the deserialization could not find the type information, and it is recommended not to use JSON deserialization.
Checked the configuration change record, and indeed added a Kafka deserialization configuration change:

spring: 
  kafka:
    producer:
      value-serializer: org.springframework.kafka.support.serializer.JsonSerializer
    consumer:
      value-deserializer: org.springframework.kafka.support.serializer.JsonDeserializer

Change spring.consumer.value-deserializer to: org.apache.kafka.common.serialization.StringDeserializerand restore.
After understanding it, my colleague hopes to use the object directly instead of String on the method parameter of the consumer, so I made this modification.
And the consumer who happened to make a mistake consumes messages produced by other projects, which do not contain type information.

And this exception is thrown at the bottom layer of spring, and the try capture cannot be performed on the business code. At the same time, the code is set to manually submit the ack, causing the code to enter an infinite loop.
In order to avoid this kind of problem, it is recommended to use StringDeserializer to deserialize, and it is better to deserialize in the code yourself.

6. A single failure of the broker caused consumers to be unable to submit offsets

In the production environment, 6 brokers were deployed for performance and failover. One day, a broker failed and went offline. It should be automatically switched. In fact, all consumers began to throw exceptions: "error when storing group assignment during syncgroup"
until the broker is manually restored and brought online, the fault will not recover.
Final investigation results:

  • O&M configures an internal topic of kafka: __consumer_offsetsthe number of replicas to 2,
  • Configure at the same time min.insync.replicas=2, the meaning of this configuration is that the minimum number of synchronous copies of the ISR list must not be less than 2
  • The broker that goes offline due to a fault contains exactly __consumer_offsetsone copy of the topic:, resulting in only one copy of the topic, which does not meet min.insync.replicas=2the configuration requirements, and thus stops working
  • The role of topic: __consumer_offsetsis to receive and store the consumption offsets of all consumer groups. If this topic does not work, consumers will not be able to submit offsets, resulting in abnormal consumption and repeated consumption of data.

Know the problem, the adjustment is to __consumer_offsetsadjust the number of copies of topic: to 3 (the default value is 3, the operation and maintenance has been corrected)

7. All consumers do not consume any messages

If the consumer starts first and then creates a topic, the consumer will not be able to consume data. You can try restarting the consumer

8. Does kafka in spring have thread safety issues?

The KafkaTemplate used by the producer is thread-safe. After testing, the same thread is used to send messages.
Similarly, consumers are also thread-safe, and each consumer also processes all received messages in a single thread.

9. How to deal with kafka message congestion

  • If the message is not important, you can directly delete the topic and rebuild it, and all the messages under the topic will naturally disappear. Note that the topic needs to be rebuilt, and then restart the consumer;
  • If all messages need to be consumed, based on the partitioning feature of kafka, each partition can only have one consumer, so the problem cannot be solved simply by adding consumers
    • Confirm whether the consumer is abnormal. It should be noted that some developers will swallow the exception, which leads you to think that the consumer is normal. You can judge whether the business data continues to grow. If the consumer is abnormal, just fix the bug. .
    • If the consumption is normal, then confirm whether the data of the burst message has increased. A simple judgment is whether the LAG of the topic continues to decrease at a normal rate. Observe for a few minutes.
    • If the news will not drop normally, the basic judgment is that the consumption speed is slow,
      • First use tools to confirm the LAG of each partition under the topic. If the LAG of a certain partition is particularly high and other partitions are normal (not blocked), then the message distribution should be unbalanced. This partition has a lot of messages. Consider adjusting the producer The message key to ensure that the number of messages in all partitions is balanced;
      • If there are no special requirements for message timing, you can process messages asynchronously through the thread pool in the code, and pay attention to controlling the number of thread pools so as not to cause the application of oom;
      • Consider adding a few partitions and a few more consumers, so that newly produced messages will be redistributed to different partitions, reducing the pressure on old consumers;
      • For blocked messages, consider adding a new consumer to another temporary topic, and add several times more partitions and consumers to the temporary topic for fast consumption. Be careful not to impact the downstream or the database, causing other problems.

Guess you like

Origin blog.csdn.net/youbl/article/details/130286729
Recommended