Online kafka messages are piled up, and the consumer is offline to deal with it in practice

Kafka messages pile up online, and all consumers are offline. What's going on?

Recently, I dealt with an online fault. The specific fault manifested itself in the accumulation of Kafka topic messages, and all consumers related to this topic went offline.

The overall troubleshooting process and subsequent review are very interesting, and combined with this failure, I have a deeper understanding of the best practices used by Kafka.

Well, let's review this online failure together. The best practice summary is at the end, so don't miss it.

1. Phenomenon

  • Online kafka messages suddenly start to pile up
  • The consumer application reports that the message was not received (there is no log of processing the message)
  • There is no consumer registration on kafka's consumer group
  • Consumer applications and Kafka clusters have had no code or configuration related changes in the last week

2. The investigation process

Neither the server nor the client has special exception logs. The production and consumption of other Kafka topics are normal, so it can be basically judged that there is a problem with the consumption of the client.

So we focus on client troubleshooting.

1) Arthas modifies the log level online and outputs debug

Since the client does not have obvious abnormal logs, the only way to find clues is to modify the application log level through arthas.

Sure enough, there are more important discoveries:

2022-10-25 17:36:17,774 DEBUG [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] - [Consumer clientId=consumer-1, groupId=xxxx] Disabling heartbeat thread
 
2022-10-25 17:36:17,773 DEBUG [org.apache.kafka.clients.consumer.internals.AbstractCoordinator] - [Consumer clientId=consumer-1, groupId=xxxx] Sending LeaveGroup request to coordinator xxxxxx (id: 2147483644 rack: null)

It seems that the kafka-client sent messages to the kafka cluster on its own initiative to expel itself. So the consumers are all offline.

2) Arthas checks related thread status variables
Use the arthas vmtool command to further check the status of kafka-client related threads.

You can see that the HeartbeatThread thread status is WAITING, and the Cordinator status is UNJOINED.

At this time, combined with the source code, it is probably inferred that the client is self-expelled due to the long consumption time.

So I immediately tried to modify max.poll.records to reduce the number of messages pulled in a batch, and at the same time increase the max.poll.interval.ms parameter to avoid self-eviction due to too long pull interval.

After the parameter modification went online, it was found that the consumer did not go offline, but after a period of consumption, it still stopped consuming.

3. The final reason

Relevant students checked the consumption logic, found an infinite loop in the business code, and confirmed the final reason.

A field in the message content has a new value, which triggers an infinite loop of consumer consumption logic, resulting in the inability to consume subsequent messages.
Consumption blocking causes consumers to evict themselves, partitions are rebalanced, and all consumers evict themselves one by one.

The core here involves the keep-alive mechanism between Kafka's consumers and Kafka, which can be briefly understood.

 

Kafka-client will have an independent thread HeartbeatThread to perform regular heartbeat with kafka cluster. This thread has nothing to do with listener and is completely independent.

According to the "Sending LeaveGroup request" information displayed in the debug log, we can easily locate the logic of self-eviction.

 

Before the HeartbeatThread thread sends a heartbeat, it will compare the current time with the last poll time. Once it is greater than the max.poll.interval.ms parameter, it will initiate self-eviction.

4. Further thinking

Although the reason was finally found, looking back at the entire investigation process, it was not smooth. There are two main points:

  • Can kafka-client have a clear exception for a certain message consumption timeout? Instead of just seeing self-eviction and rebalance
  • Is there any way to discover the infinite cycle of consumption?

4.1 Can kafka-client have a clear exception for a certain message consumption timeout?

4.1.1 Kafka does not seem to have a similar mechanism

We breakpoint the consumption logic, and we can easily see the entire call link.

 

For consumers, a thread pool is mainly used to process each kafkaListener, and a listener is an independent thread.

This thread will process poll messages synchronously, and then the dynamic agent will call back the user-defined message consumption logic, which is the business we wrote in @KafkaListener.

 

So, two things can be learned from this.

First, if the business consumption logic is slow or stuck, polling will be affected.

The second point is that there is no parameter to directly set the consumption timeout here, which is actually not very easy to do.

Because a timeout interruption is made here, the poll will also be interrupted, in the same thread. So either the poll and consumption logic are in two working threads, or after the current thread is interrupted, a new thread poll is started.

Therefore, from the perspective of business use, the possible realization is to set the business timeout by yourself. A more general implementation can be in the consumption logic, use the thread pool to process the consumption logic, and use Future get to block the timeout interrupt at the same time.

I googled it and found that kafka 0.8 used to have the parameter consumer.timeout.ms, but the current version does not have this parameter. I don’t know if it has a similar effect.

4.1.2 RocketMQ is somewhat related to the mechanism

Then I checked to see if RocketMQ has related implementations, and I found out.

In RocketMQ, consumeTimeout can be set for the consumer. This timeout is a bit like our idea.

The consumer will start an asynchronous thread pool to do cleanExpiredMsg() processing on the messages being consumed regularly.

 

Note that this mechanism does not work if the message type is ordered.

If it is concurrent consumption, a timeout judgment will be made. If it times out, the information of this message will be sent back to the broker through the sendMessageBack() method for retrying.

 

If the message is retried more than a certain number of times, it will enter RocketMQ's dead letter queue.

In fact, spring-kafka also does similar encapsulation, you can customize a dead letter topic, and do exception handling

4.2 Is there any way to quickly discover the infinite loop?

Generally speaking, threads in an infinite loop will cause CPU to soar, OOM and other phenomena. In this fault, there is no related abnormal performance, so it is not related to the issue of infinite loop.

After going through this failure, I have a deeper understanding of kafka-related mechanisms. The poll interval timeout is likely to be caused by consumption blockage or even an infinite loop.

Therefore, if a similar problem occurs next time and the consumer stops consumption, but the kafkaListener thread is still there, you can directly use the thread id command of arthas to view the call stack of the corresponding thread to see if there is an abnormal method infinite loop call.

5. Best practices

Through this failure, we can also summarize some best practices for using Kafka:

  • When using message queues for consumption, it is necessary to consider abnormal conditions, including idempotence, time-consuming processing (even infinite loops).
  • Try to increase the consumption speed of the client as much as possible. The consumption logic starts a new thread for processing, and it is best to do timeout control.
  • Reduce the number of topics that a group subscribes to. It is best not to subscribe to more than 5 topics for a group. It is recommended that a group only subscribe to one topic.
  • Refer to the following instructions to adjust the parameter value: max.poll.records: reduce the parameter value, it is recommended to be much smaller than <the number of records consumed by a single thread per second> * <the number of consumption threads> * <max.poll.interval.ms> product. max.poll.interval.ms: This value should be greater than the value of <max.poll.records> / (<number of records consumed by a single thread per second> * <number of consumption threads>).

Everyone is welcome to practice by yourself, and leave a small tail " Hospital Certificate Picture ".

 

Guess you like

Origin blog.csdn.net/dageliuqing/article/details/127659855