Spring-kafka consumption is abnormal: Commit cannot be completed since the group has already rebalanced Consumers suddenly hung live and stop consumption

One day, a large number of Kafka exceptions were reported in the online environment: CommitFailedException

org.apache.kafka.clients.consumer.CommitFailedException:
 Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
 This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms,
 which typically implies that the poll loop is spending too much time message processing.
 You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.

Later analysis revealed that the exception occurred because the processing time after a one-time poll pull (default 500) message was too long, which caused the time interval between two pulls to exceed the max.poll.interval.ms threshold (default five minutes). The solution strategy can increase the parameter: max.poll.interval.ms or reduce the number of messages pulled at one time. I am here to change the number of pull messages and session.timeout.ms to solve it.
The spring configuration is as follows:

spring:
  kafka:
   consumer:
     max-poll-records: 200

I also changed spring.kafka.properties.session.timeout.ms

spring:
  kafka:
   properties:
    session:
      timeout:
        ms: 120000

This may not need to be changed, because the version after 0.10.0.0 has been determined by the max.poll.interval.ms parameter.


Case 2:
In addition, another anomaly appeared on the Internet recently: the
news has accumulated and has not been consumed, and it feels like the consumer has died. After restarting the service, consumption starts,
but consumption stops for a period of time.
I started to try to increase the number of consumers and the number of pods (nodes), but they couldn't be solved completely, and the exception still existed.
By dumping the stack information, it is found that the consumption is all in the WAITING state, this state is the suspended state, and it is waiting indefinitely:

"kafka-coordinator-heartbeat-thread | CID_alikafka_xxx" #125 daemon prio=5 os_prio=0 tid=0x00007f1aa57fa000 nid=0x86 in Object.wait() [0x00007f1a8af80000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:920)
        - locked <0x00000000e798f558> (a org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

   Locked ownable synchronizers:
        - None

"kafka-coordinator-heartbeat-thread | CID_alikafka_xxx" #124 daemon prio=5 os_prio=0 tid=0x00007f1aa546b800 nid=0x85 in Object.wait() [0x00007f1a8b081000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:920)
        - locked <0x00000000e798f888> (a org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)

   Locked ownable synchronizers:
        - None

Later, after checking the official documents, I found a sentence:
https://docs.spring.io/spring-kafka/docs/2.6.3-SNAPSHOT/reference/html/ The
Insert picture description here
consumer was suspended because it exceeded max.poll.interval .ms defaults to five minutes. In fact, the culprit is that the business processing is too slow after the message is received, and this piece will be optimized later.
Increase spring.kafka.properties.max.poll.interval.ms to 600,000 (10 minutes) later.

Introduction to some parameters of spring-kafka

spring.kafka.producer.batch-size 150  一次性拉取消息数

spring.kafka.properties.max.poll.interval.ms  两次poll的间隔默认5分钟

spring.kafka.producer.batch-size  一次性提交大小(默认16384字节)针对消息生产者

spring.kafka.listener.concurrency  消费者数量,平均分配kafka的partition,如24个partition,此值为8,则每个消费者负责3个partition。

Guess you like

Origin blog.csdn.net/huangdi1309/article/details/109447899