Kafka often regroups troubleshooting

I saw an article on the Internet that the consumer group of Crafka repeatedly rebalanced the problem and gave some ideas:

See also the troubleshooting of Kafka consumers who are usually kicked out of the consumer group

Then I wrote this article about the heartbeats and session timeout mechanisms of Kafka high and low versions

The following records the ideas of ckafka

background

Ckafka's consumption rebalancing mechanism is the same as open source kafka, which is a process in which all Consumer instances under a consumer group reach a consensus on how to consume all partitions of the subscription topic. In the rebalancing process, all Consumer instances participate together, and with the help of the coordinator component, the allocation of subscription topic partitions is completed. However, in the entire process, all instances cannot consume any messages, which will affect the normal consumption of our business messages.

There are three main disadvantages of Ckafka rebalancing:

1. Rebalancing will affect the TPS on the Consumer end, thereby affecting the overall consumer end performance.

2. The rebalancing process is very slow. If there are many members in the Group under a consumer, you will encounter such a pain point.

3. The rebalancing efficiency is not high. All consumer members must participate, and every consumer member needs to re-occupy the partition to consume.

Therefore, when we use Ckafka for message consumption, we need to pay attention to avoid rebalancing consumption.

Cause Analysis

To avoid rebalancing on the consumer side, we still have to start with the timing of Rebalance.

There are three timings for Rebalance: 1. The number of consumer group members changes. 2. The number of subscription topics has changed. 3. The number of partitions subscribed to the topic has changed.

Changes in the number of subscribed topics and the number of subscribed topic partitions are usually active operations of operation and maintenance, so the Rebalance caused by them is mostly inevitable. Next, we mainly talk about how to avoid rebalancing caused by changes in the number of group members. If the number of Consumer instances under the Consumer Group changes, it will definitely trigger a rebalance. This is the most common cause of Rebalance. Generally speaking, 99% of the rebalances we encounter are caused by this reason.

The increase in the number of Consumer instances is well understood. When we start a Consumer application configured with the same group.id value, we actually add a new Consumer instance member to this Group. At this time, ckafka's Coordinator will accept this new instance, add it to the group, and reassign the partition. Generally speaking, the operation of adding Consumer instances is planned, and may be due to the need to increase TPS or improve scalability. In short, it does not belong to the kind of "unnecessary Rebalance" we want to avoid.

For the reduction of the number of Consumer instances, it needs to be divided into two situations: 1. Developers stop some Consumer instances by themselves, which is normal. Another situation is that the Consumer instance will be mistakenly considered "stopped" by the Coordinator and will be "kicked out" from the Group. If this is the reason for the rebalance, then we must try to avoid it.

solution

When our consumer program has slow indirect consumption or abnormal timeout, it may be that the consumer group is rebalanced. We can verify it through the Ckafka console. Next, we will talk about how to avoid it.

When the Consumer Group completes rebalancing, each Consumer instance will periodically send a heartbeat request to the Coordinator, indicating that it is still alive. If a Consumer instance cannot send these heartbeat requests in a timely manner, the Coordinator will consider the Consumer to be "dead" and remove it from the Group, and then start a new round of rebalancing.

There is a parameter on the Consumer side, called session.timeout.ms, which is used to characterize this matter. The default value of this parameter is 10 seconds, that is, if the Coordinator does not receive the heartbeat of a Consumer instance under the Group within 10 seconds, it will think that this Consumer instance has died. It can be said that session.timeout.ms determines the time interval for Consumer survival. In addition to this parameter, Consumer also provides a parameter that allows you to control the frequency of sending heartbeat requests, which is heartbeat.interval.ms. The lower the value is set, the higher the frequency of the Consumer instance sending heartbeat requests. Frequent heartbeat requests will consume additional bandwidth resources, but the advantage is that it can more quickly know whether Rebalance is currently enabled.

The first circumvention method:

The first type of non-essential Rebalance is caused by the failure to send the heartbeat in time, causing the Consumer to be "kicked out" from the group. Therefore, you need to carefully set the values ​​of session.timeout.ms and heartbeat.interval.ms. Here are some recommended values:

1. Set session.timeout.ms = 6s.

2. Set session.timeout.ms = 6s.

3. Ensure that the Consumer instance can send at least 3 rounds of heartbeat requests before being judged as "dead", that is, session.timeout.ms >= 3 * heartbeat.interval.ms.

The second circumvention method:

The second type of non-essential Rebalance is caused by the excessive consumption time of Consumer. For example, in such a scenario, when the Consumer consumes data, the message needs to be written to MongoDB after processing. Obviously, this is a heavy consumption logic. The slightest instability of MongoDB will lead to an increase in the consumption time of the Consumer program. At this time, the setting of max.poll.interval.ms parameter value is particularly critical. If you want to avoid unexpected Rebalance, you'd better set this parameter to a larger value, which is slightly longer than your downstream maximum processing time. Take MongoDB as an example. If the maximum time to write MongoDB is 7 minutes, then you can set this parameter to about 8 minutes.

Guess you like

Origin blog.csdn.net/qq_32907195/article/details/112803112