kafka study notes (XI) --- consumer group re-balance to avoid it

In the eighth chapter, we have talked about the principle of balance and heavy use, directly into the theme here.

Coordinator (Coordinator)

In Rebalance process, all instances of consumers to participate, complete the subscription assignment in partition coordinator help. The coordinator is what is this? Akafka coordinator in the corresponding terms Coordinator, Consumer Group specialized service, is responsible for the Group as well as providing execution Rebalance displacement provide management and group management members.

Specifically, at the time of submission Consumer displacement, in fact, is where the Coordinator submit Broker displacement Likewise, when the Consumer starts, but also want to send a request Broker Coordinator lies, then the Coordinator responsible for the implementation of registration, Member of the Management Group and other consumer records yuan data management operations. Broker your time starts, will create and open the corresponding Coordinator components. In other words, all have their own Coordinator Broker components . So, Consumer how to confirm Coordinator for her services on that Broker do? The answer lies in __consumer_offsets body.

Currently, Consumer Group Broker determination algorithm Coordinator where there are two steps:

Step 1: Determine which partition displacement by the body to store data Group:

partitionId=Math.abs(groupId.hashCode() % offsetsTopicPartitionCount)。

Step Two: Identify the partition Leader copy Broker where the Broker is the Coordinator.

In actual use, Consumer applications, especially Java Consumer API, can automatically discover and connect correct Coordinator, we do not have to worry about this problem. This algorithm known the greatest significance that can help us solve the location problem . When Consumer Group problem that requires quick investigation Broker logs, to accurately locate the corresponding Coordinator Broker According to this algorithm, a Broker does not have to be a blind search.

How to avoid Rebalance?

Although already said Rebalance, come here first while briefly, helping memories.

Rebalance drawbacks:

Consumer impact end TPS, because once Rebalance start, all Consumer instance will stop spending.
Rebalance very slow, if many of your team members, will certainly have this pain point.
Rebalance efficiency is not high, that is, all Group to be involved in the redistribution partition, usually do not consider the principle of locality, but the principle of locality will greatly improve system performance, because it minimizes the impact on the remainder of the Consumer Rebalance.

On the first 3:00, to say something here. Currently Kafka, by default, every time Rebalance, before the allocation will not be retained. For this reason, the community and the introduction of version 0.11.0.0 StickyAssignor, that is sticky partition allocation policy. The so-called stickiness, is every time Rebalance, the policy will be retained as much as possible before the distribution plan, try to minimize changes in the partition allocated. Sorry about that, this strategy as well as some bug, and need to upgrade to 0.11.0.0 to use, with no more than a test drive in the production environment.

In addition, more regrettable that, for the first 1:00 and 2:00 drawbacks, the community there is no good solution, that is no solution. While no way to solve these problems, but we can avoid unnecessary Rebalance. In fact, in a real business scenario, many Rebalance are outside the plan, and the client's TPS is the kind of reason to be slowing down, so try to avoid this kind of reason.

To avoid Rebalance, you must first know the time Rebalance happened:

Changes in the number of group members;
Subscribe topic number of changes;
Change the number of partitions subscribe to a topic.

The latter two are usually operation and maintenance operations, in general difficult to avoid, here mainly to talk about the first one occasion how to avoid it. Rebalance generally encountered mostly occur under the first time. Increase the strength of Consumer situation is well understood, when we start a program Consumer group.id configured with the same value, in fact, add an instance to the Group. At this time, the Coordinator will accept a new instance, be added to the group, and reallocation of partitions. Generally speaking, the increase in the Consumer instances of operations are planned, probably due to the need to improve or increase the flexibility of TPS. In short, it does not belong to the kind of "unnecessary Rebalance" We want to circumvent.

We are more concerned about is the number of instances Group under reducing conditions, if the ratio is to be stopped Consumer certain instances, it Needless to say, the key is in certain cases, the Consumer instance is Coordinator mistakenly believe that "stopped", thus It has been "kicked out" Group. So, Coordinator would consider under what circumstances instance has hung back so that the groups?

Then look at the Consumer provides three parameters, which will affect the occurrence of Rebalance:

session.timeout.ms, the default value of 10 seconds, if not receiving the heartbeat of a Coordinator Consumer Group under example within 10 seconds, it will assume in this example has been linked, such that it is removed Group, open a new round Rebalance. It can be said, this parameter determines the survival time of Consumer interval.
heartbeat.interval.ms, the smaller the value, Consumer instance sends request frequency higher heartbeat. But frequently send heartbeat will consume additional bandwidth resources, benefits are able to know quickly whether the current open Rebalance, because the Coordinator inform each instance method Rebalance Consumer opening, is to sign the package into the response body REBALANCE_NEEDED heartbeat request.
max.poll.interval.ms, defines the maximum time interval between two poll call, the default value of 5 minutes. Consumer represent your program if you can not not be completed within 5min consumer poll method returns a message, then the Consumer will take the initiative "to leave the group," the request, Coordinator will open a new round Rebalance.

The above said parameters, corresponding to three may arise during the use of "unnecessary" Rebalance the situation:

The first, the heartbeat transmission is not timely because, resulting in Consumer been "kicked out" Group triggered. So you need to carefully set the first two parameters, there are best practices, i.e. session.timeout.ms = 6s, heartbeat.interval.ms = 2s, the instance is determined to ensure that before the "dead", capable of transmitting a heartbeat request at least three , i.e. session.timeout.ms> = 3 * heartbeat.interval.ms.

The second category, Consumer consumption too long Cao Zhi Rebalance. As a user, in their scenario, you want to write the result of message processing MongoDB, MongoDB here a little bit of instability, will lead to an increase in long time COnsumer program spending. The best this parameter is set to when you grow up a little more than the maximum downstream processing. In short, you want to leave plenty of time for your business logic.

If the recommended values above properly set these parameters, go shoot GC check the performance of Consumer end, such as whether there is frequent Full GC, causing Rebalance, this is a common phenomenon.

Mark: This series of articles is my time in the geek column --- kafka study notes core technology and actual combat

https://time.geekbang.org/column/article/101171

Prisoners prison - Mineko

Published 37 original articles · won praise 20 · views 4954

Private letter concerns

kafka study notes (XI) --- consumer group re-balance to avoid it

Guess you like