In-depth analysis of Kafka source code - sequence 7 - Consumer - coordinator protocol and implementation principle of heartbeat

As we mentioned earlier, KafkaProducer

is thread-safe, and there is also a Sender inside it, which opens a background thread and continuously fetches messages from the queue for sending.
The consumer is a pure single-threaded program. All the mechanisms mentioned later, including coordinator, rebalance, heartbeat, etc., are completed in this single-threaded poll function. Therefore, inside the consumer's code, there is no lock.

//Client thread
while (true) {
         ConsumerRecords<String, String> records = consumer.poll(100);
         . . .
     }

What is a coordinator?
Going to Zookeeper depends

on the client api before 0.9, and the consumer depends on Zookeeper. Because all consumers in the same consumer group need to cooperate, perform the rebalance described below.

However, because of the "herd" and "split brain" of zookeeper, different consumers in a group have the same partition, which will cause confusion in the consumption of messages. For this reason, in 0.9, zookeeper is no longer used, but the Kafka cluster itself is used for synchronization between consumers. The following is quoted from the original text designed by kafka:

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design#Kafka0.9ConsumerRewriteDesign-Failuredetectionprotocol

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

The current version of the high level consumer suffers from herd and split brain problems, where multiple consumers in a group run a distributed algorithm to agree on the same partition ownership decision. Due to different view of the zookeeper data, they run into conflicts that makes the rebalancing attempt fail. But there is no way for a consumer to verify if a rebalancing operation completed successfully on the entire group. This also leads to some potential bugs in the rebalancing logic, for example, https://issues.apache.org/jira/browse/KAFKA-242

Why within a group, one partition can only be owned by one consumer?

We know that consumers belonging to different consumer groups can consume the same partition to implement the Pub/Sub mode.

However, within a group, multiple consumers are not allowed to consume the same partition, which means that for one topic and one group, the number of partitions is >= the number of consumers.

For example, for a topic with 4 partitions, there can only be a maximum of 4 consumers within a group. You add more consumers, they will not be allocated to the partition.

So why make this restriction? The reason is detailed in this article:
http://stackoverflow.com/questions/25896109/in-apache-kafka-why-cant-there-be-more-consumer-instances-than-partitions

, one is because there is no way to guarantee the timing of the same partition message; on the other hand, the Kafka server corresponds to one offset for each consumer group of each partition of each topic, namely (topic, partition, consumer_group_id) – offset. If multiple consumers consume the same partition in parallel, there will be a problem with the confirm of the offset.

Knowing this premise, let's analyze the partition allocation problem.
The coordinator protocol/partition allocation problem

is raised:
Given a topic, there are 4 partitions: p0, p1, p2, p3, and a group has 3 consumers: c0, c1, c2. Then, if the allocation strategy is based on the range, the allocation result is:
c0: p0, c1: p1, c2: p2, p3
If the allocation strategy is based on round-robin:
c0: p1, p3, c1: p1, c2: p2

Then this whole allocation process How does it work? See the figure below:
Write a picture here to describe
the 3-step allocation process

Step 1: For each consumer group, the Kafka cluster selects a broker from the broker cluster as its coordinator. Therefore, the first step is to find this coordinator.

    private Map<TopicPartition, List<ConsumerRecord<K, V>>> pollOnce(long timeout) {
        coordinator.ensureCoordinatorKnown(); //The first line of the poll function is to find the coordinator. If not found, it will block here
        ...
}

    public void ensureCoordinatorKnown() {
        while (coordinatorUnknown()) {
            RequestFuture<Void> future = sendGroupMetadataRequest();
            client.poll(future);

            if (future.failed()) {
                if (future.isRetriable())
                    client.awaitMetadataUpdate();
                else
                    throw future.exception();
            }
        }
    }

    private RequestFuture<Void> sendGroupMetadataRequest() {
        Node node = this.client.leastLoadedNode();
        if (node == null) {
            return RequestFuture.noBrokersAvailable();
        } else {
            GroupCoordinatorRequest metadataRequest = new GroupCoordinatorRequest(this.groupId); //向集群中负载最小的node，发送请求，询问这个group id对应的coordinator是谁
            return client.send(node, ApiKeys.GROUP_COORDINATOR, metadataRequest)
                    .compose(new RequestFutureAdapter<ClientResponse, Void>() {
                        @Override
                        public void onSuccess(ClientResponse response, RequestFuture<Void> future) {
                            handleGroupMetadataResponse(response, future);
                        }
                    });
        }
    }

步骤2：找到coordinator之后，发送JoinGroup请求

    private Map<TopicPartition, List<ConsumerRecord<K, V>>> pollOnce(long timeout) {
        coordinator.ensureCoordinatorKnown(); //步骤1：寻找coordinator

        if (subscriptions.partitionsAutoAssigned())
            coordinator.ensurePartitionAssignment(); //步骤2+3： JoinGroup + SyncGroup

    public void ensureActiveGroup() {
        if (!needRejoin())
            return;

        if (needsJoinPrepare) {
            onJoinPrepare(generation, memberId);
            needsJoinPrepare = false;
        }

        while (needRejoin()) {
            ensureCoordinatorKnown();

           if (client.pendingRequestCount(this.coordinator) > 0) {
                client.awaitPendingRequests(this.coordinator);
                continue;
            }

            RequestFuture<ByteBuffer> future = performGroupJoin();
            client.poll(future);

            if (future.succeeded()) {
                onJoinComplete(generation, memberId, protocol, future.value());
                needsJoinPrepare = true;
                heartbeatTask.reset();
            } else {
                RuntimeException exception = future.exception();
                if (exception instanceof UnknownMemberIdException ||
                        exception instanceof RebalanceInProgressException ||
                        exception instanceof IllegalGenerationException)
                    continue;
                else if (!future.isRetriable())
                    throw exception;
                time.sleep(retryBackoffMs);
            }
        }
    }

步骤3：JoinGroup返回之后，发送SyncGroup，得到自己所分配到的partition

    private RequestFuture<ByteBuffer> performGroupJoin() {
        if (coordinatorUnknown())
            return RequestFuture.coordinatorNotAvailable();

        // send a join group request to the coordinator
        log.debug("(Re-)joining group {}", groupId);
        JoinGroupRequest request = new JoinGroupRequest(
                groupId,
                this.sessionTimeoutMs,
                this.memberId,
                protocolType(),
                metadata());

        // create the request for the coordinator
        log.debug("Issuing request ({}: {}) to coordinator {}", ApiKeys.JOIN_GROUP, request, this.coordinator.id());
        return client.send(coordinator, ApiKeys.JOIN_GROUP, request)
                .compose(new JoinGroupResponseHandler());
    }

    private class JoinGroupResponseHandler extends CoordinatorResponseHandler<JoinGroupResponse, ByteBuffer> {

        @Override
        public JoinGroupResponse parse(ClientResponse response) {
            return new JoinGroupResponse(response.responseBody());
        }

        @Override
        public void handle(JoinGroupResponse joinResponse, RequestFuture<ByteBuffer> future) {
            // process the response
            short errorCode = joinResponse.errorCode();
            if (errorCode == Errors.NONE.code()) {
                log.debug("Joined group: {}", joinResponse.toStruct());
                AbstractCoordinator.this.memberId = joinResponse.memberId();
                AbstractCoordinator.this.generation = joinResponse.generationId();
                AbstractCoordinator.this.rejoinNeeded = false;
                AbstractCoordinator.this.protocol = joinResponse.groupProtocol();
                sensors.joinLatency.record(response.requestLatencyMs());
                if (joinResponse.isLeader()) {
                    onJoinLeader(joinResponse).chain(future); //关键点：在JoinGroup返回之后，竟跟着发送SyncGroup消息
                } else {
                    onJoinFollower().chain(future);
                }
            } else if (errorCode == Errors.GROUP_LOAD_IN_PROGRESS.code()) {
               . . .
            }
        }
    }

Note that in the above three steps, there is a key point: the partition allocation strategy and allocation result are actually determined by the client, not by the coordinator. What does that mean? In step 2, after all consumers send JoinGroup messages to the coordinator, the coordinator will designate one of the consumers as the leader and the other consumers as the followers.

Partition allocation is then performed by the leader. Then in step 3, the leader sends the allocation result to the coordinator through the SyncGroup message, and other consumers also send the SyncGroup message to obtain the allocation result.

Why choose a leader in the consumer and allocate it, instead of assigning it directly by the coordinator? Regarding this, Kafka's official documentation has a detailed analysis. One of the important reasons is for flexibility: if the server is allocated, once a new allocation strategy is required, the server cluster needs to be redeployed, which is very costly for the cluster that is already running online; and for the client to allocate, the server The cluster does not need to be redeployed.
The so-called rebalance mechanism of the rebalance mechanism

means that under certain conditions, the partition needs to be redistributed in the consumer. Under what conditions will rebalance be triggered?
Condition 1: A new consumer joins
Condition 2: The old consumer hangs
Condition 3: The coordinator hangs, and the cluster elects a new coordinator
. Condition 4: Condition 5 is added to the topic's partition.
Condition 5: The consumer calls unsubscrible() to cancel the subscription of the topic

. , proceed to Step 2 + Step 3: JoinGroup + SyncGroup.

But the question is: When a consumer dies, or a new consumer joins, how do other consumers know to rebalance? The answer is the heartbeat below.
The implementation principle of heartbeat

Every consumer sends heartbeat messages to the coordinator regularly. Once the coordinator returns a specific error code: ILLEGAL_GENERATION, it means that the previous group is invalid (disbanded), and the JoinGroup + SyncGroup operation must be performed again.

How to achieve this regular sending? An intuitive idea is to open a background thread and send heartbeat messages regularly, but maintaining a background thread will obviously increase the complexity of the implementation. As mentioned above, the consumer is a single-threaded program. Here is achieved through DelayedQueue.
The basic idea of DelayedQueue and HeartBeatTask

is to put the HeartBeatRequest into a DelayedQueue, and then in the poll of the while loop, take out the request from the DelayedQueue and send it each time (only when the time is up, the Task can be taken out of the Queue).

    private class HeartbeatTask implements DelayedTask {

        private boolean requestInFlight = false; //Key variable: determine whether there is currently a HeartBeatRequest sent out, and its Response has not come back

        //The essence of reset is the sending function
        public void reset() {
            long now = time.milliseconds ();
            heartbeat.resetSessionTimeout(now);
            client.unschedule(this);

            if (!requestInFlight)
                client.schedule(this, now); //If there is no requestInFlight, put HeartBeatTask in DelayedQueue
        }

        @Override
        public void run (final long now) {
            if (generation < 0 || needRejoin() || coordinatorUnknown()) {
                return;
            }

            if (heartbeat.sessionTimeoutExpired(now)) {
                coordinatorDead();
                return;
            }

            if (!heartbeat.shouldHeartbeat(now)) {
                client.schedule(this, now + heartbeat.timeToNextHeartbeat(now));
            } else {
                heartbeat.sentHeartbeat(now);
                requestInFlight = true;

                RequestFuture<Void> future = sendHeartbeatRequest();
                future.addListener(new RequestFutureListener<Void>() {
                    @Override
                    public void onSuccess(Void value) {
                        requestInFlight = false;
                        long now = time.milliseconds();
                        heartbeat.receiveHeartbeat(now);
                        long nextHeartbeatTime = now + heartbeat.timeToNextHeartbeat(now);
                        //put into delayedQueue
                        client.schedule(HeartbeatTask.this, nextHeartbeatTime);
                    }

         //after hearbeat returns , whether the response succeeds or fails, put the next heartbeat into the delayedQueue, thus forming a circular interval to send
                    @Override
                    public void onFailure(RuntimeException e) {
                        requestInFlight = false;
                        client.schedule(HeartbeatTask.this, time.milliseconds() + retryBackoffMs);
                    }
                });
            }
        } The first point of

rebalance detection

: I personally think that the network framework here is a bit redundant: sendHeartbeatRequest has both a callback mechanism (CompleteHandler) and a Listener mechanism (the above code) for its Future.

That is, in the completeHandler of heartbeat, the detection of rebalance is completed: as can be seen from the following code, rebalance will be triggered for the following response error codes:

* GROUP_COORDINATOR_NOT_AVAILABLE (15)
* NOT_COORDINATOR_FOR_GROUP (16)
* ILLEGAL_GENERATION (22)
* UNKNOWN_MEMBER_ID ( 25)
* REBALANCE_IN_PROGRESS (27)
* GROUP_AUTHORIZATION_FAILED (30)

    public RequestFuture<Void> sendHeartbeatRequest() {
        HeartbeatRequest req = new HeartbeatRequest(this.groupId, this.generation, this.memberId);
        return client.send(coordinator, ApiKeys.HEARTBEAT , req)
                .compose(new HeartbeatCompletionHandler());
    }

    private class HeartbeatCompletionHandler extends CoordinatorResponseHandler<HeartbeatResponse, Void> {
        @Override
        public HeartbeatResponse parse(ClientResponse response) {
            return new HeartbeatResponse(response.responseBody());
        }

        @Override
        public void handle(HeartbeatResponse heartbeatResponse, RequestFuture<Void> future) {
            sensors.heartbeatLatency.record(response.requestLatencyMs());
            short errorCode = heartbeatResponse.errorCode();
            if (errorCode == Errors.NONE.code()) {
                log.debug("Received successful heartbeat response.");
                future.complete(null);
            } else if (errorCode == Errors.GROUP_COORDINATOR_NOT_AVAILABLE.code()
                    || errorCode == Errors.NOT_COORDINATOR_FOR_GROUP.code()) {
                log.info("Attempt to heart beat failed since coordinator is either not started or not valid, marking it as dead.");
                coordinatorDead();
                future.raise(Errors.forCode(errorCode));
            } else if (errorCode == Errors.REBALANCE_IN_PROGRESS.code()) {
                log.info("Attempt to heart beat failed since the group is rebalancing, try to re-join group.");
                AbstractCoordinator.this.rejoinNeeded = true;
                future.raise(Errors.REBALANCE_IN_PROGRESS);
            } else if (errorCode == Errors.ILLEGAL_GENERATION.code()) {
                log.info("Attempt to heart beat failed since generation id is not legal, try to re-join group.");
                AbstractCoordinator.this.rejoinNeeded = true;
                future.raise(Errors.ILLEGAL_GENERATION);
            } else if (errorCode == Errors.UNKNOWN_MEMBER_ID.code()) {
                log.info("Attempt to heart beat failed since member id is not valid, reset it and try to re-join group.");
                memberId = JoinGroupRequest.UNKNOWN_MEMBER_ID;
                AbstractCoordinator.this.rejoinNeeded = true;
                future.raise(Errors.UNKNOWN_MEMBER_ID);
            } else if (errorCode == Errors.GROUP_AUTHORIZATION_FAILED.code()) {
                future.raise(new GroupAuthorizationException(groupId));
            } else {
                future.raise(new KafkaException("Unexpected errorCode in heartbeat response: "
                        + Errors.forCode( errorCode).exception().getMessage()));
            }
        }
    }

Key point: The so-called trigger here is actually setting rejoinNeeded to true. Then in the next poll loop, if rejoinNeeded is detected to be true, the above step 2 + step 3
failover will be repeated.

For this entire system, the consumer may fail, and the coordinator may also fail. Therefore, both parties need to check each other to see if the other party has hung up.

The detection method is also the same as the above heartbeat: when the consumer finds that the heartbeat returns with a timeout, or the coordinator has not received the heartbeat for a long time, it will think that the other party has hung up.

Of course, there will be "misoperations". For example, the consumer is very slow in processing messages (because it is a single thread), resulting in a delay in sending the next heartbeat. At this time, the coordinator will think that the consumer has hung up and will actively disconnect. Thus triggering a rebalance.
If the consumer thinks that the coordinator hangs, it

will start from step 1 above, rediscover the coordinator, and then JoinGroup + SyncGroup The
coordinator thinks that the consumer has hung up

and start from step 2 above, notify all other remaining consumers, and perform JoinGroup + SyncGroup

http://blog.csdn.net/chunlongyu/article/details/52791874

In-depth analysis of Kafka source code - sequence 7 - Consumer - coordinator protocol and implementation principle of heartbeat

Guess you like