Detailed KafkaProducer Sender thread (with detailed flowchart executed)

Tips: This article is based on Kafka 2.2.1 version.

Above "Kafka message source analysis process" has been described in detail the process KafkaProducer send method, the method simply appended to the message KafKaProducer cache, not really send a message to the broker, this paper discuss the future of Kafka Sender thread.

In KafkaProducer will start a separate thread, entitled "kafka-producer-network-thread | clientID", which is clientID id producers.

1, Sender threads explained

FIG class 1.1

Here Insert Picture Description
We first look at its various attributes meanings:

  • KafkaClient client kafka client network communications, network communications with the broker of the main package.
  • RecordAccumulator accumulator accumulator recorded message, the message is added an inlet (RecordAccumulator the append method).
  • Metadata metadata metadata manager, i.e. the routing information topic of the partition.
  • Boolean guaranteeMessageOrder if necessary to ensure the order of the message.
  • int maxRequestSize call send the maximum request size transmission method, a message including the total size of the key, the sequence of the message body can not exceed this value. Set by the parameter max.request.size.
  • short acks message used to define the condition "Submitted" (standard), is the condition to the client terminal Broker submitted the click, selectable values ​​are 0, -1.
  • int retries retries.
  • Time time time tools.
  • boolean running the thread state, true to run.
  • Whether boolean forceClose forced to close, then ignores the message being sent.
  • SenderMetrics sensors send messages related statistical indicators collector.
  • int timeout requestTimeoutMs request.
  • Failure of long retryBackoffMs request time to wait before retrying.
  • ApiVersions apiVersions API version information.
  • TransactionManager transactionManager transaction processor.
  • Map <TopicPartition, List <ProducerBatch >> inFlightBatches being executed to send a message related to batch.

1.2 run method Detailed

Sender#run

public void run() {
    log.debug("Starting Kafka producer I/O thread.");
    while (running) {   
        try {
            runOnce();    // @1
        } catch (Exception e) {
            log.error("Uncaught error in kafka producer I/O thread: ", e);
        }
    }
    log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
    while (!forceClose && (this.accumulator.hasUndrained() || this.client.inFlightRequestCount() > 0)) {    // @2
        try {
            runOnce();
        } catch (Exception e) {
            log.error("Uncaught error in kafka producer I/O thread: ", e);
        }
    }
    if (forceClose) {                                                                                                                                     // @3
        log.debug("Aborting incomplete batches due to forced shutdown");
        this.accumulator.abortIncompleteBatches();
    }
    try {
        this.client.close();                                                                                                                               // @4
    } catch (Exception e) {
        log.error("Failed to close network client", e);
    }
    log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
复制代码

Code @ 1: Sender thread primary service processing method in the operating state, the message cache sends messages to the broker. Code @ 2: If you take the initiative to close Sender thread, if not forced to shut down, then if there is a message to be sent buffer, call the method again runOnce the remaining message is completed and then exit. Code @ 3: If forced to close Sender threads, message submission is rejected unfinished. Code @ 4: Close Kafka Client i.e. network communication object.

Next will discuss implementation details that were above method.

1.2.1 runOnce Detailed

Sender#runOnce

void runOnce() {
	// 此处省略与事务消息相关的逻辑
    long currentTimeMs = time.milliseconds();
    long pollTimeout = sendProducerData(currentTimeMs);   // @1
    client.poll(pollTimeout, currentTimeMs);                            // @2
}
复制代码

This article does not concern the implementation principle of a transaction message, the omitted portion of the code. @ 1 Code: sendProducerData method of transmitting calling message. Code @ 2: Call the role of this method?

Next, the above two methods were explored in depth.

1.1.2.1 sendProducerData

The next step will analyze its implementation in detail. Sender # sendProducerData

Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
复制代码

Step1: First, the current time, the data buffer queue is determined which partitions topic which has reached the transmission conditions. The detailed analysis of conditions reach transmittable section 2.1.1.1.

Sending # send producer data

if (!result.unknownLeaderTopics.isEmpty()) {
    for (String topic : result.unknownLeaderTopics)
        this.metadata.add(topic);
    
    log.debug("Requesting metadata update due to unknown leader topics from the batched records: {}",
                result.unknownLeaderTopics);
    this.metadata.requestUpdate();
}
复制代码

Step2: If it is not found in the routing information message to be sent, the required routing information (node ​​information Leader partition) is first pulled to a corresponding broker server.

Sending # send producer data

long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
    Node node = iter.next();
    if (!this.client.ready(node, now)) {
        iter.remove();
        notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
    }
}
复制代码

Step3: removed at the network level is not ready partition, and calculate how long it within the next interval, the partition will be in a not ready state. 1, the network link is not ready criteria are as follows:

  • Partition elements have no pending update request data (metadata).
  • The current producer and broker peer connection has been established and completed the TCP three-way handshake.
  • If SSL, ACL and other mechanisms to enable the relevant state are ready.
  • When the number exceeds a set value corresponding to the partition of requests connection is being processed, default 5, can be set by the properties of max.in.flight.requests.per.connection.

2, client pollDelayMs estimate how long the partition in the next time interval will not change in a good state (not ready), whose standards are as follows:

  • If you have a TCP connection to the peer has been created, and in the connected state, at this time if there is no current limit is triggered, it returns 0, if there is a trigger limit, limiting the wait time is returned.
  • If you establish a TCP connection is located on the end, then return Long.MAX_VALUE, because after the connection is set up, it will send a wake-up thread.

Sending # send producer data

// create produce requests
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);
复制代码

Step4: The partition ready to extract the message from the buffer batches (ProducerBatch) to be transmitted, and in accordance with nodeId: List organization, attention, ProducerBatch will not be able to extract the additional information, even if there is excess space is available, dETAILED extraction will be described below in detail.

Sending # send producer data

addToInflightBatches(batches);
public void addToInflightBatches(Map<Integer, List<ProducerBatch>> batches) {
    for (List<ProducerBatch> batchList : batches.values()) {
        addToInflightBatches(batchList);
    }
}
private void addToInflightBatches(List<ProducerBatch> batches) {
    for (ProducerBatch batch : batches) {
        List<ProducerBatch> inflightBatchList = inFlightBatches.get(batch.topicPartition);
        if (inflightBatchList == null) {
            inflightBatchList = new ArrayList<>();
            inFlightBatches.put(batch.topicPartition, inflightBatchList);
        }
        inflightBatchList.add(batch);
    }
}
复制代码

Step5: ProducerBatch added to the extracted data structures inFlightBatches, declare the attribute is as follows: Map <TopicPartition, List <ProducerBatch >> inFlightBatches, i.e. according topic- partition is a bond, the ProducerBatch storing the extracted, the meaning of this attribute is to be stored message batch. According to this data structure can be known when the message is sent to the feedback partition Sender thread dimension "backlog", max.in.flight.requests.per.connection is to control the maximum number of backlog, if this value reaches backlog, for the message queue will send limiting.

Sending # send producer data

accumulator.resetNextBatchExpiryTime();
List<ProducerBatch> expiredInflightBatches = getExpiredInflightBatches(now);
List<ProducerBatch> expiredBatches = this.accumulator.expiredBatches(now);
expiredBatches.addAll(expiredInflightBatches);
复制代码

Step6: Find inflightBatches batches and batches Expired messages (ProducerBatch), determines whether the expired standard differential is created with the current system time ProducerBatch of time exceeds 120s, the expiration time can be set by parameters delivery.timeout.ms.

Sending # send producer data

if (!expiredBatches.isEmpty())
    log.trace("Expired {} batches in accumulator", expiredBatches.size());
for (ProducerBatch expiredBatch : expiredBatches) {
    String errorMessage = "Expiring " + expiredBatch.recordCount + " record(s) for " + expiredBatch.topicPartition
                + ":" + (now - expiredBatch.createdMs) + " ms has passed since batch creation";
    failBatch(expiredBatch, -1, NO_TIMESTAMP, new TimeoutException(errorMessage), false);
    if (transactionManager != null && expiredBatch.inRetry()) {
        // This ensures that no new batches are drained until the current in flight batches are fully resolved.
        transactionManager.markSequenceUnresolved(expiredBatch.topicPartition);
    }
}
复制代码

Step7: batch process has timed message, sending a failure notification message batch, ProduceRequestResult result returned by the voucher that is provided KafkaProducer # send method FutureRecordMetadata in that it will not block its get invoked.

Sending # send producer data

sensors.updateProduceRequestMetrics(batches);
复制代码

Step8: the collection of statistical indicators, we do not intend to analyze in detail, but the follow-up will be specially designed for Metrics Kafka conducted an in-depth discussion and study.

Sending # send producer data

long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
pollTimeout = Math.min(pollTimeout, this.accumulator.nextExpiryTimeMs() - now);
pollTimeout = Math.max(pollTimeout, 0);
if (!result.readyNodes.isEmpty()) {
    log.trace("Nodes with data ready to send: {}", result.readyNodes);
    pollTimeout = 0;
}
复制代码

Step9: send the next set time delay, a detailed analysis to be supplemented.

Sending # send producer data

sendProduceRequests(batches, now);
private void sendProduceRequests(Map<Integer, List<ProducerBatch>> collated, long now) {
    for (Map.Entry<Integer, List<ProducerBatch>> entry : collated.entrySet())
        sendProduceRequest(now, entry.getKey(), acks, requestTimeoutMs, entry.getValue());
}
复制代码

Step10: The steps in constructing a transmission request brokerId respectively, i.e. each of the broker will be packaged together into a plurality ProducerBatch transmission request, at the same time, each connection broker will only send a request, it is noted here only build requests and eventually through NetworkClient # send method, set the batch data to the data to be transmitted NetworkClient in this case does not trigger a real network calls.

sendProducerData method described here, and since there is also no real network request, at what time it is triggered it?

We continue to return to runOnce method.

1.2.1.2 NetworkClient the poll method
 public List<ClientResponse> poll(long timeout, long now) {
    ensureActive();

    if (!abortedSends.isEmpty()) {
        // If there are aborted sends because of unsupported version exceptions or disconnects,
        // handle them immediately without waiting for Selector#poll.
        List<ClientResponse> responses = new ArrayList<>();
        handleAbortedSends(responses);
        completeResponses(responses);
        return responses;
    }

    long metadataTimeout = metadataUpdater.maybeUpdate(now);   // @1
    try {
        this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));    // @2
    } catch (IOException e) {
        log.error("Unexpected error during I/O", e);
    }

    // process completed actions
    long updatedNow = this.time.milliseconds();
    List<ClientResponse> responses = new ArrayList<>();            // @3
    handleCompletedSends(responses, updatedNow);
    handleCompletedReceives(responses, updatedNow);
    handleDisconnections(responses, updatedNow);
    handleConnections();
    handleInitiateApiVersionRequests(updatedNow);
    handleTimedOutRequests(responses, updatedNow);
    completeResponses(responses);                                               // @4
    return responses;
}
复制代码

This article does not detail its network to achieve some depth discussions, Kafka network communications follow I will specifically detail, where the first point of its key points. Code @ 1: try to update cloud data. @ Code 2: trigger real network communications, will () method, read and write ready event is processed by the channel received calls in NIO Selector # select the method, when the write event is ready, they will be channel message to the remote broker. Code @ 3: will then send a message, message reception, is disconnected, the API version, the results collected timeout. Code @ 4: wake-up and results sequentially, then the results will be provided in response to the credential KafkaProducer # send method returns to wake-up the sending client, one complete message sending process.

Sender sending thread process described here, the next first given a flow chart, and then the above process some of the key ways to add depth look into.

1.2.2 run flowchart of a method

Here Insert Picture Description
According to the flowchart above code analysis, the figures in detail the key step is also denoted by its key. Let us interpret the core method of the flowchart Class Sender dependent thread, in order to deepen the understanding Sender thread.

Since the sending process Sender explain, most calls are RecordAccumulator methods to achieve their specific logic, so the focus of the next RecordAccumulator the method involves making a detailed analysis, Sender enhance the understanding of the process.

2, RecordAccumulator core methods Detailed

Detailed ready method of 2.1 RecordAccumulator

This method is based mainly in the cache message, determining which partitions the transmission condition has been reached.

RecordAccumulator#ready

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    Set<String> unknownLeaderTopics = new HashSet<>();

    boolean exhausted = this.free.queued() > 0;
    for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {   // @1
        TopicPartition part = entry.getKey();
        Deque<ProducerBatch> deque = entry.getValue();

        Node leader = cluster.leaderFor(part);   // @2
        synchronized (deque) {
            if (leader == null && !deque.isEmpty()) {   // @3
                // This is a partition for which leader is not known, but messages are available to send.
                // Note that entries are currently not removed from batches when deque is empty.
                unknownLeaderTopics.add(part.topic());
            } else if (!readyNodes.contains(leader) && !isMuted(part, nowMs)) {    // @4
                ProducerBatch batch = deque.peekFirst();
                if (batch != null) {
                    long waitedTimeMs = batch.waitedTimeMs(nowMs);
                    boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    boolean full = deque.size() > 1 || batch.isFull();
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    boolean sendable = full || expired || exhausted || closed || flushInProgress();
                    if (sendable && !backingOff) {   // @5
                        readyNodes.add(leader);
                    } else {
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        // Note that this results in a conservative estimate since an un-sendable partition may have
                        // a leader that will later be found to have sendable data. However, this is good enough
                        // since we'll just wake up and then sleep again for the remaining time.
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);   
                    }
                }
            }
        }
    }
    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}
复制代码

Code @ 1: producers cache ConcurrentHashMap <TopicPartition, Deque <ProducerBatch >> batches traversal, ready to pick and choose the good news batches. Code @ 2: Try to find a partition (TopicPartition) of leader metadata cache information from the producer, if not present, when the topic is added to unknownLeaderTopics (Code @ 3), sends a request to update the metadata to find later broker terminal routing information partition. Code @ 4: If you do not readyNodes on the need to determine whether the condition is satisfied, isMuted message about the order, this time being not concerned, will focus on the news section at the back of the order. Code @ 5: here is to determine whether the ready condition, a first to interpret the meaning of a local variable.

  • The ProducerBatch long waitedTimeMs long been waiting for, equal to the difference between the current timestamp and ProducerBatch lastAttemptMs, the current time will be assigned to lastAttemptMs time when ProducerBatch create or retried.
  • retryBackoffMs initiated when an abnormality occurs the waiting time before retry, the default is 100ms, can be configured through properties retry.backoff.ms.
  • batch.attempts () the batch number of retries currently.
  • backingOff background transmission is turned off, i.e., if desired retry latency and less than retryBackoffMs, the backingOff = true, means that the batch is not ready.
  • timeToWaitMs send thread waiting time required to send a message, backingOff if true, it indicates that the batch is in retry and wait time is less than the wait time set by the system, in this case timeToWaitMs = retryBackoffMs. Otherwise, the time to wait for the lingerMs.
  • boolean full whether the batch is full, if any two conditions are satisfied is the one true.
    • Deque <ProducerBatch> The number of the queue is greater than 1, represents a ProducerBatch certainly been filled.
    • ProducerBatch been filled.
  • boolean expired has expired, has been equal to the waiting time is greater than the time to wait, as if to send regularly sent, then, expired true to trigger the timer has reached the point that needs to be performed.
  • Blocked thread application cache space is greater than 0 boolean exhausted current producers cache is not enough, when you create a new ProducerBatch, it should immediately cache in a message sent to the server immediately.
  • Whether boolean sendable can be sent. Which satisfies the following conditions to any one of:
    • The batch has been filled. (Full = true).
    • When the system has been waiting for a predetermined length. (Expired = true)
    • The sender internal buffer is exhausted and there is need to apply new thread (exhausted = true).
    • The close method is called the sender (close = true).
    • The sender flush method is called.

drain method 2.2 RecordAccumulator of Detailed

RecordAccumulator#drain

public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) { // @1
    if (nodes.isEmpty())
        return Collections.emptyMap();

    Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
    for (Node node : nodes) {                                                                                                                              
        List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);                      // @2
        batches.put(node.id(), ready);
    }
    return batches;
}
复制代码

Code @ 1: We first introduce the method's parameters:

  • Cluster cluster cluster information.
  • Set <Node> nodes have prepared a set of nodes.
  • int maxSize request a maximum number of bytes.
  • long now the current time.

Code @ 2: traversing all nodes, data calls drainBatchesForOneNode extraction method to assemble a Map <Integer / ** brokerId * /, List <ProducerBatch >> batches.

Next, look at the focus of drainBatchesForOneNode. RecordAccumulator # drainBatchesForOneNode

private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {
    int size = 0;
    List<PartitionInfo> parts = cluster.partitionsForNode(node.id());   // @1
    List<ProducerBatch> ready = new ArrayList<>();
    int start = drainIndex = drainIndex % parts.size();                        // @2
    do {                                                                                                // @3 
        PartitionInfo part = parts.get(drainIndex);
        TopicPartition tp = new TopicPartition(part.topic(), part.partition()); 
        this.drainIndex = (this.drainIndex + 1) % parts.size();                     
            
        if (isMuted(tp, now))
            continue;

        Deque<ProducerBatch> deque = getDeque(tp);                              // @4
        if (deque == null)
            continue;

        synchronized (deque) {
            // invariant: !isMuted(tp,now) && deque != null
            ProducerBatch first = deque.peekFirst();                                         // @5
            if (first == null)
                continue;

            // first != null
            boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;   // @6
            // Only drain the batch if it is not during backoff period.
            if (backoff)                                                                                     
                continue;

            if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {     // @7
                break;
            } else {
                if (shouldStopDrainBatchesForPartition(first, tp))                                  
                    break;

                // 这里省略与事务消息相关的代码,后续会重点学习。
                batch.close();                                                                                            // @8
                size += batch.records().sizeInBytes();
                ready.add(batch);                                                                            

                batch.drained(now);                                                                             
            }
        }
    } while (start != drainIndex);
    return ready;
}
复制代码

Code @ 1: brokerId obtained according to all partitions on the primary broker. Code @ 2: initialize start. First, let's start here and elaborate on drainIndex.

  • start current start traversing the partition number.
  • After the last extracted drainIndex queue index, here it is mainly to each queue number from zero partition start extraction.

Code @ 3: extraction cycle data accumulated in the corresponding partition from the buffer. Code @ 4: The topic + partition number from the transmit buffer of the producer acquired accumulated double-ended Queue. Code @ 5: Get one element from the head deque. (Appended message is appended to the tail of the queue). Code @ 6: If the current retry is batch and not to the blocking time, the partition is skipped. Code @ 7: If the current message has been drawn plus the total size of the new message has more than maxRequestSize, the end of the extraction. @ 8 Code: The batch is added to the current set has been prepared, and close the batch, i.e. not allowing the batch is added to the message.

About message introduced here, inside NetworkClient the poll method calls the Selector ready to perform selected events, and extracted message to Broker server over a network, on the back of the concrete realization of the network, will be presented separately in subsequent articles.


Author: Ding Wei, "RocketMQ Technology Insider" author, RocketMQ community preacher, public number: Middleware interest circle defenders, has been published a set of Java source code analysis, Java and contracting (JUC), Netty, Mycat, Dubbo, RocketMQ, Mybatis and other source columns. Click on the link can join middleware knowledge of the planet , to explore high concurrency, distributed service architecture, AC source.

Here Insert Picture Description

Guess you like

Origin juejin.im/post/5decfcf7f265da33c028a32e