KafkaProducer source code analysis

Kafka common terms

Broker : i.e. server Kafka Kafka example, by one or more Kafka Broker cluster, is responsible for receiving and processing client requests

Topic : Topic, Kafka logical container carrying messages, each message published to Kafka has a corresponding logical container for distinguishing multiple service work

The Partition : the partition is a physical concept, the same message representative of an ordered sequence of one or more per Topic composition Partion

Replica : copy, Kafka copied the same message to multiple locations for data redundancy, where is a copy, a copy of the divided Leader and Follower, different roles in different role, is a copy of Partition terms, each partition can configure multiple copies to achieve high availability

The Record : message subject treated Kafka

Offset : message displacement, the partition position information of each message is a monotonically increasing value and constant

Producer : the producer, send a new message to the subject matter of the application

Consumer : Consumers, subscribe to news from the theme of the application

Offset Consumer : Consumer displacement, record consumer spending schedule, every consumer has its own consumer displacement

Group Consumer : consumer groups, consumers make up more than a consumer group, Consumer multiple partitions simultaneously to achieve high availability ( the number in the group of consumers can not be more than the number of partitions in order to avoid waste of resources )

Reblance : weight balance, after the number of instances of consumers to change consumption within the group, examples of other consumers automatically re-allocation process subscribe to topics partition

The following figure shows the concept of using a portion of the above-mentioned (Fig PPT painting, too strenuous, drew waited a long time, there are easy to use drawing tools welcome recommendation)

file

News production process

First to a small demo KafkaProducer

public static void main(String[] args) throws ExecutionException, InterruptedException {
        if (args.length != 2) {
            throw new IllegalArgumentException("usage: com.ding.KafkaProducerDemo bootstrap-servers topic-name");
        }

        Properties props = new Properties();
        // kafka服务器ip和端口,多个用逗号分割
        props.put("bootstrap.servers", args[0]);
        // 确认信号配置
        // ack=0 代表producer端不需要等待确认信号,可用性最低
        // ack=1 等待至少一个leader成功把消息写到log中,不保证follower写入成功,如果leader宕机同时follower没有把数据写入成功
        // 消息丢失
        // ack=all leader需要等待所有follower成功备份,可用性最高
        props.put("ack", "all");
        // 重试次数
        props.put("retries", 0);
        // 批处理消息的大小,批处理可以增加吞吐量
        props.put("batch.size", 16384);
        // 延迟发送消息的时间
        props.put("linger.ms", 1);
        // 用来换出数据的内存大小
        props.put("buffer.memory", 33554432);
        // key 序列化方式
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        // value 序列化方式
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        // 创建KafkaProducer对象,创建时会启动Sender线程
        Producer<String, String> producer = new KafkaProducer<>(props);
        for (int i = 0; i < 100; i++) {
            // 往RecordAccumulator中写消息
            Future<RecordMetadata> result = producer.send(new ProducerRecord<>(args[1], Integer.toString(i), Integer.toString(i)));
            RecordMetadata rm = result.get();
            System.out.println("topic: " + rm.topic() + ", partition: " +  rm.partition() + ", offset: " + rm.offset());
        }
        producer.close();
    }复制代码

Instantiation

KafkaProducer construction method is to operate in accordance with some examples of profiles

Resolving clientId, if not configured by the incremental number is producer-

2. Parse and instantiate partitioner partitioner can realize their partitioner, according to such partition key, the same key may be assigned to ensure that the same partition, useful for ensuring order. If not specified partitioning rule, the default rule (message with a key, on the key to do the hash, then the available partitions modulo; if there is no key, the random number of available partitions modulo [no key when said random number available partition modulo inaccurate, counter initial value is random, but the back are increasing, it is possible to count roundrobin])

3. Analytical key, value of serialization and instantiate

4. Parse and instantiate the interceptor

5. Parse and instantiate RecordAccumulator, mainly for storage of message (KafkaProducer RecordAccumulator main thread to write messages, Sender thread RecordAccumulator read from the message and sent in Kafka)

6. Parse Broker Address

7. Create a thread and start Sender

...
this.sender = newSender(logContext, kafkaClient, this.metadata);
this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
this.ioThread.start();
...复制代码

Message sending process

Message is sent inlet KafkaProducer.send method, the following main process

KafkaProducer.send
KafkaProducer.doSend
// 获取集群信息
KafkaProducer.waitOnMetadata 
// key/value序列化
key\value serialize
// 分区
KafkaProducer.partion
// 创建TopciPartion对象,记录消息的topic和partion信息
TopicPartition
// 写入消息
RecordAccumulator.applend
// 唤醒Sender线程
Sender.wakeup复制代码

RecordAccumulator

RecordAccumulator message queue is used for caching message, the message packet according TopicPartition

Focus look RecordAccumulator.applend additional message flow

// 记录进行applend的线程数
appendsInProgress.incrementAndGet();复制代码

// 根据TopicPartition获取或新建Deque双端队列
Deque<ProducerBatch> dq = getOrCreateDeque(tp);
...
private Deque<ProducerBatch> getOrCreateDeque(TopicPartition tp) {
    Deque<ProducerBatch> d = this.batches.get(tp);
    if (d != null)
        return d;
    d = new ArrayDeque<>();
    Deque<ProducerBatch> previous = this.batches.putIfAbsent(tp, d);
    if (previous == null)
        return d;
    else
        return previous;
}复制代码

// 尝试将消息加入到缓冲区中
// 加锁保证同一个TopicPartition写入有序
synchronized (dq) {
    if (closed)
        throw new KafkaException("Producer closed while send in progress");
    // 尝试写入
    RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
    if (appendResult != null)
        return appendResult;
}复制代码

private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, Deque<ProducerBatch> deque) {
    // 从双端队列的尾部取出ProducerBatch
    ProducerBatch last = deque.peekLast();
    if (last != null) {
        // 取到了,尝试添加消息
        FutureRecordMetadata future = last.tryAppend(timestamp, key, value, headers, callback, time.milliseconds());
        // 空间不够,返回null
        if (future == null)
            last.closeForRecordAppends();
        else
            return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false);
    }
    // 取不到返回null
    return null;
}复制代码

public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Header[] headers, Callback callback, long now) {
    // 空间不够,返回null
    if (!recordsBuilder.hasRoomFor(timestamp, key, value, headers)) {
        return null;
    } else {
        // 真正添加消息
        Long checksum = this.recordsBuilder.append(timestamp, key, value, headers);
        ...
        FutureRecordMetadata future = ...
        // future和回调callback进行关联    
        thunks.add(new Thunk(callback, future));
        ...
        return future;
    }
}复制代码

// 尝试applend失败(返回null),会走到这里。如果tryApplend成功直接返回了
// 从BufferPool中申请内存空间,用于创建新的ProducerBatch
buffer = free.allocate(size, maxTimeToBlock);复制代码

synchronized (dq) {
    // 注意这里,前面已经尝试添加失败了,且已经分配了内存,为何还要尝试添加?
    // 因为可能已经有其他线程创建了ProducerBatch或者之前的ProducerBatch已经被Sender线程释放了一些空间,所以在尝试添加一次。这里如果添加成功,后面会在finally中释放申请的空间
    RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq);
    if (appendResult != null) {
        return appendResult;
    }

    // 尝试添加失败了,新建ProducerBatch
    MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
    ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, time.milliseconds());
    FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, headers, callback, time.milliseconds()));

    dq.addLast(batch);
    incomplete.add(batch);
    // 将buffer置为null,避免在finally汇总释放空间
    buffer = null;
    return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true);
}复制代码

finally {
    // 最后如果再次尝试添加成功,会释放之前申请的内存(为了新建ProducerBatch)
    if (buffer != null)
        free.deallocate(buffer);
    appendsInProgress.decrementAndGet();
}复制代码

// 将消息写入缓冲区
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey,serializedValue, headers, interceptCallback, remainingWaitMs);
if (result.batchIsFull || result.newBatchCreated) {
    // 缓冲区满了或者新创建的ProducerBatch,唤起Sender线程
    this.sender.wakeup();
}
return result.future;复制代码

Sender sends a message thread

The main process is as follows

Sender.run
Sender.runOnce
Sender.sendProducerData
// 获取集群信息
Metadata.fetch
// 获取可以发送消息的分区且已经获取到了leader分区的节点
RecordAccumulator.ready
// 根据准备好的节点信息从缓冲区中获取topicPartion对应的Deque队列中取出ProducerBatch信息
RecordAccumulator.drain
// 将消息转移到每个节点的生产请求队列中
Sender.sendProduceRequests
// 为消息创建生产请求队列
Sender.sendProducerRequest
KafkaClient.newClientRequest
// 下面是发送消息
KafkaClient.sent
NetWorkClient.doSent
Selector.send
// 其实上面并不是真正执行I/O,只是写入到KafkaChannel中
// poll 真正执行I/O
KafkaClient.poll复制代码

The main flow in Sender thread through source code analysis

KafkaProducer constructor KafkaThread a thread start at instantiation performed Sender

// KafkaProducer构造方法启动Sender
String ioThreadName = NETWORK_THREAD_PREFIX + " | " + clientId;
this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
this.ioThread.start();复制代码

// Sender->run()->runOnce()
long currentTimeMs = time.milliseconds();
// 发送生产的消息
long pollTimeout = sendProducerData(currentTimeMs);
// 真正执行I/O操作
client.poll(pollTimeout, currentTimeMs);复制代码

// 获取集群信息
Cluster cluster = metadata.fetch();复制代码

// 获取准备好可以发送消息的分区且已经获取到leader分区的节点
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// ReadyCheckResult 包含可以发送消息且获取到leader分区的节点集合、未获取到leader分区节点的topic集合
public final Set<Node> 的节点;
public final long nextReadyCheckDelayMs;
public final Set<String> unknownLeaderTopics;复制代码

ready traversal methods are mainly described above add message RecordAccumulator container, Map >, Acquired from the cluster leader TopicPartition information according to the node where the partition, but can not find the corresponding node leader message to be transmitted is added to the topic in unknownLeaderTopics. While the add those TopicPartition The partition may acquire leader message and the node that satisfies the conditions for sending to node

// 遍历batches
for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
    TopicPartition part = entry.getKey();
    Deque<ProducerBatch> deque = entry.getValue();
    // 根据TopicPartition从集群信息获取leader分区所在节点
    Node leader = cluster.leaderFor(part);
    synchronized (deque) {
        if (leader == null && !deque.isEmpty()) {
            // 添加未找到对应leader分区所在节点但有要发送的消息的topic
            unknownLeaderTopics.add(part.topic());
        } else if (!readyNodes.contains(leader) && !isMuted(part, nowMs)) {
                ....
                if (sendable && !backingOff) {
                    // 添加准备好的节点
                    readyNodes.add(leader);
                } else {
                   ...
}复制代码

Then unknownLeaderTopics return traverse, the topic is added to the metadata information, invoke method request to update metadata information metadata.requestUpdate

for (String topic : result.unknownLeaderTopics)
    this.metadata.add(topic);
    result.unknownLeaderTopics);
    this.metadata.requestUpdate();复制代码

Already prepared nodes for the final inspection, remove those nodes connection is not ready, the main method to judge according to KafkaClient.ready

Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
    Node node = iter.next();
    // 调用KafkaClient.ready方法验证节点连接是否就绪
    if (!this.client.ready(node, now)) {
        // 移除没有就绪的节点
        iter.remove();
        notReadyTimeout = Math.min(notReadyTimeout, this.client.pollDelayMs(node, now));
    }
}复制代码

The following production start creating request message

// 从RecordAccumulator中取出TopicPartition对应的Deque双端队列,然后从双端队列头部取出ProducerBatch,作为要发送的信息
Map<Integer, List<ProducerBatch>> batches = this.accumulator.drain(cluster, result.readyNodes, this.maxRequestSize, now);复制代码

The messages are encapsulated into ClientRequest

ClientRequest clientRequest = client.newClientRequest(nodeId, requestBuilder, now, acks != 0,requestTimeoutMs, callback);复制代码

Call KafkaClient send a message (not really perform I / O), involves KafkaChannel. Kafka communication mode uses NIO

// NetworkClient.doSent方法
String destination = clientRequest.destination();
RequestHeader header = clientRequest.makeHeader(request.version());
...
Send send = request.toSend(destination, header);
InFlightRequest inFlightRequest = new InFlightRequest(clientRequest,header,isInternalRequest,request,send,now);
this.inFlightRequests.add(inFlightRequest);
selector.send(send);

...

// Selector.send方法    
String connectionId = send.destination();
KafkaChannel channel = openOrClosingChannelOrFail(connectionId);
if (closingChannels.containsKey(connectionId)) {
    this.failedSends.add(connectionId);
} else {
    try {
        channel.setSend(send);
    ...复制代码

Here, send messages work almost ready, call KafkaClient.poll method actually perform I / O operations

client.poll(pollTimeout, currentTimeMs);复制代码

Figure Sender thread with a summary of the process

file

Through the above description, we came out of the main flow Kafka message production, related to the main thread writes a message to RecordAccumulator while background thread gets a message from Sender RecordAccumulator using NIO manner to send a message to Kafka, with a Photo summary

file

postscript

No. This is the first attempt of this public write source code related articles, to be honest I really do not know how to write code shots, paste the entire code, etc. feel I have been denied, and finally adopted this approach, describes the main processes, the independent code is omitted, with the flowchart.

Huawei cloud kafka last week participated in training classes, simply looked at kafka production and consumption of code, think simple comb, then start reading the source code that is 8.17 noon on Sunday, carding process, has written more than a 0:00, left it is not completed, to get up early on Monday morning to complete the article. Of course, this article ignores a lot more detail things, the back will continue in-depth, the courage to try, continued to improve, come on!

Reference material

Huawei cloud combat

Geeks time kafka column

Guess you like

Origin juejin.im/post/5d7e1d04e51d4561cb5ddf4c