大数据与kafka系列之Produce源码分析(三)

上篇说了kafka produce过程中的分区和拦截器，这节继续。

    int partition = partition(record, serializedKey, serializedValue, cluster);
    int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
    ensureValidRecordSize(serializedSize);
    tp = new TopicPartition(record.topic(), partition);
    long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
    log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
    // producer callback will make sure to call both 'callback' and interceptor callback
    Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
     RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
            if (result.batchIsFull || result.newBatchCreated) {
                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
                this.sender.wakeup();
            }
            return result.future;

在produce的消息经过层层关卡，知道了自己的分区，拦截了该拦截了，该发送出去了吧！对不起，还不行，还不是直接发送，kafka的消息第一时间是存在缓存里的，经过一定的时间或者一定的条数之后，才会被发送出去，每个框架都有自己的缓存处理，下面，我们来认识kafka的缓存处理——RecordAccumulator。

public RecordAccumulator(int batchSize,
                             long totalSize,
                             CompressionType compression,
                             long lingerMs,
                             long retryBackoffMs,
                             Metrics metrics,
                             Time time) {
        this.drainIndex = 0;
        this.closed = false;
        this.flushesInProgress = new AtomicInteger(0);
        this.appendsInProgress = new AtomicInteger(0);
        this.batchSize = batchSize;
        this.compression = compression;
        this.lingerMs = lingerMs;
        this.retryBackoffMs = retryBackoffMs;
        this.batches = new CopyOnWriteMap<>();
        String metricGrpName = "producer-metrics";
        this.free = new BufferPool(totalSize, batchSize, metrics, time, metricGrpName);
        this.incomplete = new IncompleteRecordBatches();
        this.muted = new HashSet<>();
        this.time = time;
        registerMetrics(metrics, metricGrpName);
    }

batchSize是RecordBatch 的大小，totalSize由配置buffer.memory控制，配置 RecordAccumulator 中 BufferPool 的大小，compression则决定在RecordAccumulator中是否压缩，lingerMs是由配置linger.ms控制，生产者默认会把两次发送时间间隔内收集到的所有发送消息的请求进行一次聚合然后再发送，以此提高吞吐量，如消息聚合的数量小于 batch.size，则再在这个时间间隔内再增加一些延时。通过该配置项可以在消息产生速度大于发送速度时，一定程度上降低负载。metrics则是监控相关指标。下面来看accumulator.append方法：

    public RecordAppendResult append(TopicPartition tp,
                                     long timestamp,
                                     byte[] key,
                                     byte[] value,
                                     Callback callback,
                                     long maxTimeToBlock) throws InterruptedException {
        // We keep track of the number of appending thread to make sure we do not miss batches in
        // abortIncompleteBatches().
        appendsInProgress.incrementAndGet();
        try {
            // check if we have an in-progress batch
            Deque<RecordBatch> dq = getOrCreateDeque(tp);
            synchronized (dq) {
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");
                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null)
                    return appendResult;
            }

            // we don't have an in-progress record batch try to allocate a new batch
            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
            log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
            synchronized (dq) {
                // Need to check if producer is closed again after grabbing the dequeue lock.
                if (closed)
                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
                if (appendResult != null) {
                    // Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
                    free.deallocate(buffer);
                    return appendResult;
                }
                MemoryRecordsBuilder recordsBuilder = MemoryRecords.builder(buffer, compression, TimestampType.CREATE_TIME, this.batchSize);
                RecordBatch batch = new RecordBatch(tp, recordsBuilder, time.milliseconds());
                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);
                incomplete.add(batch);
                return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true);
            }
        } finally {
            appendsInProgress.decrementAndGet();
        }
    }

首先，将 append 操作的记数器 appendslnProgress 进行 incrementAndGet 操作，记数加 1,若apend 操作失败则需要将 appendslnProgress 进行 decrementAndGet 操作恢复原值，记数减 1.appendslnProgress 记数是为了追踪正在进行追加操作的线程数，以便当客户端在调用KafkaProducer.close（）方法强制关闭发送消息操作时， sender 调用消息累加器的 abortlncompleteBatches（）方法，放弃未处理完的请求，释放资源。

接着，通过本 ProducerRecord 构造的 TopicPartion 获取其对应的双端队列 Deque<RecordBatch> 。

若获取不到当前 TopicPartion 关联的 Deque 则创建一个空的 Deque 对象，并将新创建的 Deque 与该 TopicPartition 保存到 batches 中关联起来。在获取 Deque 之后，调用 RecordAccumulator.tryAppend（）方法，尝试进行消息写入操作。该过程是一个同步操作，而锁的对象为 Deque，这也保证了相同 TopicPartiton 的 append 操作只能顺序执行，当有一个线程正在进行 append 操作时，与之相同 TopicPartiton 的客户端就不能进行 append 操作，必须等待，这样就能保证写入同一个分区的数据在 BufferPool 是有序写入的。现在再来分析 RecordAccumulator.tryAppend（）方法的具体实现。在分析 tryAppend（）方法之前，我们首先要明确 RecordAccumulator、 BufferPool 、 Record.Batch 、 MemoryRecords 、 ByteBuffer之间的关系，在实例化 RecordAccum ulator 时，会创建一个 BufferPool, BufferPool 维护了一个Deque<ByteBuffer>的双端队列，而 Record.Batch 是由相同 TopicPartion 的 Record 组成的，在RecordBatch 中定义了一个 MemoryRecords 对象， MemorRecords 底层是一个消息缓冲区ByteButfer, Record 最终是被写入 BufferPool 维护的 Deque 的一个 ByteBuffer 之中。

private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, Deque<RecordBatch> deque) {
        RecordBatch last = deque.peekLast();
        if (last != null) {
            FutureRecordMetadata future = last.tryAppend(timestamp, key, value, callback, time.milliseconds());
            if (future == null)
                last.close();
            else
                return new RecordAppendResult(future, deque.size() > 1 || last.isFull(), false);
        }
        return null;
    }
    
    
    public FutureRecordMetadata tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, long now) {
        if (!recordsBuilder.hasRoomFor(key, value)) {
            return null;
        } else {
            long checksum = this.recordsBuilder.append(timestamp, key, value);
            this.maxRecordSize = Math.max(this.maxRecordSize, Record.recordSize(key, value));
            this.lastAppendTime = now;
            FutureRecordMetadata future = new FutureRecordMetadata(this.produceFuture, this.recordCount,
                                                                   timestamp, checksum,
                                                                   key == null ? -1 : key.length,
                                                                   value == null ? -1 : value.length);
            if (callback != null)
                thunks.add(new Thunk(callback, future));
            this.recordCount++;
            return future;
        }
    }

在 tryAppend（）方法执行时，首先会从双端队列队尾中取出一个 Record.Batch，若 Record.Batch不为 null ，则调用 Recrod.Batch.tryAppend（）方法尝试将 Record 写到消息缓冲区。 Recrod.Batch.tryAppend （）方法首先检查是否有空间以继续容纳新的 Record ，若无空间则直接返回 null 交由消息累加器继续处理，否则通过压缩器 Compressor 将 Record 写入 ByteBuffer 中，若写入成功则进行以下处理。

( 1 ）取当前的 maxRecordS ize 与写入的 Record 总长度两者之中较大者更新当前 RecordBatch 的 maxRecordSize . maxRecordS ize 用于 Kafka 相关指标监控， sender 会交由相应的 Sensor处理。

(2 ）更新 lastAppendTime 。每次 append（）操作完成后更新该字段，记录最后一次追加操作的时间。

(3 ）构造一个 FutureRecordMetadata 类型企future 对象。 FutureRecordMetadata 实现 Future 接口，future 由新写的 Record 在 RecordBatch 中相对偏移量 offsetCounter、时间戳、 R四ord 的 CRC32 校验和、Record 的 Key 和 Value 序列化后的 size及一个 ProduceRequestResult 类型 result 组成。同一个RecordBatch 中的 Record 共享同一个 result, result 用于 sender 线程控制 RecordBatch 中的 Record 是否被成功提交相关操作， result 保存了该 RecordBatch 的起始偏移量（ baseOffer）及 TopicPartiton 信息。

(4 ）通过future 和 callback 创建一个 Thunk 对象添加到 thunks 列表中。

(5 ）用于统计 RecordBatch 中 Record 总数的 recordCount 加 1 。

(6 ）返回 future 。

经过以上处理之后，若 future 为 null 即 RecordBatch 己无空间继续接受新的 Record 时，则将该 RecordBatch 进行 close 操作，否则根据 future 实例化一个 RecordAppendResult 对象。实例化 RecordAppendResult 对象调用的构造函数入参除了如ture 对象外，还有两个 boolean 类型的参数 batchisFull 及 newBatchCreated，这两个参数分别用来标识 RecordBatch 是否已满和当前RecordBatch 是否为新创建的。

当 RecordBatch 所在双端队列 size 大于 1 或当前 RecordBatch 己不能再被写入时（ writable 为false ，或者可写缓冲区的写限制 writeLimit 不大于 compressor 预估算的大小，将 batchlsFull 标识设置为 true。若队列中无该 TopicPartition 对应的 RecordBatch 或 RecordBatch 无空间容纳新的 Record 时，先比较当前 Record 所需要的空间与 batchSize 大小，取其较大者作为向 BufferPool 申请分配空间的size。为了谨慎起见，可能此时己有同 TopicPartition 的其他线程创建了 RecordBatch 或 RecordBatch中的部分 Record 己被 sender 处理释放了空间，此时已有空间可容纳新的 Record ，则再次调用 tryAppend（）方法，尝试写入，若此时写入成功，则释放刚才从 BufferPool 中申请的空间，否则根据申请的空间创建一个新 RecordBatch 对象，然后再进行写入操作。

写入完成后将新创建的 RecordBatch 添加到该 TopicPartiton 对应的双端队列之中，同时将新创建的 RecordBatch 加入消息累加器的 incomplete 中，最后实例化 RecordAppendResult 对象返回给 KafkaProducer。在 KafkaProducer 的 doSend（）方法中，若 RecordAppendResult 对象的batchlsFull 或 newBatchCreated 中有一个为 true 时则唤醒 sender 线程，同时返回 RecordAppendResult 的future.

在 Record.append（）操作过程中负责对 Record 写操作的执行者是 Compressor, Compressor根据当前版本支持的 4 种压缩类型： none （不压缩）、 gzip 、 snappy 及 lz4, ByteBufferOutputStrearm和默认缓冲区大小（ 1024 字节,实例化一个 DataOutputStream 对象，而 ByteBufferOutputStream继承 OutputStream，内部唯一的一个属性就是 ByteBuffer，同时提供了对 ByteBuffer 的 write 方法，因此 Compressor 最终是将 Record 写入 ByteBuffer 中。

大数据与kafka系列之Produce源码分析(三)

猜你喜欢