背景：

生产者为了提升吞吐量，在生产者客户端设计了缓冲区

RecordAccumulator实现了消息的缓冲区，从而提升了生产者的吞吐量。
实现了RecordProducer主线程和Sender线程的解耦。

设计原理

先看整体架构图。

整体架构图

RecordAccumulator主要分为三块：

消息批次集合 ConcurrentMap<TopicPartition,Deque> batches：真正用来保存消息的缓冲区
内存池 Deque bufferPool：用来给消息分配内存
RecordAccumulator自身的业务逻辑

消息批次集合batches

是一个CopyOnWriteMap集合，CopyOnWrite这个设计适合读多写少的场景，每次更新的时候都会copy一个副本，在副本里更新。CopyOnWriteMap就是CopyOnWrite这个思路。

为什么缓冲区是读多写少的场景？

缓冲区集合的一个元素是<tp,Deque< ByteBuffer >>,元素的增加和删除的概率很低，因为只有发送的分区增加或减少了才会更新元素，大部分情况下不会出现更新元素的行为。主要的行为还是根据tp获取Deque，主要还是读。

好处是不加锁的前提下读写不会造成线程冲突，提升了吞吐量。
坏处是对内存的占用是很大的，适合读多写少的场景。

bufferPool

bufferPool用来管理ByteBuffer的复用，相当于实现了一套内存管理机制。

bufferPool会向使用者提供ByteBuffer的内存对象，同时当使用者不用的时候，bufferPool会把这个内存对象保存起来等着别的线程用，这样通过内存的复用就减少垃圾回收的成本。

代码分析：

CopyOnWriteMap

它内部的集合其实就是一个非线程安全的map，通过对这个map做一系列的包装按CopyOnWrite的思想实现了线程安全。

非线程安全的Map变量用volatile去修饰，保证了线程间的可见性，只要更新了map这个引用指向的对象地址那么别的线程可以立即看到。
读的时候完全不用加锁，因为读的是一个只读副本，写不会发生在只读副本上，这样读的性能就会非常高，N多线程不加锁读。
写的时候会多个线程调用加锁的putIfAbsent方法，这个方法保证了线程安全，同时所有的操作都用一个锁。如果有了这个元素存在就直接返回，不会再写入写的元素。
保证了KafkaProducer线程的总体线程安全。



public class CopyOnWriteMap<K, V> implements ConcurrentMap<K, V> {

    // 保证可见性

    private volatile Map<K, V> map;

    public CopyOnWriteMap() {

        this.map = Collections.emptyMap();
    }


    public CopyOnWriteMap(Map<K, V> map) {

        this.map = Collections.unmodifiableMap(map);

    }


    @Override

    public boolean containsKey(Object k) {

        return map.containsKey(k);

    }

    @Override

    public boolean containsValue(Object v) {

        return map.containsValue(v);

    }

    @Override

    public Set<java.util.Map.Entry<K, V>> entrySet() {

        return map.entrySet();

    }

    @Override

    public V get(Object k) {

        return map.get(k);
    }

    @Override

    public boolean isEmpty() {

        return map.isEmpty();

    }

    @Override

    public Set<K> keySet() {
        return map.keySet();
    }

    @Override

    public int size() {
        return map.size();
    }

    @Override

    public Collection<V> values() {
        return map.values();
    }

    @Override
    public synchronized void clear() {
        this.map = Collections.emptyMap();
    }

    // 写操作，并把写后的快照提供给读请求。
    @Override
    public synchronized V put(K k, V v) {
        Map<K, V> copy = new HashMap<K, V>(this.map);
        V prev = copy.put(k, v);
        this.map = Collections.unmodifiableMap(copy);
        return prev;
    }

    @Override
    public synchronized void putAll(Map<? extends K, ? extends V> entries) {
        Map<K, V> copy = new HashMap<K, V>(this.map);
        copy.putAll(entries);
        this.map = Collections.unmodifiableMap(copy);
    }

    @Override
    public synchronized V remove(Object key) {
        Map<K, V> copy = new HashMap<K, V>(this.map);
        V prev = copy.remove(key);
        this.map = Collections.unmodifiableMap(copy);
        return prev;
    }


    @Override
    public synchronized V putIfAbsent(K k, V v) {
        if (!containsKey(k))
            return put(k, v);
        else
            return get(k);
    }


    @Override
    public synchronized boolean remove(Object k, Object v) {
        if (containsKey(k) && get(k).equals(v)) {
            remove(k);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public synchronized boolean replace(K k, V original, V replacement) {
        if (containsKey(k) && get(k).equals(original)) {
            put(k, replacement);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public synchronized V replace(K k, V v) {
        if (containsKey(k)) {
            return put(k, v);
        } else {
            return null;
        }
    }
}
复制代码

成员变量


public class BufferPool {
    private final long totalMemory;//默认32M
    private final int poolableSize;//池化大小16k
    private final ReentrantLock lock;
    private final Deque<ByteBuffer> free;//池化的内存
    private final Deque<Condition> waiters;//阻塞线程对应的Condition集合
 private long nonPooledAvailableMemory;//非池化可使用的内存
复制代码

allocate()源码

public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {
    //1.验证申请的内存是否大于总内存
    if (size > this.totalMemory)
        throw new IllegalArgumentException("Attempt to allocate " + size
                                           + " bytes, but there is a hard limit of "
                                           + this.totalMemory
                                           + " on memory allocations.");

    ByteBuffer buffer = null;
    //2.加锁，保证线程安全。
    this.lock.lock();
    if (this.closed) {
        this.lock.unlock();
        throw new KafkaException("Producer closed while allocating memory");
    }
    try {
        //3.申请内存的大小是否是池化的内存大小，16k
        if (size == poolableSize && !this.free.isEmpty())
            //如果是就从池里Bytebuffer
            return this.free.pollFirst();
            // 池化内存空间的大小
        int freeListSize = freeSize() * this.poolableSize;
        //4.如果非池化空间加池化内存空间大于等于要申请的空间
        if (this.nonPooledAvailableMemory + freeListSize >= size) {
                    // 如果申请的空间大小小于池化的大小，就从free队列里拿出一个池化的大小的Bytebuffer加到nonPooledAvailableMemory中
            // 5.如果一个池化的大小的Bytebuffer不满足size，就持续释放池化内存Bytebuffer直到满足为止。
            freeUp(size);
            this.nonPooledAvailableMemory -= size;
            //如果非池化可以空间加池化内存空间大于要申请的空间
        } else {
            int accumulated = 0;
            //创建对应的Condition
            Condition moreMemory = this.lock.newCondition();
            try {
                //线程最长阻塞时间
                long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
                //放入waiters集合中
                this.waiters.addLast(moreMemory);
                // 没有足够的空间就一直循环
                while (accumulated < size) {
                    long startWaitNs = time.nanoseconds();
                    long timeNs;
                    boolean waitingTimeElapsed;
                    try {
                        //空间不够就阻塞，并设置超时时间。
                        waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
                    } finally {
                        long endWaitNs = time.nanoseconds();
                        timeNs = Math.max(0L, endWaitNs - startWaitNs);
                        recordWaitTime(timeNs);
                    }

                    if (this.closed)
                        throw new KafkaException("Producer closed while allocating memory");
                    if (waitingTimeElapsed) {
                        this.metrics.sensor("buffer-exhausted-records").record();
                        throw new BufferExhaustedException("Failed to allocate memory within the configured max blocking time " + maxTimeToBlockMs + " ms.");
                    }
                    remainingTimeToBlockNs -= timeNs;
                    // 当申请的空间的是池化大小且ByteBuffer池化集合里有元素
                    if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
                        buffer = this.free.pollFirst();
                        accumulated = size;
                    } else {
                        //尝试给nonPooledAvailableMemory扩容
                        freeUp(size - accumulated);
                        int got = (int) Math.min(size - accumulated, this.nonPooledAvailableMemory);
                        this.nonPooledAvailableMemory -= got;
                        //累计分配了多少空间
                        accumulated += got;
                    }
                }
                accumulated = 0;
            } finally {
                this.nonPooledAvailableMemory += accumulated;//把已经分配的内存还回nonPooledAvailableMemory
                this.waiters.remove(moreMemory);//删除对应的condition
            }
        }
    } finally {
        try {
            if (!(this.nonPooledAvailableMemory == 0 && this.free.isEmpty()) && !this.waiters.isEmpty())
                this.waiters.peekFirst().signal();
        } finally {
            lock.unlock();
        }
    }
    if (buffer == null)
        //  返回非池化ByteBuffer分配内存
        return safeAllocateByteBuffer(size);
    else
        //  返回池化的ByteBuffer分配内存
        return buffer;
}
复制代码

allocate() 流程图

deallocate()源码

public void deallocate(ByteBuffer buffer, int size) {
    lock.lock();
    try {
        // 释放的空间是否是池化大小，如果是，free上加一个ByteBuffer对象
        if (size == this.poolableSize && size == buffer.capacity()) {
            buffer.clear();
            this.free.add(buffer);
        } else {
            // 否则增加非池化空间大小
            this.nonPooledAvailableMemory += size;
        }
        // 释放第一个wait()；
        Condition moreMem = this.waiters.peekFirst();
        if (moreMem != null)
            moreMem.signal();
    } finally {
        lock.unlock();
    }
}
复制代码

RecordAccumulator.append() 源码 :


public RecordAppendResult append(TopicPartition tp,//要发送的主题分区
                                 long timestamp,//发送时的时间戳
                                 byte[] key,//消息的key
                                 byte[] value,//消息的value
                                 Header[] headers,//消息的头
                                 Callback callback,//生产者的回调方法
                                 long maxTimeToBlock,//最大阻塞时间
                                 boolean abortOnNewBatch,//遇到要创建新的批次就放弃，因为一般不成功是因为
                                 long nowMs//发送的时间
                                  ) throws InterruptedException {

    // 累计发送线程数，
    appendsInProgress.incrementAndGet();
    ByteBuffer buffer = null;
    if (headers == null) headers = Record.EMPTY_HEADERS;
    try {
        // 第一部分：
        // 1. 从batches得到tp对应的 ProducerBatch 队列，如果没有就新建。
        Deque<ProducerBatch> dq = getOrCreateDeque(tp);
        // 2. 第一次加锁，相同的Deque<ProducerBatch>都会竞争这个锁。
        synchronized (dq) {
            //判断生产者是否已经关闭了。
            if (closed)
                throw new KafkaException("Producer closed while send in progress");
            //3.正式往batches里添加消息
            RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
            //4. 如果Deque<ProducerBatch>最后一个ProducerBatch空间够用，一般情况下会添加成功，返回结果
            if (appendResult != null)
                return appendResult;
        }

        // 第二部分：如果空间不够用。
        // 如果不创建新的批次
        if (abortOnNewBatch) {
            // 5.返回给KafakProducer.doSend()方法后，会引起二次调用append(),同时abortOnNewBatch=false
            return new RecordAppendResult(null, false, false, true);
        }
        // 6. KafakProducer.doSend()方法第二次调用append
        byte maxUsableMagic = apiVersions.maxUsableProduceMagic();
        int size = Math.max(this.batchSize, AbstractRecords.estimateSizeInBytesUpperBound(maxUsableMagic, compression, key, value, headers));
        log.trace("Allocating a new {} byte message buffer for topic {} partition {} with remaining timeout {}ms", size, tp.topic(), tp.partition(), maxTimeToBlock);

        // 7. Deque<ProducerBatch>最后一个ProducerBatch不够用时，使用BufferPool申请新的ByteBuffer
        buffer = free.allocate(size, maxTimeToBlock);
        nowMs = time.milliseconds();
        // 8.第二次加锁
        synchronized (dq) {
            if (closed)
                throw new KafkaException("Producer closed while send in progress");
            //9.第二次往batches里添加消息，其他线程可能已经创建了新的batch，就用当前这个，自己创建的不用了
            RecordAppendResult appendResult = tryAppend(timestamp, key, value, headers, callback, dq, nowMs);
            if (appendResult != null) {
                return appendResult;
            }
            // 10.使用BufferPool新申请的ByteBuffer构建ProducerBatch
            MemoryRecordsBuilder recordsBuilder = recordsBuilder(buffer, maxUsableMagic);
            ProducerBatch batch = new ProducerBatch(tp, recordsBuilder, nowMs);
            // 11.使用BufferPool新申请的ByteBuffer构建ProducerBatch
            FutureRecordMetadata future = Objects.requireNonNull(batch.tryAppend(timestamp, key, value, headers,
                    callback, nowMs));
            // 12.新构建ProducerBatch加入到dq里。
            dq.addLast(batch);
            incomplete.add(batch);
            buffer = null;
            return new RecordAppendResult(future, dq.size() > 1 || batch.isFull(), true, false);
        }
    } finally {
        if (buffer != null)
            free.deallocate(buffer);
        appendsInProgress.decrementAndGet();
    }
}
复制代码

RecordsAccumulator跨Deque换批次

方法会查到对应的Deque集合中最后一个RecordBatch对象，并把消息加到最后一个RecordBatch对象里。

为什么加锁？

在第2、8步对Deque加synchronized锁。加锁的原因是Deque是个非线程安全的对象，所以要加锁。

为什么会加两次加锁而不是在一个完整的synchronized块中完成？

加入A线程发送的消息比较大，需要向BufferPool申请新空间，而此时BufferPool空间不足或者需要空间的过大（需要较长时间分配空间），线程A在BufferPool上等待，此时它依然持有对应Deque的锁;线程B发送的消息较小，Deque最后一个RecordBatch剩余空间够用，但是由于线程A未释放Deque的锁，所以也需要一起等待。若线程B较多，就会造成很多不必要的线程阻塞，降低了吞吐量。这里有一个锁的设计原则：“减少锁的持有时间”。

为什么第二次加锁？

是为了防止多个线程并发向BufferPool申请空间后，造成缓存的浪费。这种场景下图所示，线程A发现最后一个RecordBatch空间不够用，申请空间并创建一个新RecordBatch对象添加到Deque的尾部;线程B与线程A并发执行，也将新创建一个RecordBatch 添加到Deque尾部。这样就造成线程A创建的RecordBatch空间还没充分利用线程A创建的RecordBatch就成为了队尾，这样A创建的RecordBatch就不是队尾了，这就出现了内存碎片化。

drain()源码：

public Map<Integer, List<ProducerBatch>> drain(Cluster cluster, Set<Node> nodes, int maxSize, long now) {
    if (nodes.isEmpty())
        return Collections.emptyMap();
    Map<Integer, List<ProducerBatch>> batches = new HashMap<>();
    for (Node node : nodes) {
        List<ProducerBatch> ready = drainBatchesForOneNode(cluster, node, maxSize, now);
        batches.put(node.id(), ready);
    }
    return batches;
}
复制代码

drain()对数据如何转换的：

drainBatchesForOneNode()：

private List<ProducerBatch> drainBatchesForOneNode(Cluster cluster, Node node, int maxSize, long now) {
    int size = 0;
    //1.获取node上所有分区的集合
    List<PartitionInfo> parts = cluster.partitionsForNode(node.id());
    //初始化要给这个node发送ProducerBatch的集合
    List<ProducerBatch> ready = new ArrayList<>();
    //记录上次停止的位置，这样每次不会从0开始，否则会造成总是发送前几个分区的情况，造成后面的分区饥饿。
    int start = drainIndex = drainIndex % parts.size();
    do {
        //2.获取分区的详情
        PartitionInfo part = parts.get(drainIndex);
        TopicPartition tp = new TopicPartition(part.topic(), part.partition());
        this.drainIndex = (this.drainIndex + 1) % parts.size();
        if (isMuted(tp))
            continue;
        //3.获取主题分区对应的Deque
        Deque<ProducerBatch> deque = getDeque(tp);
        if (deque == null)
            continue;
        synchronized (deque) {
            ProducerBatch first = deque.peekFirst();
            if (first == null)
                continue;
            boolean backoff = first.attempts() > 0 && first.waitedTimeMs(now) < retryBackoffMs;
            if (backoff)
                continue;
            if (size + first.estimatedSizeInBytes() > maxSize && !ready.isEmpty()) {
                break;
            } else {
                if (shouldStopDrainBatchesForPartition(first, tp))
                    break;
                boolean isTransactional = transactionManager != null && transactionManager.isTransactional();
                ProducerIdAndEpoch producerIdAndEpoch =
                    transactionManager != null ? transactionManager.producerIdAndEpoch() : null;

                //4.********重点：：：：："每个主题分区只取一个ProducerBatch"
                ProducerBatch batch = deque.pollFirst();
                if (producerIdAndEpoch != null && !batch.hasSequence()) {
 transactionManager.maybeUpdateProducerIdAndEpoch(batch.topicPartition);
                    batch.setProducerState(producerIdAndEpoch, transactionManager.sequenceNumber(batch.topicPartition), isTransactional);
 transactionManager.incrementSequenceNumber(batch.topicPartition, batch.recordCount);
                    log.debug("Assigned producerId {} and producerEpoch {} to batch with base sequence " +
                            "{} being sent to partition {}", producerIdAndEpoch.producerId,
                    producerIdAndEpoch.epoch, batch.baseSequence(), tp);
                    transactionManager.addInFlightBatch(batch);
                }
                batch.close();
                size += batch.records().sizeInBytes();
                //5.加入到reade集合里
                ready.add(batch);
                batch.drained(now);
            }
        }
    } while (start != drainIndex);
    return ready;
}
复制代码

哪些节点的请求已经准备好发送了

ready()源码

public ReadyCheckResult ready(Cluster cluster, long nowMs) {
    //哪些服务端节点可以发送消息
    Set<Node> readyNodes = new HashSet<>();
    long nextReadyCheckDelayMs = Long.MAX_VALUE;
    //找不到Leader副本的分区的主题
    Set<String> unknownLeaderTopics = new HashSet<>();
    //是否有线程在等待BufferPool释放空间，
    boolean exhausted = this.free.queued() > 0;
    //1.遍历batches集合中的所有元素
    for (Map.Entry<TopicPartition, Deque<ProducerBatch>> entry : this.batches.entrySet()) {
        Deque<ProducerBatch> deque = entry.getValue();
        synchronized (deque) {
            //2.取deque第一个ProducerBatch，判断deque是否为空
            ProducerBatch batch = deque.peekFirst();
            if (batch != null) {
                TopicPartition part = entry.getKey();
                //3.查找分区的leader所在的node
                Node leader = cluster.leaderFor(part);
                //leader不存在则就无法发送
                if (leader == null) {
                    // 分区的leader的节点不在元数据中，但是消息还要发送，显然要处理。
                    unknownLeaderTopics.add(part.topic());
                } else if (!readyNodes.contains(leader) && !isMuted(part)) {
                    //已经等了多久没发送了
                    long waitedTimeMs = batch.waitedTimeMs(nowMs);
                    //是否是正在退避：是否重试了，而且已等待的时间小于重试退避时间
                    boolean backingOff = batch.attempts() > 0 && waitedTimeMs < retryBackoffMs;
                    long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
                    //deque大于1或第一个batch是否满了
                    boolean full = deque.size() > 1 || batch.isFull();
                    //消息在暂存队列里是否超时了
                    boolean expired = waitedTimeMs >= timeToWaitMs;
                    //4.五个判断条件决定是否是能发送的node
                    boolean sendable = full || expired || exhausted || closed || flushInProgress();
                    //能发送且没有正在退避
                    if (sendable && !backingOff) {
                        //5.如果是能发送就加入readyNodes集合。
                        readyNodes.add(leader);
                    } else {
                        long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
                        //6.还剩多久：需要等待的时间-已经等待的时间。
                        nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
                    }
                }
            }
        }
    }
    return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}
复制代码

写在最后

本人在掘金发布了小册，对kafka做了源码级的剖析。

欢迎支持笔者小册：《Kafka 源码精讲》

Kafka源码分析02：生产者缓冲区

背景：