"Kafka Column" - 002 Detailed Explanation of Kafka Producers

1. Kafka producer workflow

In the process of sending a message, there are two threads working together - the Main thread and the Sender thread . The Main thread is responsible for processing the data, specifying the sending location, and then temporarily storing it. The Sender thread is responsible for transmitting the temporarily stored data to the Kafka Broker.

Picture from: Shang Silicon Valley

image-20220713114149675

1. Main thread details

1) Create RecordAccumulator

The Main thread will create a container, which we will call the Record Accumulator for the time being . Its default size is 32m. This is a buffer . There is a container in the buffer ConcurrentMap<TopicPartition, Deque<ProducerBatch>> batchesto store the data to be sent. The Key is Pointing to the partition, Value stores the data to be sent, which ProducerBatchis a double-ended queue.

image-20220713145905229

2) Process the data

Before data is stored in the buffer, it needs to be processed by interceptors, serializers and partitioners :

  1. Interceptors are rarely used on the production side, we can customize interception rules to intercept data
  2. Serializers are well understood, we key.serializercan value.serializerspecify how to serialize data by specifying and parameters
  3. The partitioner determines the partition to which the data needs to be sent by parsing the parameters, and sends it to the corresponding ProducerBatch

Main thread work flow chart, I sent a few:

image-20220714150802855

2. Detailed explanation of Sender thread

The Sender thread will pull the data in the ProduceBatch, and then send the data to the Kafka Broker through the Http request. Then the Sender thread will involve several issues:

  1. When to pull data?
  2. How can I confirm that the message was sent successfully?
  3. How many simultaneous requests can be made?

1) When to pull the data in the ProducerBatch?

在生产者参数中,有一个 batch.size 项,默认是 16k,这个配置项就是控制 ProducerBatch 这个双端队列的大小,当数据累计到配置的值时,Sender 线程就会将里面的数据拉走。

但如果数据一直达不到配置的大小呢?总不能一直不拉取数据吧,这样在使用者看起来,消费者迟迟收不到生产的数据,这是不合理的,因此有另一个配置项 linger.ms,当数据迟迟达不到 batch.size 时,Sender 线程等待了超过 linger.ms 设置的时间,也会拉取数据,linger.ms 的默认值是 0ms,也就是说有数据就会被立即拉走。

2)如何确认消息发送成功?

在生产环境下,消息的发送往往都不是一帆风顺,如网络波动、Kafka Broker 挂掉,等情况都有可能导致消息持久化失败,这就涉及一个问题,在什么情况下 Producer 会认为消息已经发送成功了呢?这里引入一个参数 acks,它有三个可配置的值:

  1. acks=0:生产者将不会等待来自服务器的任何确认,该记录将立即添加到缓冲区并视为已发送
  2. acks=1(默认值):Leader 会将记录写入其本地日志,但无需等待所有副本服务器的完全确认即可做出回应,在这种情况下,如果 Leader 在确认记录后立即失败,则记录将会丢失
  3. acks=all:相当于 acks=-1,Leader 将等待完整的同步副本集以确认记录,这保证了只要至少一个同步副本服务器仍然存活,记录就不会丢失,这是最强有力的保证

如果害怕消息发送失败,还可以通过配置 retries 参数来激活重试机制,发送失败 Sender 线程会自动重试

3)可以同时进行多少个请求?

生产端的 Sender 线程会缓存一个请求队列,默认每个 Broker 最多可以缓存 5 个请求,可以通过配置 max.in.flight.requests.per.connection 值来改变。

由于在 Kafka 1.X 以后,Kafka 服务端可以缓存生产者发来的最近的五个请求元数据,所以在五个请求内,都能保证数据的顺序。

Sender 线程工作流程图,来自我寄几:

image-20220714152333176

二、生产者常用参数

这个小节列举一些生产者常用的配置项,有印象、了解即可

参数名 作用
key.serializervalue.serializer 指定发送消息的 key 和 value 的序列化类型,一定要写类的全限定名
buffer.memory 缓冲区 RecordAccumulator 总大小,默认 32m
batch.size 缓冲区内的批次队列 ProducerBatch 大小,默认 16k
linger.ms 如果数据迟迟未达到 batch.size,Sender 等待 linger.time
之后就会发送数据。默认值是 0ms,表示没
有延迟。生产环境一般设置为 50ms
acks 0:生产者发送过来的数据,不需要等数据落盘应答
1:生产者发送过来的数据,Leader 收到数据后应答
-1(默认值):生产者发送过来的数据,Leader 和 ISR 队列里面的所有节点收齐数据后应答
max.in.flight.requests.per.connection Sender 线程缓存的请求数,也是就是允许没有 ack 的请求次数,默认为 5
retries 消息发送失败后重试的次数,如果需要保证数据的顺序性
应该把 Sender 线程缓存的请求数设置为1,否则其他消息可能先发送成功
retry.backoff.ms 两次重试之间的时间间隔,默认是 100ms
compression.type 生者者发送数据的时候是否压缩,默认是 none,支持
gzip、snappy、lz4 和 zstd,生产环境一般使用 snappy

三、示例:使用 API 向 Kafka 发送消息

版本:

  • Kafka 3.2
  • kafka-client 3.2

引入依赖:

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>3.2.0</version>
</dependency>

1. 同步发送

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class BaseProducer {
        public static void main(String[] args) throws ExecutionException, InterruptedException {
        //配置生产者
        Properties properties = new Properties();
        properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "http://localhost:9092");
        properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);    
        
        // 往指定的 Topic 里面发送 数据 hello,同步发送
        RecordMetadata hello = producer.send(new ProducerRecord<>("topic-test", "hello")).get();
        System.out.println(hello);
        producer.close();
        }
}

2. 异步发送,带会回调函数

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class BaseProducer {
        public static void main(String[] args) throws ExecutionException, InterruptedException {
        //配置生产者
        Properties properties = new Properties();
        properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "http://localhost:9092");
        properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        KafkaProducer<String, String> producer = new KafkaProducer<>(properties);    
        
        // 往指定的 Topic 里面发送 数据 hello,同步发送
        producer.send(new ProducerRecord<>("topic-test", "hello"), new Callback() {
            @Override
            public void onCompletion(RecordMetadata metadata, Exception exception) {
                if (exception != null) {
                    System.out.println("消息发送失败,原因:" + exception.getMessage());
                }
                System.out.println("消息发送成功," + metadata.offset() + metadata.hasOffset() + metadata.topic() + metadata.hasTimestamp() + metadata.timestamp() + metadata.partition());
            }
        });
        producer.close();
        }
}

四、生产者分区策略

**Kafka 为什么要分区?**分区有两个好处:

  1. Facilitate the rational use of storage resources : a topic can have multiple partitions, partitions can be distributed on different brokers, massive data is divided into pieces and stored on different servers, and the tasks of partitions can be reasonably controlled to achieve the effect of load balancing
  2. Improve parallelism : The producer can specify a partition to send data, and the consumer can specify a partition for consumption, which achieves the effect similar to multi-threading

1. Default partitioner

The default partitioner DefaultPartitionerdescribes the default partitioning strategy, and its comments can be found in the source code.

image-20220714231423531

I will translate the translation:

  1. With Partition : directly store data in the corresponding partition
  2. There is no Partition with a Key : a Partition value will Key的Hash值 % 主题的分区数be
  3. No Partition and no Key : Using Sticky Partition (sticky partitioner), a partition will be randomly selected, and this partition will be used as long as possible. When the ProducerBatch of the partition is full or completed, other partitions will be randomly selected (no repetition use the last partition)

2. Custom Partitioner

First of all, we have to create a partitioner. By implementing the org.apache.kafka.clients.producer.Partitionerinterface , we can get a custom partitioner. By implementing the method, we can customize the partitioning rules:

import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;
import java.util.Map;
public class MyPartitioner implements Partitioner {
    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        int partition = 0;
        if (key.equals("1")) {
            partition = 1;
        }
        return partition;
    }
    
    @Override
    public void close() {}

    @Override
    public void configure(Map<String, ?> configs) {}
}

When creating a producer, we can configure the custom partitioner into the parameters:

Properties properties = new Properties();
properties.setProperty(ProducerConfig.PARTITIONER_CLASS_CONFIG, MyPartitioner.class.getName());
KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

Guess you like

Origin juejin.im/post/7120511144703295502