Distributed stream processing platform Kafka

1. Introduction

Apache Kafka is a distributed stream processing platform. It has the following characteristics:

  • Support message publishing and subscription, similar to message queues such as RabbtMQ and ActiveMQ;
  • Support real-time data processing;
  • Can guarantee the reliable delivery of messages;
  • Support the persistent storage of messages, and ensure the fault tolerance of messages through a multi-copy distributed storage solution;
  • High throughput, a single Broker can easily handle thousands of partitions and millions of messages per second.

2. Basic concepts

2.1 Messages And Batches
The basic data unit of Kafka is called message (message). In order to reduce network overhead and improve efficiency, multiple messages will be put into the same batch (Batch) and then written.

2.2 Topics And Partitions
Kafka's messages are classified by Topics. A topic can be divided into several Partitions. A partition is a commit log. Messages are written to partitions in an append fashion and read in first-in first-out order. Kafka achieves data redundancy and scalability through partitions. Partitions can be distributed on different servers, which means that a Topic can span multiple servers to provide more powerful performance than a single server.

Since a Topic contains multiple partitions, the order of messages cannot be guaranteed within the entire Topic, but the order of messages within a single partition can be guaranteed.
insert image description here
2.3 Producers And Consumers

  1. Producers
    Producers are responsible for creating messages. In general, the producer distributes messages evenly to all partitions in the topic, and does not care which partition the message will be written to. If we want to write the message to the specified partition, we can do it by customizing the partitioner.

  2. Consumer
    A consumer is part of a consumer group, and the consumer is responsible for consuming messages. Consumers can subscribe to one or more topics and read messages in the order they were generated. Consumers distinguish read messages by checking their offsets. The offset is an incrementing number that Kafka adds to the message when it is created, and is unique to each message within a given partition. The consumer saves the last read offset of each partition on Zookeeper or Kafka. If the consumer is shut down or restarted, it can also retrieve the offset to ensure that the read state will not be lost.
    insert image description here
    A partition can only be read by one consumer in the same consumer group, but it can be read jointly by multiple consumers in different consumer groups. When consumers in multiple consumer groups jointly read the same topic, they do not affect each other.
    insert image description here
    2.4 Brokers And Clusters
    An independent Kafka server is called Broker. Broker receives messages from producers, sets offsets for messages, and commits messages to disk for storage. Broker provides services for consumers, responds to requests to read partitions, and returns messages that have been committed to disk.

Broker is an integral part of the cluster (Cluster). Each cluster will elect a Broker as the cluster controller (Controller), and the cluster controller is responsible for management, including assigning partitions to Brokers and monitoring Brokers.

In the cluster, a partition (Partition) is subordinate to a Broker, and the Broker is called the leader of the partition (Leader). A partition can be assigned to multiple Brokers, and partition replication will occur at this time. This replication mechanism provides message redundancy for partitions. If one Broker fails, other Brokers can take over leadership.
insert image description here

3. Detailed explanation of the use of producers

2.1 Project dependencies

This project is built with Maven. If you want to call the Kafka producer API, you need to import the kafka-clients dependency, as follows:

<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>2.2.0</version>
</dependency>

2.2 Create a producer

When creating a Kafka producer, the following three properties must be specified:

  • bootstrap.servers : Specify the address list of the broker. The list does not need to contain all the broker addresses. The producer will look up the broker information from the given broker. However, it is recommended to provide at least two broker information as fault tolerance;
  • key.serializer : the serializer for the specified key;
  • value.serializer : The serializer for the specified value.

The sample code created is as follows:

public class SimpleProducer {
    
    

    public static void main(String[] args) {
    
    

        String topicName = "Hello-Kafka";

        Properties props = new Properties();
        props.put("bootstrap.servers", "hadoop001:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        /*创建生产者*/
        Producer<String, String> producer = new KafkaProducer<>(props);

        for (int i = 0; i < 10; i++) {
    
    
            ProducerRecord<String, String> record = new ProducerRecord<>(topicName, "hello" + i, 
                                                                         "world" + i);
            /* 发送消息*/
            producer.send(record);
        }
        /*关闭生产者*/
        producer.close();
    }
}

2.3 Testing

  1. Start Kakfa
    The operation of Kafka depends on zookeeper, which needs to be started in advance. You can start the built-in zookeeper of Kafka, or you can start the one you installed yourself:
# zookeeper启动命令
bin/zkServer.sh start

# 内置zookeeper启动命令
bin/zookeeper-server-start.sh config/zookeeper.properties
启动单节点 kafka 用于测试:

# bin/kafka-server-start.sh config/server.properties
# 创建用于测试主题
bin/kafka-topics.sh --create \
                    --bootstrap-server hadoop001:9092 \
                     --replication-factor 1 --partitions 1 \
                     --topic Hello-Kafka

# 查看所有主题
 bin/kafka-topics.sh --list --bootstrap-server hadoop001:9092
  1. start consumer
启动一个控制台消费者用于观察写入情况,启动命令如下:

# bin/kafka-console-consumer.sh --bootstrap-server hadoop001:9092 --topic Hello-Kafka --from-beginning

insert image description here
2.4 Problems that may arise
One problem that may arise here is: after the producer program is started, it is always in a waiting state. This usually occurs when you start Kafka with the default configuration, and you need to make changes to the listeners configuration in the server.properties file:

# hadoop001 为我启动kafka服务的主机名,你可以换成自己的主机名或者ip地址
listeners=PLAINTEXT://hadoop001:9092

4. Send a message

The above sample program calls the send method and does not do anything after sending the message. In this case, we have no way to know the result of the message sending. If you want to know the result of message sending, you can use synchronous sending or asynchronous sending to achieve.

2.1 Synchronous sending
After calling the send method, you can then call the get() method. The return value of the send method is a Future object, and RecordMetadata contains information such as the topic, partition, and offset of the sent message. The rewritten code is as follows:

for (int i = 0; i < 10; i++) {
    
    
    try {
    
    
        ProducerRecord<String, String> record = new ProducerRecord<>(topicName, "k" + i, "world" + i);
        /*同步发送消息*/
        RecordMetadata metadata = producer.send(record).get();
        System.out.printf("topic=%s, partition=%d, offset=%s \n",
                metadata.topic(), metadata.partition(), metadata.offset());
    } catch (InterruptedException | ExecutionException e) {
    
    
        e.printStackTrace();
    }
}

The output obtained at this time is as follows: the offset is related to the number of calls, and all records are allocated to the 0 partition. This is because when creating the Hello-Kafka topic, use --partitions to specify that the number of partitions is 1, that is, there is only one partition.

topic=Hello-Kafka, partition=0, offset=40 
topic=Hello-Kafka, partition=0, offset=41 
topic=Hello-Kafka, partition=0, offset=42 
topic=Hello-Kafka, partition=0, offset=43 
topic=Hello-Kafka, partition=0, offset=44 
topic=Hello-Kafka, partition=0, offset=45 
topic=Hello-Kafka, partition=0, offset=46 
topic=Hello-Kafka, partition=0, offset=47 
topic=Hello-Kafka, partition=0, offset=48 
topic=Hello-Kafka, partition=0, offset=49 

2.2 Asynchronous sending
Usually we don't care about the success of the sending, but more about the failure, so Kafka provides asynchronous sending and callback functions. code show as below:

for (int i = 0; i < 10; i++) {
    
    
    ProducerRecord<String, String> record = new ProducerRecord<>(topicName, "k" + i, "world" + i);
    /*异步发送消息,并监听回调*/
    producer.send(record, new Callback() {
    
    
        @Override
        public void onCompletion(RecordMetadata metadata, Exception exception) {
    
    
            if (exception != null) {
    
    
                System.out.println("进行异常处理");
            } else {
    
    
                System.out.printf("topic=%s, partition=%d, offset=%s \n",
                        metadata.topic(), metadata.partition(), metadata.offset());
            }
        }
    });
}

3. Custom partitioner
Kafka has a default partition mechanism:

If the key value is null, the Round Robin algorithm will be used to evenly distribute the message to each partition;
if the key value is not null, then Kafka will use the built-in hash algorithm to hash the key and then distribute it to on each partition.
In some cases, you may have your own partitioning requirements, which can be implemented with a custom partitioner. Here is an example of a custom partitioner:

3.1 Custom Partitioner

/**
 * 自定义分区器
 */
public class CustomPartitioner implements Partitioner {
    
    

    private int passLine;

    @Override
    public void configure(Map<String, ?> configs) {
    
    
        /*从生产者配置中获取分数线*/
        passLine = (Integer) configs.get("pass.line");
    }

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, 
                         byte[] valueBytes, Cluster cluster) {
    
    
        /*key 值为分数,当分数大于分数线时候,分配到 1 分区,否则分配到 0 分区*/
        return (Integer) key >= passLine ? 1 : 0;
    }

    @Override
    public void close() {
    
    
        System.out.println("分区器关闭");
    }
}
需要在创建生产者时指定分区器,和分区器所需要的配置参数:

public class ProducerWithPartitioner {
    
    

    public static void main(String[] args) {
    
    

        String topicName = "Kafka-Partitioner-Test";

        Properties props = new Properties();
        props.put("bootstrap.servers", "hadoop001:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        /*传递自定义分区器*/
        props.put("partitioner.class", "com.heibaiying.producers.partitioners.CustomPartitioner");
        /*传递分区器所需的参数*/
        props.put("pass.line", 6);

        Producer<Integer, String> producer = new KafkaProducer<>(props);

        for (int i = 0; i <= 10; i++) {
    
    
            String score = "score:" + i;
            ProducerRecord<Integer, String> record = new ProducerRecord<>(topicName, i, score);
            /*异步发送消息*/
            producer.send(record, (metadata, exception) ->
                    System.out.printf("%s, partition=%d, \n", score, metadata.partition()));
        }

        producer.close();
    }
}

3.2 Test
You need to create a topic with at least two partitions:

 bin/kafka-topics.sh --create \
                    --bootstrap-server hadoop001:9092 \
                     --replication-factor 1 --partitions 2 \
                     --topic Kafka-Partitioner-Test

At this time, the input is as follows, and you can see that the scores greater than or equal to 6 are assigned to 1 partition, while those with less than 6 points are assigned to 0 partition.

score:6, partition=1, 
score:7, partition=1, 
score:8, partition=1, 
score:9, partition=1, 
score:10, partition=1, 
score:0, partition=0, 
score:1, partition=0, 
score:2, partition=0, 
score:3, partition=0, 
score:4, partition=0, 
score:5, partition=0, 

The partitioner is closed
4. Other properties
of the producer The creation of the above producer only specifies the service address, key serializer, and value serializer. In fact, Kafka's producer has many configurable properties, as follows:

  1. acks

The acks parameter specifies how many copies of the partition must receive the message before the producer considers the message write to be successful:

acks=0: The message is considered successful when it is sent out, and will not wait for any response from the server; acks=1:
As long as the leader node of the cluster receives the message, the producer will receive a successful response from the server; acks=all
: Only when all nodes participating in the replication have received the message, the producer will receive a successful response from the server.

  1. buffer.memory

Set the size of the producer memory buffer.

  1. compression.type

By default, messages sent are not compressed. If you want to compress, you can configure this parameter, the optional values ​​are snappy, gzip, lz4.

  1. retries

The number of times the message will be resent after an error occurs. If the set value is reached, the producer will give up retrying and return an error.

  1. batch.size

When multiple messages need to be sent to the same partition, the producer puts them in the same batch. This parameter specifies the memory size that a batch can use, calculated in bytes.

  1. linger.ms

This parameter specifies how long the producer waits for more messages to be added to the batch before sending it.

  1. clent.id

The client id, which the server uses to identify the source of the message.

  1. max.in.flight.requests.per.connection

Specifies how many messages the producer can send before receiving a response from the server. The higher its value, the more memory it will take up, but it will also improve throughput. Setting it to 1
ensures that messages are written to the server in the order they were sent, even if retries occur.

  1. timeout.ms, request.timeout.ms & metadata.fetch.timeout.ms

timeout.ms specifies the confirmation time for the borker to wait for the synchronous copy to return the message; request.timeout.ms
specifies the time for the producer to wait for the server to return a response when sending data; metadata.fetch.timeout.ms
specifies the time for the producer to obtain metadata (For example, who is the partition leader) Waiting for the server to return a response time.

  1. max.block.ms

Specifies how long the producer blocks when calling the send() method or fetching metadata using the partitionsFor() method. These methods block when the producer's send buffer is full, or when no metadata is available. When the blocking time reaches max.block.ms, the producer will throw a timeout exception.

  1. max.request.size

This parameter is used to control the request size sent by the producer. It can refer to the maximum value of a single message sent, or the total size of all messages in a single request. For example, assuming this value is 1000K, then the largest single message that can be sent is 1000K, or the producer can send a batch of 1000 messages, each 1K in size, in a single request.

  1. receive.buffer.bytes & send.buffer.byte

These two parameters respectively specify the buffer size of TCP socket receiving and sending packets, and -1 means to use the default value of the operating system.

5. Consumer code example

The following three options are mandatory when creating a consumer:


  • bootstrap.servers : Specify the address list of the broker. The list does not need to contain all the broker addresses. The producer will look up the broker information from the given broker. However, it is recommended to provide at least two broker information as fault tolerance;
  • key.deserializer : the deserializer for the specified key;
  • value.deserializer : The deserializer for the specified value.

In addition, you also need to specify the topic you want to subscribe to. You can use the following two APIs:

  • consumer.subscribe(Collection topics) : Indicates the collection of topics that need to be subscribed;
  • consumer.subscribe(Pattern pattern) : Use regular patterns to match the collections that need to be subscribed

.
Finally, you only need to periodically request data from the server through the polling API (poll). Once a consumer subscribes to a topic, polling will handle all the details, including group coordination, partition rebalancing, sending heartbeats, and fetching data, which allows developers to only focus on the data returned from partitions and then perform business processing. Examples are as follows:

String topic = "Hello-Kafka";
String group = "group1";
Properties props = new Properties();
props.put("bootstrap.servers", "hadoop001:9092");
/*指定分组 ID*/
props.put("group.id", group);
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);

/*订阅主题 (s)*/
consumer.subscribe(Collections.singletonList(topic));

try {
    
    
    while (true) {
    
    
        /*轮询获取数据*/
        ConsumerRecords<String, String> records = consumer.poll(Duration.of(100, ChronoUnit.MILLIS));
        for (ConsumerRecord<String, String> record : records) {
    
    
            System.out.printf("topic = %s,partition = %d, key = %s, value = %s, offset = %d,\n",
           record.topic(), record.partition(), record.key(), record.value(), record.offset());
        }
    }
} finally {
    
    
    consumer.close();
}

4.1 Synchronous commit
Synchronous commit is performed by calling consumer.commitSync(). When no parameters are passed, the commit is the maximum offset of the current polling.

while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(Duration.of(100, ChronoUnit.MILLIS));
    for (ConsumerRecord<String, String> record : records) {
    
    
        System.out.println(record);
    }
    /*同步提交*/
    consumer.commitSync();
}

If a submission fails, the synchronous submission will be retried, which can ensure that the data can be submitted successfully to the maximum extent, but it will also reduce the throughput of the program. For this reason, Kafka also provides an API for asynchronous submission.

4.2 Asynchronous submission
Asynchronous submission can improve the throughput of the program, because at this time you can request data without waiting for the Broker's response. code show as below:

while (true) {
    
    
    ConsumerRecords<String, String> records = consumer.poll(Duration.of(100, ChronoUnit.MILLIS));
    for (ConsumerRecord<String, String> record : records) {
    
    
        System.out.println(record);
    }
    /*异步提交并定义回调*/
    consumer.commitAsync(new OffsetCommitCallback() {
    
    
        @Override
        public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets, Exception exception) {
    
    
          if (exception != null) {
    
    
             System.out.println("错误处理");
             offsets.forEach((x, y) -> System.out.printf("topic = %s,partition = %d, offset = %s \n",
                                                            x.topic(), x.partition(), y.offset()));
            }
        }
    });
}

The problem with asynchronous submission is that there is no automatic retry when the submission fails, and in fact automatic retry cannot be performed. Assuming that the program submits offsets of 200 and 300 at the same time, the offset of 200 fails at this time, but the offset of 300 that follows succeeds, and if retrying at this time, 200 will overwrite the offset of 300 Quantity possible. This problem does not exist in synchronous submission, because in the case of synchronous submission, the 300 submission request must wait for the server to return the successful feedback of 200 submission request before it is issued. For this reason, in some cases, it is necessary to combine both synchronous and asynchronous submission methods.

Note: Although the program cannot automatically retry when it fails, we can manually retry. You can use a Map<TopicPartition, Integer> offsets to maintain the offset of each partition you submitted, and then when it fails At this time, you can judge whether the failed offset is less than the last submitted offset of the same topic and partition you maintain. If it is less, it means that you have submitted a larger offset request, and there is no need to retry at this time. Otherwise a manual retry can be done.

4.3 Synchronous plus asynchronous submission
In the following situation, asynchronous submission is used in normal polling to ensure throughput, but because the consumer is about to be closed at the end, synchronous submission is required to ensure maximum submission success at this time.

try {
    
    
    while (true) {
    
    
        ConsumerRecords<String, String> records = consumer.poll(Duration.of(100, ChronoUnit.MILLIS));
        for (ConsumerRecord<String, String> record : records) {
    
    
            System.out.println(record);
        }
        // 异步提交
        consumer.commitAsync();
    }
} catch (Exception e) {
    
    
    e.printStackTrace();
} finally {
    
    
    try {
    
    
        // 因为即将要关闭消费者,所以要用同步提交保证提交成功
        consumer.commitSync();
    } finally {
    
    
        consumer.close();
    }
}

Other Content Reference Links

Guess you like

Origin blog.csdn.net/zouyang920/article/details/130421133