Distributed - Message Queue Kafka: Kafka consumer message consumption and parameter configuration

1. Kafka consumer consumes messages

public class CustomConsumer {
    
    
    public static void main(String[] args) {
    
    
        Properties properties = new Properties();
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,StringDeserializer.class.getName());
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"10.65.132.2:9093");
        properties.put(ConsumerConfig.GROUP_ID_CONFIG,"test-group-hh");

        // 创建消费者
        KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);
        // 订阅主题 test
        consumer.subscribe(Collections.singletonList("test"));
        // 消费数据
        while (true){
    
    
            ConsumerRecords<String, String> consumerRecords = consumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> record : consumerRecords) {
    
    
                System.out.printf("主题 = %s, 分区 = %d, 位移 = %d, " + "消息键 = %s, 消息值 = %s\n",
                        record.topic(), record.partition(), record.offset(), record.key(), record.value());
            }
        }
    }
}

01. Create consumers

Insert image description here

Before reading messages, you need to create a KafkaConsumer object. Creating a KafkaConsumer object is very similar to creating a KafkaProducer object - put the properties you want to pass to the consumer in the Properties object.

Properties properties = new Properties();
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,StringDeserializer.class.getName());
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"10.65.132.2:9093");
properties.put(ConsumerConfig.GROUP_ID_CONFIG,"test-group-hh");

// 创建消费者
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(properties);

For simplicity, only 4 necessary properties are provided here: bootstrap.servers, key.deserializer and value.deserializer.

① bootstrap.servers specifies the string connecting to the Kafka cluster.

② key.deserializer and value.deserialize are used to convert byte arrays into Java objects.

③ group.id specifies which consumer group a consumer belongs to. The default value is "". If set to empty, an exception will be reported: Exception in thread "main" org.apache.kafka.common.errors.InvalidGroupIdException: The configured groupId is invalid. Generally speaking, this parameter needs to be set to a name with certain business significance.

02. Subscribe to topics

Insert image description here

① After creating the consumer, the next step is to start subscribing to the topic. The subscribe() method receives a list of topics as a parameter.

// 订阅单个主题 test
consumer.subscribe(Collections.singletonList("test"));
// 订阅多个主题
consumer.subscribe(Arrays.asList("test","test1"));

② You can also pass in a regular expression when calling the subscribe() method. Regular expressions can match multiple topics. If someone creates a new topic, and the name of the topic matches the regular expression, a rebalancing will be triggered immediately, and then consumers can read the messages in the new topic. This subscription method is useful if the application needs to read from multiple topics and can handle different types of data.

consumer.subscribe(Pattern.compile("test.*"));

One parameter type in the overloaded method of subscribe is ConsumerRebalance-Listener, which is used to set the corresponding rebalance listener.

③ Consumers can not only subscribe to topics through the KafkaConsumer.subscribe() method, but also directly subscribe to specific partitions of certain topics. KafkaConsumer also provides an assign() method to implement these functions. This method only accepts one parameter partitions. Used to specify the partition set to be subscribed.

public class KafkaConsumer<K, V> implements Consumer<K, V> {
    
    
    @Override
    public void assign(Collection<TopicPartition> partitions) {
    
    
        // ...
    }

    public final class TopicPartition implements Serializable {
    
    
        private static final long serialVersionUID = -613627415771699627L;

        private int hash = 0;
        private final int partition;
        private final String topic;

        public TopicPartition(String topic, int partition) {
    
    
            this.partition = partition;
            this.topic = topic;
        }

        public int partition() {
    
    
            return partition;
        }

        public String topic() {
    
    
            return topic;
        }
        // ...
    }

    @Override
    public List<PartitionInfo> partitionsFor(String topic) {
    
    
        return partitionsFor(topic, Duration.ofMillis(defaultApiTimeoutMs));
    }
}

The TopicPartition class has only two attributes: topic and partition, which respectively represent the topic to which the partition belongs and its own partition number. This class can be mapped to what we usually call the concept of topic-partition.

// 订阅主题 test 和分区2
consumer.assign(Collections.singletonList(new TopicPartition("test",2)));

What if we don’t know in advance how many partitions there are in the topic? The partitionsFor() method in KafkaConsumer can be used to query the metadata information of the specified topic. The attribute topic in the PartitionInfo class represents the topic name, partition represents the partition number, leader represents the location of the partition's leader copy, replicas represents the partition's AR set, inSyncReplicas represents the partition's ISR set, and offlineReplicas represents the partition's OSR set.

public class PartitionInfo {
    
    
    private final String topic;
    private final int partition;
    private final Node leader;
    private final Node[] replicas;
    private final Node[] inSyncReplicas;
    private final Node[] offlineReplicas;
}

Subscribing to a topic through the subscribe() method has the function of automatic rebalancing of consumers. In the case of multiple consumers, the relationship between each consumer and partition can be automatically allocated according to the partition allocation strategy. When the number of consumers in the consumer group increases or decreases, the partition allocation relationship will be automatically adjusted to achieve consumer load balancing and automatic failover. When subscribing to a partition through the assign() method, it does not have the function of automatic consumer balancing. In fact, this can be seen from the parameters of the assign() method. Both types of subscribe() have ConsumerRebalanceListener type parameters. method, but the assign() method does not.

03. Polling to pull data

The core thing of the consumer API is to request data from the server through a simple poll.

// 消费数据
while (true){
    
    
    ConsumerRecords<String, String> consumerRecords = consumer.poll(Duration.ofSeconds(1));
    for (ConsumerRecord<String, String> record : consumerRecords) {
    
    
        System.out.printf("主题 = %s, 分区 = %d, 位移 = %d, " + "消息键 = %s, 消息值 = %s\n",
                          record.topic(), record.partition(), record.offset(), record.key(), record.value());
    }
}

This is an infinite loop. A consumer is actually a long-running application that requests data from Kafka by continuously polling. The consumer must continue to poll Kafka, otherwise it will be considered "dead" and the partitions it consumed will be handed over to other consumers in the group. The parameter passed to poll() is a timeout interval, which is used to control the blocking time of poll() (blocking occurs when there is no available data in the consumer buffer). If this parameter is set to 0 or data is available, poll() will return immediately, otherwise it will wait for the specified number of milliseconds. The poll() method returns a list of records. Each record in the list contains topic and partition information, the offset of the record in the partition, and the key-value pair of the record. We usually traverse this list and process the records one by one.

Polling is not just about getting data. The first time the consumer's poll() method is called, it needs to find the GroupCoordinator, join the group, and receive the partition assigned to it. If rebalancing is triggered, the entire rebalancing process will also be performed in polling, including executing related callbacks. Therefore, any errors that may occur in consumers or callbacks will eventually be converted into exceptions thrown by the poll() method.

It should be noted that if poll() is not called for more than max.poll.interval.ms, the consumer will be considered "dead" and will be expelled from the consumer group. Therefore, avoid doing anything in a polling loop that might cause unpredictable blocking.

The type of each message consumed by the consumer is ConsumerRecord, which corresponds to the message type ProducerRecord sent by the producer:

public class ConsumerRecord<K, V> {
    
    
    private final String topic;
    private final int partition;
    private final long offset;
    private final long timestamp;
    private final TimestampType timestampType;
    private final int serializedKeySize;
    private final int serializedValueSize;
    private final Headers headers;
    private final K key;
    private final V value;
    private final Optional<Integer> leaderEpoch;
}

Among them, the two fields topic and partition respectively represent the name of the topic to which the message belongs and the number of the partition where it is located. offset represents the offset of the message in the partition to which it belongs. timestamp represents a timestamp, and the corresponding timestampType represents the type of timestamp. timestampType has two types: CreateTime and LogAppendTime, which represent the timestamp of message creation and the timestamp of message appended to the log respectively. headers represents the header content of the message. Key and value respectively represent the key of the message and the value of the message. Generally, what business applications need to read is the value.

2. Kafka consumer parameter configuration

01. fetch.min.bytes

This attribute specifies the minimum number of bytes for the consumer to obtain records from the server. The default is 1 byte. When the broker receives the consumer's request to obtain data, if the amount of available data is less than the size specified by fetch.min.bytes, it will wait until there is enough available data before returning the data. This reduces the load on consumers and brokers because they don't need to transmit messages back and forth when the topic traffic is not very heavy (or during low-traffic periods of the day). If the consumer has high CPU usage when there is not much data available, or to reduce the load on the broker when there are many consumers, the value of this property can be set larger than the default value. However, it should be noted that in low throughput situations, increasing this value will increase latency.

02. fetch.max.wait.ms

By setting fetch.min.bytes, Kafka can wait until there is enough data before returning it to the consumer. feth.max.wait.ms is used to specify the waiting time of the broker. The default is 500 milliseconds. If not enough data flows into Kafka, then consumer requests for data will not be satisfied, resulting in delays of up to 500 milliseconds. If you want to reduce potential latency, you can set the value of this property smaller. If fetch.max.wait.ms is set to 100 milliseconds and fetch.min.bytes is set to 1 MB, then after Kafka receives the consumer's request, if there is 1MB of data, it will return it, if not, it will Returns after 100 milliseconds, depending on which condition is satisfied first.

03. fetch.max.bytes

This property specifies the maximum number of bytes of data returned by Kafka (default is 50 MB). The consumer will store the data returned by the server in memory, so this attribute is used to limit the memory size used by the consumer to store data. It should be noted that records are sent to the client in batches. If the batch to be sent by the broker exceeds the size specified by this attribute, then this limit will be ignored. This ensures that consumers can continue to process messages. It is worth noting that the broker side also has a corresponding configuration property, which the Kafka administrator can use to limit the maximum number of fetches. This configuration property on the broker side can be useful because the larger the amount of data requested, the larger the amount of data that needs to be read from disk and the longer it takes to send the data over the network, which can cause resource contention and increase the broker's load.

04. max.poll.records

This attribute is used to control the number of records returned by a single call to the poll() method. You can use it to control the number of records (not the size of the records) that the application needs to process during each polling loop.

05. max.partition.fetch.bytes

This property specifies the maximum number of bytes the server returns to the consumer from each partition (default is 1MB). When the KafkaConsumer.poll() method returns ConsumerRecords, the records returned from each partition will not exceed the bytes specified by max.partition.fetch.bytes. Note that using this property to control consumer memory usage complicates things because you have no control over how many partitions are included in the response from the broker. Therefore, in this case, it is recommended to use fetch.max.bytes instead, unless there are special needs, such as requiring a similar amount of data to be read from each partition.

06. session.timeout.ms and heartbeat.interval.ms

session.timeout.ms specifies how long the consumer can not interact with the server and is still considered "alive". The default is 10 seconds. If a consumer does not send a heartbeat to the group coordinator within the time specified by session.timeout.ms, it will be considered "dead" and the coordinator will trigger rebalancing and allocate partitions to other consumers in the group. . session.timeout.ms is closely related to heartbeat.interval.ms.

heartbeat.interval.ms specifies how often the consumer sends heartbeats to the coordinator, and session.timeout.ms specifies how long the consumer can not send heartbeats. Therefore, we usually set these two properties at the same time. heartbeat.interval.ms must be smaller than session.timeout.ms, usually the former is 1/3 of the latter. If session.timeout.ms is 3 seconds, then heartbeat.interval.ms should be 1 second. Setting session.timeout.ms smaller than the default can detect and recover from crashes faster, but can also lead to unnecessary rebalancing. Setting session.timeout.ms larger than the default reduces unexpected rebalancing, but takes longer to detect crashes.

07. max.poll.interval.ms

This property specifies how long a consumer can go without polling before it is considered "dead". As mentioned earlier, heartbeats and session timeouts are Kafka's primary mechanisms for detecting "dead" consumers and revoking their partitions. We also mentioned that heartbeats are sent through background threads, and the background thread may continue to send heartbeats when the consumer's main thread is deadlocked, but this consumer is not reading the data in the partition. The easiest way to know if a consumer is still processing messages is to check if it is still requesting data. However, the time between requests is difficult to predict and depends not only on the amount of data available, how the consumer handles the data, and sometimes on the latency of other services. In applications that take time to process each record, max.poll.records can be used to limit the amount of data returned, thereby limiting how long the application waits before calling poll() again. However, even with max.poll.records set, the time interval between poll() calls is still difficult to predict. Therefore, setting max.poll.interval.ms becomes an insurance measure. It must be set large enough so that normal consumers try not to hit this threshold, but small enough to prevent problematic consumers from causing serious impact on the application. The default value for this property is 5 minutes.

08. default.api.timeout.ms

If no timeout is explicitly specified when calling the consumer API, the consumer will use the value specified by this attribute when calling other APIs. The default value is 1 minute, because it is larger than the default value of the request timeout, so the retry time can be included. The poll() method is an exception because it requires an explicit timeout.

09. request.timeout.ms

This property specifies the maximum amount of time a consumer can wait before receiving a response from the broker. If the broker does not respond within the specified time, the client will close the connection and try to reconnect. Its default value is 30 seconds. It is not recommended to set it smaller than the default value. Give the broker enough time to handle other requests before giving up, because there is little benefit in sending requests to an already overloaded broker, and disconnecting and reconnecting will only cause greater overhead.

10. auto.offset.reset

This attribute specifies what the consumer should do when reading a partition that has no offset or an invalid offset (because the consumer has been offline for a long time, the record corresponding to the offset has expired and been deleted). Its default value is latest, which means that if there is no valid offset, the consumer will start reading from the latest record (the record written to Kafka after the consumer started). Another value is earliest, which means that if there is no valid offset, the consumer will read records starting from the starting position. If you set auto.offset.reset to none and try to read a record with an invalid offset, the consumer will throw an exception.

11. partition.assignment.strategy

We know that partitions will be assigned to consumers in the group. PartitionAssignor determines which partitions should be assigned to which consumer based on the given consumers and the topics they subscribe to. Kafka provides several default allocation strategies.

① Range

This strategy assigns several contiguous partitions of each topic to consumers. Assume that consumer C1 and consumer C2 subscribe to both topic T1 and topic T2, and each topic has 3 partitions. Then consumer C1 may be assigned to partition 0 and partition 1 of these two topics, and consumer C2 will be assigned to partition 2 of these two topics. Because each topic has an odd number of partitions and all follow the same allocation strategy, the first consumer will be allocated more partitions than the second consumer. This occurs whenever this strategy is used and the number of partitions is not divisible by the number of consumers.

② Polling (roundRobin)

This strategy will allocate all partitions of all subscribed topics to consumers one by one in order. If a round-robin strategy is used to allocate partitions to consumer C1 and consumer C2, then consumer C1 will be assigned to partition 0 and partition 2 of topic T1 and partition 1 of topic T2, and consumer C2 will be assigned to partition 1 of topic T1 and Partition 0 and partition 2 of topic T2. Generally speaking, if all consumers are subscribed to the same topic (which is often the case), then the polling strategy will allocate the same number of partitions (or at most one less) to all consumers.

③ Sticky

There are two purposes in designing a sticky partition allocator: one is to allocate partitions as evenly as possible, and the other is to retain as much of the original partition ownership relationship as possible during rebalancing and reduce the need to transfer partitions from one consumer to another. expenses incurred by consumers. If all consumers are subscribed to the same topic, the initial allocation ratio of the sticky allocator will be as balanced as the polling allocator. Subsequent reallocations will remain equally balanced but reduce the number of partitions that need to be moved. If consumers in the same group subscribe to different topics, the allocation ratio of the sticky allocator will be more balanced than the polling allocator.

④ cooperative sticky

This allocation strategy is the same as the sticky allocator, except that it supports cooperative (incremental) rebalancing, during which consumers can continue to read messages from partitions that have not been reallocated.

The partition strategy can be configured through partition.assignment.strategy. The default value is org.apache.kafka.clients.consumer.RangeAssignor, which implements the range strategy. You can also change it to org.apache.kafka.clients.consumer.RoundRobinAssignor, org.apache.kafka.clients.consumer.StickyAssignor or org.apache.kafka.clients.consumer.CooperativeStickyAssignor. You can also use a custom allocation strategy. If so, you need to set partition.assignment.strategy to the name of the custom class.

12. client.id

This attribute can be any string used by the broker to identify requests sent from the client, such as get requests. It is commonly used in logs, metrics and quotas.

13. group.instance.id

This attribute can be any unique string that is used as a fixed name for the consumer group.

14. receive.buffer.bytes和send.buffer.bytes

These two attributes respectively specify the TCP buffer size used by the socket when reading and writing data. If they are set to –1, the operating system defaults are used. If the producer or consumer and the broker are located in different data centers, you can increase their values ​​appropriately, because the latency of the cross-data center network is generally higher and the bandwidth is lower.

15. offsets.retention.minutes

This is a configuration property on the broker side. It should be noted that it will also affect consumer behavior. As long as there are active members in the consumer group (that is, there are members who maintain their identity by sending heartbeats), the last offset of each partition submitted by the group will be retained by Kafka and redistributed Or you can get these offsets after restarting. However, if a consumer group loses all members, Kafka will only retain the offset for the time specified by this property (default is 7 days). Once the offset is deleted, even if the consumer group "lives" again, it will be like a brand new group without past consumption memory.

Guess you like

Origin blog.csdn.net/qq_42764468/article/details/132276934