【Kafka】Kafka consumer

【Kafka】Kafka consumer

1. Consumption method

  • Pull (pull) mode: the consumer actively pulls data from the broker. Kafka takes this approach.
  • Push (push) mode: Kafka does not use this method, because the message sending rate is determined by the broker, and it is difficult to adapt to the consumption rate of all consumers. For example, the push speed is 50m/s, and consumer1 and consumer2 are too late to process the message.

The disadvantage of the pull mode is that if Kafka has no data, the consumer may fall into a loop and always return empty data.

image-20230709204839264


1.1 Consumer workflow

image-20230709205454097

  • Consumers are independent of each other, and a consumer can consume data from multiple partitions. Consumer A consumes partition 0 data and consumer B can also consume partition 0 data.
  • The data of each partition can only be consumed by one of the consumers in the consumer group.
  • The consumer knows where the message has been consumed by recording the offset in the system topic

1.2 Principle of Consumer Group

Consumer Group (CG) : A consumer group consisting of multiple consumers. The condition for forming a consumer group is that all consumers have the same groupid.

  • Each consumer in the consumer group is responsible for consuming data from different partitions, and a partition can only be consumed by one consumer in the group.
  • Consumer groups do not affect each other. All consumers belong to a consumer group, that is, a consumer is a logical subscriber .

image-20230709210342726

image-20230709210443025


1.3 Consumer group initialization process

coordinator : Assist in the initialization of consumer groups and the allocation of partitions.

Coordinator node selection = hashcode value of groupid% 50 (50 is the number of partitions in _consumer_offsets)

For example: the hashcode value of groupid = 1, 1%50 = 1, then the 1st partition of the _consumer_offsets topic, on which broker, select the coordinator of this node as the boss of the consumer group. When all consumers under the consumer group submit offsets, they submit offsets to this partition.

image-20230709223645300


1.4 Detailed consumption process of consumer groups

image-20230709224235165


1.5 Important Parameters of Consumers

parameter name describe
bootstrap.servers A list of host/ports to use to establish the initial connection to the Kafka cluster.
key.deserializer 和 value.deserializer Specifies the deserialization type of the key and value of the received message. Be sure to write the full class name.
group.id Marks the consumer group to which the consumer belongs.
enable.auto.commit The default value is true , and the consumer will automatically submit offsets to the server periodically.
auto.commit.interval.ms If the value of enable.auto.commit is set to true, this value defines the frequency of consumer offset submission to Kafka, and the default is 5s.
auto.offset.reset What should I do when there is no initial offset in Kafka or the current offset does not exist in the server (for example, data is deleted)?
earliest: Automatically reset the offset to the earliest offset.
latest: By default, the offset is automatically reset to the latest offset.
none: If the original (previous) offset of the consumer group does not exist, an exception is thrown to the consumer.
anything: Throw an exception to the consumer.
offsets.topic.num.partitions The number of partitions for __consumer_offsets, the default is 50 partitions.
heartbeat.interval.ms The heartbeat time between Kafka consumer and coordinator, the default is 3s . The value of this entry must be less than session.timeout.ms and should not be higher than 1/3 of session.timeout.ms.
session.timeout.ms Connection timeout between Kafka consumer and coordinator, default 45s . Above this value, the consumer is removed and the consumer group performs a rebalance.
max.poll.interval.ms The maximum time for a consumer to process a message, the default is 5 minutes . Above this value, the consumer is removed and the consumer group performs a rebalance.
fetch.min.bytes The default is 1 byte . The minimum number of bytes that the consumer obtains from a batch of messages from the server.
fetch.max.wait.ms The default is 500ms . If the minimum number of bytes of a batch of data is not obtained from the server. When the time is up, the data will still be returned.
fetch.max.bytes Default Default: 52428800 (50 m) . The maximum number of bytes that a consumer can obtain from a batch of messages from the server. If the batch of data on the server side is larger than this value (50m), the batch of data can still be pulled back, so this is not an absolute maximum. The batch size is affected by message.max.bytes (broker config) or max.message.bytes (topic config).
max.poll.records The maximum number of messages returned by a poll pull data, the default is 500 .

2. Consumer APIs

2.1 The case of independent consumers

Requirements: Create an independent consumer to consume data in the first topic

image-20230709233800103

**Note: The consumer group id must be configured in the consumer API code. **If the command line starts the consumer and does not fill in the consumer group id, it will be automatically filled with a random consumer group id

Sample code:

public class CustomConsumer {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }
        }

    }
}

2.2 Subscription partition

Requirement: Create an independent consumer to consume data from partition 0 of the first topic.

image-20230709235756550

The code example is as follows:

public class CustomConsumerPartition {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");

        //1.创建一个消费者
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<TopicPartition> topicPartitions = new ArrayList<>();
        topicPartitions.add(new TopicPartition("first", 0));
        kafkaConsumer.assign(topicPartitions);

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }
        }

    }
}

2.3 Consumer group case

Requirements: Test partitioned data of the same topic, which can only be consumed by one consumer group.

image-20230710001513719

Copy the class in 2.1 and create two CustomConsumer1 and CustomConsumer2. This gives us a consumer group consisting of three consumers. We send data using the following producer code:

public class CustomProducerCallback {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接集群 bootstrap.server
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //指定对应的 key 和 value 的序列化类型(key.serializer,)
        //properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        //1.创建kafka生产者对象
        KafkaProducer<String, String> kafkaProducer = new KafkaProducer<String, String>(properties);

        //2.发送数据
        for (int i = 0; i < 500; i++) {
    
    
            kafkaProducer.send(new ProducerRecord<>("first", "hello,world" + i), new Callback() {
    
    
                @Override
                public void onCompletion(RecordMetadata recordMetadata, Exception e) {
    
    
                    if (e == null) {
    
    
                        System.out.println("主题:" + recordMetadata.topic() + "\t分区" + recordMetadata.partition());
                    }
                }
            });
        }

        //3.关闭资源
        kafkaProducer.close();
    }
}

Run three consumer programs, and find that each consumer consumes data from different partitions in the three consoles.


3. Partition allocation and rebalancing

  1. A consumer group consists of multiple consumers, and a topic consists of multiple partitions. Now the question is which consumer will consume the data of which partition .
  2. Kafka has four mainstream partition allocation strategies:
    • Range
    • RoundRobin
    • Sticky
    • CooperativeSticky
  3. partition.assignment.strategyYou can modify the partition allocation strategy by configuring parameters . The default policy is Range + CooperativeStickythat Kafka can use multiple partition allocation strategies at the same time.

image-20230709223645300


3.1 Principle of Range Allocation Strategy

Range is for each topic .

  • First sort the partitions in the same topic according to the serial number , and sort the consumers alphabetically .
  • Adding that there are now 7 partitions and 3 consumers, the sorted partitions will be 0, 1, 2, ..., 5, 6; the sorted consumers will be C0, C1, C2.
  • How many partitions each consumer should consume is determined by the number of partitions/number of consumers . If it cannot be divided, then the first few consumers will consume one more partition
    • For example, 7/3=2 with a remainder of 1, indivisible, then consumer C0 will consume 1 more partition.
    • 8/3=2 and 2 are inexhaustible, so C0 and C1 consume one more each.

Note: If it is only for one topic, the impact of C0 consumers consuming one more partition is not great. However, if there are more than N topics, consumer C0 will consume one more partition for each topic. The more topics there are, the more partitions consumed by C0 will be N partitions than other consumers. prone to data skew


3.1.1 Range partition allocation strategy case

1) Modify the theme first to 7 partitions

bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --alter --topic first --partitions 7

2) Copy the CustomConsumer class and create CustomConsumer2. In this way, a consumer group can be composed of three consumers, CustomConsumer, CustomConsumer1, and CustomConsumer2. The group name is "test", and three consumers are started at the same time.

3) Start the CustomProducer producer, send 500 messages, and randomly send them to different partitions.

public class CustomProducerCallback {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接集群 bootstrap.server
        properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //指定对应的 key 和 value 的序列化类型(key.serializer,)
        //properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
        properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
        properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

        //1.创建kafka生产者对象
        KafkaProducer<String, String> kafkaProducer = new KafkaProducer<String, String>(properties);

        //2.发送数据
        for (int i = 0; i < 500; i++) {
    
    
            kafkaProducer.send(new ProducerRecord<>("first", "hello,world" + i), new Callback() {
    
    
                @Override
                public void onCompletion(RecordMetadata recordMetadata, Exception e) {
    
    
                    if (e == null) {
    
    
                        System.out.println("主题:" + recordMetadata.topic() + "\t分区" + recordMetadata.partition());
                    }
                }
            });
        }

        //3.关闭资源
        kafkaProducer.close();
    }
}

4) View the data of which partitions are consumed by the three consumers.

CustomConsumer consumes partitions 0,1,2; CustomConsumer1 consumes partitions 3,4; CustomConsumer2 consumes partitions 5,6.


3.1.2 Range partition allocation rebalancing case

(1) Stop consumer No. 0, and quickly resend the message to watch the results (within 45s, the sooner the better)

result:

  • Consumer No. 1: Consume to No. 3 and No. 4 partition data
  • Consumer No. 2: Consume to partition data of No. 5 and No. 6
  • The task of consumer 0 will be assigned to consumer 1 or consumer 2 as a whole

Note: After consumer No. 0 hangs up, the consumer group needs to judge whether it exits according to the timeout time of 45s, so it needs to wait. After the time is up to 45s, if it is judged that it really exits, the task will be assigned to other brokers for execution.

(2) Resend the message again after 45s and watch the result

  • Consumer No. 1: Consume data from partitions 0, 1, 2, and 3.
  • Consumer No. 2: Consume data from partitions 4, 5, and 6.

Explanation: No. 0 consumer has been kicked out of the consumer group, so it is re-allocated according to the range method


3.2 Principle of RoundRobin allocation strategy

RoundRobin is for all topics in the cluster .

  • RoundRobin : The polling partition strategy is to list all the partitions and all the consumers , then sort them according to the hashcode , and finally allocate the partitions to each consumer through the polling algorithm .

image-20230710123309713


3.2.1 Case Study of RoundRobin Partition Allocation Strategy

(1) Change the partition allocation strategy to RoundRobin in the three consumer codes of CustomConsumer, CustomConsumer1, and CustomConsumer2 in turn.

// 修改分区分配策略
properties.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, "org.apache.kafka.clients.consumer.RoundRobinAssignor");

(2) Restart 3 consumers, repeat the steps of sending messages, and watch the partition results.

  • CustomConsumer consumes partitions 1 and 4
  • CustomConsumer1 consumes partitions 2 and 5
  • CustomConsumer2 consumes partitions 0, 3, and 6

3.2.2 RoundRobin Partition Allocation Rebalancing Case

(1) Stop consumer 0, and quickly resend the message to watch the results (within 45s, the sooner the better).

  • Consumer No. 1: Consume data from partitions 2 and 5
  • Consumer No. 2: Consume to partition data No. 4 and No. 1
  • The task of consumer No. 0 will divide the data polling into partition data of No. 0, No. 6 and No. 3 according to the RoundRobin method, which will be consumed by No. 1 consumer or No. 2 consumer respectively.

Explanation: After consumer 0 hangs up, the consumer group needs to judge whether it exits according to the timeout time of 45s, so it needs to wait. After the time is up to 45s, if it is judged that it really exits, it will assign the task to other brokers for execution.

(2) Resend the message again to watch the result (after 45s)

  • Consumer No. 1: consume data from partitions 0, 2, 4, and 6
  • Consumer No. 2: Consume data from partitions 1, 3, and 5

Explanation: Consumer 0 has been kicked out of the consumer group, so it is reassigned according to the RoundRobin method.


3.3 Principle of Sticky Allocation Strategy

Definition of sticky partition : It can be understood that the allocation result is "sticky". That is, before performing a new allocation, consider the result of the previous allocation, and adjust the allocation changes as little as possible, which can save a lot of overhead.

Sticky partitioning is a distribution strategy that Kafka has introduced since version 0.11.x. First, it will try to place partitions on consumers in a balanced manner . When there is a problem with consumers in the same consumer group, it will try to maintain the original partition allocation. No change.

For example, there are 7 partitions and 3 consumers. How to place partitions on consumers in a balanced manner . Partitions may be assigned 0,1; 2,3,5; 4,6, three copies. It may also be divided into 0,2; 1,4,6; 3,5, three copies...


3.3.1 Sticky Partition Allocation Strategy Case

Requirements: Set the topic as first, 7 partitions; prepare 3 consumers, adopt the sticky partition strategy, and consume, and observe the consumption distribution. Then stop one of the consumers and observe the consumption allocation again.

(1) Modify the partition allocation strategy to sticky.

// 修改分区分配策略
ArrayList<String> startegys = new ArrayList<>();
startegys.add("org.apache.kafka.clients.consumer.StickyAssignor");

properties.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, startegys);

(2) Send 500 messages using the same producer.

It can be seen that the number of partitions will be kept as close as possible to the partitions.

image-20230710125401181

image-20230710125412908


3.3.2 Sticky Partition Allocation Rebalancing Case

(1) Stop consumer 0, and quickly resend the message to watch the results (within 45s, the sooner the better).

  • Consumer No. 1: consumes data from partitions 2, 5, and 3.
  • Consumer No. 2: consume data from partitions 4 and 6.
  • The task of consumer 0 will be randomly divided into partition data of partition 0 and partition 1 as evenly as possible according to the sticky rule, and will be consumed by consumer 1 or consumer 2 respectively.

Explanation: After consumer 0 hangs up, the consumer group needs to judge whether it exits according to the timeout time of 45s, so it needs to wait. After the time is up to 45s, if it is judged that it really exits, it will assign the task to other brokers for execution.

(2) Resend the message again to view the result (after 45s).

  • Consumer No. 1: consumes data from partitions 2, 3, and 5.
  • Consumer No. 2: Consume data from partitions 0, 1, 4, and 6.

Explanation: Consumer 0 has been kicked out of the consumer group, so it is reassigned according to the sticky method.


4. offset displacement

4.1 The default maintenance location of offset

image-20230710144405628

The __consumer_offsets topic uses key and value to store data. The key is group.id+topic+partition number, and the value is the value of the current offset. Every once in a while, Kafka will compact this topic internally, that is, keep the latest data for each group.id+topic+partition number.

(1) Consumption offset case

0) Idea: _consumer_offsets is a topic in Kafka, so it can be consumed by consumers.

1) Add configuration in the configuration file config/consumer.properties exclude.internal.topics=false, the default is true, which means that the system theme cannot be consumed. In order to view the system subject data, this parameter is changed to false.

2) Use the command line to create a new topic

bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --create --topic atguigu --partitions 2 --replication-factor 2

3) Start the producer to go to atguigu to produce data

bin/kafka-console-producer.sh --topic atguigu --bootstrap-server hadoop102:9092

4) Start consumers to consume atguigu data

bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic atguigu --group test

Note: Specify the name of the consumer group to better observe the data storage location (key is group.id+topic+partition number)

5) View consumer consumption topic _consumer_offsets

[atguigu@hadoop102 kafka]$ bin/kafka-console-consumer.sh --topic __consumer_offsets --bootstrap-server hadoop102:9092 --consumer.config config/consumer.properties --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --from-beginning

[offset,atguigu,1]::OffsetAndMetadata(offset=7, 
leaderEpoch=Optional[0], metadata=, commitTimestamp=1622442520203, 
expireTimestamp=None)
[offset,atguigu,0]::OffsetAndMetadata(offset=8, 
leaderEpoch=Optional[0], metadata=, commitTimestamp=1622442520203, 
expireTimestamp=None)

4.2 Automatically submit offset

In order to enable us to focus on our own business logic, Kafka provides the function of automatically submitting offsets.

Related parameters for automatically submitting offset:

  • enable.auto.commit: Whether to enable the automatic submission of offset function, the default is true.
  • auto.commoit.interval.ms: The time interval for automatically submitting offsets, the default is 5s.

image-20230710150416494

Sample code:

public class CustomConsumerAutoOffset {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
        //自动提交
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, true);
        //提交时间间隔
        properties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }

        }

    }
}

4.3 Submit offset manually

Although it is very simple and convenient to automatically submit offsets, it is difficult for developers to grasp the timing of offset submissions because it is submitted based on time. Therefore, Kafka also provides an API for manually submitting offsets.

There are two ways to manually submit offsets:

  • commitSync (synchronous submission) : You must wait for the offset to be submitted before consuming the next batch of data.
  • commitAsync (asynchronous submission) : After sending and submitting the offset request, the next batch of data will be consumed. recommend

The same point between the two is that the highest offset of a batch of data submitted this time will be submitted; the difference is that the synchronous submission blocks the current thread until the submission is successful, and will automatically fail and retry (caused by uncontrollable factors) , there will also be a submission failure); and asynchronous submission does not have a failure retry mechanism, so the submission may fail.

image-20230710151453676


4.3.1 Submit offset synchronously

Since the synchronous submission of offset has a failure retry mechanism, it is more reliable, but the efficiency of submission is relatively low because it has been waiting for the submission result.

The code example is as follows:

public class CustomConsumerByHandSync {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
        //配置手动提交offset
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }
            //手动提交offset 同步
            kafkaConsumer.commitSync();
        }

    }
}

4.3.2 Submit offset asynchronously

Although submitting offset synchronously is more reliable, it will block the current thread until the submission is successful. Therefore throughput will be greatly affected. Therefore, in more cases, the method of submitting offsets asynchronously will be selected.

public class CustomConsumerByHandSync {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");
        //配置手动提交offset
        properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }
            //手动提交offset 异步
            kafkaConsumer.commitAsync();
        }

    }
}

4.4 Specify offset consumption

auto.offset.reset= earliest | latest | none , default is latest .

What to do when there is no initial offset in Kafka (consumer group consumes for the first time) or the current offset no longer exists on the server (e.g. the data has been deleted)?

  1. earliest : Automatically reset the offset to the earliest offset, --from-beginning.
  2. latest (default) : automatically reset offsets to latest offsets
  3. none : Throws an exception to the consumer if no previous offset for the consumer group is found.

image-20230710161138992

The code example is as follows:

public class CustomConsumerSeek {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //指定位置进行消费
        Set<TopicPartition> assignment = kafkaConsumer.assignment();
        //保证分区分配方案已经制定完毕
        while (assignment.size() == 0) {
    
    
            kafkaConsumer.poll(Duration.ofSeconds(1));
            assignment = kafkaConsumer.assignment();
        }
        //指定消费的offset
        for (TopicPartition topicPartition : assignment) {
    
    
            kafkaConsumer.seek(topicPartition, 100);
        }

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }

        }

    }
}

Note: After each execution, the consumer group name needs to be modified


4.5 Specified Time Consumption

Requirement: In the production environment, you may encounter data exceptions in the last few hours of consumption, and want to consume according to time again. For example, it is required to consume the data of the previous day according to the time, how to deal with it?

public class CustomConsumerSeekTime {
    
    
    public static void main(String[] args) {
    
    
        //0.配置
        Properties properties = new Properties();
        //连接
        properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.101.66:9092,192.168.101.67:9092,192.168.101.68:9092");
        //反序列化
        properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
        //配置消费者组id
        properties.put(ConsumerConfig.GROUP_ID_CONFIG, "test");

        //1.创建一个消费者 "","hello"
        KafkaConsumer<String, String> kafkaConsumer = new KafkaConsumer<>(properties);

        //2.订阅主题first
        ArrayList<String> topics = new ArrayList<>();
        topics.add("first");
        kafkaConsumer.subscribe(topics);

        //指定位置进行消费
        Set<TopicPartition> assignment = kafkaConsumer.assignment();
        //保证分区分配方案已经制定完毕
        while (assignment.size() == 0) {
    
    
            kafkaConsumer.poll(Duration.ofSeconds(1));
            assignment = kafkaConsumer.assignment();
        }
        //希望把时间转换为对应的offset
        HashMap<TopicPartition, Long> topicPartitionLongHashMap = new HashMap<>();
        //封装对应集合
        for (TopicPartition topicPartition : assignment) {
    
    
            topicPartitionLongHashMap.put(topicPartition, System.currentTimeMillis() - 1 * 24 * 3600 * 1000);
        }
        Map<TopicPartition, OffsetAndTimestamp> topicPartitionOffsetAndTimestampMap = kafkaConsumer.offsetsForTimes(topicPartitionLongHashMap);

        //指定消费的offset
        for (TopicPartition topicPartition : assignment) {
    
    
            OffsetAndTimestamp offsetAndTimestamp = topicPartitionOffsetAndTimestampMap.get(topicPartition);
            kafkaConsumer.seek(topicPartition, offsetAndTimestamp.offset());
        }

        //3.消费数据
        while (true) {
    
    
            ConsumerRecords<String, String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
            for (ConsumerRecord<String, String> consumerRecord : consumerRecords) {
    
    
                System.out.println(consumerRecord);
            }

        }

    }
}

5. Missed consumption and repeated consumption

Repeated consumption : The data has been consumed, but the offset has not been submitted.

Leaked consumption : Submit the offset first and then consume, which may result in missed consumption of data.

image-20230710163853508

How can we achieve neither missing consumption nor repeated consumption? Cheung Kan Consumer Affairs


6. Consumer Affairs

If you want to complete precise one-time consumption on the consumer side, you need the Kafka consumer side to atomically bind the consumption process and the offset submission process. At this point we need to save Kafka's offset to a custom medium that supports transactions (such as MySQL).

image-20230710172425973


7. Data backlog

  1. If Kafka's consumption capacity is insufficient, you can consider increasing the number of topic partitions , and at the same time increase the number of consumers in the consumer group, where the number of consumers = the number of partitions . (both are indispensable)

image-20230710172944651

  1. If the downstream data processing is not timely: increase the number of pulls per batch . Pulling too little data in batches (pulling data/processing time < production speed), making the processed data smaller than the production data, will also cause a data backlog.

image-20230710173137390


8. Kafka-Kraft cluster deployment

1) Unzip a kafka installation package again

tar -zxvf kafka_2.12-3.0.0.tgz -C /opt/module/

2) Rename to kafka2

mv kafka_2.12-3.0.0/ kafka2

3) Modify the /opt/module/kafka2/config/kraft/server.properties configuration file on hadoop102

Several parameters to pay attention to:

  • process.roles: The role of kafka, the controller is equivalent to the host, the broker node is equivalent to the slave, and the host is similar to the zk function
  • node.id: the id of the node
  • controller.quorum.voters: List of all Controllers
  • advertised.Listeners: The address exposed by the broker
  • log.dirs: Data storage directory, it is recommended to store in the kafka2/data directory

4) Distribute kafka2

xsync kafka2/
  • On hadoop103 and hadoop104, node.id needs to be changed accordingly, and the value needs to correspond to controller.quorum.voters.
  • On hadoop103 and hadoop104, the corresponding advertised.Listeners addresses need to be modified according to their respective host names.

5) Initialize the cluster data directory

First generate a unique ID for the storage directory.

bin/kafka-storage.sh random-uuid

//控制台输出一下内容
J7s9e8PPTKOO47PxzI39VA

Use this ID to format the kafka storage directory (every node runs once)

bin/kafka-storage.sh format -t J7s9e8PPTKOO47PxzI39VA -c /opt/module/kafka2/config/kraft/server.properties

6) Start the kafka cluster (every node runs once)

bin/kafka-server-start.sh -daemon config/kraft/server.properties

7) Stop the kafka cluster (every node runs once)

bin/kafka-server-stop.sh

8.1 Kraft cluster start and stop scripts

1) Create the file kf2.sh script file in the /root/bin directory

vim kf2.sh

The script is as follows:

#! /bin/bash
case $1 in
"start"){
    
    
 for i in hadoop102 hadoop103 hadoop104
 do
 echo " --------启动 $i Kafka2-------"
 ssh $i "/opt/module/kafka2/bin/kafka-server-start.sh -
daemon /opt/module/kafka2/config/kraft/server.properties"
 done
};;
"stop"){
    
    
 for i in hadoop102 hadoop103 hadoop104
 do
 echo " --------停止 $i Kafka2-------"
 ssh $i "/opt/module/kafka2/bin/kafka-server-stop.sh "
 done
};;
esac

2) Add execution permission

chmod +x kf2.sh

3) Start the cluster command

kf2.sh start

4) Stop the cluster command

kf2.sh stop

Guess you like

Origin blog.csdn.net/Decade_Faiz/article/details/131629804