Kafka learning (c) -------- Kafka core of Cosumer

Learned what kafka (https://www.cnblogs.com/tree1123/p/11226880.html) after

Learn the core api consumers, kafka consumer version after several changes, particularly vulnerable to confusion, so be sure to find out which version of another study.

First, the old version of the consumer

Only the old version (before 0.9) have high-level consumer and low-level consumer points, many articles mentioned that these two: low-level and high-end consumer customers, but require more flexible low-end consumers maintain their own a lot of things, the high-order bit rigid but it does not require too much maintenance.

high-level consumer is the consumer group.

a single low-level consumer is a consumer, the consumer which has no single consumer group, no correlation between each other and the other consumer.

1、low-level consumer

low-level consumer underlying implementation is

SimpleConsumer he can manage on their own consumers

Storm's storm-kafka Kafka plug-in is the use of SimpleConsumer

The advantage is flexible and can take a message from any location.

If you need: reading data repeatedly consume only partially accurate consumption data partition would have to use this,

But we must deal with their own displacement submit Looking partition leader broker deal leader changes.

接口中的方法:
fetch
send  发送请求
getOffsetBefore
commitOffsets
fetchOffsets
earliestOrlatestOffset
close

Steps for usage:

Referring to the official website, more complex take several steps to pull message.

Find an active Broker and find out which Broker is the leader for your topic and partition

Find active broker Which broker is to find your topic and partition of leader

Determine who the replica Brokers are for your topic and partition

Find out brokers replica of

Build the request defining what data you are interested in

Establishment request

Fetch the data

Take data

Identify and recover from leader changes

Recovery time leader change

Some offset can also check information such as metadata, specific code as follows.

//根据指定的分区从主题元数据中找到主副本
SimpleConsumer consumer = new SimpleConsumer(seed, a_port, 100000, 64 * 1024,
                        "leaderLookup");
List<String> topics = Collections.singletonList(a_topic);
TopicMetadataRequest req = new TopicMetadataRequest(topics);                kafka.javaapi.TopicMetadataResponse resp = consumer.send(req);
List<TopicMetadata> metaData = resp.topicsMetadata();

String  leader = metaData.leader().host();

//获取分区的offset等信息
//比如获取lastoffset
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition); 

Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfo = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>();  

requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(whichTime, 1)); 

kafka.javaapi.OffsetRequest request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), clientName);

OffsetResponse response = consumer.getOffsetsBefore(request);

long[] offsets = response.offsets(topic, partition);
long lastoffset = offsets[0];

Now apply this api much, unless you have special needs, such as to write their own monitor, you may need more metadata information.

2、high-level consumer

The main use categories: ConsumerConnector

Partition each shielding offset for each topic administration (to automatically read in the Consumer group zookeeper offset the Last)

Broker failover, load balancing at the Partition Consumer decrease (increase or decrease when Partiotion and Consumer, Kafka automatic load balancing)

These features low-level consumer needs its own implementation.

主要方法如下:
createMessageStreams
createMessageStreamsByFilter
commitOffsets
setconsumerReblanceListener
shutdown

group complete the core functionality through zookeeper,

zookeeper directory structure is as follows:

/consumers/groupId/ids/consumre.id

Record subscription information of the consumer, is also used to monitor consumer a viable state. This is a temporary node failures session will be automatically deleted.

/consumers/groupId/owners/topic/partition

Save id consumer consumption of each thread, saving execution rebalance.

/consumers/groupId/offsets/topic/partition

Save the consumer group specified displacement information partition.

The consumer support multi-threaded design, create only a consumer instance, but if it is more than one partition, it will automatically create multiple threads consumption.

Steps for usage:

   Properties properties = new Properties();  
   properties.put("zookeeper.connect", "ip1:2181,ip2:2181,ip3:2181");//声明zk  
   properties.put("group.id", "group03");
   ConsumerConnector  consumer =  Consumer.createJavaConsumerConnector(new ConsumerConfig(properties)); 
   
   Map<String, Integer> topicCountMap = new HashMap<String, Integer>();  
   topicCountMap.put(topic, 1); // 一次从主题中获取一个数据  
   Map<String, List<KafkaStream<byte[], byte[]>>>  messageStreams = consumer.createMessageStreams(topicCountMap);  
   KafkaStream<byte[], byte[]> stream = messageStreams.get(topic).get(0);// 获取每次接收到的这个数据  如果是多线程在这里处理多分区的情况
   ConsumerIterator<byte[], byte[]> iterator =  stream.iterator();  
   while(iterator.hasNext()){  
        String message = new String(iterator.next().message());  
        System.out.println("接收到: " + message);  
   }  

//auto.offset.reset 默认值为largest
//从头消费 properties.put("auto.offset.reset", "smallest"); 

Quite simply, we use a lot of versions prior to 0.9 are his, spring integrated approach and so on. But after a new consumer version 0.9 appeared.

Second, the new version of the consumer

Let me talk about the version of the problem:

Kafka Kafka after 0.10.0.0 increased so Kafka1.0 Streams Streams start to stabilize.

kafka security stability after 0.9.0.0 After 0.10.0.1

After the new version 0.10.1.0 stable consumer

There are even kafka storm two packages:

storm-kafka using an older version of the consumer

storm-kafka-client uses a new version of the consumer

kafka 0.9.0.0 abandoned scala version of the new version of java development when the producer and consumer Legacy Legacy

version Recommended producer Recommended consumer the reason
0.8.2.2 Old version Old version The new producer is not yet stable
0.9.0.x The new Old version New producer stable
0.10.0.x The new Old version New consumer unstable
0.10.1.0 The new The new New consumer stable
0.10.2.x The new The new We have stabilized

Older versions offset management relying zookeeper, the new version does not rely zookeeper.

Language Package names The main use categories
old version scala kafka.consumer.* ZookeeperConsumerConnector SimpleConsumer
new version java org.apache.kafka.clients.consumer.* Kafka Consumer

The new version of the core concepts:

consumer group

Consumers use a consumer group name (group.id) to mark their own, topic of each message will only be sent to a consumer instance of each subscribe to his consumer group.

1, a consumer group has a number of consumers.

2, with respect to a group, topic of each message can be sent to a consumer in the example group.

3, topic message may be sent to the plurality of group.

consumer end offset

Position record every consumer consumption partition

kafka not put this on the server side, saved in the consumer group, and periodically persistence.

Older versions will offset this in a regular presence zookeeper: path / consumers / groupid / offsets / topic / partitionid

The new version will offset placed in an internal topic: __ consumer_offsets (front two underscores) which has 50 partitions

So the new version of the consumer does not need even a zookeeper.

The old version is set offsets.storage = kafka set to submit to this displacement, do not use.

The structure __consumer_offsets: key = group.id + topic + partition value = offset

consumer group reblance

The individual consumer is not rebalance.

He provides all the consumer under a consumer group how to allocate all of the partitions.

Single-threaded sample code:
Properties props = new Properties();
        props.put("bootstrap.servers", "kafka01:9092,kafka02:9092");
        props.put("group.id", "test");
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        
        props.put("auto.offset.reset","earliest");
        
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("foo", "bar"));
      try{  
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(1000);
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
         }
        }finally{
          consumer.close();
        }

Very simple, one only needs to configure the server groupid autocommit kafka serialization autooffsetreset (wherein bootstrap.server group.id key.deserializer value.deserializer must be specified);

2, consumer objects constructed with these Properties (KafkaConsumer other configurations, can pass into the serialization);

3, subscribe subscribe topic list (you can use regular subscription Pattern.compile ( "kafka. *")

You must be specified using a regular listener subscribe (Pattern pattern, ConsumerRebalanceListener listener)); can be rewritten to implement this interface logic partitioning changes. If you set enable.auto.commit = true to ignore this logic.

4, and then poll the message loop (where 1000 is timeout, if not a lot of data, also one second, etc.);

5, the message processing (printing the offset key value here write processing logic).

6, close KafkaConsumer (may pass a default timeout value is 30 seconds to wait).

Detailed Properties:

bootstrap.server (not the best ip hostname kafka internal use unless it is configured with the ip host name)

deserializer deserializes broker end consumer is obtained from byte array, revert back to the object type.

By default there are a dozen: StringDeserializer LongDeserializer DoubleDeserializer. .

Can also be customized: Create Custom Deserializer deserializer class implements the interface to rewrite the logical format defined serializer

In addition to four will pass bootstrap.server group.id key.deserializer value.deserializer

There session.timeout.ms "coordinator failure detection time"

Consumer hang time is detected in order to rebalance timely default is 10 seconds can be set to a smaller value to avoid delay message.

max.poll.interval.ms "consumer processing logic maximum time"

When dealing with more complex logic can set this value to avoid unnecessary rebalance, because the poll twice longer than this parameter, kafka think the consumer has not keep up, you will be kicked out of the group, and can not be submitted to offset, will repeat consumption. The default is 5 minutes.

auto.offset.reset "kafka's no displacement or displacement cross-border coping strategies"

So if a consumer from scratch after a group start or restart successfully submitted displacement followed by consumption of this parameter is invalid

So explain three values ​​are:

earliset When offset submitted under the district, from start spending offset submitted; no offset when submitted, the earliest shift from consumption

latest when there is offset submitted in each partition, start spending from the offset submitted; when the offset no submitted a new consumer data produced none in the partition topic when the partitions are present offset committed, from the offset start consumption; as long as there is a partition offset submitted does not exist, an exception is thrown

(Note kafka-0.10.1.X the previous version:. Auto.offset.reset value is smallest, and, largest (offest stored in the zk),

What we are talking about the new version: After kafka-0.10.1.X version: Change the value auto.offset.reset: earliest, latest, and none (offest stored in a special kafka the topic entitled: __ consumer_offsets inside ))

Whether enable.auto.commit automatically submitted displacement

true to automatically submit false require users to manually submit there is only one treatment is needed to control their own recently set to false.

fetch.max.bytes consumer to obtain a single maximum number of bytes

The maximum message max.poll.records return a single poll numbers

Default 500 if consumption is very light amount can be appropriately increased rate of consumption increase this value.

Other team members hearbeat.interval.ms consumer perception of time rabalance

This value must be less than session.timeout.ms will hang up if it detects consumer simply can not perceive the rabalance

connections.max.idle.ms periodically close the connection time

The default is 9 minutes never close to -1

Detailed poll method:

(Old version: multi-threaded multi-partition the new version: a thread to manage multiple socket connection)

But the new version is KafkaConsumer double thread, the main thread is responsible for: a message obtaining, rebalance, coordinator, submit displacement, etc.,

Another is a background thread heartbeat.

The various configurations of the upper side, poll the method will find offset, acquired when enough data is available, or the wait time exceeds the specified timeout, returns.

java consumer is not thread-safe, with a KafkaConsumer used in the multiple threads, will report Kafka Consumer is not safe for multi-threaded assess abnormalities. You can add a synchronization lock for protection.

poll timeout parameter 1000 has been said it is timeout, if not a lot of data, also one second, etc., on the return, such as the timing to write the message 5 seconds, timeout parameter can be set to 5000, to maximize efficiency.

If you do not do regular tasks, it is set to Long.MAX_VALUE not obtain enough data on wait indefinitely. Here we want to capture it WakeupException.

Detailed consumer offset:

consumer needs to regularly submit their information to the offset kafka. We have learned the new version will be submitted to him a topic in __consumer_offsets.

offset a greater role is to accomplish semantics:

At most once at most once may be lost not repeat

At least once may be repeated at least once is not lost

Exact time exactly once is not lost just once is not repeated

If the consumer before the consumer to submit displacement achieved at most once

If submitted after consumption is realized at least once this default.

The position information of the plurality of consumer:

The position of the last commit log level of the current position of the latest displacement

0 1 。。 5 。。 10 。。 15

Last submit position: consumer offset value last submitted;

Current Location: consumer last poll to this position but has not submitted;

Water Level: This is the consumer partition management can not read the log above the water level of the message;

The latest displacement: Management and largest displacement value of the log partition will not be smaller than the water level.

The new version of the consumer would choose a broker in the broker as a coordinator consumergroup for implementing group member management, consumer distribution plan, submitted displacement. If the consumer crashes, he was responsible for the partition is assigned to another consumer, if not done displacement may submit repeated consumption.

The case of multiple submissions, kafka just the most recent one submitted.

Consumer automatically submit the default displacement submitted to five seconds may be provided by the spaced auto.commit.interval.ms.

Automatic submission can reduce development, but may be repeated consumption, we still need to manually submit accurate consumption. Set manual submission enable.auto.commit = false, then call consumer.commitSync () or consumer.commitAync () Sync is a synchronization system, blocking Aync to asynchronous mode, without blocking. These two methods can pass parameters to specify which partition is submitted, which is more reasonable.

(The old version of automatic submission is auto.commit.enable default setting is 60 seconds)

Detailed rebalance:

rebalance consumer group is how to allocate the topic of all partitions.

Normally, such as partitions 10, 5 average consumer that consumer group will be assigned two partitions each consumer.

Each partition will only be given to a consumer instance. There are consumer problems, will be re-run this process, this process is rebalance.

(Old version by zookeeper management rebalance, the new version will select a broker for the group coordinator to manage)

rebalance trigger conditions:

1, adding new consumer, there is consumer or to leave or hang up.

2, group subscribed topic changed, such as regular subscription.

3, the number of partitions group subscription changes.

The first often, not necessarily hang, it may be too slow process, in order to avoid frequent rebalance, to adjust request.timeout.ms max.poll.records and ma.poll.interval.

rebalance partitioning strategy:

partition.assignment.strategy set custom partitioning strategy - to create a partition device assignor

range strategy (the default), partition divided into subsections, one assigned to each consumer.

round-robin policy polling assignment.

sticky strategy (0.11.0.0 appear better), range policy when subscribe to multiple topic will be uneven.

sticky two principles, when the two conflict, the first target to a second target priority.

  1. Assigned to the partition as uniform as possible;
  2. Assigned partition remains the same as last time allocation as possible.

rebalance generation generational mechanisms to ensure that the issue of duplicate submission when rabalance, delay the submission of the old generation offset information will be reported abnormal ILLEGAL_GENERATION

rebalance process:

1, to determine where the broker coordinator, establish socket connections.

Determination algorithm: Math.abs (groupID.hashCode)% offsets.topic.num.partition parameter values ​​(default 50)

Looking __consumer_offset partition leader a copy of the 50 where the broker, the broker is the coordinator of this group

2, join group

The election of a consumer to do leader (the leader is consumer coordinator is broker) After all the consumer sends a request to the JoinGroup coordinator, receive all requests, coordinator and member of the subscription information to the coordinator.

3, the synchronization distribution scheme

leader to develop distribution plan by SyncGroup request to the coordinator, each consumer will return to the requesting program.

kafka also supports offset not submit to __consumer_offset, you can customize, this time on the need to implement a listener ConsumerRebalanceListener, where reprocessing Rebalance logic.

Multi-threaded sample code:
这里要根据自身需求开发,我这里只举一个简单的例子,就是几个分区就启动几个consumer,一一对应。
三个类:
Main:
public static void main(String[] args) {
        
        String bootstrapServers = "kafka01:9092,kafka02:9092"; 
        String groupId = "test";
        String topic = "testtopic";
        int consumerNum = 3;
        ConsumerGroup cg = new ConsumerGroup(consumerNum,bootstrapServers,groupId,topic);
        cg.execute();
}



import java.util.ArrayList;
import java.util.List;


public class ConsumerGroup {
    
    private List<ConsumerRunnable> consumers;
    
    public ConsumerGroup(int consumerNum,String bootstrapServers,String groupId,String topic){
        
        consumers = new ArrayList<>(consumerNum);
        
        for(int i=0;i < consumerNum;i++){
            ConsumerRunnable ConsumerRunnable = new ConsumerRunnable(bootstrapServers,groupId,topic);
            consumers.add(ConsumerRunnable);
        }
    }
    
    public void execute(){
        
        for(ConsumerRunnable consumerRunnable:consumers){
            new Thread(consumerRunnable).start();
        }
    }
}



import java.util.Arrays;
import java.util.Properties;

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

public class ConsumerRunnable implements Runnable{
    
    private final KafkaConsumer<String,String> consumer;
    
    public ConsumerRunnable(String bootstrapServers,String groupId,String topic){
        
        Properties props = new Properties();
        props.put("bootstrap.servers", bootstrapServers);
        props.put("group.id", groupId);
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("auto.offset.reset","earliest");
        this.consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList(topic));
    }

    @Override
    public void run() {
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(10);
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
        }
    }
}
standalone consumer

There are some requirements, you need to specify a partition in a consumer spending. Do not interfere with each other, a standalone consumer crash will not affect the other.

Similar to the old version of the low-end consumers.

The following sample code: consumer.assign partition Subscription Method

public static void main(String[] args) {
        
        Properties props = new Properties();
        props.put("bootstrap.servers", "kafka01:9092,kafka02:9092");
        props.put("group.id", "test");
        props.put("enable.auto.commit", "true");
        props.put("auto.commit.interval.ms", "1000");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        
        props.put("auto.offset.reset","earliest");
        
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        List<TopicPartition> partitions = new ArrayList<>();
        List<PartitionInfo> allpartitions = consumer.partitionsFor("testtopic");
        if(allpartitions!=null && !allpartitions.isEmpty()){
            for(PartitionInfo partitionInfo:allpartitions){
                partitions.add(new TopicPartition(partitionInfo.topic(),partitionInfo.partition()));
            }
            consumer.assign(partitions);
        }
        
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(10);
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
        }
        
    }

The above is kafka consumers learn different specific details still need to carefully study the document through the official website.

Guess you like

Origin www.cnblogs.com/tree1123/p/11243668.html