Learned what kafka (https://www.cnblogs.com/tree1123/p/11226880.html) after
Learn the core api consumers, kafka consumer version after several changes, particularly vulnerable to confusion, so be sure to find out which version of another study.
First, the old version of the consumer
Only the old version (before 0.9) have high-level consumer and low-level consumer points, many articles mentioned that these two: low-level and high-end consumer customers, but require more flexible low-end consumers maintain their own a lot of things, the high-order bit rigid but it does not require too much maintenance.
high-level consumer is the consumer group.
a single low-level consumer is a consumer, the consumer which has no single consumer group, no correlation between each other and the other consumer.
1、low-level consumer
low-level consumer underlying implementation is
SimpleConsumer he can manage on their own consumers
Storm's storm-kafka Kafka plug-in is the use of SimpleConsumer
The advantage is flexible and can take a message from any location.
If you need: reading data repeatedly consume only partially accurate consumption data partition would have to use this,
But we must deal with their own displacement submit Looking partition leader broker deal leader changes.
接口中的方法:
fetch
send 发送请求
getOffsetBefore
commitOffsets
fetchOffsets
earliestOrlatestOffset
close
Steps for usage:
Referring to the official website, more complex take several steps to pull message.
Find an active Broker and find out which Broker is the leader for your topic and partition
Find active broker Which broker is to find your topic and partition of leader
Determine who the replica Brokers are for your topic and partition
Find out brokers replica of
Build the request defining what data you are interested in
Establishment request
Fetch the data
Take data
Identify and recover from leader changes
Recovery time leader change
Some offset can also check information such as metadata, specific code as follows.
//根据指定的分区从主题元数据中找到主副本
SimpleConsumer consumer = new SimpleConsumer(seed, a_port, 100000, 64 * 1024,
"leaderLookup");
List<String> topics = Collections.singletonList(a_topic);
TopicMetadataRequest req = new TopicMetadataRequest(topics); kafka.javaapi.TopicMetadataResponse resp = consumer.send(req);
List<TopicMetadata> metaData = resp.topicsMetadata();
String leader = metaData.leader().host();
//获取分区的offset等信息
//比如获取lastoffset
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition);
Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfo = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>();
requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(whichTime, 1));
kafka.javaapi.OffsetRequest request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), clientName);
OffsetResponse response = consumer.getOffsetsBefore(request);
long[] offsets = response.offsets(topic, partition);
long lastoffset = offsets[0];
Now apply this api much, unless you have special needs, such as to write their own monitor, you may need more metadata information.
2、high-level consumer
The main use categories: ConsumerConnector
Partition each shielding offset for each topic administration (to automatically read in the Consumer group zookeeper offset the Last)
Broker failover, load balancing at the Partition Consumer decrease (increase or decrease when Partiotion and Consumer, Kafka automatic load balancing)
These features low-level consumer needs its own implementation.
主要方法如下:
createMessageStreams
createMessageStreamsByFilter
commitOffsets
setconsumerReblanceListener
shutdown
group complete the core functionality through zookeeper,
zookeeper directory structure is as follows:
/consumers/groupId/ids/consumre.id
Record subscription information of the consumer, is also used to monitor consumer a viable state. This is a temporary node failures session will be automatically deleted.
/consumers/groupId/owners/topic/partition
Save id consumer consumption of each thread, saving execution rebalance.
/consumers/groupId/offsets/topic/partition
Save the consumer group specified displacement information partition.
The consumer support multi-threaded design, create only a consumer instance, but if it is more than one partition, it will automatically create multiple threads consumption.
Steps for usage:
Properties properties = new Properties();
properties.put("zookeeper.connect", "ip1:2181,ip2:2181,ip3:2181");//声明zk
properties.put("group.id", "group03");
ConsumerConnector consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(properties));
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, 1); // 一次从主题中获取一个数据
Map<String, List<KafkaStream<byte[], byte[]>>> messageStreams = consumer.createMessageStreams(topicCountMap);
KafkaStream<byte[], byte[]> stream = messageStreams.get(topic).get(0);// 获取每次接收到的这个数据 如果是多线程在这里处理多分区的情况
ConsumerIterator<byte[], byte[]> iterator = stream.iterator();
while(iterator.hasNext()){
String message = new String(iterator.next().message());
System.out.println("接收到: " + message);
}
//auto.offset.reset 默认值为largest
//从头消费 properties.put("auto.offset.reset", "smallest");
Quite simply, we use a lot of versions prior to 0.9 are his, spring integrated approach and so on. But after a new consumer version 0.9 appeared.
Second, the new version of the consumer
Let me talk about the version of the problem:
Kafka Kafka after 0.10.0.0 increased so Kafka1.0 Streams Streams start to stabilize.
kafka security stability after 0.9.0.0 After 0.10.0.1
After the new version 0.10.1.0 stable consumer
There are even kafka storm two packages:
storm-kafka using an older version of the consumer
storm-kafka-client uses a new version of the consumer
kafka 0.9.0.0 abandoned scala version of the new version of java development when the producer and consumer Legacy Legacy
version | Recommended producer | Recommended consumer | the reason |
---|---|---|---|
0.8.2.2 | Old version | Old version | The new producer is not yet stable |
0.9.0.x | The new | Old version | New producer stable |
0.10.0.x | The new | Old version | New consumer unstable |
0.10.1.0 | The new | The new | New consumer stable |
0.10.2.x | The new | The new | We have stabilized |
Older versions offset management relying zookeeper, the new version does not rely zookeeper.
Language | Package names | The main use categories | |
---|---|---|---|
old version | scala | kafka.consumer.* | ZookeeperConsumerConnector SimpleConsumer |
new version | java | org.apache.kafka.clients.consumer.* | Kafka Consumer |
The new version of the core concepts:
consumer group
Consumers use a consumer group name (group.id) to mark their own, topic of each message will only be sent to a consumer instance of each subscribe to his consumer group.
1, a consumer group has a number of consumers.
2, with respect to a group, topic of each message can be sent to a consumer in the example group.
3, topic message may be sent to the plurality of group.
consumer end offset
Position record every consumer consumption partition
kafka not put this on the server side, saved in the consumer group, and periodically persistence.
Older versions will offset this in a regular presence zookeeper: path / consumers / groupid / offsets / topic / partitionid
The new version will offset placed in an internal topic: __ consumer_offsets (front two underscores) which has 50 partitions
So the new version of the consumer does not need even a zookeeper.
The old version is set offsets.storage = kafka set to submit to this displacement, do not use.
The structure __consumer_offsets: key = group.id + topic + partition value = offset
consumer group reblance
The individual consumer is not rebalance.
He provides all the consumer under a consumer group how to allocate all of the partitions.
Single-threaded sample code:
Properties props = new Properties();
props.put("bootstrap.servers", "kafka01:9092,kafka02:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset","earliest");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
try{
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
}finally{
consumer.close();
}
Very simple, one only needs to configure the server groupid autocommit kafka serialization autooffsetreset (wherein bootstrap.server group.id key.deserializer value.deserializer must be specified);
2, consumer objects constructed with these Properties (KafkaConsumer other configurations, can pass into the serialization);
3, subscribe subscribe topic list (you can use regular subscription Pattern.compile ( "kafka. *")
You must be specified using a regular listener subscribe (Pattern pattern, ConsumerRebalanceListener listener)); can be rewritten to implement this interface logic partitioning changes. If you set enable.auto.commit = true to ignore this logic.
4, and then poll the message loop (where 1000 is timeout, if not a lot of data, also one second, etc.);
5, the message processing (printing the offset key value here write processing logic).
6, close KafkaConsumer (may pass a default timeout value is 30 seconds to wait).
Detailed Properties:
bootstrap.server (not the best ip hostname kafka internal use unless it is configured with the ip host name)
deserializer deserializes broker end consumer is obtained from byte array, revert back to the object type.
By default there are a dozen: StringDeserializer LongDeserializer DoubleDeserializer. .
Can also be customized: Create Custom Deserializer deserializer class implements the interface to rewrite the logical format defined serializer
In addition to four will pass bootstrap.server group.id key.deserializer value.deserializer
There session.timeout.ms "coordinator failure detection time"
Consumer hang time is detected in order to rebalance timely default is 10 seconds can be set to a smaller value to avoid delay message.
max.poll.interval.ms "consumer processing logic maximum time"
When dealing with more complex logic can set this value to avoid unnecessary rebalance, because the poll twice longer than this parameter, kafka think the consumer has not keep up, you will be kicked out of the group, and can not be submitted to offset, will repeat consumption. The default is 5 minutes.
auto.offset.reset "kafka's no displacement or displacement cross-border coping strategies"
So if a consumer from scratch after a group start or restart successfully submitted displacement followed by consumption of this parameter is invalid
So explain three values are:
earliset When offset submitted under the district, from start spending offset submitted; no offset when submitted, the earliest shift from consumption
latest when there is offset submitted in each partition, start spending from the offset submitted; when the offset no submitted a new consumer data produced none in the partition topic when the partitions are present offset committed, from the offset start consumption; as long as there is a partition offset submitted does not exist, an exception is thrown
(Note kafka-0.10.1.X the previous version:. Auto.offset.reset value is smallest, and, largest (offest stored in the zk),
What we are talking about the new version: After kafka-0.10.1.X version: Change the value auto.offset.reset: earliest, latest, and none (offest stored in a special kafka the topic entitled: __ consumer_offsets inside ))
Whether enable.auto.commit automatically submitted displacement
true to automatically submit false require users to manually submit there is only one treatment is needed to control their own recently set to false.
fetch.max.bytes consumer to obtain a single maximum number of bytes
The maximum message max.poll.records return a single poll numbers
Default 500 if consumption is very light amount can be appropriately increased rate of consumption increase this value.
Other team members hearbeat.interval.ms consumer perception of time rabalance
This value must be less than session.timeout.ms will hang up if it detects consumer simply can not perceive the rabalance
connections.max.idle.ms periodically close the connection time
The default is 9 minutes never close to -1
Detailed poll method:
(Old version: multi-threaded multi-partition the new version: a thread to manage multiple socket connection)
But the new version is KafkaConsumer double thread, the main thread is responsible for: a message obtaining, rebalance, coordinator, submit displacement, etc.,
Another is a background thread heartbeat.
The various configurations of the upper side, poll the method will find offset, acquired when enough data is available, or the wait time exceeds the specified timeout, returns.
java consumer is not thread-safe, with a KafkaConsumer used in the multiple threads, will report Kafka Consumer is not safe for multi-threaded assess abnormalities. You can add a synchronization lock for protection.
poll timeout parameter 1000 has been said it is timeout, if not a lot of data, also one second, etc., on the return, such as the timing to write the message 5 seconds, timeout parameter can be set to 5000, to maximize efficiency.
If you do not do regular tasks, it is set to Long.MAX_VALUE not obtain enough data on wait indefinitely. Here we want to capture it WakeupException.
Detailed consumer offset:
consumer needs to regularly submit their information to the offset kafka. We have learned the new version will be submitted to him a topic in __consumer_offsets.
offset a greater role is to accomplish semantics:
At most once at most once may be lost not repeat
At least once may be repeated at least once is not lost
Exact time exactly once is not lost just once is not repeated
If the consumer before the consumer to submit displacement achieved at most once
If submitted after consumption is realized at least once this default.
The position information of the plurality of consumer:
The position of the last commit log level of the current position of the latest displacement
0 1 。。 5 。。 10 。。 15
Last submit position: consumer offset value last submitted;
Current Location: consumer last poll to this position but has not submitted;
Water Level: This is the consumer partition management can not read the log above the water level of the message;
The latest displacement: Management and largest displacement value of the log partition will not be smaller than the water level.
The new version of the consumer would choose a broker in the broker as a coordinator consumergroup for implementing group member management, consumer distribution plan, submitted displacement. If the consumer crashes, he was responsible for the partition is assigned to another consumer, if not done displacement may submit repeated consumption.
The case of multiple submissions, kafka just the most recent one submitted.
Consumer automatically submit the default displacement submitted to five seconds may be provided by the spaced auto.commit.interval.ms.
Automatic submission can reduce development, but may be repeated consumption, we still need to manually submit accurate consumption. Set manual submission enable.auto.commit = false, then call consumer.commitSync () or consumer.commitAync () Sync is a synchronization system, blocking Aync to asynchronous mode, without blocking. These two methods can pass parameters to specify which partition is submitted, which is more reasonable.
(The old version of automatic submission is auto.commit.enable default setting is 60 seconds)
Detailed rebalance:
rebalance consumer group is how to allocate the topic of all partitions.
Normally, such as partitions 10, 5 average consumer that consumer group will be assigned two partitions each consumer.
Each partition will only be given to a consumer instance. There are consumer problems, will be re-run this process, this process is rebalance.
(Old version by zookeeper management rebalance, the new version will select a broker for the group coordinator to manage)
rebalance trigger conditions:
1, adding new consumer, there is consumer or to leave or hang up.
2, group subscribed topic changed, such as regular subscription.
3, the number of partitions group subscription changes.
The first often, not necessarily hang, it may be too slow process, in order to avoid frequent rebalance, to adjust request.timeout.ms max.poll.records and ma.poll.interval.
rebalance partitioning strategy:
partition.assignment.strategy set custom partitioning strategy - to create a partition device assignor
range strategy (the default), partition divided into subsections, one assigned to each consumer.
round-robin policy polling assignment.
sticky strategy (0.11.0.0 appear better), range policy when subscribe to multiple topic will be uneven.
sticky two principles, when the two conflict, the first target to a second target priority.
- Assigned to the partition as uniform as possible;
- Assigned partition remains the same as last time allocation as possible.
rebalance generation generational mechanisms to ensure that the issue of duplicate submission when rabalance, delay the submission of the old generation offset information will be reported abnormal ILLEGAL_GENERATION
rebalance process:
1, to determine where the broker coordinator, establish socket connections.
Determination algorithm: Math.abs (groupID.hashCode)% offsets.topic.num.partition parameter values (default 50)
Looking __consumer_offset partition leader a copy of the 50 where the broker, the broker is the coordinator of this group
2, join group
The election of a consumer to do leader (the leader is consumer coordinator is broker) After all the consumer sends a request to the JoinGroup coordinator, receive all requests, coordinator and member of the subscription information to the coordinator.
3, the synchronization distribution scheme
leader to develop distribution plan by SyncGroup request to the coordinator, each consumer will return to the requesting program.
kafka also supports offset not submit to __consumer_offset, you can customize, this time on the need to implement a listener ConsumerRebalanceListener, where reprocessing Rebalance logic.
Multi-threaded sample code:
这里要根据自身需求开发,我这里只举一个简单的例子,就是几个分区就启动几个consumer,一一对应。
三个类:
Main:
public static void main(String[] args) {
String bootstrapServers = "kafka01:9092,kafka02:9092";
String groupId = "test";
String topic = "testtopic";
int consumerNum = 3;
ConsumerGroup cg = new ConsumerGroup(consumerNum,bootstrapServers,groupId,topic);
cg.execute();
}
import java.util.ArrayList;
import java.util.List;
public class ConsumerGroup {
private List<ConsumerRunnable> consumers;
public ConsumerGroup(int consumerNum,String bootstrapServers,String groupId,String topic){
consumers = new ArrayList<>(consumerNum);
for(int i=0;i < consumerNum;i++){
ConsumerRunnable ConsumerRunnable = new ConsumerRunnable(bootstrapServers,groupId,topic);
consumers.add(ConsumerRunnable);
}
}
public void execute(){
for(ConsumerRunnable consumerRunnable:consumers){
new Thread(consumerRunnable).start();
}
}
}
import java.util.Arrays;
import java.util.Properties;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
public class ConsumerRunnable implements Runnable{
private final KafkaConsumer<String,String> consumer;
public ConsumerRunnable(String bootstrapServers,String groupId,String topic){
Properties props = new Properties();
props.put("bootstrap.servers", bootstrapServers);
props.put("group.id", groupId);
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset","earliest");
this.consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
}
@Override
public void run() {
while (true) {
ConsumerRecords<String, String> records = consumer.poll(10);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
}
}
standalone consumer
There are some requirements, you need to specify a partition in a consumer spending. Do not interfere with each other, a standalone consumer crash will not affect the other.
Similar to the old version of the low-end consumers.
The following sample code: consumer.assign partition Subscription Method
public static void main(String[] args) {
Properties props = new Properties();
props.put("bootstrap.servers", "kafka01:9092,kafka02:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset","earliest");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
List<TopicPartition> partitions = new ArrayList<>();
List<PartitionInfo> allpartitions = consumer.partitionsFor("testtopic");
if(allpartitions!=null && !allpartitions.isEmpty()){
for(PartitionInfo partitionInfo:allpartitions){
partitions.add(new TopicPartition(partitionInfo.topic(),partitionInfo.partition()));
}
consumer.assign(partitions);
}
while (true) {
ConsumerRecords<String, String> records = consumer.poll(10);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
}
}
The above is kafka consumers learn different specific details still need to carefully study the document through the official website.