definition:
- Kafka is a distributed publish/subscribe default message queue
- It is an open source distributed event streaming platform that is commonly used for data pipelines, flow analysis, data integration, and mission-critical applications.
Consumption pattern:
- Point-to-point mode (less used)
consumers actively pull data and clear messages after they are received.
- Publish/subscribe model:
producers push messages to queues, and consumers subscribe to the messages they need.
basic concept:
- Producer: message producer
- Consumer: consumer
- Consumer: Group consumer group, consumers with the same consumer group ID are a consumer group; a consumer also consumes for a consumer group
- Broker: kafka server
- Topic: message topic, data classification
- Partition: Partition, a Tpoic consists of multiple partitions
- Replica: Replica, each partition corresponds to multiple replicas
- Leader: The copy contains leader and follower; production and consumption are only for leader
Producer sending process:
- producer -> send(producerRecord) -> interceprots interceptor -> Serializer serializer -> Partitioner partitioner
- When the data is accumulated
batch.size
, the sender will send the data; the default is 16k- If the data does not reach batch.size, the sender will wait for
linger.ms
the set time and then send the data. Unit ms. Default value0ms
, indicating no delaycompression.type
Data compression methodRecordAccumulator
Buffer size, default 32m- answer mode
ack
- 0: After the producer sends data, there is no need to wait for data response.
- 1: The data sent by the producer, the leader responds after receiving the data
- 1: all leader responds after collecting data from all other nodes
General consumption logic:
Consumer Group (CG):
groupid
Identical consumers form a consumer group- Each consumer in the consumer group is responsible for consuming data from different partitions. A partition can only be consumed by one consumer in a group.
- Consumer groups have no influence on each other
- When the number of consumer groups is greater than the number of partitions, there will be
闲置
coordinator
: Assists in realizing初始化
the sum of consumer groups分区的分配
- Each node has one
coordinator
. Bygroupid % 50
selectingcoordinator
node 50, the number of partitions is _consumer_offset.- 1%50 = 1,
_consumer_offset
the number on the partitioncoordinator
is the leadercoordinator
Randomly select a consumer in the consumer group to become the leader. The leader will formulate a consumption plan and return it to the consumer groupcoordinator
, and thencoordinator
allocate the consumption technology to other consumers.coordinator
心跳
The retention time with the consumer3秒
,45秒 超时
- will remove the consumer and trigger再平衡
- The consumer consumption time is too long. By default
5分钟
- the consumer trigger will be removed.再平衡
Consumption process:
- Create a consumer network connection client
ConsumerNetworkClient
to interact with kafka- Consumption request initialization: each batch
最小抓取大小
, the data does not reach the timeout time of 500ms, and the upper limit of the captured data size- Send consumption request-》onSuccess() callback, pull data-》Put it into the message queue in batches
- Consumers consume data from the message queue in each batch (500 items) -》Deserialization-》Interceptor-》Processing data
Consumption plan (partition allocation strategy) default Range + CooperativeSticky:
- Range: For
每一个topic
sorting topic partitions and message consumers, determine how many partitions each message consumer consumes through the number of partitions/number of consumers, excluding the inexhaustible previous consumers who consume more.容易产生数据倾斜
- RoundRobin: Polling partitioning strategy,
针对所有topic
lists all topic partitions and consumers, sorts them according to hashcode, and轮询算法
allocates partitions to consumers- Sticky: Sticky (when performing new allocation, try to be as close as possible to the last allocation result), first try to be as even as possible, and randomly allocate partitions to consumers
- CooperativeSticky: Collaborator stickiness, Sticky’s strategy is the same, but supports cooperative rebalancing. Consumers can continue to consume from partitions that have not been reallocated.
offset displacement: marks the consumption position
- <0.9: It is maintained in zookeeper
- After 0.9: offsets are maintained in a built-in topic: _consumer_offsets
- Use key-value method to store data, key: groupid + topic + partition number
- offset
自动提交
: By default, offset is automatically submitted every 5 seconds,默认
which is true- offset
手动提交
: when consuming, manually submit the offset
- Synchronization: wait for the offset to be submitted successfully before consuming the next one
- Asynchronous: no waiting, direct consumption, no retry mechanism after failure
- Specify offset consumption:
earliest
: Automatically reset the offset to the earliest offset --from-beginninglatest
(Default): Automatically recharge the offset to the latest offsetnono
: Throws an exception to the consumer if the previous offset of the consumer group is not found.
//设置自动提交offset
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,true);
//自动提交时间 5s
properties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG,"5000");
//offset 手动提交
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG,false);
KafkaConsumer kafkaConsumer = new KafkaConsumer<String,String>(properties);
//定义主题
ArrayList<String> topics = new ArrayList<>();
topics.add("first");
//订阅
kafkaConsumer.subscribe(topics);
while (true){
ConsumerRecords<String,String> consumerRecords = kafkaConsumer.poll(Duration.ofSeconds(1));
if (CollectionUtil.isNotEmpty(consumerRecords)){
for (ConsumerRecord<String, String> record : consumerRecords) {
System.out.println(record);
}
}
//手动提交offset
kafkaConsumer.commitAsync();
}
Consumption at specified time:
//查询对应分区
Set<TopicPartition> partitions = kafkaConsumer.assignment();
//保证分区分配方案定制完毕
while (partitions.size()==0){
kafkaConsumer.poll(Duration.ofSeconds(1));
partitions=kafkaConsumer.assignment();
}
//把时间转换成对应的 offset
Map<TopicPartition,Long> map = new HashMap<>(6);
Map<TopicPartition,Long> offsetmap = kafkaConsumer.offsetsForTimes(map);
for (TopicPartition topicPartition : partitions) {
//一天前
offsetmap.put(topicPartition,System.currentTimeMillis() - 1 * 24 * 3600 * 1000);
}
Map<TopicPartition, OffsetAndTimestamp> offsetsForTimeMap = kafkaConsumer.offsetsForTimes(offsetmap);
for (TopicPartition partition : partitions) {
OffsetAndTimestamp timestamp = offsetsForTimeMap.get(partition);
kafkaConsumer.seek(partition,timestamp.offset());
}
kafka file storage mechanism:
Topic
It is a logical concept andpartition
a physical concept, and eachpartition
corresponds to onelog文件
. The log file stores the data produced by the producer.- The data produced by the Producer will be continuously appended to the end of the log file. In order to prevent the log file from being too large and causing low data positioning efficiency, Kafka adopts a sharding and indexing mechanism.
- Each partition is divided into multiple segments
segment
, and each segment contains .index .log .timeindex .snapshot files- These files are located in a folder, and the folder naming rule is: topic name + partition number first-0
- Sparse index: Approximately every 4kb of data written to the log file, an index will be written to the index file.
- The odffset saved in the index file is
相对offset
, this can ensure that the space occupied by the offset value will not be too large, so the offset value can be controlled to a fixed size
File cleaning and compression strategies:
- Kafka’s default log storage time is 7 days
- Compression strategy: compact, corresponding to the value of the same key, only the latest version is retained.
Kafka efficient reading and writing:
- Kafka itself is a distributed cluster, which can use partitioning technology and has a high degree of parallelism.
- Used to read data
稀疏索引
, you can quickly locate the data to be consumed- Write to the disk sequentially. Kafka's producer produces data that needs to be written
log文件
. The writing process is appended to the end of the file.顺序写
零拷贝
: Kaka’s data processing operations are handled by Kaka producers and Kaka consumers. The Kaka Broker application layer does not care about the stored data, so there is no need to go through the application layer and the transmission efficiency is high.- Page Cache: Kaka relies heavily on the PageCache function provided by the underlying operating system. When there is a write operation in the upper layer, the operating system just writes the data to PageCache. When a read operation occurs, it is first searched from PageCache. If it cannot be found, it is read from the disk. In fact, PageCache uses as much free memory as possible as a disk cache.
Commonly used script names:
- topic related commands :
- Query topic list:
sh kafka-topics.sh --bootstrap-server localhost:9092 --list
- Create a topic (name: first partition: 1 replica and 3 replicas). The number of replicas cannot exceed the number of clusters.
sh kafka-topics.sh --bootstrap-server localhost:9092 --topic first --create --partitions 1 --replication-factor 3
- topic information
sh kafka-topics.sh --bootstrap-server localhost:9092 --topic first --describe
- Modify the number of topic partitions (can only be increased)
sh kafka-topics.sh --bootstrap-server localhost:9092 --topic first --describe --partitions 3
- Production news:
sh kafka-console-producer.sh --bootstrap-server localhost:9092 --topic first
- Consumption Consumption:
sh kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic first
sh kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic first --from-beginning
Spring boot simple integration:
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
</dependency>
server:
port: 8200
spring:
mvc:
pathmatch:
matching-strategy: ant_path_matcher
application:
name: @artifactId@
kafka:
bootstrap-servers:
- 192.168.1.250:32010
# 生产配置
producer:
#序列化方式
key-serializer: org.apache.kafka.common.serialization.StringSerializer
value-serializer: org.apache.kafka.common.serialization.StringSerializer
properties:
linger.ms: 10 #sender 等待事件
#ssl认证配置相关
# sasl.mechanism: PLAIN
# security.protocol: SASL_PLAINTEXT
# sasl.jaas.config: org.apache.kafka.common.security.plain.PlainLoginModule required username="admin" password="admin";
#缓存区大小 32m
buffer-memory: 33554432
#批次大小 16k
batch-size: 16
# ISR 全部应答
#acks: -1
#事务ID前缀 ,配合 @Transactional ,保证多个消息的原子性
#transaction-id-prefix: "transaction-id-xx"
#消费配置
consumer:
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
#group-id: xiaoshu-1
enable-auto-commit: false
# 从最早消息开始消费,但是消费后,会记录offset、相同 group-id不会再次消费
# offset 是针对每个消费者组
auto-offset-reset: earliest
#批量消费,每次最多消费多少条
#max-poll-records: 50
#ssl认证配置相关
# properties:
# sasl.mechanism: PLAIN
# security.protocol: SASL_PLAINTEXT
# sasl.jaas.config: org.apache.kafka.common.security.plain.PlainLoginModule required username="admin" password="admin";
listener:
# 手动调用Acknowledgment.acknowledge()后立即提交
ack-mode: manual
#批量消费,配合 @KafkaListener - batch="true"
#type: batch
Production:
@Resource
private KafkaTemplate<String,String> kafkaTemplate;
//@Transactional(rollbackFor = RuntimeException.class),配合 ack配置 实现多条消息发送,原子性
@ApiOperation(value = "推送消息到kafak")
@GetMapping("/sendMsg")
public String sendMsg(String topic,String msg){
kafkaTemplate.send(topic,msg).addCallback(success -> {
if (success==null){
System.out.println("消息发送失败");
return;
}
// 消息发送到的topic
String topicName = success.getRecordMetadata().topic();
// 消息发送到的分区
int partition = success.getRecordMetadata().partition();
// 消息在分区内的offset
long offset = success.getRecordMetadata().offset();
System.out.println("发送消息成功:" + topic + "-" + partition + "-" + offset);
}, failure -> {
System.out.println("发送消息失败:" + failure.getMessage());
});
return "ok";
}
Consumption:
@Configuration
public class KafkaConsumer {
private static final String TOPIC_DLT=".DLT";
@Autowired
private KafkaTemplate<String, Object> kafkaTemplate;
/**
* 每个分区由消费者组种得一个消费者消费,每个消费者独立
* 分区 -》 消费 、2分区2个消费监听
* @param record
* @param consumer
*/
@KafkaListener(groupId = "group-1", topicPartitions ={
@TopicPartition(topic = "four",partitions = {
"0"})},batch = "false")
public void consumerTopic1(ConsumerRecord<String, String> record, Consumer consumer){
String value = record.value();
String topic1 = record.topic();
long offset = record.offset();
int partition = record.partition();
try {
log.info("收到消息:"+value+"topic:"+topic1+"offset:"+offset+"分区"+partition);
//TODO 异常,推送到 对应死信 ↓
//int i=1/0;
} catch (Exception e) {
System.out.println("commit failed");
kafkaTemplate.send(topic1+TOPIC_DLT,value);
} finally {
consumer.commitAsync();
}
}
@KafkaListener(groupId = "group-1", topicPartitions ={
@TopicPartition(topic = "four",partitions = {
"1"})},batch = "false")
public void consumerTopic2(ConsumerRecord<String, String> record, Consumer consumer){
String value = record.value();
String topic1 = record.topic();
long offset = record.offset();
int partition = record.partition();
try {
log.info("收到消息:"+value+"topic:"+topic1+"offset:"+offset+"分区"+partition);
//TODO 异常,推送到 对应死信 ↓
//int i=1/0;
} catch (Exception e) {
System.out.println("commit failed");
kafkaTemplate.send(topic1+TOPIC_DLT,value);
} finally {
consumer.commitAsync();
}
}
}
/**
* 监听 topic1 ->转发到 topic2
*/
@KafkaListener(topics = {
"topic1"},groupId = "group-4")
@SendTo("topic2")
public String onMessage7(ConsumerRecord<?, ?> record) {
return record.value()+"-转发消息";
}
@KafkaListener(topics = {
"topic2"},groupId = "group-5")
public void onMessage8(ConsumerRecord<?, ?> record) {
System.out.println("收到转发消息"+record.value());
}