Article Directory
overview
Core idea
Kafka publishes messages published by producers to topics, and consumers who need these messages can subscribe to these topics.
The following picture also leads us to several important concepts of Kafka:
-
Producer : The party that produces the message.
-
Consumer : A party that consumes messages.
-
Broker (agent/single kafka instance) : It can be regarded as an independent Kafka instance.
- Multiple Kafka Brokers form a Kafka Cluster.
- Each Broker contains the two important concepts of Topic and Partition
-
Topic (topic) : Producer sends messages to a specific topic, and Consumer consumes messages by subscribing to a specific Topic (topic).
-
Partition (partition/queue) : Partition is part of Topic. A Topic can have multiple Partitions, and the Partitions under the same Topic can be distributed on different Brokers, which means that a Topic can be
focused across multiple Brokers: Partitions in Kafka can actually become messages A queue within a queue. -
Consumer group : In the same consumer group, multiple consumers subscribe to the same topic, and only one consumer can receive the message.
producer
example
sending process
- The production end uses
kafkaProducer
sending messages, and the messages go through:拦截器
,序列化器
andkafka分区器
, and are stored in the record accumulation areaRecord Accumulator
. - The Kafka producer does not immediately send each message to the Kafka cluster, but puts a batch of messages into the
Record Accumulator
double-ended queue for accumulation, and then sends them together to reduce network overhead and improve performance.Record Accumulator
Configuration involving two parameters:- batch.size: Only when the number of messages in the double-ended queue reaches the batch size, will the sender thread be called to send messages in batches.
- linger.ms: If the amount of data does not reach batch.size, the data will be sent when the time set by linger.ms (unit: ms) is exceeded. The default value is 0, which means there is no delay.
- The sender thread will send messages to the same broker to form a sending request sequence. If the broker does not respond, it will cache up to 5 request instances.
- The ack confirmation mechanism for sending requests:
- ack=0: The broker responds when it receives the message
- ack=1: The broker received the message, and the leader backed up the disk and responded
- ack=2,: The broker responds only after receiving the message, and both the leader backup and the ISR (nodes that need to be synchronized for backup and storage) are placed on the disk.
- If the sending request fails, it will retry. If the ack response from the broker has not been received beyond the number of retry settings, the retry will be given up and an error will be returned.
Send messages synchronously/asynchronously
Producer parameter configuration
ack - acknowledgment mechanism
retries - number of retries
compression_type - message compression type
Production Experience Questions
improve throughput
- Increase
batch.size
, cooperate to extend some waiting timelinger.ms
- Specify
compression.type
to compress the message - Appropriately increase
RecordAccumulator
the backlog buffer size.
data reliability
Data reliability is guaranteed by the ack response mechanism:
when ACK is set to all, it is necessary to ensure that both the leader backup and the ISR backup are placed on the disk before responding. What if an ISR node fails during this period?
长期未向leader发送同步通信请求或同步数据,将被提出ISR,该时间阈值由replica.lag.time.max.ms
Parameter setting, the default is 30s
Data duplication - idempotence and transactions
idempotence
Idempotency means that no matter how much duplicate data is sent by the Producer to the Broker, the Broker will only persist one piece of data to ensure no duplication.
Several concepts:
- PID (producer ID): producer id, kafka will assign a new one every time it restarts
- Partition: partition number
- Sequence Number: Monotonically increasing sequence number
Criteria for judging duplicate data: When messages with the same primary key < PID, Partition, SeqNumber > are submitted, Broker will only persist one message! Idempotency can only guarantee that there is no duplication within a single partition and a single session.
How to use idempotence: open parameter enable.idempotence
defaults to true, false closes
producer affairs
Idempotency can only ensure that messages are not repeated when there is a single partition and a single session. Once Kafka hangs up, duplicate data may still be generated. In this case, producer transactions are needed.
开启事务,必须开启幂等性!
data order
- In a single partition, because the brokers are received in order, everything is naturally guaranteed to be in order.
- There are multiple partitions, and there is no order between partitions. If you want to ensure order, you need to maintain an orderly window on the consumer side, receive and slide in order.
Partition mechanism
The kafka partition mechanism allows messages to be stored on different partitions of different brokers.
partition strategy
The default is polling.
custom partitioner
Step 1: Customize the partitioner to implement the Partitioner interface.
Step 2: Rewrite the partition rule function partition.
Step 3: Associate a custom partitioner when setting up the producer configuration.
Broker
consumer
consumption example
consumer group
Consumption patterns
Kafka actively pulls data from broker.
consumption process
- Consumers use the pull method to obtain messages from partitions, and only one message in the same consumer group can pull messages from the same partition.
- After the consumer finishes consuming the message, it will reply with an offset offset to
_consumer_offsets
the topic of the broker, and record the consumption location of the consumer, so as to facilitate recovery when the consumer restarts after a shutdown.
commits and offsets
When consumers consume messages, they can track the location (offset) of the message repartition, and automatically _consumer_offset
send a message to a special topic called, including the offset of the partition.
A rebalance is triggered if a consumer sends a crash or if a new consumer joins the group. For example, if consumer 2 fails, then partition 3 and partition 4 will be rebalanced to point to other consumers.
In auto-commit-offsets mode, the rebalancing mechanism can cause problems because the message offsets committed by a dead consumer are inconsistent with the message offsets being processed by a newly designated consumer.
The committed offset is less than the offset being processed:
If the committed offset is less than the offset of the last message being processed, messages between the two offsets will be processed repeatedly.
The committed offset is greater than the offset being processed:
If the committed offset is greater than the offset of the last message being processed, messages between the two offsets will be lost.
Offset submission method
Submit manually
First set autocommit to false:
Synchronous Commit
Committing offsets is easiest and most reliable using commitSync(). This API will submit the latest offset returned by the poll() method, and return immediately after the submission is successful, and throw an exception if the submission fails.
commitSync() will commit the latest offset returned by poll(), so make sure to call commitSync() after processing all records, otherwise there is still a risk of losing messages.
If a rebalance occurs, all messages from the most recent batch of messages until the rebalance occurs will be reprocessed.
At the same time, in this program, as long as no unrecoverable errors occur, the commitSync() method will keep trying until the commit is successful. If the submission fails, we can only record the exception to the error log.
Asynchronous commits
Synchronous commits have the disadvantage that the application blocks until the broker responds to the commit request, which limits the throughput of the application. We can improve throughput by reducing commit frequency, but if rebalancing occurs, it will increase the number of duplicate messages. At this time, you can use the asynchronous submission API. We just send the commit request without waiting for the broker's response.
commitSync() will keep retrying until it successfully commits or encounters an unrecoverable error, but commitAsync() will not, which is also a bad place for commitAsync(). The reason why it does not retry is because by the time it receives the server response, a larger offset may have been submitted successfully. Suppose we send a request to submit an offset of 2000. At this time, a short-term communication problem occurs. The server cannot receive the request, and naturally it will not respond. Meanwhile, we processed another batch of messages and successfully committed offset 3000. If commitAsync() retries to commit at offset 2000, it may succeed after offset 3000. If rebalancing occurs at this time, duplicate messages will appear. commitAsync() also supports callbacks, which are executed when the broker responds. Callbacks are often used to log commit errors or generate metrics. If you want to use it for retrying, you must pay attention to the order of submission.
Synchronous and asynchronous mixed submission
In general, for occasional submission failures, there is no big problem without retrying, because if the submission failure is caused by a temporary problem, subsequent submissions will always succeed. But if this is the last commit that happened before closing the consumer or rebalancing, make sure the commit succeeds. So in this case, we should consider using a hybrid commit approach:
High availability design
cluster
- Cluster mode (cluster), a Kafka cluster is composed of multiple brokers, if one broker goes down, the brokers on other machines can still serve the outside world.
backup
- Backup mechanism (replication): In order to ensure the security of messages in Kafka, information is backed up, and two types of copies are defined:
- Leader copy: The producer first sends messages to the leader copy for backup, and there is only one leader copy.
- Follower copy: The leader copy synchronizes its own messages with the follower copy. There can be multiple copies of followers, and they can be divided into two categories:
- ISR (in - sync replica): Followers that need to be copied and saved synchronously.
- Normal: Saved asynchronously with the leader replica.
- When the leader fails, a new leader needs to be elected. The election principles are as follows:
- Select from the ISR first, because the message data in the ISR is synchronized with the leader
- If the follower in the ISR list is not available, it can only be selected from other followers.
- Extreme case: all copies are invalid, then there are two options:
- Wait for one of the ISRs to come alive and be elected as the leader. The data is reliable, but the time is uncertain.
- Choose the first copy that survives as the leader, not necessarily in the ISR, to restore availability as quickly as possible, but the data may not be complete.
SpringBoot integrates Kafka
basic use