Kafka source code analysis - series 1 - strategy and semantics of message queue

-Introduction to key concepts of Kafka
-various strategies and semantics of message queues

As a message queue, Kafka is already quite famous in the industry. Compared with traditional RabbitMq/ActiveMq, Kafka is inherently distributed, and supports data fragmentation, replication, and convenient cluster expansion.

At the same time, Kafka is a highly reliable, persistent message queue, and this reliability does not come at the expense of performance.

At the same time, in business scenarios where message loss is allowed, Kafka can run in a non-ACK and asynchronous manner to maximize performance.

Starting from this article, this sequence will comprehensively analyze all aspects of Kafka, the message middleware, from shallow to deep, from usage to principle to source code. (The Kafka source code used is 0.9.0)

Key concept introduction

topic

The following is the logical structure diagram of Kafka: Each topic is a custom queue, the producer puts messages into the queue, the consumer fetches messages from the queue, and the topics communicate with each other. Independence. Broker corresponds to the physical structure diagram of Kafka: each broker is usually a physical machine on which an instance of Kafka server runs, and all these broker instances form a Kafka server cluster


. Each broker assigns itself a unique broker id. The broker cluster is managed by the zookeeper cluster. Each broker will be registered on zookeeper, if a machine hangs, and a new machine joins, zookeeper will be notified. In 0.9.0, the producer/consumer no longer relies on Zookeeper to obtain the configuration information of the cluster, but obtains the configuration information of the entire cluster through any broker. As shown in the following figure: only the server depends on zk, and the client does not depend on zk.








The topic of partition

kafka, on each machine, is stored in files. And these files will be divided into directories. partition is the directory of the file. For example, a topic is called abc, which is divided into 10 partitions. In the directory of the machine, it is:
abc_0
abc_1
abc_2
abc_3
...
abc_9

Then, in each directory, a bunch of message files are stored, and the messages are stored in the append log method. This will be explained in detail later. All messages of each topic's partition of

replica/leader/follower are not only stored in one copy, but are stored redundantly on multiple brokers, thereby improving the reliability of the system.

These multiple machines are called a replica set.

In this replica set, one leader needs to be elected, and the rest are followers. That is, master/slave.

When sending a message, it will only be sent to the leader, and then the leader will send it to the follower.

Then there is a question: after the leader receives the message, does it directly return it to the producer, or does it wait for all followers to write the message before returning? About this, we will believe in the explanation later.

Key point: here replica/leader/follower are logical concepts, and they are relative to "partion", not "topic". That is to say, different partitions of the same topic can have different replica sets.

for example
"abc-0" <1,3,5> //abc_0's replica set is borrower 1, 3, 5, leader is 1, follower is 3, 5
"abc-1" <1,3,7> //abc_1 The replica set is broker 1, 3, 7, leader is 1, follower is 3, 7
"abc_2" <3,7,9>
"abc_3" <1,7,9>
"abc_4" <1,3,5>

Various strategies and semantics

of message queues For the use of message queues, it seems very simple on the surface, one end is put in, and the other end is taken from it. But there are a number of strategies in this one-and-done take.

Whether the Producer's policy

is ACK

The so-called ACK means that after the server receives the message, it saves it and returns it to the client, or returns it directly. Obviously, ACK or not is an important indicator that affects performance. In kafka, request.required.acks has 3 values, corresponding to 3 strategies:

request.required.acks

//0: It returns without waiting for the server ack, the performance is the highest, and data may be lost
//1. Leader confirms The message is saved, and then return
//all: The leader and all followers confirm that the message is saved, and then return. The most reliable

Remarks : In versions before 0.9.0, -1 is used to indicate all

synchronous sending vs asynchronous sending.

The so-called asynchronous sending means that the client has a local buffer, the message is first stored in the local buffer, and then there is a background thread. send.

In versions prior to 0.8.2 and 0.8.2, synchronous sending and asynchronous sending were implemented separately, using the Scala language. Starting from 0.8.2, a new set of client api for Java version has been introduced. In this set of APIs, synchronization is actually implemented indirectly by asynchronous:

under asynchronous sending, there are the following 4 parameters to be configured:

(1) The maximum length of the queue
buffer.memory //The default is 33554432, which is 32M

(2 ) When the queue is full, whether the client blocks or throws an exception (default is true)
block.on.buffer.full
//true: Blocking messages
//false: When an exception is thrown

(3) when sending, it can be sent in batches Data volume
batch.size //The default is 16384 bytes, that is, 16K

(4) How long is the longest waiting time, batch sending
linger.ms //The default is 0
//Similar to the linger algorithm in the TCP/IP protocol, > 0 Indicates that the sent requests will be accumulated in the queue and then sent in batches.

Obviously, asynchronous sending can improve sending performance, but once the client hangs, data may be lost.

For RabbitMQ, ActiveMQ, they all emphasize reliability, so non-ACK sending is not allowed, and there is no asynchronous sending mode. Kafka provides this flexibility, allowing users to make tradeoffs between performance and reliability.

(5) The maximum length of the message
max.request.size //The default is 1048576, that is, the 1M

parameter will affect the size of the batch. If the size of a single message > the maximum value of the batch (16k), then the batch will increase accordingly.

Consumer strategy

Push vs Pull

All message queues have to face a problem, does the broker push the message to the consumer, or does the consumer actively go to the broker to pull the message?

Kafka chose the pull method, why? Because the pull method is more flexible: how often the message should be sent, whether the message can be delayed and then sent in batches, only the consumers themselves know this information best!

Therefore, the control is handed over to the consumer, and the consumer controls the consumption rate. When the consumer processes messages slowly, it can choose to slow down the consumption rate; when processing messages quickly, it can choose to speed up the consumption rate. In the push method, to implement this flexible control strategy, an additional protocol is required to allow consumers to tell the broker whether to slow down or speed up the consumption rate, which increases the complexity of the implementation.

In addition, in the pull mode, consumers can easily adaptively control whether the message is sent in batches, or to minimize the delay, and send one message every time it comes.

The confirmation of consumption is

on the consumer side. A problem that all message queues must solve is the "consumption confirmation problem": the consumer gets a message, and then hangs up when processing the message. If the broker thinks that the message has been consumed at this time, Then the message is lost.

One solution is to send a confirm message to the broker after the consumer finishes consuming. After the broker receives the confirm message, it deletes the message.

To achieve this, the broker must maintain the state of each message, sent/consumed, which obviously increases the difficulty of the broker's implementation. At the same time, there is another problem, that is, when the consumer finishes consuming the message and sends the confirm, it hangs up. At this time, there will be a problem of repeated consumption.

Kafka does not directly solve this problem, but introduces an offset fallback mechanism to solve this problem in disguise. In Kafka, messages will be stored for a week before being deleted. And in a partition, messages are stored in the order of increasing serial numbers, so consumers can go back to a certain historical offset for re-consumption.

Of course, for the problem of repeated consumption, consumers need to solve it.

The order of the broker's policy

messages

In some business scenarios, the order of the messages should not be chaotic: the sending order and the consumption order should be strictly consistent. In Kafka, the same topic is divided into multiple partitions, and these multiple partitions are independent of each other.

The reason why it is divided into multiple partitions is to improve concurrency. Multiple partitions are sent/consumed in parallel, but there is no way to guarantee the order of messages.

One solution is to use only one partition per topic, but this obviously limits flexibility.

Another way is to use the same key for all sent messages, so that the same key will fall in a partition. We all know that

the message brushing mechanism has a page cache in the operating system itself.

Even if we use unbuffered io, the message will not fall on disk immediately, but in the page cache of the operating system. The operating system controls the content of the page cache and when it is written back to disk. At the application layer, the corresponding function is the fsync function.

We can specify to call fsync save to disk once for each message, but this will reduce performance and increase disk IO. You can also let the operating system control the save to disk.

The message is not duplicated and not leaked – Exactly Once

a perfect message queue should be “not duplicated and not leaked”, it contains 4 semantics:
Messages will not be stored repeatedly;
messages will not be repeatedly consumed;
messages will not be lost in storage;
messages will not be lost in consumption.

Let’s talk about the first one: Repeated storage. After the sender sends a message, the server returns a timeout. So, is this message stored successfully, or not?
To solve this problem: the sender needs to add a primary key to each message, and the server needs to record all sent messages for weight judgment. Obviously, to achieve this, the cost is very high

Repeated consumption: As mentioned above, to avoid this, consumers need message confirm. But again, some other problems will be introduced. For example, when the consumption is finished, what should I do if I hang up when sending confirm? A message has been sent, but there is no confirmed status, what should I do?

Lost storage: this has been resolved

Lost consumption: same as lost storage, need to confirm.

To sum up: it is very difficult to really do not miss out, exactly once. This requires the coordination and cooperation of brokers, producers, consumers and business parties.

In Kafka, it is guaranteed that the message is not leaked, that is, at least once. As for the problem of repeated consumption, the business needs to guarantee it by itself, for example, the business will increase the judgment and heavy table.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327039509&siteId=291194637