04-Detailed explanation of Kafka design principle

Kafka core master controller Controller

There will be one or more brokers in the Kafka cluster, and one of them will be elected as the controller (Kafka Controller), which is responsible for managing the state of all partitions and replicas in the entire cluster.

  • When the leader copy of a partition fails, the controller is responsible for electing a new leader copy for the partition.
  • When a change in the ISR set of a certain partition is detected, the controller is responsible for notifying all brokers to update their metadata information.
  • When using the kafka-topics.sh script to increase the number of partitions for a topic, the controller is also responsible for making the new partitions aware of other nodes.

Controller election mechanism

When the kafka cluster is started, a broker will be automatically elected as the controller to manage the entire cluster. The election process is that each broker in the cluster will try to create a /controller temporary node on zookeeper, and zookeeper will ensure that there is only one broker. If it can be created successfully, this broker will become the master controller of the cluster.

 When the broker of the controller role goes down, the zookeeper temporary node will disappear at this time, and other brokers in the cluster will always monitor this temporary node. When they find that the temporary node disappears, they will compete to create the temporary node again, which is the election mechanism we mentioned above. Zookeeper will ensure that a broker becomes the new controller.

A broker with a controller identity needs to have more responsibilities than other ordinary brokers. The details are as follows:

  1. Monitor broker-related changes . Add BrokerChangeListener to the /brokers/ids/ node in Zookeeper to handle the changes of brokers.
  2. Monitor topic related changes. Add TopicChangeListener to the /brokers/topics node in Zookeeper to handle the changes of topic increase and decrease; add TopicDeletionListener to the /admin/delete_topics node in Zookeeper to handle the action of deleting topics.
  3. Read all current information related to topic, partition, and broker from Zookeeper and manage it accordingly. Add PartitionModificationsListener to the /brokers/topics/[topic] node in Zookeeper corresponding to all topics to monitor the changes in the partition allocation in the topic.
  4. Update the metadata information of the cluster and synchronize it to other ordinary broker nodes.

Partition copy election leader mechanism

The controller senses that the broker where the partition leader is located is down (the controller monitors many zk nodes and can perceive that the broker is alive), and the controller will pick the first broker from the ISR list (under the premise of the parameter unclean.leader.election.enable=false) As the leader (the first broker is put into the ISR list first, it may be the copy with the most synchronized data), if the parameter unclean.leader.election.enable is true, it means that all the copies in the ISR list can be in the ISR when all the copies are down. The leader is selected from the replicas not in the list. This setting can improve the availability, but the new leader selected may have much less data.

There are two conditions for a copy to enter the ISR list:

  1. Replica nodes cannot generate partitions, they must be able to maintain sessions with zookeeper and connect to the leader replica network
  2. The replica can replicate all write operations on the leader, and can't fall behind too much. (Replicas that are lagging in synchronization with the leader replica are determined by the replica.lag.time.max.ms configuration. Replicas that have not synchronized with the leader once after this time will be removed from the ISR list)

Offset recording mechanism for consumer consumption messages

Each consumer will periodically submit the offset of its own consumption partition to the internal topic of Kafka: __consumer_offsets. When submitting the past, the key is consumerGroupId+topic+partition number, and the value is the value of the current offset. Kafka will periodically clean up the messages in the topic, and finally Keep the latest piece of data

 Because __consumer_offsets may receive high-concurrency requests, Kafka allocates 50 partitions to it by default (which can be set by offsets.topic.num.partitions), so that it can resist large concurrency by adding machines.

The following formula can be used to select which partition the consumer consumes offsets to be submitted to __consumer_offsets

Formula: hash(consumerGroupId)% __consumer_offsets topic partition number

Consumer Rebalance mechanism

Rebalance means that if the number of consumers in the consumer group changes or the number of consumption partitions changes, Kafka will redistribute the relationship between consumer consumption partitions. For example, if a consumer in the consumer group hangs up, the partition assigned to him will be automatically handed over to other consumers at this time. If he restarts again, some partitions will be returned to him again.

Note: rebalance is only for subscribe, which does not specify partition consumption. If the partition is specified through assign, Kafka will not perform rebanlance.

The following situations may trigger consumer rebalance

  1. Consumers in the consumer group increase or decrease
  2. Dynamically add partitions to topic
  3. Consumer group subscribed to more topics

During the rebalance process, consumers cannot consume messages from Kafka, which will have an impact on Kafka’s TPS. If there are many nodes in the Kafka cluster, such as hundreds, the rebalancing may take a lot of time, so try to avoid in the system The peak rebalance occurs.

The Rebalance process is as follows

When a consumer joins a consumer group, the consumer, consumer group, and group coordinator will go through the following stages.

The first stage: select the group coordinator

Group Coordinator: Each consumer group will choose a broker as its group coordinator coordinator, responsible for monitoring the heartbeat of all consumers in this consumer group, and determine whether it is down, and then turn on consumer rebalance.

When each consumer in the consumer group starts, it will send a FindCoordinatorRequest request to a node in the Kafka cluster to find the corresponding group coordinator GroupCoordinator, and establish a network connection with it.

Group coordinator selection method:

Which partition of __consumer_offsets the consumer consumes offsets should be submitted to, and the broker corresponding to the leader of this partition is the coordinator of the consumer group

The second stage: join the consumer group JOIN GROUP

After successfully finding the GroupCoordinator corresponding to the consumer group, it enters the stage of joining the consumer group. At this stage, the consumer will send a JoinGroupRequest request to the GroupCoordinator and process the response. Then the GroupCoordinator selects the first consumer to join the group from a consumer group as the leader (consumer group coordinator), sends the consumer group information to this leader, and then this leader will be responsible for formulating the partition plan.

The third stage (SYNC GROUP)

The consumer leader sends a SyncGroupRequest to the GroupCoordinator, and then the GroupCoordinator distributes the partition plan to each consumer, and they will perform network connection and message consumption according to the leader broker of the designated partition.

Analysis of the producer's publishing message mechanism

0

Consumer Rebalance partition allocation strategy:

There are three main rebalance strategies: range, round-robin, and sticky.

Kafka provides the consumer client parameter partition.assignment.strategy to set the partition allocation strategy between consumers and subscription topics. The default is the range allocation strategy.

Assuming that a topic has 10 partitions (0-9), there are now three consumer consumption:

The range strategy is to sort according to the partition number. Assuming that n=number of partitions/number of consumers = 3, m=number of partitions% number of consumers = 1, then each of the first m consumers is allocated n+1 partitions, and the following (consumption Number of persons-m) consumers are allocated n partitions each.

For example, partition 0~3 is given to one consumer, zone 4~6 is given to one consumer, and zone 7~9 is given to one consumer.

The round-robin strategy is round-robin allocation. For example, partitions 0, 3, 6, and 9 are assigned to a consumer, partitions 1, 4, and 7 are assigned to a consumer, and partitions 2, 5, and 8 are assigned to a consumer.

The initial allocation strategy of sticky strategy is similar to round-robin, but during rebalance, the following two principles need to be guaranteed.

1) The partition distribution should be as even as possible.

2) As far as possible, the partition allocation remains the same as the last allocation.

When the two conflict, the first goal takes precedence over the second goal. In this way, the original partition allocation strategy can be maintained to the greatest extent.

For example, for the allocation of the first range case, if the third consumer fails, the result of re-allocation using sticky strategy is as follows:

In addition to the original 0~3, consumer1 will be assigned a 7

Consumer2 will allocate 8 and 9 in addition to the original 4~6

Analysis of the producer's publishing message mechanism

1. Writing method

The producer uses the push mode to publish messages to the broker, and each message is appended to the patition, which is a sequential disk write (sequential disk write efficiency is higher than random write memory, and Kafka throughput rate is guaranteed).

2. Message routing

When the producer sends a message to the broker, it will choose which partition to store it in according to the partition algorithm. The routing mechanism is:

1. If patition is specified, use it directly; 2. If patition is not specified but key is specified, a patition is selected by hashing the value of the key 3. If both patition and key are not specified, use polling to select a patition.

3. Writing process


1. The producer first finds the leader of the partition from the "/brokers/.../state" node of the zookeeper
2. The producer sends the message to the leader
3. The leader writes the message to the local log
4. Followers pull the message from the leader and write After entering the local log, send an ACK to the leader.
5. After the leader receives ACKs from all the replicas in the ISR, it adds HW (high watermark, the offset of the final commit) and
sends an ACK to the producer

Detailed explanation of HW and LEO

HW is commonly known as high water mark, short for HighWatermark. The smallest LEO (log-end-offset) in the ISR corresponding to a partition is taken as HW, and the consumer can only consume up to the location where the HW is located. In addition, each replica has HW, and leader and follower are responsible for updating their own HW status. For the message newly written by the leader, the consumer cannot consume it immediately. The leader will wait for the message to be synchronized by all the replicas in the ISR and update the HW, then the message can be consumed by the consumer. This ensures that if the broker where the leader is located fails, the message can still be obtained from the newly elected leader. For read requests from internal brokers, there is no HW restriction.

The following figure illustrates in detail the flow of ISR, HW and LEO after the producer produces messages to the broker:

It can be seen that Kafka's replication mechanism is neither a complete synchronous replication nor a pure asynchronous replication. In fact, synchronous replication requires that all working followers have been replicated before this message will be committed. This replication method greatly affects the throughput rate. In the asynchronous replication mode, the follower replicates data from the leader asynchronously. As long as the data is written to the log by the leader, it is considered to have been committed. In this case, if the follower has not replicated yet, and when the leader is behind, the leader suddenly goes down. Data will be lost. Kafka's way of using ISR is a good balance to ensure that data is not lost and throughput. Let’s review the setting of the message sender's parameter acks for the persistence mechanism of the sent message. Let’s combine HW and LEO to see the case of acks=1.

Combine HW and LEO to see the case of acks=1


Log segmentation

The message data of a partition of Kafka is stored in a folder, named after topic name + partition number. Messages are stored in segments in the partition, and the messages of each segment are stored in a different log file. This feature is convenient for the old segment file to be deleted quickly. Kafka stipulates that the maximum log file of a segment is 1G. The purpose of this restriction is to facilitate the loading of the log file into the memory for operation:

# 部分消息的offset索引文件,kafka每次往分区发4K(可配置)消息就会记录一条当前消息的offset到index文件,
# 如果要定位消息的offset会先在这个文件里快速定位,再去log文件里找具体消息
00000000000000000000.index
# 消息存储文件,主要存offset和消息体
00000000000000000000.log
# 消息的发送时间索引文件,kafka每次往分区发4K(可配置)消息就会记录一条当前消息的发送时间戳与对应的offset到timeindex文件,
# 如果需要按照时间来定位消息的offset,会先在这个文件里查找
00000000000000000000.timeindex

00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex

00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex

This number like 9936472 represents the starting offset contained in the log segment file, which means that at least nearly 10 million pieces of data have been written in this partition.

Kafka Broker has a parameter, log.segment.bytes, which limits the size of each log segment file, the maximum is 1GB.

When a log segment file is full, a new log segment file is automatically opened for writing to prevent a single file from being too large and affecting the read and write performance of the file. This process is called log rolling, and the log segment file being written is called active log segment.

Finally, a data diagram of the zookeeper node is attached:

https://note.youdao.com/yws/public/resource/d9fed88c81ff75e6c0e6364012d19fef/xmlnote/2F76FF53FBF643E785B18CD0F0C2D3D2/83219

Guess you like

Origin blog.csdn.net/nmjhehe/article/details/114500875