Kafka graphic watermark backup mechanism

Highly available distributed systems is one of many essential features, high availability is Kafka log-based multiple copies of the leader-follower synchronous implementation, there are multiple copies of each of the partition, which is only a copy of the leader, to provide send and consume messages, the rest are copies follower, continue to send requests to fetch a copy of leader synchronization message, if the leader does not malfunction during the operation of the entire cluster, follower copy will not play any role, the problem is that any system it can not guarantee stable operation, when the leader Replicas broker crash, in which a follower copy becomes the new leader a copy of this partition, then the question is, when chosen as the new leader a copy of the message will result in the loss or discrete ? Kafka is how to resolve the error message does not change when the leader a copy? And copy data between the leader follower and synchronization is carried out? With these questions, we then look down, together unveiled Kafka watermark backup.

Watermark concepts

Before explaining watermark backup, we must first clear up some key terms and their meanings, below I used a diagram to schematically displacement information partition replica of Kafka:

As shown above, a message indicating that the green part is fully backed up, visible to the consumer, the purple color indicates that the message is not fully backed up, is not visible to the consumer.

LEO (last end offset): Logs the end of the displacement, the displacement values ​​recorded message a copy of the object underlying the log file, when a copy of a written message, automatically updated value LEO.

HW (high watermark): you can know from the name, this value is called the high watermark value, HW will not be greater than LEO value, less than the value of HW message is considered "Submitted" or "backup" message, and consumers visible.

LEO leader will save the value of two types, one is their own LEO, and the other is remote LEO value, the value of remote LEO LEO copy of the value is the follower, LEO value means follower copy is saved in duplicate, one copy leader copy, one copy to yourself here.

remote LEO value of what use is it?

It is the key to determine the size of the value of HW, HW when you want to update, will contrast LEO value (including leader LEO), take the smallest one to do the latest HW value.

The following describes the update mechanism LEO and HW values:

LEO update mechanism:

  1. leader of LEO value is updated copy of itself: When Producer messages sent over that news displacement position +1 leader stored copy of the current date;
  2. follower a copy of the updated value itself LEO: when to fetch a copy of the message from the leader and written to the local log file, namely the latest news displacement position +1 current synchronization leader a copy of a copy of the follower;
  3. remote LEO leader copy updates the value of: sending a copy of each follower will fetch request contains the current value of LEO follower, leader will attempt to get the updated value Remote LEO value.

leader HW update mechanism:

When the update is divided into fault leader HW update and update as normal:

Fault Update:

  1. When a copy of a copy was elected leader: When a follower was elected leader a copy of a copy of the partition, kafka will try to update the HW value;
  2. When the copy is kicked out of ISR: If a copy of a copy of the leader to catch up with the progress, or where the broker collapsed, leading to being kicked out of the ISR, leader also checks whether the HW value needs to be updated, after all, only with HW update values ​​in a copy of the ISR LEO relationship.

Normal update:

  1. When the producer writes a message to the leader a copy of: when a message is written updates leader LEO values, and therefore need to check whether you need to update HW value;
  2. When the leader follower FETCH request processing: follower fetch request carries the value LEO, LEO Remote leader will update the corresponding value in accordance with this value, but also need to check if the value of HW needs to be updated.

follower HW update mechanism:

  1. HW follower update update occurs after it LEO, HW value in response to the Fetch each follower member comprising a leader will then compare the current value of LEO, taken as a new minimum value of HW.

Graphic watermark backup process

After understanding the concepts Kafka watermark backup mechanism, here I use the map to help you better understand Kafka watermark backup process, assuming that there are two copies of a partition, min.insync.replica = 1:

Step 1: leader and follower replica is initialized values, fetch request to send a copy follower, since there is no leader copy data, it will not synchronize;

Step 2: After sending the message m1 to the producer partition copy leader, leader Update message included in the article LEO = 1;

Step 3: follower send fetch request carrying the latest offset current = 0, when the leader process fetch requests, updates Remote LEO = 0, Comparative LEO minimum value is 0, the HW = 0, leader copy of the response message data and leader HW = 0 a follower, follower message after writing, LEO update value, while Comparative Leader HW value, taken as a new minimum value of HW, then follower HW = 0, which means, follower HW HW will not exceed the value of Leader .

Step 4: follower transmitting a second round fetch request carrying the new current offset = 1, when the leader fetch processing requests, updates Remote LEO = 1, the minimum value of 1 LEO contrast, the HW = 1, at this time no new message leader data, returns directly to the leader HW = 1 follower, follower comparison with the most current value of Leader LEO HW value, taken as a new minimum value of HW, then follower HW = 1.

Based on some of the defects of the watermark backup mechanism

As can be seen from the above steps to update the stored value leader in remote LEO always need extra round fetch RPC request to complete, which means that the leader switching process, there will be data loss and data inconsistency problem, I use the following diagram to illustrate the problem:

  • data lost

Have said before, the value of HW is a leader fetch RPC completed update request, as shown in FIG follower, have a copy of A and B, wherein B is a copy of the leader, A is a follower replica, a first in the A Sec fetch request, and after receiving the response, then the HW B has been updated to 2, if it is not processed in response to a collapsed, i.e. follower does not update the value of HW, when a restart will automatically LEO HW value is adjusted to the value before, i.e., log truncation will be performed, and then sends a fetch request to B, but unfortunately at this time also occurred B down, Kafka will be elected as the new partition a Leader. When the restart B, A will send a request to fetch from, received fetch response HW get value, and updates the local value of HW, then HW is adjusted to 1 (before a 2), then do log truncation B Therefore, offsets = 1, the message is permanently deleted.

You may ask, Why follower of copies to be log truncation?

This is due to the leader previously recorded message, and then pulling follower in synchronization message from the leader, which will result in the LEO leader (offset between ollower are not the same is larger than the follower, although ultimately the same, but the process there will be differences), assuming that the leader appeared to switch, there may be elected a LEO smaller follower to become the new leader, then the copy of LEO will become the new standard, which will lead to possible follower LEO value will, therefore follower during synchronization is required before obtaining the LEO leader than in the case of larger values ​​from the leader LastOffset value (which will be explained later), if the current is less than the LEO LastOffset, log truncation is required, then pulled from the leader fetch data synchronization.

You may also ask, log truncation will not cause data loss?

Have said before, the above is no news value HW "Submitted" or "backup", so the message is not visible to consumers that these messages are not user Unbound, is to say cut logs from the HW value, and will not cause loss of data (committed within the scope of the user).

  • Data inconsistencies / Discrete

Above, a need will happen to meet one of the following conditions:

  1. Before downtime, B is no longer in the list ISR, unclean.leader.election.enable = true, i.e., allowing a copy of the non-Leader becomes ISR;
  2. B message is written to pagecache, but not flush to disk.

There are two copies of the partition, wherein A is a copy of Leader, B is a follower replica, two A message has been written, and to update HW 2, B only write a message, HW 1, when A and B simultaneously dang machine, to restart the B, B leader has become a copy, then the producer transmits a message, save to B, since only this time the partition B, B when a message is written to the HW update 2, at this time a restart, found leader HW 2, like its own HW, so there is no log truncation, which resulted in the offset a and B = offset log 1 = log 1 is not the same phenomenon.

leader epoch

In order to solve the HW asynchronous update timing is delayed, while HW is decided whether the backup log a sign of success, resulting in data loss and data inconsistency, Kafka introduced the leader epoch mechanism, creates a leader in each copy of the log directory -epoch-checkpoint file to save leader epoch of information, as follows, so long epoch leader:

It takes the form (epoch offset), epoch leader refers to the version, it is a monotonically increasing a positive integer, each leader is changed, the version will Epoch + 1, offset is the first message writing each generation leader displacement values, such as:

(0, 0)
(1, 300)复制代码

The second version is more than to start writing a message from the displacement of 300, meaning that the first written version of the message 0-299.

leader epoch specific work mechanism is as follows:

1) When a copy of becoming leader:

At this time, if at this time there is a new message sent by the producer, will be the first new leader epoch and LEO added to the leader-epoch-checkpoint file.

2) When the copy becomes a follower:

  1. LeaderEpochRequest send a copy of the request to the leader, the request includes the latest version follower of the epoch;
  2. leader returns to the corresponding the follower comprises a LastOffset, if the follower last epoch = leader last epoch, the LastOffset = leader LEO, otherwise, it is greater than the follower last epoch smallest leader Epoch a start offset value, for example: Suppose follower Last = Epoch 1, this time with a leader (1, 20) (2, 80) (3, 120), the LastOffset = 80;
  3. After follower get LastOffset, will compare the current value is greater than LEO LastOffset, if the current is greater than LEO LastOffset, LastOffset cut from the log;
  4. follower fetch request to start sending messages to maintain synchronization leader.

Based on the working mechanism of leader epoch, we are going to see how it is to solve the backup watermark defects:

(1) to solve the data loss:

As shown above, after, after A restart request to send LeaderEpochRequest B, since B not added message, this time epoch = request epoch = 0, the flow returns LastOffset = leader LEO = 2 to A, A to get LastOffset, found equal to LEO current value, it is not a log truncation. At this point B is down, become Leader A, B after the start back, repeat action A, log truncation does not require the same data is not lost.

(2) resolve inconsistencies Data / Discrete

As shown above, A and B, while the rear down, back B to be restarted Leader partition, this time over a message sent by the producer, leader epoch updates to 1, the case back A starts transmitting LeaderEpochRequest (follower epoch = 0) to B, B is determined not equal to the latest follower epoch Epoch, thus found is greater than the minimum follower epoch epoch = 1, i.e. LastOffset = epoch start offset = 1, the a get LastOffset, LEO is determined smaller than the current value, then the LastOffset log truncation position, and then starts sending a request to fetch start synchronization message B, the message to avoid inconsistencies / discrete problem.

No public more exciting articles please pay attention to the maintenance of a "back-end Advanced", which is a focus on back-end technology-related public number.

No public attention and reply "back-end" to receive free e-books related to the back-end.

Welcome to share, reproduced Please keep the source.

No public "back-end Advanced", focused on the back-end technology to share!

Guess you like

Origin juejin.im/post/5e057670e51d45582b2a470b