Kafka copy mechanism ISR

Topic, Partition, and Replica are the three elements of the topic layer. Each topic has at least one Partition, and Partition has a copy mechanism. Kafka  defines two types of copies: leader copy and follower copy. There can only be 1 copy of the leader and N-1 copies of the follower.

These are all you learned by rote, so you will have a lot of questions:

  • Why does Kafka have a copy mechanism?
  • Why should there be two roles: leader copy and follower copy?
  • What is the relationship between the leader copy and the follower copy?
  • What will happen to the follower copy when the leader copy hangs?

Today we will make it from a concept to your plaything

Copy advantage

Let's not talk about kafka specifically , but talk about the advantages of replicas in a distributed system in a wide range?

First of all, you will say without hesitation: high availability. This is too easy to understand, as if we usually back up important files into two copies on a USB flash drive. In this way, if your computer is hacked, you are not afraid of it.

Distributed systems do the same. By providing data redundancy, the system can continue to operate even if some components of the system fail, increasing overall availability and data durability.

Is this one advantage? Obviously not, you may think of the mysql slave library. The mysql slave library can help MySQL resist pressure (reading resistance). All write operations are performed in the main library, and read operations are allocated to the slave library. As a result, the efficiency of reading can be greatly improved.

That's right, the distributed system copy mechanism also provides high scalability, which can improve read performance by adding machines, thereby increasing the throughput of read operations.

However, unfortunately, Kafka does not have this advantage, because the copy of Kafka does not provide external services. .

We'll talk about it later, why it is so stingy, and if there is data, it is not for others to use!

Anyway, now you just need to remember that Kafka 's copy mechanism has only one advantage: it ensures high availability through data redundancy.

Copy definition

As you already know, there will be multiple copies of partition data. So what exactly is this copy? In fact, it is not mysterious, it is a commit log that can only add messages. All replicas in the same partition store the same message sequence, and these replicas are stored on different Brokers, which can counteract the unavailability of data caused by some Broker downtime.

As shown in the figure: TopicA has three partitions, part0, part1, and part2. Part0 has two copies, a leader copy and a follower copy. Respectively on broker1 and broker2.

Analyzing the Kafka replica mechanism, do you really understand ISR?

Dungeon role

Why does Kafka define two types of replicas: Leader Replica and Follower Replica. And there can only be 1 copy of the leader and N-1 copies of the follower?

To answer this question, first think about it. Since multiple replicas can be configured under a partition, and the content of these replicas must be consistent, how do we ensure that all data in the replicas are consistent?

First of all, we have to have a benchmark, whose data shall prevail? Otherwise, it would be bad for the mother-in-law to say that the mother-in-law is reasonable. Therefore, Kafka has the role of leader.

Well, we take the leader's data as the standard, that is, based on the leader-based copy mechanism, then the leader is responsible for interacting with the producer, and the follower pulls its data. Is it clear? As shown below:

Analyzing the Kafka replica mechanism, do you really understand ISR?

Regarding this picture, you should mainly understand the following:

1. Replicas are divided into two categories: Leader Replica and Follower Replica. When each partition is created, a copy is elected, which is called a leader copy, and the remaining copies are automatically called follower copies.

2. In  Kafka  , the follower copy does not provide external services. Any follower copy cannot respond to consumer and producer read and write requests. All requests must be processed by the leader copy, or in other words, all read and write requests must be sent to the broker where the leader copy is located, and the broker is responsible for processing. The follower copy does not process client requests. Its only task is to asynchronously pull messages from the leader copy and write it to its commit log, thereby achieving synchronization with the leader copy.

3. When the leader copy is down, Kafka  can detect it in real time by relying on the monitoring function provided by ZooKeeper, and immediately start a new round of leader election, and choose one from the follower copy as the new leader. After the old leader copy is restarted, it can only be added to the cluster as a follower copy.

Pay attention to the words marked in red, the follower copy is not available for external services. Remember when we talked about the benefits of the replication mechanism just now, we said that  Kafka  failed to provide horizontal expansion of read operations? This is the specific reason. All in all, the follower copy of Kafka has no other benefits besides ensuring high availability. For client users, it is worthless.

That being the case,  why is Kafka designed this way? In fact, this copy mechanism has two advantages.

1. Conveniently realize "Read-your-writes".

The so-called Read-your-writes, as the name implies, is that when you use the producer API to  successfully write a message to  Kafka , immediately use the consumer API to read the message just produced.

For example, when you usually post a Weibo, after you post a Weibo, you definitely want to see it immediately. This is a typical Read-your-writes scenario. If the follower copy is allowed to provide services externally, since the synchronization of the copy is asynchronous, it is possible that the follower copy has not pulled the latest message from the leader copy, so that the client cannot see the latest written message.

2. Conveniently realize Monotonic Reads.

What is monotonous reading? That is, for a consumer user, when it consumes messages multiple times, it will not see that a certain message exists for a while and does not exist for a while.

If the follower copy is allowed to provide read services, then suppose there are currently two follower copies F1 and F2, which pull the leader copy data asynchronously. If F1 pulls the latest news of Leader and F2 has not pulled it in time, then if a consumer first reads the message from F1 and then pulls the message from F2, it may see this phenomenon: The latest news seen during one consumption disappears during the second consumption. This is not monotonic read consistency. However, if all read requests are handled by Leader, then  Kafka  can easily achieve monotonic read consistency.

What should I do if the leader dies?

We have just repeatedly emphasized that only the Leader copy can provide services. Both the consumer and producer's read and write requests are completed by the Leader copy.

So what should I do if the leader hangs up? Without thinking about it, you can answer: The boss has hung up on the second. But how should the second child be in charge? Especially when there are many follow copies, how should kafka choose?

It is not difficult to answer, it must be a high-quality spare tire.

Doesn’t it mean that all copies keep the same message? How to distinguish high quality?

All copies are the same, that is only under normal circumstances. Since the follower only periodically pulls the data in the leader's copy asynchronously, there is a risk that it is impossible to synchronize with the leader in real time.

By default (note that it is only the default), only a follower copy that is identified as a real-time synchronization can be elected as a leader.

There are many reasons why a replica loses real-time synchronization with the leader, such as:

  • Slow replica: The follower replica has been unable to keep up with the leader's write progress for a period of time. One of the most common reasons for this situation is the I/O bottleneck on the follower replica, which causes it to persist logs longer than the time it takes to consume messages from the leader;
  • Stuck replica: The follower replica stops getting messages from the leader for a long time. This may be because the GC is paused, or the copy is faulty;

Bootstrapping replica: When the user increases the replication factor for a topic, the new follower replicas are not synchronized until it keeps up with the leader's log.

If you are a careful person, pick out the words in the few paragraphs just now,

"It has been unable to catch up with the leader's writing progress for a period of time"

You will have this question: How long does a period of time mean?

For example, we know that the standard of excellence for elementary school students is 80 points, so if you want to confirm whether the follow copy is of high quality, you have to give it a standard?

This standard is the value of the replica.lag.time.max.ms parameter on the Broker side. The meaning of this parameter is the maximum time interval that the follower copy can lag behind the leader copy. The current default value is 10 seconds. This means that as long as a follower copy lags behind the leader copy for more than 10 seconds,  Kafka  considers that the follower copy and the leader are synchronized. Conversely, when the replica lags behind the leader partition, the replica is considered out of sync or lagging.

Here, it can lead to kafka a noun in the famous, called ISR

In-sync Replicas(ISR

 The replicas in the ISR are all replicas that are synchronized with the leader. On the contrary,  the follower replicas that are not in the  ISR are considered to be out of sync with the leader.

But the first thing to be clear is that the Leader copy is naturally in the  ISR  . In other words, ISR is  not just a collection of follower copies, it must include Leader copies. Even in some cases, ISR  has only one copy of Leader.

As for how to determine whether Follower is synchronized, I have just said that the value of the parameter replica.lag.time.max.ms on the Broker side.

If the speed of this synchronization process continues to be slower than the message writing speed of the Leader copy, then after the replica.lag.time.max.ms time, this follower copy will be considered out of sync with the Leader copy, so it cannot be replayed. Into the  ISR  . At this point, Kafka  will automatically shrink the  ISR  set and "kick out" the copy from the ISR .

It is worth noting that if the copy slowly catches up with the leader's progress later, it can be added back to the  ISR  . This also shows that the ISR  is a dynamically adjusted set, rather than static.

Unclean Leader Election

Unclean leader election. Let's go back and see what we did when we said that the Leader has hung up. There is a sentence that emphasizes the boldness, "By default (note that it is only the default), only a follower copy that is identified as a real-time synchronization can be elected as a leader".

This is controlled by the Broker parameter unclean.leader.election.enable. The default is false, that is, only the follower copy (in the ISR ) that is identified as a real-time synchronization can be elected as a leader. If you set it to true, all copies can be elected.

Enabling the Unclean leader election may cause data loss, but the advantage is that it allows the partition leader copy to always exist and will not stop providing services to the outside world, thus improving high availability. On the contrary, the advantage of prohibiting Unclean leader election is to maintain data consistency and avoid message loss, but at the expense of high availability.

If you have heard of CAP theory, you must know that a distributed system can usually only meet two of Consistency, Availability, and Partition tolerance at the same time. Obviously, Kafka  gives you the right to choose C or A on this issue .

You can decide whether to open the Unclean leader election based on your actual business scenario. However, I strongly recommend that you do not enable it, after all, we can also improve high availability in other ways. If data consistency is sacrificed for this high-availability improvement, it is very unworthy.

to sum up

1. Kafka replica has only one advantage: to ensure high availability.

2. A copy is actually a commit log that can only add messages to be written. The copy in Kafka is divided into a leader copy and a follower copy. Each partition can only have one copy of the leader and N-1 followers.

3. Only the leader copy provides services to the outside world, and the only thing the follower copy does is to asynchronously synchronize the message of the leader copy.

4. ISR is a collection of the leader replica and the follower replica synchronized with the leader. How to determine whether to synchronize is based on the broker-side parameter replica.lag.time.max.ms

5. You can choose unclean leader election by yourself. If you want to ensure high availability, set it to true and allow non-synchronized followers to be elected. If you want to ensure consistency, set it to False.

Guess you like

Origin blog.csdn.net/Erica_1230/article/details/113808081