Depth understanding of Kafka reliability data analysis [three]


1, multiple copies of data synchronization strategy

In order to protect messages sent Prosucer can reliably sent to the specified Topic, each Partition Topic of the message is received, Producer want to send ACK, sending the next round if Produser receive ACK, it will be, or try again.

Data Reliability Assurance .png

1.1 Overview of multiple copies

To improve the reliability of the message, Kafka Topic of each partition are N copies (replica). The N copies, which is a replica Leader, others are Follower. Leader is responsible for handling all requests of Partition, Follower Leader responsible for the data synchronization. The following figure shows a cluster of four Kafka Broker, Topic has three Partition.

Multi-copy .png

1.2, the synchronous copy queue ISR

Partition when will it send ACK? To make sure that after all of the Follower and Leader synchronization is complete, Leader to send ACK, so as to ensure hang up after Leader, Follower of all to elect a new Leader. But in order to send ACK in case there is a failure because Follower, delays and synchronization Leader, Leader'll have to wait after it finished syncing, how must the solution it? This leads to the ISR (in-sync replica set). ISR maintained in Leader, also known as synchronous copy of the queue, and the leader is the leader to keep a collection of synchronized followers. When the ISR is complete synchronization data Follower, Leader will send an ACK to the Producer, time synchronization if the data is not a predetermined Follower, it is kicked ISR. When Leader hang, elected a new Leader in the ISR.

1.3, data replication between Leader and Follower principle

Before learning the principles of replication, look at two concepts: HW (HighWatermark) and LEO (LogEndOffset):

  • LEO refers to the last offset each copy.
  • HW refers to all copies of the smallest LEO, Consumer can see the position of the last partition of data that is before HW fishes Consumer visible.

Figure:

HW and LEO.png

Leader and Follower in, will maintain their own HW, to write a new message, Consumer and can not be consumed immediately, need to wait for the completion of the ISR Followers copied from the Leader. The figure shows the data replication process a new message after writing Partition:

ISR and HW and LEO circulation process .png

Seen from FIG, Kafka replication neither asynchronous nor synchronous, it has a good balance between reliability and throughput.

3, ACK Exaclty Once the semantic response mechanism

When transmitting data to Producer Leader, in three ACK response mechanism may be provided Kafka, reliability and delay requirements of the data to do balance. By configuring request.required.acks achieved.

3.1、0

Producer does not wait for ACK Broker, which can guarantee a minimum of delay, but when the Broker fails, data may be lost, that is the least reliable. At Most Once reflects the semantics, at most once, data is sent only once, does not guarantee that data will be lost.

3.2、1

Leader Producer Broker waiting in the partition plate off after successful ACK is returned. If the Leader fails, the data will be lost before the end of the synchronization Follower.

3.3、-1

Producer waiting Partition of Leader and Follower place orders after all successful return ACK. If, after the completion of the data synchronization, Leader failure, Producer resends the message before sending ACK, resulting in duplication of data. This reflects At Least Once semantic at least once, you can ensure data is not lost, but does not guarantee data duplication.

3.4, Exactly Once just once

At Least Once idempotent = Exactly Once. Kafka in idempotent is ensured by a PID assignment initialization Broker. Partition is destined for the same message comes Sequence Number (SN), while the Broker will (PID, Partition, SN) as cache, when a message is submitted to the same primary key, a Broker will persist. But it will change the PID is restarted, different Partition also have different primary keys, so idempotency can not guarantee Exactly Once across partitions session.

4, copy troubleshooting

4.1, Follower fault

After Follower fault will be temporarily kicked out of the ISR, when the Follower recovery, Follower reads the last record HW local disk, and log files than part HW taken out, start the synchronization from HW to the Leader. Follower LEO the like is greater than or equal to the partition HW, i.e., after the catch Follower Leader, will be re-added ISR.

4.2, Leader failure

Leader fails, will elect a new Leader from the ISR, the rest of the Follower will first log file above their respective portions of HW interception off after synchronous data from the new Leader.

5, Leader election

Zookeeper kafka will dynamically maintain the ISR for each Partition, when hang Leader, sequentially selects from the ISR Follower as a master. If it happens all the ISR Follower hang up, then you have two options:

  • ISR wait any Follower recovery, selected it as the Leader.
  • Select the first recovery Follower as Leader, Follower this is not necessarily in the ISR.

How we should choose to do a trade-off between the availability and consistency.

tencent.jpg

Guess you like

Origin juejin.im/post/5ded009d6fb9a016323d717d