In-depth understanding of Kafka series (5)-Kafka reliable data transmission

Series Article Directory

Kakfa authoritative guide series articles

Preface

This series is my transcript and thoughts after reading the book "The Definitive Guide to Kafka".

text

This article mainly explains the reliable transmission of Kafka

Reliability assurance and replication

The database has the reliability of the database, such as Mysql's ACID to ensure the transaction characteristics. Then Kafka also has corresponding guarantees. From the following aspects to ensure:

  1. Kafka guarantees the order of partitioned messages. If message B is written after message A, Kafka will ensure that the offset of message B is greater than that of A, and consumers will read message A first and then read B.
  2. Only when the message is written to all synchronized copies of the partition (not necessarily to disk), is it considered "committed". The producer chooses a different confirmation mechanism, acks.
  3. As long as one copy is active, the submitted messages will not be lost.
  4. Consumers can only read messages that have been submitted.
  5. All requests can only flow in and out from the primary copy, and those from the copy only serve as data backups.

The above basic guarantee mechanisms can be used to build a reliable system, but relying on them alone cannot guarantee complete reliability. Next, we will discuss the reliability of Kafka based on Kafka's replication mechanism and important configuration parameters.

Kafka's replication mechanism and partitioned multi-copy architecture are the core of Kafka's reliability guarantee. Here is a brief review.

  1. Kafka topics are divided into multiple partitions, and partitions are basic data blocks.
  2. Kafka guarantees that the messages in the partition are in order.
  3. Each partition can have multiple copies, of which only one copy is the master copy and the others are slave copies.
  4. The master copy is responsible for processing requests, and the slave copy is only responsible for keeping pace with the master copy. When the leader copy is not available, the slave copy elects to replace it.
  5. The partition leader is a synchronized copy and needs to meet the following 3 conditions.

1. There is an active session with zookeeper.
2. Obtained information from the leader in the past 10s.
3. The information obtained in the past 10s is the latest.


Broker in a reliable system

There are three more important parameters in the broker that will affect the reliability of Kafka message storage. Here, we will explain from the configuration of these three parameters.

Replication factor

Theme-level configuration parameters: replication.factor
broker level: default.replication.factor

Kafka's default replication factor is 3, which means that each partition will be replicated 3 times by 3 different brokers. If the replication factor is N, then even if N-1 brokers fail, data can still be read or written from the topic, so the higher the replication factor, the better the availability and reliability. But there are also disadvantages. If the replication factor is N, at least N brokers and N data copies are required, which takes up N times the disk space. The greater the N, the greater the overhead .

If it is acceptable that the topic is unavailable due to the restart of the broker, then the replication factor is 1.
If the replication factor is 2, it means that one broker failure can be tolerated, but if the replication factor is 2, the cluster may be unavailable due to restart problems.
Therefore, a replication factor of 3 is generally sufficient.

Leader election

Parameters: unclean.leader.election is configured at the broker level. The default is true, which means: when the partition leader is unavailable, a synchronized copy will be selected as the new leader. If no data is lost during the election process, that is, the submitted data exists on all synchronized copies at the same time, then the election is called As complete . If this parameter is true, the unsynchronized copy is allowed to become the leader, that is, incomplete election.

If the unsynchronized replica cannot be promoted to the new leader, then the partition is unavailable until the old leader replies, and the cluster may be unavailable for a long time. Reliability is down.
If the unsynchronized copy can be promoted to the new leader, then all messages written to the old leader after this copy becomes unsynchronized will be lost, resulting in data inconsistency.

In short, if a system has high requirements for high availability and can accept a certain degree of data inconsistency, then please set this parameter to true. Otherwise, for some systems that require high data consistency, such as banks, this parameter will be set to false.

Least synchronized copy

Parameters (the subject and broker level are the names): min.insync.replicas

First of all, let me explain Kafka's definition of reliability guarantee: a message is considered committed only after it has been written to all synchronized replicas. And this parameter specifies the minimum number of synchronized replicas.

1. If the value is set to 1, it means that as long as there is a synchronized copy, the message is considered to have been submitted, but if the only synchronized copy is down, then the function will stop.
2. If the number of synchronized replicas is lower than the number specified in the configuration, the producer trying to send the message will receive a NotEnoughReplicasException .
3. Of course, consumers can still read the existing data. When 2 occurs, some unavailable partitions must be restored, such as the broker restart, and wait for him to be synchronized.

Obviously, the above three parameters have a significant impact on the reliable transmission of Kafka. However, it is not enough to configure at the broker level to improve reliability. It also needs to be configured reliably from both the producer and the consumer.


Producer in a reliable system

First, let's elaborate on the sending mode of the producer;

  1. acks=0

It means that if the producer can send the message through the network, then it is assumed that the message has been successfully written to Kafka.
The problem is that in this mode, the possibility of errors is very high, because once you send a message, such as an object in java, but in the transmission process, the object must be serialized, if the sequence If the conversion fails or the network card fails, does it mean that the transmission fails.
Generally, the acks=0 mode is used for benchmarking, because in this mode, high throughput and broadband utilization, in a word: speed sacrifices accuracy.

  1. acks=1

This means that when the leader receives a message and writes it to the partition data file, it will return an acknowledgement or error response. This mode is more reliable than the message with acks=0. In this mode, if a normal leader election occurs ( that is, the original leader collapses and the remaining synchronized follower copies begin to elect a new leader copy ), the producer will receive related exceptions during the election, then if the producer can Correctly handle this error, he will try to send the message again. , The final news will reach the new leader.
But this mode may also lose data. For example, the message has been successfully written to the leader, but the leader crashes before the message is copied to the follower copy.

  1. acks=all

This means that the leader will wait for all synchronized replicas to receive the message before returning a confirmation or error response. Of course, it is recommended that when acks=all be combined with the configuration min.insync.replicas, it can determine at least how many replicas can receive the message before returning the confirmation.
Benefit: This is the safest approach.
Disadvantage: The throughput is the slowest.

Secondly, from the perspective of the producer's retry parameters:

Generally speaking, if the broker sends an error, the returned error can be solved by retrying. The error that the producer needs to handle consists of two parts

  1. Errors that the producer can handle automatically: LEADER_NOT_AVAILABLE (returned by the broker)
  2. Errors that need to be handled manually by the developer (the producer cannot handle it automatically): INVALID_CONFIG (returned by the broker)

Suggestions here:

1. If you want to catch the exception and try a few more times: Set the number of retries a bit larger.
2. If you want to discard the message directly: the first point of the number of retries.
3. If you want to save the message somewhere and then come back to continue processing: you can stop and try again.

Kafka's cross-data center replication tool MirrorMaker will perform unlimited retries by default.
Of course, retrying to send a message that has failed will also bring some risks. For example, the producer did not receive the confirmation from the broker due to network problems, but in fact the message has been sent successfully, and then the producer retries and sends again successfully , It means that the broker will receive two identical messages. What to do then?

Answer: You can try to add a unique identifier to the message to detect duplicate messages. Consumers can clean them up when reading the message. Ensure the idempotence of the message.

Small summary

For the reliability of the producer, if you really want to achieve high-reliability transmission of the system, you can set acks=all reasonably, and set the minimum number of synchronized copies, and also deal with some errors that the system cannot automatically handle, such as: Errors such as message size error, authentication error, serialization error, producer reaching the upper limit of the number of retries, or message occupied memory reaching the upper limit. Then these errors need programmers to solve from the code level.


Consumers in reliable systems

After talking about how to produce data under the premise of ensuring the reliability of Kafka, let's see how to read the data in the same way. From the previous article, we can see that only the data that is submitted to Kafka, that is, the data that is written to all synchronized replicas, is available to consumers. This means that the messages received by consumers are already consistent.

What do you want to do as a consumer?

Answer: The only thing to do is to track which messages have been read and which have not been read.

And such tracking is inseparable from a concept: offset. Consumers have several parameters that are very important, which are related to reliability and linked to offset.

  1. group.id

If two consumers have the same group.id and subscribe to the same topic, then each consumer will be assigned to a subset of the topic partition, which means that they can only read a subset of all messages.

  1. auto.offset.reset

1. This attribute specifies what consumers should do when reading a partition without an offset or the offset is invalid. There are two values
latest: when the offset is invalid, the consumer will start reading data from the latest record.
earliest: When the offset is invalid, the consumer will read the partition record from the starting position.

  1. enable.auto.commit

1. This attribute specifies whether the consumer automatically submits the offset, the default is true.
2. The main disadvantage of automatic submission is that it cannot control repeated processing of messages, and if the message is handed over to another background thread for processing, the automatic submission mechanism may submit the offset before the message is processed.
3. In order to avoid duplicate data or data loss as much as possible (in fact, the offset is not submitted or submitted twice), we can set it to false.

  1. auto.commit.interval.ms

If you select true in configuration 3, you can configure the frequency of automatic submission through this parameter, and the default is to submit once every 5s.

After talking about several important configuration parameters of consumers, then manually submit the offset angle. Here are a few points to pay attention to.

  1. Always submit the offset after processing the time.
  2. The frequency of submission is a trade-off between performance and the number of repeated messages.
  3. Make sure to count the submitted offsets.
  4. Consider the issue of consumer rebalance. (Rebalancing occurs when consumers increase or decrease)
  5. Ensure the idempotence of the message. For example, with support for one-key systems, relational databases, ES, key-value storage engines, etc. You can use the combination of topic + partition + offset to create a unique key. In other words, this combination can uniquely identify a Kafka record

to sum up

In fact, this article is essentially a small summary on the basis of the previous articles. From the three levels of Kafka's broker end, production and consumer end, several important points or parameters are proposed, and the reliability of Kafka is carried out. Associated.
Such as

  1. Kafka's broker end: replication factor, minimum number of synchronized replicas, and whether it is fully elected.
  2. Kafka's producer side: the impact of acks, the number of retries, and which errors need to be resolved in code.
  3. The consumer side of Kafka: offset configuration, offset submission method, consumer retry problem, how to ensure idempotence, etc.

The next article is going to be written from Kafka's data pipeline (Connector).

Guess you like

Origin blog.csdn.net/Zong_0915/article/details/109644270