Kafka reading notes

The concept of idempotent and transaction in Kafka

1. Idempotence: Simply put, the result of multiple calls to the interface is the same as one call.

1.1. In Kafka, the producer needs to be set to enable idempotence: properties.put(ProducerConfig .ENABLE_IDEMPOTENCE CONFIG, true);
and the default retries=Integer.MAX_VALUE; acks=-1; max.in.flight.requests.per.connection= 5.

1.2. In order to realize the equivalence of producers, Kafka introduces the two concepts of producer id (PID) and sequence number.
For each PID, each partition to which the message is sent has a corresponding Serial numbers, these serial numbers increase monotonically from the beginning. Each time the producer sends a message, the value of the serial number corresponding to the <PID partition> will be increased by 1.
The broker will maintain a serial number for each pair of <PID, partition> in memory. For each message received, only when its serial number value (SN_new) is 1 (SN_new = SN_old + 1) greater than the corresponding serial number value maintained in the broker end (SN_old), the
broker will receive it If SN_new <SN_old + 1, then the message has been written repeatedly. The broker can discard it directly. If SN_new> SN_old + 1, then there is data in the middle that has not been written, and there is disorder, which implies that there may be message loss, and the
corresponding producer will throw an exception. Since Kafka only maintains data results such as <pid, partition>, it shows that only a single partition in a single producer session can be idempotent.

2. Transaction: To ensure the atomicity of operations, which means that multiple operations either all succeed or all fail. There is no possibility of partial success or partial failure.

2.1. In order to implement transactions in Kafka, you need to set: properties.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "transaction_Id"); and idempotent configuration must also be set.

2.2. There are two isolation levels for transactions in kafka: read_uncommitted [meaning that consumer applications can see (consumed) uncommitted transactions, of course, are also visible to committed transactions],
read_committed[, which means consumer applications Can’t see messages in uncommitted transactions]. For example, if a producer starts a transaction and sends 3 messages msgl, msg2, and msg3 to a certain partition value, before executing the commitTransaction(), bortTransaction() methods,
the consumer application set to "read_committed" will not be able to consume These messages, but these messages will be cached inside KafkaConsumer, until the producer executes the commitTransaction () method, it can push these messages to the consumer application. Conversely, if the producer
executes the abortTransaction() method, then KafkaConsumer will discard these cached messages without pushing them to the consumer application.

2.3. The kafka__transaction_state topic is used to store the kafka transaction state of the user's corresponding transaction ID. The transaction status information of each transaction Id is sent to the partition of [hash("transaction_Id")% partitionNums ].

Sample code for transactions in Kafka:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("transactional.id", "my-transactional-id");
Producer<String, String> producer = new KafkaProducer<>(props, new StringSerializer(), new StringSerializer());

producer.initTransactions();

try {
 producer.beginTransaction();
 for (int i = 0; i < 100; i++)
	 producer.send(new ProducerRecord<>("my-topic", Integer.toString(i), Integer.toString(i)));
 producer.commitTransaction();
} catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
 // We can't recover from these exceptions, so our only option is to close the producer and exit.
 producer.close();
} catch (KafkaException e) {
 // For all other exceptions, just abort the transaction and try again.
 producer.abortTransaction();
}
producer.close();

The concept of _consumer _offsets in Kafka

The topic will be created when a consumer consumes a message for the first time in the cluster. The topic offsets.topic.replication.factor=3; offsets.topic.num.partitions=50;
the offsets submitted in each consumer group will be Send to the partition with [hash(consumer-groupId)% partitionNums ].
Among them, the configuration of the existence event of the message sent by the consumer on the broker side is: offsets.retention.minutes=10080 in version 2.0.0 and later, offsets.retention.minutes=1440 before.
If the consumer performs the consumption task again after a long time interval, the original consumption displacement information will be lost if the interval exceeds offsets.retention.minutes. The location of consumption can only be selected according to the auto.offset.reset parameter of the client.

View the content in _consumer _offsets: For
example, to view the offset content of the consumer group of consumerGroupId, you need to calculate the partition number corresponding to the corresponding consumer group.

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic __consumer_offsets partition 20 --formatter "kafka.coordinator.group.GroupMetadataManager.OffsetsMessageFormatter"

AR and ISR collection concept in Kafka

1. A partition contains one copy, one of which is the leader copy, and the rest are the follower copies, and each
copy is located in a different broker node. Only the leader copy provides external services, and the follower copy is only
responsible for data synchronization. All the copies in the partition are collectively called Wei AR

2. ISR refers to a collection of replicas that are synchronized with the leader replica. Invalid copy: outside the ISR set, that is, replicas that are in synchronization failure or function failure (for example, the copy is in a
non-survival state) are collectively called invalid copies, and the partition corresponding to the invalid copy is also called the synchronization failure partition, that is, the under replicated zone

LEO and HW concepts in Kafka

1. LEO identifies the next position of the last message in each partition, and each copy of the partition has its own LEO.
2. The smallest LEO in the ISR is HW, commonly known as the high water mark. Consumers can only pull the news before HW.

Kafka's topic partition management:

1. Query the invalid replica
bin/kafka-topics.sh --zookeeper localhost:2181/kafka --describe --topic my_topic under-replicated-partitions

Kafka consumer management:

1. Check the consumer's consumption
bin/kafka-consumer-groups. sh --bootstrap-server localhost:9092 --describe --group consumerGroup

2. View the status of consumers
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --descrbe --group consumerGroup --state

3. View the member information of each consumer group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group consumerGroup --members

3. Check the distribution of each consumer member
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group consumerGroup --members --verbose

4. Delete the specified consumer group
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --delete --group consumerGroup

Kafka consumption offset management

1. To reset the offset of all
topics in the consumer group of the consumer group, there must be no consumers in the consumer group to consume bin/kafka-consumer-groups.sh --bootstrap-server localhost: 9092 --group consumerGroup --all-topics --reset-offsets --to-earliest --execute

2. Reset the offset
bin/kafka-consumer-groups.sh of the specified topic partition in the consumer group --bootstrap-server localhost:9092 --group consumerGroup --topic topic-monitor:2 --reset-offsets --to--latest --execute

In the previous two examples, the to-earliest and to-latest parameters were used to adjust the consumption displacement to the beginning and the end of the partition, respectively.

In addition, the kafka-consumer-groups.sh script also offers more options:

by-duration<String: duration>: Adjust the consumption displacement to the earliest displacement from the specified interval of the current time. The duration format is "pnDTnHnMnS".

from-file<String: path to CSV file>: Reset the consumption displacement to the location in the csv file.

shift-by<Long:number-of-offsets>: adjust the consumption displacement to the current displacement + mber-of-offsets, umber-of-offset~ can be a negative number

to-current: Adjust the consumption displacement to the current position.

to-datetime<String: datatime>: adjust the consumption displacement to the earliest displacement greater than the given time. The format of datatime is "YYYY-MM-DDTHH:mm:SS.sss"

to-offset<Long: offset>: Adjust the consumption displacement to the specified position.

Kafka manually delete the message:

kafka-delete-records.sh script, this script can be used to delete messages before the specified location.

First specify the message before the offset that needs to be deleted for each partition, and write the following delete.json file:

{
	"partitions":[
		{
			"topic":"my_topic",
			"partition":0,
			"offset":10
		},
		{
			"topic":"my_topic",
			"partition":1,
			"offset":10
		},
		{
			"topic":"my_topic",
			"partition":2,
			"offset":10
		}
	],
	"version":1
}

Execute the delete command:
bin/kafka-delete-records.sh --bootstrap-server localhost:9092 --offset-json-file delete.json