Article Directory

Kafka class notes for the second day

Kafka class notes for the second day

The partition copy mechanism in Kafka

Producer's partition write strategy

Polling (as far as possible to ensure the load of each partition in accordance with the message) strategy, the message will be evenly distributed to each partition
- When writing a message, when the key is null, the polling strategy is used by default
Random strategy (not used)
Write strategy by key, key.hash()% number of partitions
Custom partition strategy (similar to MapReduce specified partition)

Out of order problem

In Kafka, the producer has a write strategy. If the topic has multiple partitions, the data will be stored in different partitions.

When the number of partitions is greater than 1, the data (messages) will be scattered and distributed in different partitions

If there is only one partition, the messages are ordered

Consumer Group Rebalance Mechanism

Rebalancing: In some cases, the consumption partition of consumers in the consumer group will change, which will cause uneven distribution of consumers (for example: two consumers consume 3, because a partition collapses, but also If a consumer does not currently have a partition to cut peaks), the Kafka Consumer Group will enable the rebalance mechanism to rebalance the partition allocation of consumer consumption in this Consumer Group.
Trigger timing
- Changes in the number of consumers
  - A consumer crash
  - Add consumers
- The number of topics has changed
  - A topic is deleted
- The number of partitions has changed
  - Delete partition
  - Add partition
Adverse effects
- When rebalance occurs, all consumers will no longer work and participate in the rebalancing together until each consumer has been successfully allocated the required consumption partition (the rebalance ends)

Consumer's partition allocation strategy

Partition allocation strategy: to ensure that each consumer can consume the partitioned data as much as possible, and there can be no consumer consumption partitions that are particularly large, and a consumer consumes particularly few partitions

Range allocation strategy (range allocation strategy): Kafka default allocation strategy
- n: the number of partitions / the number of consumers
- m: the number of partitions% the number of consumers
- The first m consumers consume n+1 partitions
- The remaining consumers consume n partitions
RoundRobin allocation strategy (polling allocation strategy)
- Consumers allocate consumption partitions one by one
Striky sticky allocation strategy
- In the absence of rebalance, it is consistent with the polling allocation strategy
- When rebalance occurs, the allocation strategy is polled, and the process of polling allocation is repeated. The stickiness will ensure that it is as consistent as possible with the last time, but the new partition that needs to be allocated is evenly distributed to the existing available consumers.
- Reduce context switching

Copy ACK mechanism

The producer is constantly writing data to Kafka, and writing data will have a return result indicating whether the writing is successful. There is a configuration for ACKs.

acks = 0: The producer just writes, regardless of whether the write is successful or not, data may be lost. Performance is the best
acks = 1: The producer will wait until the leader partition is written successfully, return success, and then send the next one
acks = -1/all: to ensure that the message is written to the leader partition, and to ensure that the message is successfully written to the corresponding copy, then send the next one, the performance is the worst

Selecting the ack mechanism according to the business situation requires the highest performance, and some data loss has little impact, so you can choose 0/1. If data must not be lost, it must be configured as -1/all.

There is a concept of leader and follower in the partition. In order to ensure that the data consumed by consumers is consistent, only the partition leader can read and write messages. What the follower does is to synchronize data, backup.

High-level API (High-Level API), low-level API (Low-Level API)

The high-level API is to directly let Kafka help manage, process distribution, and data
- offset is stored in ZK
- Kafka's rebalance controls the partitions allocated by consumers
- It is relatively simple to develop, and does not require developers to pay attention to the underlying details
- Inability to achieve fine-grained control
Low-level API: the logic is controlled by the written program itself
- Manage the offset yourself, you can store the offset in the state storage of ZK, MySQL, Redis, HBase, Flink
- Specify consumers to pull data from a certain partition
- Can achieve fine-grained control
- The original Kafka strategy will fail, and we need to implement the consumption mechanism by ourselves

Kafka principle

leader and follower

Leader and follower in Kafka are relative partitions, not relative brokers
When Kafka creates a topic, it will try to allocate the leader of the partition to different brokers, which is actually load balancing
Leader responsibility: read and write data
Follower responsibilities: synchronize data and participate in elections (after a leader crash, a follower will be elected to become the leader of the partition again
Pay attention to distinguishing from ZooKeeper
- The leader of ZK is responsible for reading and writing, and the follower can read
- Kafka's leader is responsible for reading and writing, and followers cannot read and write data (to ensure that the data consumed by each consumer is consistent). Kafka has multiple partition leaders for a topic, which can also achieve load balancing of data operations.

ON \ ISR \ OSR

AR means all copies under a topic
ISR: In Sync Replicas, the replica being synchronized (it can be understood that there are several followers currently alive)
OSR: Out of Sync Replicas, a copy that is no longer synchronized
ON = ISR + OSR

leader election

Controller: Controller is the boss of the Kafka cluster and a role for Broker
- Controller is highly available and used ZK to conduct elections
Leader: is a role for partition
- Leader is to conduct fast election through ISR
If Kafka conducts elections based on ZK, the pressure on ZK may be greater. For example, if a node crashes, there is not only one leader on this node, but many leaders need to be elected. Quickly conduct elections through ISR.
leader's load balancing
- If a broker crashes, it may lead to uneven distribution of partition leaders, that is, there are leaders of different partitions under a topic on a broker.
- Through the following instructions, the leader can be allocated to the broker corresponding to the priority leader to ensure that the leader is evenly distributed
```
bin/kafka-leader-election.sh --bootstrap-server node1.itcast.cn:9092 --topic test --partition=2 --election-type preferred
```

Kafka read and write process

Writing process
- Find the leader corresponding to the partition through ZooKeeper, the leader is responsible for writing
- producer starts to write data
- The follower in the ISR starts to synchronize data and returns to the leader ACK
- Return to producer ACK
Reading process
- Find the leader corresponding to the partition through ZooKeeper, the leader is responsible for reading
- Find the consumer's corresponding offset through ZooKeeper
- Then start to pull data sequentially from the offset backward
- Submit the offset (automatic submission-submit the offset every few seconds, manual submission-put into the transaction to submit)

Kafka's physical storage

Kafka's data organization structure
- topic
- partition
- segment
  - .log data file
  - .index (sparse index)
  - .timeindex (index based on time)
In-depth understanding of the process of data interpretation
- The consumer's offset is a global offset for the partition
- The segment segment can be found according to this offset
- Then you need to convert the global offset into the local offset of the segment
- According to the local offset, the corresponding data location can be found from (.index sparse index)
- Start sequential reading

Semantics of message passing

There are guarantees for each of the different mechanisms in Flink, providing Exactly-Once guarantees (two-phase transaction submission method)

At-most once: At most once (just consume the data, no matter whether it is successful or not, there may be data loss)
At-least once: at least once (repetitive consumption may occur)
Exactly-Once: Only once (transactional guarantee, guarantee that the message will be processed only once)

Kafka's messages are not lost

Broker messages are not lost: Because there is a copy of relicas, the copy will be continuously synchronized from the leader, so a broker crash will not cause data loss, unless there is only one copy.
Producer messages are not lost: ACK mechanism (configured as ALL/-1), configuration 0 or 1 may be lost
Consumer consumption is not lost: focus on controlling offset
- At-least once: One type of data may be repeatedly consumed
- Exactly-Once: Only consumed once

Data backlog

Data backlog refers to the fact that consumers have some external IO and some time-consuming operations (Full GC-Stop the world), which will cause the message to always exist in the partition and not be consumed, which will result in a data backlog.
In enterprises, we must have a monitoring system. If this happens, we need to deal with it as soon as possible. Although the subsequent Spark Streaming/Flink can implement the back pressure mechanism, too much data accumulation must have an impact on the real-time performance of the real-time system

Data cleaning & quota speed limit

Data cleaning
- Log Deletion: If the message reaches certain conditions (time, log size, offset size), Kafka will automatically set the log to be deleted (the suffix on the segment end will end with .delete), and the log management program will periodically Clean up these logs
  - The default is 7 days to expire
- Log Compaction
  - If in some key-value data, a key can correspond to multiple different versions of value
  - After log merge, only the latest version will be kept
Quota rate limit
- Can limit the rate of Producer and Consumer
- Prevent Kafka from being too fast, occupying all IO resources of the entire server (broker)

Kafka notes for the second day