Article Directory
Producer partition write rules
Overview
When the producer writes the message to the topic, Kafka will distribute the data to different partitions according to different rules. If the partition data is specified, it will be written to the specified partition. If the partition is not specified, the specified partition will be performed according to the following rules:
classification
Hash partition (specified key is enabled by default)
Advantages: The same key will enter the same partition.
Disadvantages: Since the Hash remainder of the same key is the same, it will cause data skew.
Polling partition (open when no key is specified)
- The default strategy is also the most used strategy, which can ensure that all messages are evenly distributed to one partition.
- If the key is null when the message is produced, the polling algorithm is used to distribute the partitions evenly
Advantages: data distribution is more even.
Disadvantages: data with the same key goes into different partitions.
Before 2.x: polling partitions, after 2.x: sticky partitions
Random partition (not used)
Random strategy, randomly assigning messages to each partition every time. In earlier versions, the default partition strategy was a random strategy, which was also to write messages to each partition evenly. However, the subsequent polling strategy performs better, so the random strategy is rarely used.
Custom partition
implementation code:
1. Create a custom partitioner
import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;
import java.util.Map;
import java.util.Random;
/**
* @ClassName UserPartition
* @Description TODO 自定义分区器,实现随机分区
* @Date 2021/3/31 9:21
* @Create By Frank
*/
public class UserPartition implements Partitioner {
/**
* 返回这条数据对应的分区编号
* @param topic:Topic的名
* @param key:key的值
* @param keyBytes:key的字节
* @param value:value的值
* @param valueBytes:value的字节
* @param cluster:集群对象
* @return
*/
@Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
//获取Topic的分区数
Integer count = cluster.partitionCountForTopic(topic);
//构建一个随机对象
Random random = new Random();
//随机得到一个分区编号
int part = random.nextInt(count);
return part;
}
@Override
public void close() {
//释放资源
}
@Override
public void configure(Map<String, ?> configs) {
//获取配置
}
}
2. In the Kafka producer configuration, customize the class name of the custom partitioner
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, 类所在包路径.UserPartition.class);
Copy mechanism
Overview
The copy is to back up the data. When the data (partition) of a broker in the Kafka cluster is lost or down, the copy in other brokers is available
producer's ACKS parameter
The ACKS parameter indicates that the producer produces a message and has a strict requirement for writing a copy. Different ACKS parameters have different performance and security
Parameter classification
The acks configuration is 0:
acks=0: The producer sends the next message directly regardless of whether the Kafka cluster has received it or not. Advantages
and disadvantages:
Advantages: good performance is fast.
Disadvantages: easy to cause data loss, with higher probability
The acks configuration is 1:
acks=1: The producer sends the data to Kafka, and Kafka waits for the success of writing the leader copy of this partition, returns an ack confirmation, and the producer sends the next one.
Advantages and disadvantages:
Advantages: A balance between performance and security.
Disadvantages: There is still a probability of data loss, but the probability is relatively small
Acks is configured as -1 or all
acks=all/-1: The producer sends data to Kafka, and Kafka waits for all copies of this partition to be written, returns an ack confirmation, and the producer sends the next one.
Advantages and disadvantages:
Advantages: data security
Disadvantages: slow
Supplement: What if Kafka does not return acks?
- The producer will wait for the Kafka cluster to return ACKS, and there will be a waiting time. If Kafka does not return ACKS within the specified time, it means that the data is lost.
- The producer has a retry mechanism to resend this piece of data to Kafka
Existing problem: acks returns every time the Kafka cluster data is successfully written. If Kafka goes down during the return process, it will cause data duplication. How to solve it?