Kafka partition and copy mechanism

Producer partition write rules

Overview

When the producer writes the message to the topic, Kafka will distribute the data to different partitions according to different rules. If the partition data is specified, it will be written to the specified partition. If the partition is not specified, the specified partition will be performed according to the following rules:

classification

Hash partition (specified key is enabled by default)
Insert picture description here

Advantages: The same key will enter the same partition.
Disadvantages: Since the Hash remainder of the same key is the same, it will cause data skew.

Polling partition (open when no key is specified)

Insert picture description here

  • The default strategy is also the most used strategy, which can ensure that all messages are evenly distributed to one partition.
  • If the key is null when the message is produced, the polling algorithm is used to distribute the partitions evenly

Advantages: data distribution is more even.
Disadvantages: data with the same key goes into different partitions.
Before 2.x: polling partitions, after 2.x: sticky partitions

Random partition (not used)

Random strategy, randomly assigning messages to each partition every time. In earlier versions, the default partition strategy was a random strategy, which was also to write messages to each partition evenly. However, the subsequent polling strategy performs better, so the random strategy is rarely used.

Custom partition
Insert picture description here
implementation code:
1. Create a custom partitioner

import org.apache.kafka.clients.producer.Partitioner;
import org.apache.kafka.common.Cluster;

import java.util.Map;
import java.util.Random;

/**
 * @ClassName UserPartition
 * @Description TODO 自定义分区器,实现随机分区
 * @Date 2021/3/31 9:21
 * @Create By     Frank
 */
public class UserPartition implements Partitioner {
    
    

    /**
     * 返回这条数据对应的分区编号
     * @param topic:Topic的名
     * @param key:key的值
     * @param keyBytes:key的字节
     * @param value:value的值
     * @param valueBytes:value的字节
     * @param cluster:集群对象
     * @return
     */
    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
    
    
        //获取Topic的分区数
        Integer count = cluster.partitionCountForTopic(topic);
        //构建一个随机对象
        Random random = new Random();
        //随机得到一个分区编号
        int part = random.nextInt(count);
        return part;
    }

    @Override
    public void close() {
    
    
        //释放资源
    }

    @Override
    public void configure(Map<String, ?> configs) {
    
    
        //获取配置
    }
}

2. In the Kafka producer configuration, customize the class name of the custom partitioner
props.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, 类所在包路径.UserPartition.class);

Copy mechanism

Overview

The copy is to back up the data. When the data (partition) of a broker in the Kafka cluster is lost or down, the copy in other brokers is available

producer's ACKS parameter

The ACKS parameter indicates that the producer produces a message and has a strict requirement for writing a copy. Different ACKS parameters have different performance and security

Parameter classification

The acks configuration is 0:
Insert picture description here
acks=0: The producer sends the next message directly regardless of whether the Kafka cluster has received it or not. Advantages
and disadvantages:

Advantages: good performance is fast.
Disadvantages: easy to cause data loss, with higher probability

The acks configuration is 1:
Insert picture description here
acks=1: The producer sends the data to Kafka, and Kafka waits for the success of writing the leader copy of this partition, returns an ack confirmation, and the producer sends the next one.
Advantages and disadvantages:

Advantages: A balance between performance and security.
Disadvantages: There is still a probability of data loss, but the probability is relatively small

Acks is configured as -1 or all
Insert picture description here
acks=all/-1: The producer sends data to Kafka, and Kafka waits for all copies of this partition to be written, returns an ack confirmation, and the producer sends the next one.
Advantages and disadvantages:

Advantages: data security
Disadvantages: slow

Supplement: What if Kafka does not return acks?

  • The producer will wait for the Kafka cluster to return ACKS, and there will be a waiting time. If Kafka does not return ACKS within the specified time, it means that the data is lost.
  • The producer has a retry mechanism to resend this piece of data to Kafka

Existing problem: acks returns every time the Kafka cluster data is successfully written. If Kafka goes down during the return process, it will cause data duplication. How to solve it?

Click on the link to learn how Kafka solves this problem

Guess you like

Origin blog.csdn.net/zh2475855601/article/details/115346569