[Kafka Progressive Series 007] Analyze Kafka Producer from the perspective of interviews

1. Why does Kafka adopt the concept of partition

Kafka messages adopt a three-level structure of topic (topic)-partition (partition)-message. Each message will only be stored in a certain partition, and will not be stored in multiple partitions.

Using partitions role is to provide load balancing capabilities, to achieve high system scalability .

Different partitions are placed on different node machines, and the data read and write operations are aimed at the granularity of the partition. Each node machine can independently execute the read and write processing requests of their respective partitions, and can also be increased by adding new node machines The throughput of the overall system.

In addition, the use of partitions can also achieve the problem of message order on the business. The order of message sending can be guaranteed on the same partition, and the message order problem on the business can be solved by specifying the key to send the messages to the same partition.

2. What are the common partitioning strategies?

The partition strategy refers to which partition the producer sends the message to. Common partition strategies are:

  • Polling strategy: Assigned to different partitions in sequence, the default strategy, with better load balancing capabilities, is also more recommended.
  • Random strategy: send the message to any partition.
  • Specify the message key to save: messages with the same key will be sent to the same partition, and the processing of messages under each partition is orderly.
  • Custom strategy, implement org.apache.kafka.clients.producer.Partitionerinterface

3. Producer compression algorithm

Compression refers to the use of CPU time to exchange disk space or network I/O transfer volume, resulting in less disk occupation or less network I/O transfer with smaller CPU overhead.

(1) Compression

Compression in Kafka occurs in two places: the producer and the broker.

The producer side is props.put("compression.type", "gzip");used to enable and use the compression function.

The Broker side usually does not recompress the message, and the received message will only be sent out intact. There are two scenarios for compressing messages:

  • The Broker side specifies a different compression algorithm from the Producer side
  • Message format conversion occurred on the Broker side

The Broker side uses compression.typeto set the compression algorithm. If the value is different from the producer side value, the received message will be decompressed and then compressed. (It is easy to cause the CPU to surge)

There are multiple versions of the message format. If there are both new and old versions of the message set, the Broker will convert the new message to the old version of the message.

(2) Unzip

Generally speaking, decompression occurs on the consumer side. The producer side compresses the message, the broker side receives and saves the message, and the consumer side receives and decompresses the message.

The producer will encapsulate which compression algorithm to use in the message collection (note: not a message, it is a message collection), and the consumer will decompress the messages in the message collection according to the set compression algorithm when reading the message collection .

The Broker side needs to decompress each message set received (Note: it is not a message, the message will not be decompressed, unless the above two Broker side recompression scenarios occur, the message will be recompressed and decompressed. operating).

(3) When to open the compression algorithm

  • The program machine running on the Producer side has sufficient CPU resources, otherwise enabling the message collection will cause the CPU to surge and cause the CPU to be occupied;
  • Bandwidth resources are limited, it is recommended to enable compression, which can greatly save network resource consumption;
  • Try to avoid compression/decompression problems caused by message format conversion.

4. How does the producer ensure that the message is not lost?

When we send a message to Kafka through a business program, what factors will cause the message to be sent unsuccessfully:

  • Network jitter, the message sent by the Producer is not sent to the Broker at all;
  • The message is too large to be able to withstand the Broker end, resulting in rejection
  • Broker is down (there is no cure for this situation, you can only find operation and maintenance to solve it as soon as possible)

All of the above will cause the false appearance of "lost" messages sent by the Producer.

So how to solve the above problems?

The producer side should use the sending API producer.send(msg, callback)with callback, and don't use the API without callback producer.send(msg). The API with callback will tell you whether the Broker side has received the message sent by the Producer side. If the message is not sent successfully, you can know the situation and deal with it in a targeted manner, such as retrying.

For detailed examples, see: Kafka producer message sending code example

/**
 * 异步发送kafka消息
 *
 * @param topic
 * @param key
 * @param message
 */
public void asyncSendMsg(String topic, String key, String message) {
    
    
    ProducerRecord producerRecord = new ProducerRecord(topic, key, message);
    producer.send(producerRecord, (recordMetadata, e) -> {
    
    
        if (e != null) {
    
    
            System.out.println("!!!+++error!!!!");
            log.error("kafka msg send error, topic={}, key={}, message={}, e={}", topic, key, message, e);
            return;
        }

        // send success
        if (recordMetadata != null) {
    
    
            System.out.println(">>>>>>>message:" + message);
            log.info("kafka msg send success, topic={}, key={}, partition:{}, offset:{}, timestamp:{}", topic, key, message, recordMetadata.partition(),
                    recordMetadata.offset(), recordMetadata.timestamp());
        } else {
    
    
            log.info("kafka msg send success result is null, topic={}, key={}, timestamp:{}", topic, key, message, recordMetadata.timestamp());
        }

    });
}

In production, the method of handling messages without loss:

(1) Producer configuration:

  • Use API callback notification of the send: producer.send(msg, callback);

  • Set acks=all, it means that all replica Brokers receive the message before they are considered "committed";

    Setting retries > 0, which represents the automatic retry of Producer. When the network experiences instantaneous jitter, message sending may fail.

    retries > 0The Producer can automatically retry message sending to avoid message loss.

(2) Broker configuration

  • unclean.leader.election.enable = false: When some brokers settled in the original leader data, if they become a new leader, it may cause message loss; setting to false means that this type of broker is not allowed to become a new leader;
  • replication.factor >= 3: The number of copies is at least 3, so that the message is stored on multiple Brokers, to avoid the loss of messages caused by a single Broker downtime;
  • min.insync.replicas > 1: A message is considered "committed" if it is written into at least a few copies, and the lower limit of the copy is guaranteed to acks=-1take effect only at the time.
  • replication.factor > min.insync.replicas: If they are equal, as long as one copy is down, the Producer cannot write to it. For example, if both are 2, and 1 copy is down, then min.insync.replicas=2it cannot be satisfied. It is generally recommended to set replication.factor = min.insync.replicas + 1.

(3) Consumer configuration

enable.auto.commitIf set to false, the code adopts manual displacement submission method.

5. Kafka Producer TCP management

The communication between Kafka's producer, consumer, and Broker uses the TCP protocol to communicate, and some advanced features of TCP can be used, such as multiplexing.

(1) Timing of creating a TCP connection between the producer and the Broker

  • When the producer creates an instance of KafkaProducer, a Sender thread is created and started in the background, and a connection to the Broker is created when the thread runs.

    As long as we specify the bootstrap.serversparameter value, Prodcuer will create a TCP connection with these configured Brokers, and then send a metadata request to the Broker to try to obtain cluster metadata information.

    Note: A Producer will create TCP connections to all Brokers in the cluster by default, and may only communicate with 3 of them in time.

  • When the Producer tries to send a message to an unknown topic topic (without knowing the Partition information and corresponding node information corresponding to the topic), it will send a requestUpdate request to the Kafka cluster to try to obtain the latest metadata information;

  • Producer By metadata.max.age.msperiodically to update the metadata parameters, the default value is 5 minutes;

(2) When TCP closes the connection

  • User actively closed:producer.close
  • The Producer side sets connections.max.idle.msparameters to indicate the time for the Producer side to maintain a TCP connection session with the Broker. The default is 9 minutes. Kafka will close the TCP connection without any request for more than 9 minutes. Set to -1 to keep the connection permanently.

6. How does Kafka Producer ensure that messages are not sent repeatedly

There are three types of guarantees for the reliability of message delivery on the Producer side:

  • At most once: Producer will only send once, so the message will not be sent repeatedly, but the message may be lost; for example, the broker does not receive the message because of the instantaneous network jitter, and the message is lost;
  • At least once: The message will not be lost, but it may be sent repeatedly; this is achieved by letting the Producer prohibit retries.
  • Exactly once: The message will not be lost and will not be sent repeatedly.

In actual operation, we definitely need to ensure that the message cannot be lost, so the second or third method is generally adopted.

Therefore, the producer guarantees that the message will not be sent repeatedly, and the default premise is that the message will not be lost. Now let's see how to achieve:

  • Idempotent
  • Affairs

(1) Idempotent Producer

Kafka has introduced idempotent Producer since version 0.11.0.0, as long as it is props.put(“enable.idempotence”, ture)set, the code does not need any changes, and Kafka will automatically perform message deduplication. When the Producer sends a message with the same field value, the Broker knows that the message is duplicated and discards it.

But the idempotent Producer can only guarantee that a topic does not repeat messages on a single partition; secondly, if the service restarts, this idempotence will be lost.

(2) Transactional Producer

Transactional Producer can guarantee that messages are written atomically to multiple partitions; and after the service is restarted, they can still guarantee that they send messages exactly once.

Guess you like

Origin blog.csdn.net/noaman_wgs/article/details/105646647