In-depth understanding of the internal mechanism of Kafka Producer

Overall, Kafka Producer is the client that sends data to the kafka cluster. Its components are shown in the figure below:

Basic components:

  • Producer Metadata - Metadata required to manage producers: topics and partitions in the cluster, agent nodes that act as partition leaders, etc.
  • Partitioner – Calculate the partition of a given record.
  • Serializers - Record key and value serializers. Serializer converts objects into byte arrays.
  • Producer Interceptors - Interceptors that may alter records.
  • Record Accumulator – Accumulate records and group them into batches by topic partition.
  • Transaction manager - Manages transactions and maintains necessary state to ensure idempotent production.
  • Sender - a background thread that sends data to the Kafka cluster.

Place Kafka Producer

kafka producer has three parameters that must be specified:

  • bootstrap.servers — List of host/port pairs used to establish the initial connection to the Kafka cluster. Format: "host1:port1,host2:port2,…"
  • key.serializer — Represents the fully qualified class name of a key serializer that implements the org.apache.kafka.common.serialization.Serializer interface.
  • value.serializer — Represents the fully qualified class name of a value serializer that implements the org.apache.kafka.common.serialization.Serializer interface.

Send data to kafka process

Kafka Producer sends messages asynchronously and returns a Future, which represents the sending result. Additionally, users can optionally provide a callback that is called when a record is confirmed by the Kafka broker. While it looks simple, there's something going on behind the scenes.

  1. Producers deliver messages to a configured list of interceptors. For example, an interceptor might alter the message and return an updated version.
  2. Serializer converts record keys and values ​​into byte arrays
  3. If not specified, the default or configured partitioner is used to calculate topic partitions.
  4. The record accumulator appends messages to the producer batch using the configured compression algorithm.

At this time, the message is still in memory and has not been sent to the Kafka broker. The Record Accumulator groups in-memory messages by topic and partition.

The Sender thread groups multiple batches with the same broker as leader into requests and sends them. At this point, the message is sent to Kafka.

Send time

Kafka Producer provides configuration parameters to control the time spent in various stages:

  • max.block.ms — Time to wait for metadata retrieval and buffer allocation
  • linger.ms — Time to wait for additional records to be sent
  • retry.backoff.ms - How long to wait before retrying a failed request
  • request.timeout.ms — Time to wait for response from Kafka broker
  • delivery.timeout.ms — introduced later as part of KIP-91 to provide users with guaranteed timeout caps without having to adjust producer component internals

Data persistence

Users can control the persistence of messages written to Kafka through the acks configuration parameter. Allowed values ​​are:

  • 0, producer will not wait for broker confirmation
  • 1. A producer will only wait for the partition leader to write messages, without waiting for all followers.
  • all, the producer will wait for all synchronized replicas to confirm the message. This comes at a cost of latency and represents the strongest guarantee available.

There are some subtle differences that need to be clarified regarding synchronized replicas when using acks=all. On the Kafka side, two settings and the current state affect behavior:

  • Topic replication factor
  • min.insync.replicas setting
  • The number of replicas currently in sync, including the leader itself.

min.insync.replicas Specifies the minimum threshold for synchronized replicas requested by acks=all. If this requirement cannot be met, the Broker will reject the producer's request and not even try to write and wait for confirmation. The table below illustrates possible scenarios.

During transient failures, the number of sync replicas may be lower than the total number of replicas, but as long as it is greater than or equal to min.insync.replicas - requests with acks=all will succeed.

Users can mitigate transient failures and increase durability by resending failed requests. This can be achieved by setting retries (default MAX_INT) and delivery.timeout.ms (default 120000). Retries may cause message duplication and change the order of messages. These side effects can be mitigated by setting enable.idempotence=true, but it comes at the cost of reduced throughput.

Partition

Messages in topics are organized into sections. Users can control partition allocation through message keys or pluggable ProducerPartitioner implementations. Partitioner can be set using the partitioner.class configuration, which should be a fully qualified class name that implements the org.apache.kafka.clients.producer.Partitioner interface.

Kafka provides three implementations out of the box: DefaultPartitioner, RoundRobinPartitioner and UniformStickyPartitioner.

DefaultPartitioner - If the message key is empty - use the current partition and change it in the next batch. For non-null keys, it is calculated using the following formula: murmur2hash(key) % total nr of topic partitions.

RoundRobinPartitioner — Distributes messages equally among all active partitions in a round-robin fashion, ignoring message keys. A partition is considered active if it has a designated broker as the leader.

UniformStickyPartitioner - Ignore message keys, use current partition, and change partitions in next batch.

Guess you like

Origin blog.csdn.net/weixin_39636364/article/details/128254055