Detailed explanation of ContainerProperties.AckMode in spring-kafka

  Recently, we encountered a performance problem online, which almost caused an online failure. Later, just by modifying one line of code, the performance improved dozens of times. A line of code is dozens of times faster. The data sounds exaggerated, but it is real data. Wrong online configuration may indeed lead to an order of magnitude difference in performance. You will understand after I finish talking about our performance problem.

  We are connected to Tencent Cloud's IOT platform online. The upload events of any IoT device are passed to us through Tencent Cloud's ckafka. As the number of devices and event data increases, we have a performance bottleneck when consuming Tencent Cloud's ckafka. , there will be data congestion during peak data periods, which will cause business problems due to data processing delays. The simplest solution is to expand partitions and consumers. In fact, we did this when performance problems occurred half a year ago. Doubling the partitions doubled the performance. However, half a year later, we have reached a bottleneck again.

  After investigation, we found that a single Kafka message processing takes 6ms. After splitting all the execution logic, we found that the 6ms delay is mainly the time to send ack to Tencent Cloud. The rtt from our computer room to Tencent Cloud happens to be about 6ms, so almost all events are It is spent on network transmission of messages. However, this is limited by physical distance and cannot be reduced. Later, we accidentally discovered that we used MANUAL_IMMEDIATE in spring-kafka's AckMode in the code. In this mode, kafka's consumer will manually confirm each message to the server. Later, we adjusted this configuration to AckMode.MANUAL to process a single message. The duration has been reduced from the original 6ms to less than 0.2ms, an increase of more than 30 times. Now even if we do not expand our performance redundancy, it will be enough to support it for many years. Why can a simple configuration change result in such an improvement? Are there other configuration types?

  In fact, spring-kafka does not only provide two ack modes, MANUAL and MANUAL_IMMEDIATE, but the following seven, each with various functions and suitable scenarios.

  • RECORD : Confirm immediately after each record is processed.
  • BATCH : After each call to the poll() method, only the last record returned is confirmed.
  • TIME : Each time the set time interval passes, confirm the last record processed within this period.
  • COUNT : After processing the set number of records, confirm the last processed record.
  • COUNT_TIME : combines TIME and COUNT, that is, when any condition is met, the last processed record is confirmed.
  • MANUAL : Users need to manually call acknowledgment.acknowledge() in batches to confirm messages.
  • MANUAL_IMMEDIATE : Users need to manually call acknowledgment.acknowledge() to confirm the message. Each message will be acknowledged once.

  If classified, the above 7 modes can be divided into 2, manual confirmation and automatic confirmation, of which MANUAL and MANUAL_IMMEDIATE are manual confirmations, and the rest are automatic confirmations. The core difference between manual confirmation and automatic determination is whether you need to explicitly call Acknowledgment.acknowledge() in the code. Let's look at it one by one.

Manual confirmation

MANUAL:

  In this mode, the consumer needs to manually call the Acknowledgment.acknowledge() method to confirm the message after processing the message. The confirmation operation will be performed in batches , that is, the confirmation operation will be delayed until a batch of messages is processed before being sent to Kafka. The advantage of this model is that it can improve efficiency because the number of interactions with the Kafka server is reduced. But the disadvantage is that if half of a batch of messages is consumed, the consumer suddenly crashes abnormally. Because the data is not confirmed to the Kafka server in time, the message will be repeatedly pulled next time, causing some data to be consumed repeatedly.

MANUAL_IMMEDIATE:

  In this mode, the consumer needs to manually call the Acknowledgment.acknowledge() method to confirm the message after processing the message. However, unlike the MANUAL mode, once the acknowledge() method is called, the confirmation information will be sent to Kafka immediately instead of waiting for a batch of messages to be processed before sending. This mode may increase the number of interactions with the Kafka server, and will cause significant performance consumption bottlenecks when the network delay is large. However, the confirmation information can be sent to Kafka as soon as possible. Even if the consumer is abnormally down, it will only Causes a single message to be consumed repeatedly.

  The advantage of manual confirmation is that the consumer can judge whether the data has been successfully consumed in the code logic. Unconfirmed data will not be confirmed. This can ensure that the data is not lost. The manual mode can ensure the integrity of the data, which is what is required in a distributed data system. Said at least once . The core difference between these two modes is single confirmation and batch confirmation. The batch method can significantly improve performance. I introduced in detail three methods to improve performance of IO-intensive services in my blog last month. If you are interested, you can read it . .

automatic confirmation

  RECORD, BATCH, TIME, COUNT, and TIME_COUNT are all automatic confirmations, that is, you do not need to explicitly call them in the code. As long as the Acknowledgment.acknowledge()consumer pulls the message, it will be automatically confirmed, regardless of whether the consumption is really successful, so automatic confirmation The mode may cause data loss, but it should be noted that compared with manual confirmation, automatic confirmation may cause data loss or data duplication, so it is not at most once semantic level. Although they are both automatic confirmations, these five modes actually have their own differences.

RECORD and BATCH

  First, let’s take a look at RECORD and BATCH. These two modes are actually the automatic versions corresponding to MANUAL and MANUAL_IMMEDIATE mentioned above. RECORD is confirmed once, and performance problems may also occur if the network delay is large. BATCH is a batch confirmation. This batch of messages will be confirmed after each poll(). Similarly, if the consumer crashes abnormally, the message will not be confirmed successfully, resulting in the message being pulled repeatedly. Of course, if the consumer fails to process data due to other reasons, but the data is confirmed normally, the message will be lost in this case.

TIME

  TIME mode is timed confirmation. For example, if you set the confirmation interval to 5S, the consumer will confirm to Kafka every 5s the messages consumed within 5s. There is a problem here. If it is a high-frequency data flow and the time interval is set to be large, It may lead to the accumulation of a large number of unconfirmed messages, and then repeatedly pull these messages after abnormal downtime. The COUNT mode we will talk about next can avoid this situation.

COUNT

  The timing of confirmation in COUNT mode is triggered by the number of consumed data. For example, it is confirmed once every 100 items consumed, which perfectly avoids the accumulation of a large amount of unconfirmed data. However, if it is an extremely low-frequency data flow, for example, one piece of data only takes a few minutes, and it will take several hours to accumulate 100 pieces, the data will not be confirmed for a long time after consumption, and it may even cause Kafka to think that the data consumption has timed out, causing the data to be Repeat consumption.

TIME_COUNT

  In view of the advantages and disadvantages of TIME and COUNT, TIME_COUNT combines the characteristics of both. As long as the time interval or the number of messages meets one of them, it will be confirmed. It has stronger adaptability, so when you want to select from TIME, COUT, and TIME_COUNT, If you choose one, I personally think you can blindly choose TIME_COUNT, unless you are particularly clear about the characteristics of your data and know which one is more suitable.

Summarize

  To briefly summarize the above modes, if you cannot tolerate data loss, you must choose manual mode. If the network delay is relatively high, you can choose MANUAL (batch processing) mode, but note that even manual mode cannot guarantee data. Without duplication, if you want to be completely idempotent, you have to rely on other methods, such as database transactions. If you can accept partial data loss (for example: monitoring data), you can consider automatic mode, but I personally do not recommend RECORD mode, because this mode will cause serious performance problems in the case of high network latency. The remaining modes can be selected according to your own data volume and network conditions. Using different modes in different situations may have significant performance differences.

Guess you like

Origin blog.csdn.net/xindoo/article/details/132652579
Recommended