Kafka consumption delay problem search

Recently, Kafka consumption delays occasionally broke out online, but the amount of data in the system is not large. Why is it delayed?
The specific analysis is as follows.

Basic ideas :
1. Check whether the data backlog in the machine is due to the consumption delay caused by the excessive amount of data.
2. Statistical data is sent to Kafka successfully to the time consumption of data consumption (not yet doing business processing).
3. It takes time to consume statistical data and complete business processing.

1. Check the data distribution of Kafka machine's topic in each partition

Insert picture description here
It can be seen that there is a topic in the group, the topic has 6 partitions, and the consumers are distributed on two machines (two ips), and each machine has three consumers.
Focus on the log backlog of each partition (check LAG parameters):
0 data backlog in
partition-3, 1 data backlog in
partition-1, 3 data backlogs in partition-2,
0 data backlog in partition-0 backlog
partition-4 0 has a backlog of data
partition-5 has a backlog of data 0
can be seen from the data backlog is not serious. Not a delay caused by a large amount of data

Second, the statistical data is sent to Kafka successfully to the time consumption of data consumption (not yet doing business processing)

After log statistical analysis, it is basically time-consuming at the millisecond level. So it is not the delay caused by the problem.

3. Time-consuming consumption of statistical data and completion of business processing

After log statistical analysis, it is basically time-consuming at the millisecond level. But occasionally, it takes 20 minutes to process. Therefore, it is certain that the delay is caused by some operations when processing the consumed data.
I found this part of the log and found that it was due to business processing that sometimes needed to call an external interface. As a result, the address of this interface was blocked. The timeout was not set during the http call. The service stuck here and waited for 20 minutes before it timed out, which caused some Delay in subsequent consumption.

Four, result processing

In the end, setting the http request timeout to 15s solved the problem.

Guess you like

Origin blog.csdn.net/wuxiaolongah/article/details/110977378
Recommended