The interviewer asked me: How to solve the kafka client timeout caused by thread lock?

This article is shared from Huawei Cloud Community "Kafka client timeout problem caused by thread lock", author: Zhang Jian.

Problem background

There is an environment where the kafka client partially times out when sending data, and the topology diagram is also very simple.

cke_114.png

Positioning process

We first checked the client environment and JVM situation. The network from the virtual machine where the JVM is located to the kafka server is normal, and the garbage collection (GC) time is also within the expected range, and there are no abnormalities.

Immediately afterwards, we turned our attention to the Kafka server, conducted some basic checks, and also checked the timeout log of Kafka's request processing. Among them, the metadata and produce requests we were concerned about did not time out.

The problem has reached a deadlock, although some kafka servers have also been found to resolve issues that cause timeouts for connected clients (https://github.com/apache/ kafka/pull/10059), but through some simple analysis we determined that this was not the problem.  

At the same time, we also found some anomalies in the environment. At that time, we felt that they were not core issues/we could not explain them clearly, so we did not look into them in depth.

  • The number of JVM threads in question is high, exceeding 10,000. Although this number of threads is indeed high, it will not have any substantial impact on a 4U container.
  • The thread responsible for indicator reporting has a high CPU and takes up about 1/4 ~ 1/2 of the CPU core. This is not a big problem for a 4U container.

When the investigation reached an impasse, we began to consider other possible investigative methods. We try to capture packets to find clues. The packet capture here is SASL authentication + SSL encryption, which is very difficult to read. We can only barely infer the content of the message based on the length and response time.

In this process, we discovered a very important clue. The client actually initiated a timeout and disconnected the link, and the server actually responded to the message that timed out.

Then we turned on the trace level log of kafka client. We couldn’t help but sigh that there were relatively few kafka client logs and found that there were indeedlog.debug(“Disconnecting from node {} due to request timeout.”, nodeId); log printing.

Network related processes:

try {

// 这里发出了请求

client.send(request, time.milliseconds());

while (client.active()) {

List<ClientResponse> responses = client.poll(Long.MAX_VALUE, time.milliseconds());

for (ClientResponse response : responses) {

if (response.requestHeader().correlationId() == request.correlationId()) {

if (response.wasDisconnected()) {

throw new IOException("Connection to " + response.destination() + " was disconnected before the response was read");

}

if (response.versionMismatch() != null) {

throw response.versionMismatch();

}

return response;

}

}

}

throw new IOException("Client was shutdown before response was read");

} catch (DisconnectException e) {

if (client.active())

throw e;

else

throw new IOException("Client was shutdown before response was read");

}

This poll method is not a simple poll method, but a timeout judgment is performed in the poll method. Check the handleTimedOutRequests method called in the poll method.

@Override

public List<ClientResponse> poll(long timeout, long now) {

ensureActive();

if (!abortedSends.isEmpty()) {

// If there are aborted sends because of unsupported version exceptions or disconnects,

// handle them immediately without waiting for Selector#poll.

List<ClientResponse> responses = new ArrayList<>();

handleAbortedSends(responses);

completeResponses(responses);

return responses;

}

long metadataTimeout = metadataUpdater.maybeUpdate(now);

try {

this.selector.poll(Utils.min(timeout, metadataTimeout, defaultRequestTimeoutMs));

} catch (IOException e) {

log.error("Unexpected error during I/O", e);

}

// process completed actions

long updatedNow = this.time.milliseconds();

List<ClientResponse> responses = new ArrayList<>();

handleCompletedSends(responses, updatedNow);

handleCompletedReceives(responses, updatedNow);

handleDisconnections(responses, updatedNow);

handleConnections();

handleInitiateApiVersionRequests(updatedNow);

// 关键的超时判断

handleTimedOutRequests(responses, updatedNow);

completeResponses(responses);

return responses;

}

From this we infer that the problem may be that the client hangs for a period of time, resulting in a timeout and disconnection. We used the tool Arthas to deeply trace Kafka's related code, and even found that some simple operations (such as A.field) take several seconds. This further confirms our suspicion that the problem may lie in the JVM. The JVM may have a problem at some point, causing the system to hang, but this is not caused by the GC.

cke_115.png

In order to solve this problem, we also checked the problem of high CPU of the monitoring thread. We found that the execution hotspot of the thread is the "getThreadInfo" method in "sun.management.ThreadImpl".

"metrics-1@746" prio=5 tid=0xf nid=NA runnable

java.lang.Thread.State: RUNNABLE

at sun.management.ThreadImpl.getThreadInfo(Native Method)

at sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:185)

at sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:149)

It was further discovered that in some versions of JDK8, reading thread information requires locking.

At this point, the root cause of the problem has become clear: the excessive number of threads and the existence of JVM global locks during thread monitoring have caused this problem. You can use the following demo to reproduce this problem

import java.lang.management.ManagementFactory;

import java.lang.management.ThreadInfo;

import java.lang.management.ThreadMXBean;

import java.util.concurrent.Executors;

import java.util.concurrent.ScheduledExecutorService;

import java.util.concurrent.TimeUnit;

public class ThreadLockSimple {

public static void main(String[] args) {

for (int i = 0; i < 15_000; i++) {

new Thread(new Runnable() {

@Override

public void run() {

try {

Thread.sleep(200_000);

} catch (InterruptedException e) {

throw new RuntimeException(e);

}

}

}).start();

}

ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor();

executorService.scheduleAtFixedRate(new Runnable() {

@Override

public void run() {

System.out.println("take " + " " + System.currentTimeMillis());

}

}, 1, 1, TimeUnit.SECONDS);

ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();

ScheduledExecutorService metricsService = Executors.newSingleThreadScheduledExecutor();

metricsService.scheduleAtFixedRate(new Runnable() {

@Override

public void run() {

long start = System.currentTimeMillis();

ThreadInfo[] threadInfoList = threadMXBean.getThreadInfo(threadMXBean.getAllThreadIds());

System.out.println("threads count " + threadInfoList.length + " cost :" + (System.currentTimeMillis() - start));

}

}, 1, 1, TimeUnit.SECONDS);

}

}

In order to solve this problem, we have the following possible solutions:

  • Reduce the unreasonable number of threads, and thread leakage may occur.
  • Upgrade jdk to jdk11 or jdk17 (recommended)
  • Temporarily close Thread-related monitoring

The solution to this problem should be chosen based on the actual situation. I hope it will be helpful to you.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10322316