RocketMQ actual combat summary|Record a message queue accumulation problem troubleshooting

1. Background

picture

The system links involved in this problem are shown in the figure above, and the basic responsibilities of each system are:

  • Proxy: Provide request proxy service. Unified agent of various requests sent to the downstream system, so that the upstream system does not need to perceive the difference in the use of downstream services; at the same time, it provides refined service features such as current limiting and diversion;
  • Latu SDK: Provides a unified image download function for each storage platform. It is usually arranged together with the image algorithm model;
  • Estimation engine: Provides model reasoning services. Support engineering deployment of various machine learning and deep learning algorithm models;

2. Troubleshooting

Note: The message queue used in this article is called MetaQ internally, and the external open source version is RocketMQ, which will be referred to as "MQ" in the following.

2.1 Problem Description

One morning, I received an alarm notification, indicating that the entrance MQ of the Proxy system had accumulated. Open the console and find that among the 500 Proxy machines, 1 machine has a serious accumulation, while the rest of the machines are running normally:

picture

2.2 Root cause of a sentence

Let’s get straight to the point and talk about the root cause of the problem directly (refer to the following content for detailed troubleshooting process):

  • Individual machine accumulation: the machine has a consumption thread that is stuck, although the rest of the threads consume normally, but the MQ mechanism decides that the consumption position does not advance;
  • HTTP download is stuck: the version of HttpClient used has a bug, and the timeout does not take effect under certain circumstances, which may cause the thread to be stuck all the time;

2.3 Troubleshooting process

1. The consumption speed of the machine is too slow?

The first reaction is that the consumption of this machine is too slow. The same amount of information can be digested quickly by other machines, but it continues to accumulate due to slow consumption. However, after a detailed comparison in the MQ console, it is found that the business processing time and consumption TPS of this machine are similar to those of other machines, and the specifications of different machines are also the same:

picture

Secondly, observe the flame diagram drawn by Arthas. The two slender flames on the right are the business logic of our system (the higher the flame is because the RPC call stack is deeper). It can be seen from the figure that there is no obvious "flat-top effect" in the flame graph, and the width of the flame is relatively narrow, which also shows that the machine does not have obvious time-consuming stuck points when executing business logic. Prove that the machine is not accumulated due to slow business processing:

picture

2. Are the system indicators normal?

Log in to the machine and find that the CPU, MEM, and LOAD are all normal and similar to the normal machine, and no obvious clues are found:

picture

And the machine has no obvious Full GC:

picture

3. Caused by current limitation?

Tip: Proxy has a flow-limiting mechanism when proxying requests, and traffic exceeding the limit will trigger blocking waiting, thereby protecting downstream synchronization services.

Therefore, if the system traffic is very large today and exceeds the limit that the downstream service can bear, the excess requests will trigger the RateLimiter flow limit and block. As a result, a large number of MQ consumer threads are blocked, so the overall message consumption speed of the Proxy system will slow down, and the final result is the accumulation of entry topics.

picture

However, whether you look at the logs or monitoring of the machine, the phenomenon of blocking due to current limiting is not serious, and it is relatively smooth:

picture

Secondly, if it is really caused by the large ingress traffic of the system today, then all proxy machines (MQ consumers) should have a similar level of accumulation. It should not be all stacked on the same machine, while the other 499 machines are normal. Therefore, check the possibility that the system inlet flow is too large.

4. MQ data skew?

It is expected that 500 Proxy machines will equally distribute the messages in the ingress MQ. Is it possible that there is a skew in the data distribution, resulting in too many messages for this machine and few messages for other machines?

Check the code of the upstream system of Proxy. The system does not do custom Shuffle when sending messages to Proxy, so it will use MQ's default selectOneMessageQueue strategy. And this strategy will send the current message to the queue selected based on index % queue_size. Looking further at the logic of index, it is found that it starts with a random number when it is initialized, and it will increase by +1 each time it is accessed:

picture

Combining the above two points, the effect achieved is: distribute messages evenly to each queue from left to right, and realize automatic circulation through %:

picture

To sum up, the default shuffle strategy of MQ is: divide the messages into each queue equally. Therefore, we can rule out the possibility that our Proxy machine will accumulate due to skewed message data sent by MQ.

5、CPU steal ?

Tip: CPU steal indicates the percentage of CPU time that the process running in the virtual machine is occupied by other processes/virtual machines on the host machine. A high CPU steal value usually means that the process performance in the virtual machine is degraded.

For example, the machine specification itself says 4C, but actually only 2C can be used after being stolen. So the reflected result is that the RT of a single request does not change significantly (the difference between different C is not big), but the amount of C decreases, so the overall throughput of this machine becomes smaller and the consumption capacity becomes weaker, resulting in accumulation.

However, the investigation found that st is normal, and this possibility is ruled out:

picture

6. Find clues: MQ consumption location has not changed!

I went around and found no abnormal point. Since the problem phenomenon is MQ accumulation, I thought that I could check the middleware log to find clues. Sure enough, through a directional search for the queueId of the stuck machine, it was found that the consumption point of this queue has not been advanced for a long time, and has been stuck in a certain position:

picture

Tip: The mechanism of MQ pulling messages is that the pulled messages will first be stored in the cache with a capacity of 1000 in the memory, and then the messages in the memory will be consumed by the consumer thread. When the cache is full, it will no longer be pulled from the queue.

picture

From this, I suspect: is it because the Proxy consumption stops or the consumption is extremely slow, so that the local cache is always full, so MQ will stop pulling messages from the queue, so the offset has not changed?

But as mentioned above, regardless of the MQ console, system monitoring indicators, and machine logs, the machine is normal and has no difference from other machines. Then why is the consumption point of the machine not moving, causing the accumulation to become more and more serious?

7. Find the root cause: a consumer thread is stuck

Tip: For messages in the local cache, MQ will open multiple threads (the number of threads is specified by the user) to pull and cancel consumption. And in order to ensure that the message is not lost, the offset records only the frontmost message.

First, the mechanism is sound. Because under the At Least Once semantics, messages are allowed to be consumed multiple times but not allowed to be missed. Assume that there are now two threads pulling two messages at the same time, and the latter message is executed first. Since the execution of the previous message may be abnormal, the offset of the latter cannot be directly used to update the offset, otherwise the message of consumption failure will not be retrieved. Therefore, the meaning of the consumption point offset is: all messages at and before this position have been correctly consumed (somewhat similar to Flink's Watermark mechanism).

According to the above mechanism, returning to this question, if any one of the many MQ consumer threads of this Proxy machine is stuck, then the consumption position of the entire queue will always stay at the offset corresponding to the stuck message. At this time, although other threads are continuing to consume normally, they cannot advance the offset forward. On the other hand, because the upstream is still continuously sending messages to the queue, the messages can only be sent in and out, and the accumulation amount = the offset of the latest incoming message - the offset of the consumption site only increases, which is reflected in the accumulation on the console. more serious!

Based on this analysis, check the status of all MQ consumer threads through jstack, and indeed find that the No. 251 thread is always in the runnable state!

picture

There is reason to suspect that it is the stuck consumer thread. Because in the business scenario of the Proxy system, most of the time is spent on RPC synchronously calling the deep learning model (a few hundred milliseconds fast, and a few seconds slow), so the thread should be waiting for the synchronous call to return most of the time. The waiting state! But thread 251 here is always runnable, so there must be a problem.

Further print the details to find the specific code location where the thread is stuck:

picture

Here, the getImageDetail method internally downloads images via HTTP so that the deep learning model can make predictions. Secondly, searching the business log also found that the log of this thread cannot be found. Since it was stuck at 10 o'clock last night, this thread will not generate any new logs. As the machine logs are rolled and cleaned up, all logs will now contain the content of this thread:

picture

So far, the root cause of the serious problem of MQ accumulation on individual machines in the Proxy system has been found: a consumer thread of this machine has been stuck when downloading pictures through HTTP, which prevents the consumption of the entire queue from moving forward, resulting in continuous accumulation.

8. Why does HTTP keep getting stuck?

So far, although the root cause of the accumulation has been found, and after trying, it can be temporarily solved by restarting the application or going offline in the short term. But there are always hidden dangers in the long run, because the same problem has appeared from time to time every few days recently. It is necessary to thoroughly investigate the root cause of the HTTP stuck in order to fundamentally solve this problem.

After several squats, I got some picture URLs that would cause the thread to get stuck. It was found that none of them were internal image addresses, and the addresses could not be opened, nor did they end in jpg format:

https://ju1.vmhealthy.cn
https://978.vmhealthy.cn
https://xiong.bhjgkjhhb.shop

But the problem is, even if you enter such an extreme URL that cannot be opened, since we have set a 5s timeout for HttpClient, it will be stuck for 5s at most. Why does it seem that the timeout mechanism has not taken effect and has been stuck for more than ten hours? ?

Tip: HTTP will need to establish a connection before data transmission. This corresponds to two timeouts: 1. Connection timeout, for the stage of establishing a connection; 2. Socket timeout, for timeout during data transmission.

picture

After inspection, only the socket timeout is set in the current code, and the connection timeout is not set, so I suspect that the above requests are directly stuck in the previous connect stage, and because the connect timeout is not set, the request has been stuck? However, after modifying the code and trying to request these addresses again, it is still stuck and needs further investigation.

9. Find the root cause of HTTP stuck

Tip: Find the root cause. In a specific version of HttpClient, there is a bug in the request based on the SSL connection: in fact, it will try to establish the connection first, and then set the timeout period, and the order is reversed. Therefore, if you get stuck while connecting, and the timeout has not been set yet, this HTTP request will keep getting stuck...

picture

Looking back at the stuck URLs above, they all start with https without exception! The case was solved. It turned out that the HttpClient used by the project had a bug. After upgrading the HttpClient version and re-posting the construction test request, the thread is no longer stuck, and all of them are sent to the online environment. The problem of a small number of machine accumulations that frequently occurred recently has never appeared again.

So far, after going through many twists and turns, the problem has finally been completely solved.

2.4 Overall review

From the appearance of the accumulation of individual machines in the outermost Proxy, I went through many key nodes in the inner investigation until I found the root cause. At present, all the problems have been fully investigated, and now from the perspective of God, the complete causal chain from the inside to the outside is as follows:

–> The Proxy system will download pictures based on HttpClient, and then call the image class model for estimation

–> The version of HttpClient used has a bug, and when accessing the https address, the timeout will not take effect

–> The Proxy system happened to encounter a small number of https addresses, and it happened to get stuck (low probability event), so it will always get stuck because the timeout does not take effect

–> Based on MQ's At Least Once mechanism, the consumption point will always stay at the offset corresponding to the stuck message (although the rest of the threads are consuming normally)

–> The upstream system still sends messages to the Proxy continuously, so the messages can only come in and out, and the accumulation is getting worse

–> When the accumulation exceeds a certain threshold, a monitoring alarm is triggered and the user notices

3. Summary

The troubleshooting process of this problem is a bit tortuous, but we can also learn a lot of common methodologies and lessons, which are summarized in this section:

  • **Make good use of troubleshooting tools: **Learn to use jstack, Arthas, jprofile and other sharp tools, and make good use of appropriate tools in different scenarios, which can efficiently find abnormal points and find clues, so as to gradually find the root cause of the problem;
  • **Abnormal sensitivity:**Sometimes, some clues to discover the problem are actually shown in front of us early, but due to various reasons (such as a lot of doubts about the problem at the beginning, many interference factors that have not been ruled out, etc.) To discover clues, more experience is needed;
  • ** Extensive search for information: ** In addition to searching documents on the intranet and seeking relevant personnel for consultation, you must also learn to check first-hand information on external English websites;
  • **Perseverance when it is difficult:** For some hidden problems, many times the problem cannot be found and solved immediately after it appears once. It may take several rounds of investigations before and after to find the root cause;
  • ** Supplementary underlying knowledge: ** If you can know the offset mechanism of MQ at the beginning, some problems will not take so many detours. In the process of using various middleware in the future, it is necessary to learn more about their underlying principles;
  • **No metaphysical questions:** Codes and machines will not deceive people. Behind all "metaphysical" questions, there is a set of rigorous and reasonable causes;

reference:

  • HttpClient Bug:https://issues.apache.org/jira/browse/HTTPCLIENT-1478
  • Connection timeout v.s Socket timeout:https://stackoverflow.com/questions/7360520/connectiontimeout-versus-sockettimeout

Guess you like

Origin blog.csdn.net/agonie201218/article/details/131783826