ActiveMQ usage problem record

                  记录一次ActiveMQ使用过程中问题以及解决流程
    19年元旦前上线一个ActiveMQ服务,版本为5-13-2,使用文件持久化实现主备。
    2019-01-17号晚,A系统(生产者)在未通知broker的情况下,开始通过MQ发送队列消息给B系统(消费方),数量约为20万,第二天早上9点左右,发现MQ停止了响应,重启之后服务恢复,但是B系统的处理速度明显没有之前的快。
   2019-01-19号晚上A系统反馈向MQ推送消息时,无法连接上MQ。01-20号早上7点左右调整了MQ的最大连接数(activemq.xml maximumConnections),从3000到5000,重启后A系统反馈能正常推送数据了。当日中午,MQ的服务挂了,从后台日志中查看,异常原因为:too many open files,进程在某个时刻打开了超过系统限制的文件数量以及通讯链接数,通过调整/etc/security/limits.conf,在其中加入* - nofile 60000问题解决。(后续监控发现MQ进程的句柄数峰值为2900)。
   2019-01-21号晚上,A系统开始重新推送新的数据,01-22号凌晨MQ服务挂掉,后台日志出现OutOfMemoryError,当时MQ进程使用的JVM的初始化内存栈和最大内存都为2000M,其中,内存中ActiveMQTestMessage占用了938M,连接connection占用约503M。这两者主要来源为因消费方速度跟不上生产者推送的速度出现的消息堆积。

01-22 After adjusting the initial memory and the maximum memory of the JVM of the MQ service to 4096M at 10 am, it was found that the consumer processed messages very slowly. From the top of the consumer server, it appears that a server is working under high load, 64 cores Basically, each core of the server can reach 100% load. The other three servers have very low load. The background log is monitored. The output of the high-load service is significantly faster than that of the low-load server. It is suspected that the low-load server has not received the MQ message. deal with. From the MQ background log, you can see that the underlying transport links corresponding to a large number of slow consumers are closed. If multiple consumers on the client share a connection, all consumers will be closed. After the slowConsumerStrategy in activemq.xml was changed to false and restarted, the processing speed was restored, but in fact it only solved some problems of low-speed consumers. The reason for really slowing down the processing speed has not been found. Low-speed consumer reference link: https://blog.csdn.net/asdfsadfasdfsa/article/details/53501723.
In the morning of 01-23, it was found that the processing speed of the consumer B system has slowed down again. From the number of MQ threads, handles, GC, top, etc., there is no exception. The server B system has a problem of unbalanced load among the servers. It is still a server working at high load, and each core basically reaches 100% utilization, but the load of the other three servers is completely up. Based on the suggestions of the experts and related information on the Internet, the three low-load servers of the B system were stopped, and the number of parallel consumers of the high-load servers was reduced to 10. After restarting, it was found that the processing speed returned to normal. Considering that MQ pushes data to the consumer B system according to the default prefetchsize = 1000, it may happen that the server with high load can get the message for processing, while the server with low load can not get the message processing, so the consumer B system adjusts the prefetch The number is 4 (1000 by default), and the single-server parallel consumer is 60. After restarting, it is found that the load of the original low-load server has increased and the processing speed has returned to normal.
On January 25, 2019, at about 11 am, the number of prefetches in the B system was adjusted to 30, and the time limit for MQ to judge slow consumers was adjusted from the default 30 seconds to 60 seconds. The current number of consumers in the queue is still Changes, but more stable than before. The speed is also faster than before, with about 10,000 processed in 24 minutes.
2019-01-25 at 14:00, the time limit for adjusting the slow consumer maxSlowDuration is 90 seconds.
On the morning of 2019-04-02, the A system reported that the connection could not reach MQ. The MQ log saw: could not accept connection from tcp: // IP: PORT: java.lang.IllegalStateException: Timer already cancelled, check the status of many connections for TCP connections Both are close_connection, you can see in the log that an exception was thrown first: java.lang.OutOfMemoryError: unable to create new native thread, under the premise that the number of threads cannot be quickly reduced (related to the number of parallel consumers), temporarily set The memory parameters Xmx and Xms of the JVM are 4000m (previously 6144m), and the problem is solved. (Refer to the in-depth understanding of Java virtual machine + JVM advanced features and best practices), reference link: https://www.cnblogs.com/zhangshiwen/p/5793908.html, and subsequently adjusted the maximum number of threads that can be created in the system.
2019-05–05 After finding that MQ has been running for five days (04-30 restart), check jstat -gcuitl PID and found that the old generation space only released a small part after the FULL GC, and because the old generation space was released very little, The frequency of FULL GC gradually increases. At this time, check the number of MQ connections (netstat -n | grep PID | wc -l), reaching 9002, and the number of connections currently configured by MQ is 9000, and some connections have been suspended. The number of B systems accounted for more than 90% of the connections. After communication, the B system did not use the connection pool technology. This problem was solved after the system B used connection pool technology.

Published 14 original articles · praised 3 · visits 939

Guess you like

Origin blog.csdn.net/sjz88888/article/details/90522373