Remember to solve the problem of downtime in the production environment again (business system)

Written in the front: The difficulty of troubleshooting the downtime problem in this article is far greater than the previous article ( remember the mental journey of solving online OOM at one time ). The memory leak problem in the previous article is traceable. This time There are no signs of downtime in the business log. In addition, there are many points worth digging into this article. I hope readers can find the points of interest to dig deeper, and leave a message in the comment area for everyone to make progress together.

Phenomenon: zabbix alarms the production environment application shutdown, log in to the production environment through the bastion machine, check the application container process, and find that there is no corresponding process for the business application, the first feeling that the process is killed by the system under certain conditions, and then check the container log , found no abnormality to be found.

question:

        1. Why is there no application process and abnormal log output? If the system really killed the application process, what conditions triggered it?

        2. Why is there never such a situation in the test environment and UAT environment, and what is the difference?

Check: 

        First, check the system resources of the production environment and the configuration of the corresponding container, and find that the total memory of the production machine is 8G:

          

         Then look at the variable settings of the container environment where the application is located:

#!/bin/sh

# chkconfig: 345 88 14
# description: Tomcat Daemon
set -m
#Change to your Jdk directory
SCRIPT_HOME=/home/appadmin/scripts/glink

export JAVA_HOME=/Data/software/jdk1.8.0_65

#Change to your Tomcat directory
CATALINA_HOME=/Data/software/tomcat-glink
export JAVA_OPTS="-server -Xms512m -Xmx4G 
-XX:PermSize=128M 
-XX:MaxNewSize=512m 
-XX:MaxPermSize=256m 
-Djava.awt.headless=true 
-XX:+PrintGCDetails 
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCTimeStamps 
-Xloggc:$CATALINA_HOME/logs/gc.log
-Djava.awt.headless=true
-Denv=prod
-Dproject=GLINK
-Dprofile=Production_us
-Dallow.push=true
-Dsave.directory=/Data/diamond/glink/config
-Dcat.app.name=GLINK_PROD
-Duser.timezone=GMT+08"

# Determine and execute action based on command line parameter
#-Ddiamond.conf.path=/Data/diamond/glink/diamond.properties
echo "Starting Best Glink Tomcat..."
DATE=`date "+%Y-%m-%d %H:%M:%S"`
echo "$DATE : user $USER starts tomcat." >> $SCRIPT_HOME/start.log
sh $CATALINA_HOME/bin/startup.sh

Here you can see that the maximum memory set for java here is 4G, and then we checked the settings of another application environment deployed on this machine. The result is the same as the application, the set memory size is 4G, we continue to check I checked other java applications on the machine, and found that a local queue (activitymq) was also deployed, and the maximum occupied memory was set to 1G. At this time, it seems that it can be guessed that because the total memory of the machine is not enough, when the application is applying for a new When using memory, due to the environment variable setting, it did not trigger GC, but directly applied for memory. The system found that the total memory was not enough, so it directly killed the application process.

In order to verify the above statement, we look for the relevant logs of the system:

less /var/log/messages

Clearly see: Feb 23 00:01:09 us-app01 kernel: Out of memory: Kill process 14157 (java) score 619 or sacrifice child, it means that the system killed the application process, according to the system log here The query data shows that the system will add a trace record to the low memory for each newly allocated memory. When the memory of the low memory is exhausted and new memory needs to be allocated, the kernel will trigger the Kill process. Why did the system choose kill here? This application process, my guess is that first, the application process belongs to the user process, and its termination will not affect the normal operation of the system. Second, due to the unreasonable setting of the application's maximum available memory, it needs to allocate new memory to store objects. happen in it. Since the process at that time has been killed by the system, we cannot know the specific memory usage and allocation at that time. As a reference, we choose to view the memory of another machine in the same environment:

It can be seen here that the total memory of the system is 7967m, and the total memory of the low memory is also 7967, (64-bit system, the total system memory is the same as the low memory), the current low memory has used 7832 (low memory = used + buff/ cache).

There are various indications that: due to this application and another application deployed on the same machine, the maximum available memory setting in its environment is unreasonable (the total system memory is 8G, the maximum available memory for these two applications is set to 4G, and the There is also a local queue) property, which causes the system to kill the application process because the system's low memory is exhausted when applying for memory.

Temporary solution:

       Reduce the maximum available memory set in the application environment and restart the application.

Continue to solve:

       Some time after the app restarts (a few days later), we check the memory used by the app:

     

     We thought that the application's business volume is very small and there is no such a large amount of data. Why does the system use so much memory, so we decided to perform some feasible tuning of the application from the perspective of code.

First, use jmp to dump the memory snapshot, import jprofiler, and analyze it with jprofiler:

From the above figure, we can see that char[] occupies 651m of memory, and String occupies about 10m. Obviously, there is a problem with the use or processing of the char type. Let's further check where char[] is used. What other large char[] instances are there, and what kind of data is stored in them:

Here we see that there are nearly 20 char[] instances with a size of more than 10m, the largest of which is already 30m, let's check what data is stored here:

Here you can see that there are some business data records. In addition, there are some task information in the internal task center of some companies in the memory data (the screenshots are not taken here), so we vaguely feel whether the task platform is doing it. Something, such as collecting logs or something, let's continue to check its references:

        Seeing the rabbit here, we can basically conclude that it is the reference of the client in the task center. The rabbit is not used in other places in the project, so we asked the maintainers of the task center to ask, and their answer was that they would not collect the logs of the application. (The answer in our heart is also this, it is not the client of the log platform, the task center team will not do such a stupid thing), but what we do see in the char[] is the information of the task platform. What is going on? So we looked at the source code of the task platform client and the external API to see if there were any clues, and finally locked a problem. The task platform client uses a thread pool, and the default thread pool has a maximum of 100 threads (this is a bit of a pitfall). , a brief description, the task center is implemented through the queue, the server sends the task message, and the client consumes it after receiving it).

        We see the thread in the screenshot again, which is also a thread in the thread pool of the task center client, so we further find out where char[] will be used, and there will be business data. After viewing the global code, we find that for tasks in the application The message processor did not use char[] directly. A classmate in the group reminded whether the use of log4j was the reason for improper log processing. Thinking about the current situation, this statement makes sense, so we are in every In the task processor, check the log for output of large objects, and I found some (orders, inventory, etc....).

private List<AmazonOrderInfo> getAmazonOrderNotInLocal(CustomerSalesChannelVO channel) {
        List<AmazonOrderInfo> amazonOrderInfoList = new ArrayList<>();
        try {
            Date nowTime = new Date();
            Date toTime = new Date(nowTime.getTime() - 121000);
            Date fromTime = new Date(nowTime.getTime() - 720000);
            amazonOrderInfoList = amazonOrderService.getAmazonOrderInfoNotInLocal(channel, fromTime, toTime);
            BizLogger.info("get amazon new order when amazon inventory feedback, result : " + amazonOrderInfoList);
        } catch (Exception e) {
            BizLogger.syserror("getAmazonOrderNotInLocal error,", e);
        }
        return amazonOrderInfoList;
    }

    //获取本地amazon创建失败的平台订单暂用的库存量
    private List<AmazonOrderItemVO> getAmazonErrorOrderInLocal(String customerCode) {
        AmazonOrderSO so = new AmazonOrderSO();
        so.setCustomerCode(customerCode);
        so.setStatus(AmazonOrderStatus.ERROR);
        List<AmazonOrderItemVO> amazonOrderItemVOList = amazonOrderService.getAmazonOrderItemVOList(so);
        BizLogger.info("get amazon error order item in local when amazon inventory feedback, result : " + amazonOrderItemVOList);
        return amazonOrderService.getAmazonOrderItemVOList(so);
    }
 if (CollectionUtils.isNotEmpty(inventoryAgeVOList)) {
         BizLogger.info("send inventoryAge message." + inventoryAgeVOList);
                sendInventoryAgeSyncMessage(asnVO.getCustomerCode(), inventoryAgeVOList);
    }

So we checked the relevant information and learned that log4j will cache the size of the longest record printed in the current thread, so far the problem has made sense.

       Since the client of the task center has enabled a thread pool with a default maximum number of 100 threads, these threads are used to consume the task messages sent by the server. Many tasks related to inventory and orders are configured in the business application. In these specific task messages There is some large object log output in the processor. Since the thread is maintained by the thread pool, it will not be destroyed after each execution, so there will be a large number of cache areas.

       This also answers the second question we originally envisaged: why the test environment and the UAT environment have never been like this, and what is the difference? There are four differences:

  • The memory size and virtual machine version of the system environment.
  • Differences in order data or inventory data volume.
  • In the test environment and UAT environment, we do not configure many jobs, and more are the end of the test.
  • The frequency of application restarts is different, and the test environment is more frequent.

Final treatment plan:

(1) Change the maximum number of threads in the thread pool of the task center client to consume task messages to 10 according to the actual situation of the application.

(2) Remove log output for large objects.

At the same time, the task center client will give feedback on the unreasonable setting of the maximum number of threads in the default thread pool to prevent the same thing from happening to other project groups.

Summarize:

1. When low memory is exhausted, the system will kill user processes that will not affect its stable operation.

2. The setting of the maximum number of threads in the thread pool on the application needs to be set reasonably according to the machine environment and the application itself. In particular, pay attention to the setting of thread times in the third-party toolkit that relies on message implementation to avoid excessive cache usage of objects referenced by it. .

3. Make it clear in the group meeting that the log output of large objects is not allowed in the task processor.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324376278&siteId=291194637