Production environment - linux-tomcat downtime solution

For small and medium-sized companies using tomcat as a java container, it is easy to cause tomcat service downtime during the running process without system tuning, and generally no useful information can be seen in the tomcat log. The post-machine tuning is adjusted by the company's framework. He has a very deep understanding of JVM tuning, while the author's understanding of JVM tuning is relatively shallow, so this example will not explain the principle of tuning too much, only Record the process of analysis and tuning, hoping to bring some ideas to the operation and maintenance friends who encounter tomcat downtime.

1. Preliminary analysis of tomcat downtime

    The tomcat service in the production environment will be down every few days. There is no rule. The rule is that the tomcat is down every time it is restarted within 10 minutes to 60 minutes after the version upgrade, not every time the tomcat is restarted. If tomcat is started for more than a day, it will not be down during the running process, and after restarting tomcat again, the tomcat will not be down again, which can preliminarily rule out the main cause of the online code (the new code is online). In fact, there are some reasons, which will be discussed later).

    View tomcat's catalina.out log:

1
2
3
4
5
6
7
8
9
10
11
2015-1-5 13:35:41 org.apache.coyote.http11.Http11NioProtocol pause
信息: Pausing Coyote HTTP /1 .1 on http-8890
2015-1-5 13:35:42 org.apache.catalina.core.StandardService stop
信息: Stopping service Catalina
 
2015-1-5 13:35:42 org.apache.coyote.http11.Http11NioProtocol destroy
信息: Stopping Coyote HTTP /1 .1 on http-8890
Exception  in  thread  "Timer-1"  java.lang.NullPointerException
         at com.qhfax.invest.balanceAccount.common.util.TaskBalanceAccount.run(TaskBalanceAccount.java:75)
         at java.util.TimerThread.mainLoop(Timer.java:512)
         at java.util.TimerThread.run(Timer.java:462)

    It can be seen from the log that tomcat first suspended port 8890 of the http service, and then stopped the core service.

    Obviously, from the above log, we can't see any useful information, so what should we do?

Second, tomcat memory analysis and tuning - open GC log

    Since it is impossible to see how tomcat is down from the log, we will start by analyzing the memory of tomcat, and we can open the gc log to record the situation when the daily JVM is cleaning up the memory. The specific steps are as follows:

    First, let's show the configuration of production tomcat before tuning (the core of tomcat tuning is JVM parameter settings):

1
2
3
4
5
原配置:
编辑{tomcat_home} /bin/catalina .sh,可以直接在顶部 #注释说完后,就添加以下参数
export  JRE_HOME= /usr/java/jdk1 .6.0_38
export  CATALINA_HOME= /home/resin/tomcat
JAVA_OPTS= "-Xms1024m -Xmx1024m -XX:PermSize=512m -XX:MaxPermSize=512m"

JVM startup parameter description:

    -Xms1024m Set the minimum available memory of the JVM to 1024M

    -Xmx1024m Set the maximum available memory of the JVM to 1024M

    -XX:PermSize=512m The non-heap memory initially allocated by the JVM

    -XX:MaxPermSize=512m Set the persistent generation size to 512M

 

1
2
3
4
修改后配置:
export  JRE_HOME= /usr/java/jdk1 .6.0_38
export  CATALINA_HOME= /home/resin/tomcat
JAVA_OPTS= "-server -Xms2048m -Xmx2048m -Xmn512m -XX:+UseParallelOldGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:/home/resin/tomcat/logs/gc.log"

Description of JVM startup parameters after tuning:

    -server The server mode starts slowly, but once it is running, the performance will be greatly improved

    -Xmn512m  sets the young generation size to 2G. Whole heap size = young generation size + old generation size + persistent generation size . The persistent generation is generally fixed at 64m, so after increasing the young generation, the size of the old generation will be reduced. This value has a great impact on system performance, and Sun officially recommends setting it to 3/8 of the entire heap.

    -XX:+UseParallelOldGC executes the garbage collector, generally used on multithreaded multiprocessor machines

    -XX:+PrintGCDateStamps  GC logging does not specifically affect Java program performance

    -XX:+PrintGCDetails  prints additional information collected

    -Xloggc:/****/****/tomcat/logs/gc.log GC log path address

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2.tomcat配置文件server.xml
原配置:
<Listener className= "org.apache.catalina.core.JreMemoryLeakPreventionListener"  />
     <Connector port= "8890"  protocol= "HTTP/1.1" 
             URIEncoding= "UTF-8"    connectionUploadTimeout= "36000000"  disableUploadTimeout= "false"  connectionTimeout= "60000"    
                redirectPort= "8443"  />
修改后配置:
  <!-- Listener className= "org.apache.catalina.core.JreMemoryLeakPreventionListener"  gcDaemonProtection= "false" / -->    #此语句注释掉
     <Connector executor= "tomcatThreadPool"
                port= "8890"  protocol= "org.apache.coyote.http11.Http11NioProtocol" 
    connectionUploadTimeout= "36000000" 
                disableUploadTimeout= "false" 
                connectionTimeout= "60000" 
                redirectPort= "8443"  />

 说明:注释掉的语句是为了减少JVM的周期性的Full GC频繁问题。

            在初步调优后,之后又出现了一次tomcat宕掉,看来tomcat调优没有还需要继续,通过查看gc.log:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
2015-01-07T14:29:45.658+0800: 169496.327: [GC [PSYoungGen: 5069K->4064K(514304K)] 40126K->39121K(2087168K), 0.0034770 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 
2015-01-07T14:29:45.661+0800: 169496.331: [Full GC (System) [PSYoungGen: 4064K->0K(514304K)] [ParOldGen: 35057K->38587K(1572864K)] 39121K->38587K(2087168K) [PSPermGen: 51269K->51269K(72128K)], 0.3473340 secs] [Times: user=1.42 sys=0.00, real=0.35 secs] 
2015-01-07T14:55:31.114+0800: 171041.784: [GC [PSYoungGen: 212923K->9959K(514048K)] 251510K->54259K(2086912K), 0.0073020 secs] [Times: user=0.06 sys=0.00, real=0.00 secs] 
2015-01-07T14:55:31.122+0800: 171041.791: [Full GC (System) [PSYoungGen: 9959K->0K(514048K)] [ParOldGen: 44299K->34703K(1572864K)] 54259K->34703K(2086912K) [PSPermGen: 51276K->51274K(69440K)], 0.2653740 secs] [Times: user=0.79 sys=0.01, real=0.27 secs] 
2015-01-07T14:55:31.531+0800: 171042.200: [GC [PSYoungGen: 5345K->2304K(507776K)] 40048K->37007K(2080640K), 0.0024170 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
2015-01-07T14:55:31.533+0800: 171042.203: [Full GC (System) [PSYoungGen: 2304K->0K(507776K)] [ParOldGen: 34703K->36747K(1572864K)] 37007K->36747K(2080640K) [PSPermGen: 51364K->51363K(67264K)], 0.2703940 secs] [Times: user=0.83 sys=0.00, real=0.28 secs] 
2015-01-07T14:55:37.374+0800: 171048.044: [GC [PSYoungGen: 10021K->10878K(508416K)] 46768K->47625K(2081280K), 0.0026770 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
2015-01-07T14:55:37.377+0800: 171048.046: [Full GC (System) [PSYoungGen: 10878K->0K(508416K)] [ParOldGen: 36747K->34645K(1572864K)] 47625K->34645K(2081280K) [PSPermGen: 51364K->51363K(65216K)], 0.2716380 secs] [Times: user=0.86 sys=0.00, real=0.27 secs] 
2015-01-07T14:55:37.670+0800: 171048.339: [GC [PSYoungGen: 3948K->2880K(510016K)] 38594K->37525K(2082880K), 0.0025790 secs] [Times: user=0.03 sys=0.00, real=0.00 secs] 
2015-01-07T14:55:37.672+0800: 171048.342: [Full GC (System) [PSYoungGen: 2880K->0K(510016K)] [ParOldGen: 34645K->37317K(1572864K)] 37525K->37317K(2082880K) [PSPermGen: 51363K->51363K(63104K)], 0.2714200 secs] [Times: user=0.84 sys=0.00, real=0.27 secs] 
2015-01-07T15:33:44.205+0800: 173334.875: [GC [PSYoungGen: 153420K->14231K(505728K)] 190738K->51548K(2078592K), 0.0048450 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] 
2015-01-07T15:33:44.210+0800: 173334.880: [Full GC (System) [PSYoungGen: 14231K->0K(505728K)] [ParOldGen: 37317K->34594K(1572864K)] 51548K->34594K(2078592K) [PSPermGen: 51364K->51359K(61440K)], 0.2908710 secs] [Times: user=0.89 sys=0.00, real=0.29 secs] 
2015-01-07T15:33:44.514+0800: 173335.183: [GC [PSYoungGen: 3254K->2560K(506816K)] 37849K->37154K(2079680K), 0.0024490 secs] [Times: user=0.03 sys=0.00, real=0.00 secs] 
2015-01-07T15:33:44.516+0800: 173335.186: [Full GC (System) [PSYoungGen: 2560K->0K(506816K)] [ParOldGen: 34594K->36883K(1572864K)] 37154K->36883K(2079680K) [PSPermGen: 51360K->51360K(60224K)], 0.2721010 secs] [Times: user=0.82 sys=0.00, real=0.27 secs]

发现,tomcat会周期在一段时间频繁的进行full gc,说明从JVM内存出发对tomcat调优方向是对的。

二、tomcat内存分析调优-开启HeapDump

        仅仅开启gc.log还是无法分析出tomcat宕掉的具体原因,到底是由什么导致的,这时候就要针对tomcat的OutOfMemory时的内存情况进行分析,于是构架师又进行了第二次调优:

调整JVM启动参数  

1
2
3
原:JAVA_OPTS= "-server -Xms2048m -Xmx2048m -Xmn512m -XX:+UseParallelOldGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:/home/resin/tomcat/logs/gc.log"
修改后: 
JAVA_OPTS= "-server -Xms2048m -Xmx2048m -Xmn768m -XX:PermSize=128m -XX:MaxPermSize=256m -XX:+UseParallelOldGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/resin/tomcat/dumpfile/heap.bin  -Xloggc:/home/resin/tomcat/logs/gc.log"

调优后JVM启动参数说明

-XX:PermSize=128m  方法区内存从默认20M,提高到128M

-XX:MaxPermSize=256m   限定方法区内存最大256M

-XX:+UseParallelOldGC        相当于” Parallel Scavenge” + “ParallelOld”,都是多线程并行处理

-XX:+HeapDumpOnOutOfMemoryError    开启HeapDump

-XX:HeapDumpPath=/*****/****/tomcat/dumpfile/heap.bin   HeapDump输出路径

 

三、如何对HeapDump文件分析

    在上次调优后,tomcat很稳定的跑了2个月,但年后又出现了一次服务宕掉,并且声称了HeapDump文件,那么我们就要对HeapDump文件进行分析了。

  1. 先将HeapDump文件从服务器传输到你的本地电脑。

  2. 使用MemoryAnalyzer-1.4.0.20140604-win32.win32.x86这个软件,使用过程我会在下面截图,软件是构架给我的,51cto下载地址:点这里

    或者打开这个链接:http://down.51cto.com/data/2006879

  3. 打开MemoryAnalyzer.exe,导入HeapDump文件

wKiom1UBCrLChlBxAANhL5pi-dg754.jpg

wKioL1UBC9Whh60hAASdWxuSKGc966.jpg

 

wKiom1UBCrryWAMXAAZqiYBG1mE870.jpg

wKioL1UBC9riInnIAAqhYl4NRq0013.jpg

    至此,我们就大致找到了导致此次tomcat宕掉的原因,剩下的就是将此次分析报告交给上司,后面的工作就是开发的了,可以看到因代码或应用无限循环导致内存泄露,再对JVM调优已无意义。

四、总结

    以上只是给各位遇到此类问题的运维童鞋一些思路,如果遇到这种棘手的问题导致找出最终原因,其中涉及到JVM调优的东东,作者暂时涉及不深,可能存在错误的地方,往各路大神指正!

    具体关于JVM调优,作者手机了几篇比较给力的,特贡献出来:

    JVM原理及调优--网页链接收藏

原文地址:
http://vekergu.blog.51cto.com/9966832/1619640

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326347064&siteId=291194637