Online NG 502 Troubleshooting

Online error 502 is reported, which causes the client to report an error, which has a poor impact on the customer's experience and needs to be checked.

 

Scenes:

1 It system calls enterprise api, cannot get data, returns 502

2 C reports error 502 when initiating the request, the page cannot be displayed, or the data cannot be returned

---------------------

 

Troubleshoot ideas

502, tomcat is slow to respond, which will cause ngix to return 502 to the client

Connection timed out?

 

--------------

Troubleshoot whether it is a business code problem

1 The business code of the webapp is robust enough to handle various errors

----------

 View the accessLog of nginx and tomcat

 Comparing the accessLog record results of webapp and Nginx  's accessLog, it is found that the request does not actually enter the webapp, nor does it enter the business layer

 

1 AccessLog is enabled by default in ngnix, but not in tomcat. With ngnix, you can record the access history. It is not meaningful to open access logs in tomcat, and tomcat enables access logs at the business level.

 

2 View the access log of ngnix, the log format can be configured, you need to print srcIp (which IP the request came from), dstIp (which tomcat the request was given)

 

2016/5/9 11:19:10
J.Yang 2016/5/9 11:19:10
172.16.3.137 - - [09/May/2016:11:19:01 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 200 176 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.088" "-" "0.088" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
2016/5/9 11:21:30
J.Yang 2016/5/9 11:21:30
172.16.3.137 - - [09/May/2016:11:02:28 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.092" "-" "0.092" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:04:48 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.088" "-" "0.088" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:06:17 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.087" "-" "0.087" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:06:41 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.091" "-" "0.091" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon

J.Yang 2016/5/9 11:21:49
grep selectQuota /opt/nginx/logs/enterprise.app.access.log |grep -v "HTTP/1.1\" 200"


2016/5/9 11:26:19
天天向上 2016/5/9 11:26:19
APIMerchantO2OController.class
2016/5/9 11:41:10
J.Yang 2016/5/9 11:41:10
172.16.3.132 - - [09/May/2016:11:40:47 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.132" "-" "0.839" "-" "0.839" "-" 172.16.3.140:8081 on
2016/5/9 11:53:46
J.Yang 2016/5/9 11:53:46
172.16.3.132 - - [09/May/2016:11:53:17 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.132" "-" "0.102" "-" "0.102" "-" 172.16.3.140:8081 on
2016/5/9 13:50:07

 

3 查看tomcat的访问日志(tomcat需要重启),之所以开这个日志,是为了比对nginx的accessLog,判断ngnxi收到的请求报文是否真的给到了tomcat

[root@host140 ~]# tailf /opt/enterprise_server/tomcat_enterprise/logs/localhost_access_log.2016-05-09.txt

 

4 通过ngnix的访问日志,分析502出错的机器是只有一台还是多台都会出错?

   分析: 虽然代码是一样的,但是不排除有一台或者若干台机器被运维配错了,导致出问题,想想cas的8台机器只有一台机器的redis配错了,是多么的无语

   结论: 每一台机器都可能出错,而且以一定的小概率出错,如果一台机器出错了,它出错的概率看起来会比别人的概率大

  

    排除了是某一台机器的问题,是所有机器都有这种问题

 

---------------

CURL 直接在m.qbao.com所在的tomcat上给enterprise发批量请求,复现异常

 

for i in {0..1000}; do sleep 1;curl -H Host:enterprise.qbao.com "http://XX.XX.X.X/api/merchant/o2o/selectQuota?userId=32601744"; done

--------------

重启机器是程序员解决问题的终极大招,试试这个方法能否解决问题?

     -- 重启之后问题果然解决,之前大概率出现502机器不再出现502了,其它机器还会出现502

 

========================================

结论:

1 不是业务代码导致的,报文根本没有进入业务层,初步定为为存在报文的丢失

    (1) 业务代码足够健壮,出错后会打印错误日志,502出现的时候,日志根本没有打印出来

    (2) webapp的accessLog和nginx的accessLog对不上

 

2  出错并不是针对某一台特定的机器,而是所有机器普遍存在的问题

 

3 重启出错的机器,此机器在重启后的一段时间内部不报错了502了,其它机器还会出错,证明重启可以解决或者缓解问题

============================================

 

继续分析为什么重启有助于问题的解决

 

 

 

问题诡异:

可能是tomcat的性能出问题了,内存使用

ngix<-->TOMCAT之间的链路出问题了,报文给到了tomcat,但是并没有被应用层取出进行处理,或者tomcat取出之后过载了!!

 

启动脚本:

export JAVA_OPTS="-server -d64 -Djava.awt.headless=true -Xms4096m -Xmx4096m  -Xmn1024m  -Xss256k -XX:Pe

rmSize=64m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection  -XX:CMSMaxA

bortablePrecleanTime=5000 -XX:CMSInitiatingOccupancyFraction=80  -XX:+DisableExplicitGC  -XX:+CMSClassU

nloadingEnabled -XX:CMSFullGCsBeforeCompaction=10 -Djava.net.preferIPv4Stack=true"

 

分析: 1 年轻代过小,old区过大  2 不打印fullGC日志,应该加上打印fullGC日志

 

tomcat过载, 如何知道tomcat是否已经过载

 

监测每一个区的大小,膨胀速度,当出现fullGC频度高时,直接把进程kill掉,然后重新拉起来,重启可能比fullGC还要快速

 

 

 

 

 

 

  

   

 

 

 

 

 

 

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326944459&siteId=291194637