线上NG 502问题排查

线上报错502,导致客户端报错,对客户的体验造成了比较差的影响,需要进行排查

 

场景:

1  它系统调用enterprise api ,无法获取数据,返回502

2  C发起请求的时候报错502,页面无法展示,或者无法返回数据

---------------------

 

排查思路

502,tomcat反应迟钝,会造成ngix返回502给客户端

连接超时?

 

扫描二维码关注公众号,回复: 317072 查看本文章

--------------

排查是否是业务代码的问题

1  webapp的业务代码足够健壮,对各种错误都有处理

----------

 nginx 和tomcat的accessLog查看

 webapp的accessLog记录结果和Nginx 的accessLog对比,发现请求其实并没有进入给到webapp,没有进入业务层

1 ngnix默认开启accessLog,tomcat不开启,有了ngnix记录访问历史就可以了,tomcat开访问日志意义不大,tomcat开启了业务层面的访问日志

 

2 查看ngnix的访问日志,日志格式可配置,需要打印srcIp(请求来自于哪一个IP),dstIp(请求给了哪一个tomcat)

 

2016/5/9 11:19:10
J.Yang 2016/5/9 11:19:10
172.16.3.137 - - [09/May/2016:11:19:01 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 200 176 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.088" "-" "0.088" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
2016/5/9 11:21:30
J.Yang 2016/5/9 11:21:30
172.16.3.137 - - [09/May/2016:11:02:28 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.092" "-" "0.092" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:04:48 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.088" "-" "0.088" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:06:17 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.087" "-" "0.087" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon
172.16.3.137 - - [09/May/2016:11:06:41 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.137" "-" "0.091" "-" "0.091" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" access_logon

J.Yang 2016/5/9 11:21:49
grep selectQuota /opt/nginx/logs/enterprise.app.access.log |grep -v "HTTP/1.1\" 200"


2016/5/9 11:26:19
天天向上 2016/5/9 11:26:19
APIMerchantO2OController.class
2016/5/9 11:41:10
J.Yang 2016/5/9 11:41:10
172.16.3.132 - - [09/May/2016:11:40:47 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.132" "-" "0.839" "-" "0.839" "-" 172.16.3.140:8081 on
2016/5/9 11:53:46
J.Yang 2016/5/9 11:53:46
172.16.3.132 - - [09/May/2016:11:53:17 +0800] "POST /api/merchant/o2o/selectQuota HTTP/1.1" 502 606 "-" "Apache-HttpClient/4.2.1 (java 1.5)" "-" "enterprise.qbao.com" "172.16.3.132" "-" "0.102" "-" "0.102" "-" 172.16.3.140:8081 on
2016/5/9 13:50:07

 

3 查看tomcat的访问日志(tomcat需要重启),之所以开这个日志,是为了比对nginx的accessLog,判断ngnxi收到的请求报文是否真的给到了tomcat

[root@host140 ~]# tailf /opt/enterprise_server/tomcat_enterprise/logs/localhost_access_log.2016-05-09.txt

 

4 通过ngnix的访问日志,分析502出错的机器是只有一台还是多台都会出错?

   分析: 虽然代码是一样的,但是不排除有一台或者若干台机器被运维配错了,导致出问题,想想cas的8台机器只有一台机器的redis配错了,是多么的无语

   结论: 每一台机器都可能出错,而且以一定的小概率出错,如果一台机器出错了,它出错的概率看起来会比别人的概率大

  

    排除了是某一台机器的问题,是所有机器都有这种问题

 

---------------

CURL 直接在m.qbao.com所在的tomcat上给enterprise发批量请求,复现异常

 

for i in {0..1000}; do sleep 1;curl -H Host:enterprise.qbao.com "http://XX.XX.X.X/api/merchant/o2o/selectQuota?userId=32601744"; done

--------------

重启机器是程序员解决问题的终极大招,试试这个方法能否解决问题?

     -- 重启之后问题果然解决,之前大概率出现502机器不再出现502了,其它机器还会出现502

 

========================================

结论:

1 不是业务代码导致的,报文根本没有进入业务层,初步定为为存在报文的丢失

    (1) 业务代码足够健壮,出错后会打印错误日志,502出现的时候,日志根本没有打印出来

    (2) webapp的accessLog和nginx的accessLog对不上

 

2  出错并不是针对某一台特定的机器,而是所有机器普遍存在的问题

 

3 重启出错的机器,此机器在重启后的一段时间内部不报错了502了,其它机器还会出错,证明重启可以解决或者缓解问题

============================================

 

继续分析为什么重启有助于问题的解决

 

问题诡异:

可能是tomcat的性能出问题了,内存使用

ngix<-->TOMCAT之间的链路出问题了,报文给到了tomcat,但是并没有被应用层取出进行处理,或者tomcat取出之后过载了!!

启动脚本:

export JAVA_OPTS="-server -d64 -Djava.awt.headless=true -Xms4096m -Xmx4096m  -Xmn1024m  -Xss256k -XX:Pe

rmSize=64m -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+UseCMSCompactAtFullCollection  -XX:CMSMaxA

bortablePrecleanTime=5000 -XX:CMSInitiatingOccupancyFraction=80  -XX:+DisableExplicitGC  -XX:+CMSClassU

nloadingEnabled -XX:CMSFullGCsBeforeCompaction=10 -Djava.net.preferIPv4Stack=true"

分析: 1 年轻代过小,old区过大  2 不打印fullGC日志,应该加上打印fullGC日志

tomcat过载, 如何知道tomcat是否已经过载

监测每一个区的大小,膨胀速度,当出现fullGC频度高时,直接把进程kill掉,然后重新拉起来,重启可能比fullGC还要快速

 

 

 

 

  

   

 

 

 

 

 

 

 

猜你喜欢

转载自curious.iteye.com/blog/2296965