Investigation and research online server issues

Online issues such as:

  1, how to troubleshoot online server CPU usage is high?

  2, online server Load skyrocketed how to troubleshoot? 

  3, online server frequently occurred Full GC how to troubleshoot? 

  4, online server deadlock how to troubleshoot?


1: How to troubleshoot online server CPU usage is high?

Problems found:

 Before every big promotion, our testers will be on site to conduct stress tests, this time will see the service cpu, memory, load, rt, qps other indicators.

 In a pressure test procedure, the test found that one of our interfaces, in qps after rising to 500, a sharp increase in CPU usage .

CPU utilization , also known as CPU utilization. As the name suggests, it is to describe the CPU utilization CPU usage, indicating the situation over a period of time the CPU is occupied . The higher the utilization rate, indicating that your machine is running a lot of programs at this time, whereas less.

identify the problem:

 Positioning process: login, execute top command to see CPU utilization:

top command is commonly used under Linux performance analysis tools, it can display real-time system resource usage status of each process, similar to Windows Task Manager.

 Through the above command, we can see the process ID of the Java process CPU utilization of 1893 reached 181%, which can be positioned to be our lead Java application server CPU occupancy rate of the whole soar.

 Positioning Thread

  We know that, Java is a single multi-threaded process, then we are going to look at PID = CPU usage of the Java process 1893 in each thread, the same with the top command is: 

 

  By top -Hp 1893 order, we can see that the current 1893 this process, ID is 4519 thread occupy the highest CPU.

  Location Code 

  By the top command, we now target specific threads result in higher CPU utilization, then my next it will locate in the end is a problem which line of code.

  First, we need to turn this thread into a 4519 hex:

 

  Next, jstack order to view stack information:

 

  Through the above code, we can clearly see, line 30 BeanValidator.java there is a possible problem.

 problem solved

  The next step is to solve the problem by looking at the code, we find that we customize a BeanValidator, encapsulates the Hibernate Validator, and then validate the method by Validation.buildDefaultValidatorFactory (). GetValidator () to initialize a Validator instance, by analyzing examples of this discovery process is relatively time-consuming.

  We refactored the code, the Validator instance initialization method mentioned, but at the time of the creation of the class initialization time to solve the problem.

 to sum up 

  Above, it shows a relatively complete online process to locate the problem. The main commands used are: Top, printf and jstack

  In addition, you can also use the online troubleshooting Alibaba open source tool Arthas investigation, the above problem, you can use the following command positioning:

 

  Above, this article describes how to troubleshoot online server CPU usage problem, if you're interested, you can later re-introduce some means of investigation regarding LOAD soared high, frequently GC and other issues.

 


 

 Second, the online server Load skyrocketed how to troubleshoot? 

What is the load

Load (load) is an important indicator linux machine visually the current state of the reaction machine.

Look at what kind of load is defined as:

In UNIX computing, the system load is a measure of the amount of computational work that a computer system performs. The load average represents the average system load over a period of time. It conventionally appears in the form of three numbers which represent the system load during the last one-, five-, and fifteen-minute periods.(wikipedia)

  Briefly explain: On UNIX systems, system load is a measure of current CPU workload is defined as the average number of threads running within a specific time interval queue. load average load within the machine represents the average over time. This value is as low as possible. Load is too high will cause the machine can not process the request and other operations, and even lead to crashes.

  High load Linux, mainly due to CPU usage, memory usage, IO consume three parts. Excessive use of any one, will lead to a sharp rise in server load.

Check the machine load.

  On Linux machines, there are multiple commands can view load information of the machine. These include uptime, top, wand the like.

uptimecommand

  uptimePrint command system can run for a total average load and how long the system. uptime information display command can be displayed as follows: time now, the system has been running for a long time, the current number of users landed in the last 1 minute, 5 minute average load within 15 minutes and the system.

  This latter part of the line information showing "load average", which means "the average load of the system," there are three numbers, which we can determine the system load is large or small.

  1.74 1.87 1.97 这三个数字的意思分别是1分钟、5分钟、15分钟内系统的平均负荷。我们一般表示为load1、load5、load15。

w命令

  w命令的主要功能其实是显示目前登入系统的用户信息。但是与who不同的是,w命令功能更加强大,w命令还可以显示:当前时间,系统启动到现在的时间,登录用户的数目,系统在最近1分钟、5分钟和15分钟的平均负载。然后是每个用户的各项数据,项目显示顺序如下:登录帐号、终端名称、远 程主机名、登录时间、空闲时间、JCPU、PCPU、当前正在运行进程的命令行。

 

  从上面的w命令的结果可以看到,当前系统时间是14:08,系统启动到现在经历了23小时41分钟,共有3个用户登录。系统在近1分钟、5分钟和15分钟的平均负载分别是1.74 1.87 1.97。这和uptime得到的结果相同。 下面还打印了一些登录的用户的各项数据,不详细介绍了。

top命令

  top命令是Linux下常用的性能分析工具,能够实时显示系统中各个进程的资源占用状况,类似于Windows的任务管理器。

机器正常负载范围

  对于机器的Load到底多少算正常的问题,一直都是很有争议的,不同人有着不同的理解。对于单个CPU,有人认为如果Load超过0.7就算是超出正常范围了。也有人认为只要不超过1都没问题。也有人认为,单个CPU的负载在2以下都可以接受。

  为什么会有这么多不同的理解呢,是因为不同的机器除了CPU影响之外还有其他因素的影响,运行的程序、机器内存、甚至是机房温度等都有可能有区别。

  比如,有些机器用于定时执行大量的跑批任务,这个时间段内,Load可能会飙的比较高。而其他时间可能会比较低。那么这段飙高时间我们要不要去排查问题呢?

  我的建议是,最好根据自己机器的实际情况,建立一个指标的基线(如近一个月的平均值),只要日常的load在基线上下范围内不太大都可以接收,如果差距太多可能就要人为介入检查了。

  但是,总要有个建议的阈值吧,关于这个值。阮一峰在自己的博客中有过以下建议:

当系统负荷持续大于0.7,你必须开始调查了,问题出在哪里,防止情况恶化。

当系统负荷持续大于1.0,你必须动手寻找解决办法,把这个值降下来。

当系统负荷达到5.0,就表明你的系统有很严重的问题,长时间没有响应,或者接近死机了。你不应该让系统达到这个值。

  以上指标都是基于单CPU的,但是现在很多电脑都是多核的。所以,对一般的系统来说,是根据cpu数量去判断系统是否已经过载(Over Load)的。如果我们认为0.7算是单核机器负载的安全线的话,那么四核机器的负载最好保持在3(4*0.7 = 2.8)以下。

  还有一点需要提一下,在Load Avg的指标中,有三个值,1分钟系统负荷、5分钟系统负荷,15分钟系统负荷。我们在排查问题的时候也是可以参考这三个值的。

  一般情况下,1分钟系统负荷表示最近的暂时现象。15分钟系统负荷表示是持续现象,并非暂时问题。如果load15较高,而load1较低,可以认为情况有所好转。反之,情况可能在恶化。

如何降低负载

 导致负载高的原因可能很复杂,有可能是硬件问题也可能是软件问题。

 如果是硬件问题,那么说明机器性能确实就不行了,那么解决起来很简单,直接换机器就可以了。

 前面我们提过,CPU使用、内存使用、IO消耗都可能导致负载高。如果是软件问题,有可能由于Java中的某些线程被长时间占用、大量内存持续占用等导致。建议从以下几个方面排查代码问题:

  1、是否有内存泄露导致频繁GC

  2、是否有死锁发生

  3、是否有大字段的读写

  4、会不会是数据库操作导致的,排查SQL语句问题。

 这里还有个建议,如果发现线上机器Load飙高,可以考虑先把堆栈内存dump下来后,进行重启,暂时解决问题,然后再考虑回滚和排查问题。

Java Web应用Load飙高排查思路

1、使用uptime查看当前load,发现load飙高。

2、使用top命令,查看占用CPU较高的进程ID。、

  发现PID为1893的进程占用CPU 181%。而且是一个Java进程,基本断定是软件问题。

3、使用 top命令 (top -Hp 进程id),查看具体是哪个线程占用率较高

4、使用printf命令查看这个线程的16进制

5、使用jstack命令查看当前线程正在执行的方法 (jstack 进程id | grep 线程id的16进制)。(Java命令学习系列(二)——Jstack)

  从上面的线程的栈日志中,可以发现,当前占用CPU较高的线程正在执行我代码的com.hollis.test.util.BeanValidator.validate(BeanValidator.java:30)类。那么就可以去排查这个类是否用法有问题了。

 

网上有牛人写了一个脚本能自动帮我们大致定位到现场导致LOAD飙升的JVM线程,脚本大概如下

#!/bin/ksh
typeset top=${1:-10}
typeset pid=${2:-$(pgrep -u $USER java)}
typeset tmp_file=/tmp/java_${pid}_$$.trace

$JAVA_HOME/bin/jstack $pid > $tmp_file
ps H -eo user,pid,ppid,tid,time,%cpu --sort=%cpu --no-headers\
        | tail -$top\
        | awk -v "pid=$pid" '$2==pid{print $4"\t"$6}'\
        | while read line;
do
        typeset nid=$(echo "$line"|awk '{printf("0x%x",$1)}')
        typeset cpu=$(echo "$line"|awk '{print $2}')
        awk -v "cpu=$cpu" '/nid='"$nid"'/,/^$/{print $0"\t"(isF++?"":"cpu="cpu"%");}' $tmp_file
done

rm -f $tmp_file

现在我们就来拆解其中的原理,以及说明下类似脚本的适用范围。
  1.使用top命令查看飙高的java进程,记录pid
  2.通过jstack命令将java的线程栈输出,保留现场 jstack -l 30142 > 30142.stack
  3.找到当前CPU使用占比高的线程,通过 ps H -eo user,pid,ppid,tid,time,%cpu –sort=%cpu
    USER:进程归属用户,PID:进程号,PPID:父进程号,TID:线程号
    %CPU:线程使用CPU占比(这里要提醒下各位,这个CPU占比是通过/proc计算得到,存在时间差)
  4.合并相关信息,通过PS拿到了TID,可以通过进制换算10-16得到jstack出来的JVM线程号​
    typeset nid=”0x”(echo"(echo"line”|awk ‘{print $1}’|xargs -I{} echo “obase=16;{}”|bc|tr ‘A-Z’ ‘a-z’)
  5.最后再将ps和jstack出来的信息进行一个匹配与合并。终于,得到我们最想要的信息


 

 

 出处:

  https://mp.weixin.qq.com/s/aZ2Otci6TntXdsoyMcwBVw  

  https://blog.csdn.net/huangyimo/article/details/80401638

 

Guess you like

Origin www.cnblogs.com/myseries/p/11230839.html