SpringBoot suspended animation, tens of thousands of urgent, how to fight the fire?

Foreword:

In the reader community (50+) of the 40-year-old architect Nien, many friends can't get offers, or can't get good offers .

Nien often optimizes projects, optimizes resumes, and digs out technical highlights for everyone . In the process of guiding resumes, online troubleshooting and Java tuning are very important guidance.

The problem is that many small partners don't even have a little experience in tuning basics and online troubleshooting. Of course, they don't even understand high-concurrency scenarios. Yesterday, in Nien's architect community, some friends encountered online problems:

The jar package of springboot, the service is there, but the request does not respond. Checked with arthas, the memory and CPU are normal, but the tomcat thread pool is full, what should I do?

SpringBoot suspended animation, what should I do?

For architects and advanced developers, tuning and online troubleshooting are the core content, and they are the inner strength of the inner strength.

The Nien team combined senior architecture experience and industry cases to sort out a series of " Java Tuning Bible " PDF e-books, including the five planned parts of this article:

(1) Tuning Bible 1: Zero-based proficiency in Jmeter distributed pressure measurement, 10Wqps+ pressure measurement practice (completed)

(2) Tuning Bible 2: from 70s to 20ms, a 3500 times performance optimization practice, the solution is available to everyone (completed)

(3) Tuning Bible 3: How to do mysql tuning? Deadly 7 tricks to tune slow SQL by 100 times, realize your freedom of Mysql tuning (completed)

(4) Tuning Bible 4: SpringBoot suspended animation, how to put out the fire ? (This article)

(5) Tuning Bible 5: Zero-based proficiency in JVM tuning operations to achieve JVM freedom (writing)

(6) Tuning Bible 6: Zero-based proficiency in Linux and Tomcat tuning operations to achieve freedom of infrastructure tuning (in writing)

The above articles will be published on the official account of the technology free circle in succession. The complete "Java Tuning Bible" PDF is available from Nien.

Article Directory

Online SpringBoot suspended animation, a phenomenon of urgency

The SpringBoot application will be inaccessible. There are 3 specific performance points:

  • Performance 1. There is no response to the client request;
  • Performance 2. When requesting, there is no log output;
  • Performance 3. The SpringBoot process is alive

Performance 1. There is no response to the client request;

If there is a front-end request, it will be displayed as pengding.

If it is an RPC Client call, the performance is that the connection timed out

Feign client - timeout configuration

Performance 2. When requesting, there is no log output;
when checking the business log, it is found that the log has stopped and there is no latest access log.

Performance 3. The SpringBoot process is alive

Check the process through jps or ps, you can see that the service process exists;

Judgment of suspended animation:

If there are the above three manifestations, it can basically be determined that the SpringBoot process is suspended animation.

Top 10 possible causes of suspended animation

  1. The java thread is deadlocked, or all threads are blocked;
  2. The connection in the database connection pool is exhausted, resulting in a permanent wait when obtaining the database connection;
  3. There is a memory leak that leads to OutOfMemory, and insufficient memory space leads to continuous failure to allocate memory space; the available memory of the server is sufficient, but the memory allocated to jvm is exhausted, which is prone to this situation;
  4. The jar package was replaced during the running of the service program, but the service was not restarted. This is a problem caused by not following the rules;
  5. The disk space is full, causing all places that need to write data to fail;
    the client receives 500, and the connection times out
  6. The thread pool is full and no more threads can be allocated to process requests, usually because a large number of threads are blocked on a certain request; the
    client connection timed out

Nien's five-axe analysis method

There are many situations that cause SpringBoot to be unable to continue business processing, so it needs to be checked in many ways.

  • The first trick: check the log
  • The second trick: check system resources
  • The third trick: check JVM thread deadlock
  • The fourth trick: Check the JVM stack
  • The fifth trick: check the operating system configuration

The first trick: check the log

First, check the running status of the service to determine the superficial reason why the business cannot be handled, whether it is SpringBoot’s suspended animation or a business exception.

Check the project log to check whether there is an obvious error situation, and deal with the error situation.

  • Check the local log log
  • Or elk's distributed log

Error message, very important. Generally, 70% of the problems can be analyzed.

If there are multiple nodes serving, the state of one node can be reserved for fault cause analysis and search, and the other nodes can restore normal service as soon as possible by restarting the service;

The second trick: check system resources

Check the network, disk, CPU, etc. of the machine where the service is located one by one.

First check CPU resources

step1: Find out the family background of CPU resources (how many CPU Cores are there?)

  1. Number of physical cpus : the number of cpus actually inserted on the motherboard, how many physical ids can be counted without repetition (physical id)
  2. Number of cpu cores : the number of chipsets that can process data on a single CPU, such as dual-core, quad-core, etc. (cpu cores)
  3. Logical cpu number : In simple terms, it enables 1 core in the processor to function in the operating system like 2 cores.

In this way, the execution resources available to the operating system are doubled, and the overall performance of the system is greatly improved. At this time, logical cpu = number of physical CPUs x number of cores per core x2.

The total number of cores = the number of physical CPUs × the number of cores of each physical CPU.

The total number of logical CPUs = the number of physical CPUs × the number of cores of each physical CPU × the number of hyperthreads.

So it's dual core.

# 查看CPU信息(型号)
cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
# 查看物理CPU个数
 cat /proc/cpuinfo| grep "physical id"| sort| uniq| wc -l
# 查看每个物理CPU中core的个数(即核数)
 cat /proc/cpuinfo| grep "cpu cores"| uniq
# 查看逻辑CPU的个数
cat /proc/cpuinfo| grep "processor"| wc -l

There is a formula above: total number of logical CPUs = number of physical CPUs x number of cores per physical CPU x number of hyperthreads.

There is another indicator for this: the number of hyperthreads. Then check whether centos7 is hyperthreaded?

cat /proc/cpuinfo | grep -e "cpu cores"  -e "siblings" | sort | uniq

Calculate whether hyperthreading is enabled

  • Logical CPU > physical CPU x number of CPU cores (with hyperthreading enabled)
  • Logical CPU = physical CPU x number of CPU cores (hyperthreading is not enabled or does not support hyperthreading)
  • If the number of cpu cores is the same as the number of siblings, hyperthreading is not enabled, otherwise hyperthreading is enabled.

Judging from the results of the above execution, Nin's virtual machine does not have hyperthreading enabled.

The cpu used by Nien's virtual machine has 1 * 4 = 4 cores, and each core has 1 hyperthread, so there are 4 logical cpu cores.

step2: Check the usage of all CPU cores

View all CPU usage.

mpstat -P ALL 1 

  • %dileIs the proportion of idle CPU
  • %iowait%wa
  • %iowait%It represents the percentage of CPU time waiting for I/O

Even if the CPU is not full.

If the CPU is very full, check the CPU usage of the process

Check the CPU usage of the process

topCommand to view the CPU usage of the process

top

ps aux|sort -nr -k3|head -10Command to view the CPU usage of the process

ps aux|sort -nr -k3|head -10  

It's easy to find processes that are taking up a lot of CPU. Then analyze further.

Then, check memory resources

free -mLook at the overall memory usage . Determine if the memory is full.

Display memory information, in megabytes, the output information is as follows:

$free -m

topCommand, also check the use of memory at the process level

top

step3: check IO

Determine whether the IO is full.

iostat command to check IO

iostat can provide us with rich IO status data

iostat command parameters
iostat [选项] [<时间间隔>] [<次数>]

-c: 显示CPU使用情况
-d: 显示磁盘使用情况
-N: 显示磁盘阵列(LVM) 信息
-n: 显示NFS 使用情况
-k: 以 KB 为单位显示
-m: 以 M 为单位显示
-t: 报告每秒向终端读取和写入的字符数和CPU的信息
-V: 显示版本信息
-x: 显示详细信息
-p:[磁盘] 显示磁盘和分区的情况

CPU usage statistics: user process usage, system usage, idle rate, etc.; and the sum of the following indicators is 1

  • %user: The percentage of time the CPU is in user mode.
  • %nice: The percentage of time the CPU is in user mode with a NICE value.
  • %system: The percentage of time the CPU is in system mode.
  • %iowait: The percentage of time the CPU waits for input and output to complete. If the value of %iowait is too high, it means that there is an I/O bottleneck on the hard disk
  • %steal: Percentage of unintentional wait time for a virtual CPU while the hypervisor is maintaining another virtual processor.
  • %idle: CPU idle time percentage. A high %idle value means that the CPU is relatively idle. If the %idle value is high but the system responds slowly, it may be that the CPU is waiting to allocate memory. At this time, the memory capacity should be increased. If the value of %idle is continuously lower than 10, then the CPU processing capacity of the system is relatively low, indicating that the resource that needs to be solved most in the system is the CPU.

Block device I/O statistics: the amount of data read and written per second, the total amount of read and written data, etc.

Read indicators:

  • r/s: The number of read I/O devices completed per second. That is, rio/s, which may explain a lot of random IO
  • rkB/s: Read K bytes per second. It is half of rsect/s because each sector is 512 bytes in size.
  • rrqm/s: The number of merge read operations per second. ie rmerge/s
  • %rrqm: The percentage of read requests that were coalesced together before being sent to the device.
  • r_await: The average time required for each read operation. Focus on it. For HDD, if it is higher than 20ms, there may be too many requests, resulting in queuing, because a normal seek is only 10ms.
  • rareq-sz: The average size of read requests sent to the device (in k)

Write index:

  • w/s: The number of write I/O devices completed per second. i.e. wio/s
  • wkB/s: The number of K bytes written per second. It is half of wsect/s.
  • wrqm/s: The number of merge write operations per second. i.e. wmerge/s
  • %wrqm: The percentage of write requests that were coalesced together before being sent to the device.
  • w_await: The average time (in milliseconds) of each write request. This includes the time the requests spent in the queue and the time it took to execute them. It is important to note that for HDD, if it is higher than 20ms, there may be too many requests, resulting in queuing, because a normal seek is only 10ms.
  • wareq-sz: The average size of write requests sent to the device (in k).

Drop indicators:

  • d/s: The number of discard requests completed by the device per second (after coalescing).
  • dkB/s: The number of kB discarded from the device per second
  • drqm/s: The number of merge discard requests queued to the device per second
  • %drqm: The percentage of discard requests that were merged together before being sent to the device.
  • d_await: Average time in milliseconds to issue a discard request for a device to be serviced. This includes the time spent on requests in the queue and the time spent servicing requests.
  • dareq-sz: Average size (in kilobytes) of discard requests issued to the device.

Other indicators:

  • aqu-sz: The average queue length of requests issued to the device. Note: In previous releases, this field was called avgqu-sz. This indicator is high and needs to be paid attention to. There may be too many IOs and you need to wait
  • %util: What percentage of the time in one second is used for I/O operations, that is, the percentage of cpu consumed by io, and the percentage of elapsed time for sending I/O requests to the device (the bandwidth utilization rate of the device). Device saturation occurs when this value approaches 100% of devices that are serially servicing requests. But for devices that process requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limitations. A high indicator indicates that IO is basically the bottleneck, but a low indicator does not necessarily mean that IO is not the bottleneck. Generally, if %util is greater than 70%, the I/O pressure will be relatively high.

At the same time, you can combine vmstat to view the b parameter (the number of processes waiting for resources) and the wa parameter (the percentage of CPU time occupied by I/O waiting, and the I/O pressure is high when it is higher than 30%)

iotop query process disk I/O

iotopis a top-like tool that displays real-time disk activity.

iotopMonitors the I/O usage information output by the Linux kernel and displays the current I/O usage of a process or thread in the system.

iotopis a similar toptool that displays real-time disk activity.

iotopMonitors the I/O usage information output by the Linux kernel and displays the current I/O usage of a process or thread in the system.

It shows read and write I/O bandwidth per process/thread. It also shows the percentage of time spent by threads/processes waiting for swap in and waiting for I/O.

Total DISK READThe values ​​of and Total DISK WRITErepresent the total read and write bandwidth between processes and kernel threads on the one hand, and the kernel block device subsystem on the other.

Actual DISK READThe values ​​of and Actual DISK WRITErepresent the actual disk I/O bandwidth corresponding to the kernel block device subsystem and the underlying hardware (HDD, SSD, etc.).

iotopCommand parameters use

-o, --only只显示正在产生I/O的进程或线程。除了传参,可以在运行过程中按o生效。
-b, --batch非交互模式,一般用来记录日志。
-n NUM, --iter=NUM设置监测的次数,默认无限。在非交互模式下很有用。
-d SEC, --delay=SEC设置每次监测的间隔,默认1秒,接受非整形数据例如1.1。
-p PID, --pid=PID指定监测的进程/线程。
-u USER, --user=USER指定监测某个用户产生的I/O。
-P, --processes仅显示进程,默认iotop显示所有线程。
-a, --accumulated显示累积的I/O,而不是带宽。
-k, --kilobytes使用kB单位,而不是对人友好的单位。在非交互模式下,脚本编程有用。
-t, --time 加上时间戳,非交互非模式。
-q, --quiet 禁止头几行,非交互模式。有三种指定方式。
-q 只在第一次监测时显示列名
-qq 永远不显示列名。
-qqq 永远不显示I/O汇总。

Interactive keys, similar to the top command, iotop also supports the following interactive keys.

left和right方向键:改变排序。  
r:反向排序。
o:切换至选项--only。
p:切换至--processes选项。
a:切换至--accumulated选项。
q:退出。
i:改变线程的优先级。

Command example:

 #时间刷新间隔2秒,输出5次
 iotop  -d 2 -n 5

#非交互式,输出pid为 1404  的进程信息,这里示例1404 为nginx进程
 iotop -botq -p 1404

You can see the rate of reading the disk and the rate of writing

View the IO status of each disk.

  • kB_read/s: The amount of data read from the device (drive expressed) per second;
  • kB_wrtn/s: The amount of data written to the device (drive expressed) per second;
  • kB_read: the total amount of data read;
  • kB_wrtn: The total amount of data written; these units are Kilobytes.

Check out the await and util for each disk.

iostat -dx 1|awk '{print $1"\t"$10"\t"$11"\t"$12}'   

step4: Check the network

Network usage is also an important metric to monitor.

When the bandwidth is insufficient, the response time of the request will be greatly increased.

In order to prevent sudden concurrency pressure, you should ensure that the bandwidth usage of the server is above 80%.

It should be noted here that the physical network card limits the maximum bandwidth that the server can use.

Use nload to view the network

First you need to install nload, take centos as an example

yum install nload -y

After the installation is complete, we run directlynload

nload

After entering nloadthe command, the network usage is shown in the figure above.

Wherein, the network usage is divided into data flowing into the network card and data flowing out of the network card.

  • Incoming The network speed corresponding to the downstream bandwidth of the incoming network card,
  • Outgoing The data flowing out of the network card corresponds to the network speed of the upstream bandwidth.

If the "current network speed" is close to the "maximum network speed", it means that the bandwidth usage is close to 100%.

Indicator description:

  • Curr: current internet speed
  • Avg: average internet speed
  • Min: minimum internet speed
  • Max: maximum network speed
  • Ttl: total traffic
Use iftop to view the network

If you are still not satisfied, you can use iftop

Use iftop command, use yum install iftop -y under CentOS system to install

The interface of the iftop command is as follows:

iftop -P(It can dynamically display all connections with traffic, including port analysis)
-i: specify the network card to be monitored
-n: display the output host information through IP, without DNS reverse analysis
-B: output in bytes Display network card traffic, the default is bits
-p: run iftop in promiscuous mode, at this time iftop can be used as a network sniffer
-N: only display the connection port number, do not display the service name corresponding to the port
-P: display host and port information , this parameter is very useful
-F: Display the inbound and outbound traffic of the network card of a specific network segment
-m: Set the maximum value of the top flow scale in the iftop output interface, and the flow scale is displayed in five large segments

The scale range similar to a scale is displayed on the interface, which is used as a scale for the long bar displaying the flow graph.

#"TX":从网卡发出的流量
#"RX":网卡接收流量
#"TOTAL":网卡发送接收总流量
#"cum":iftop开始运行到当前时间点的总流量
#"peak":网卡流量峰值
#"rates":分别表示最近2s、10s、40s 的平均流量

iftop interactive parameters:

参数      含义
P        暂停/继续 (Display unpaused/paused )
h        帮助/显示(help / Display)
b        平均流量刻度条开关 (Bars on/off)
B        2s、10s、40s内的平均流量 (Bars show 2s/10s/40s average)
T        显示/隐藏每个连接的总流量( show / hide cumulative totals)
j/k      上移/下滚(通vi hjkl 左上下右)
l        过滤 (screen filter > IP、主机名或端口支持模糊查询  ctrl+删除键回退)
L        对数尺度、计算尺;  直线标度、线性标尺  (logarithmic scale && linear scale)==加个进度条比例不同
q        退出(quit)
n        DNS解析开关(DNS resolution off/on)主要看hosts 文件有无
s/d      显示源/目的主机信息  show/hide  source/dest host
S/D      显示源/目的端口信息  port display  dest/source或on
t        仅显示接收流量。received traffic only , 仅显示发送流量 sent traffic only,接收发送同时显示  two line per host 接收发送合并显示 one line per host
N        端口号及对应服务名称切换,只识别通用端口修改后不显示服务。port resolution on/off
p        全量显示/关闭端口信息   (port display off/on)
1/2/3    根据近2 秒、10 秒、40 秒的平均网络流量排序  sort by col 1/2/3
<        根据源ip/主机名排序 (sort by  source)
>        根据目的地址ip或主机名排序 (sort by dest)
o        冻结当前连接显示  order frozen/unfrozen

The third ax: check JVM thread deadlock

Focus on analyzing whether the thread is blocked in a certain position

By jstack -F pidfinding thread deadlocks, export thread stacks, and then view thread status;

1. View the thread status in the service process

top -H -p pid

top -H -p 1293ps -mp pid -o THREAD,tid,time

ps -mp  1293 -o THREAD,tid,time

2. View the hexadecimal system exception thread

printf “%x\n” nid

3. View the exception thread stack information

jstack pid | grep number

例如:
jstack 1293 | grep 100
jmap -histo 1293|head -100

export to file

jstack -l PID >> a.log

4. Analyze thread deadlock

Nien's tip: How to analyze the thread deadlock, the operation steps are relatively cumbersome. For details, please see the "Tuning Bible" video.

After the "Tuning Bible" pdf is finished, the supporting video will be released soon

The fourth trick: Check the JVM stack

  1. If there is enough free space in the memory, it can be determined that it is not caused by insufficient memory;
  2. If the garbage collection log is normal, including the young generation and the old generation, it can be basically determined that it is not caused by insufficient memory;
  3. By checking the object instances and occupied space in the memory, if there is no particularly large situation, it can be basically determined that it is not caused by insufficient memory;
  4. The cause of insufficient memory has been ruled out, and a memory leak may have occurred, and memory leak investigation is required;

Therefore, the core is to check for memory overflow and memory leaks.

Memory overflow and memory leak:

  • Memory overflow: The application for memory space exceeds the maximum heap memory space.
  • Memory leak: In fact, it includes memory overflow. The heap memory space is occupied by useless objects and is not released in time, resulting in memory occupation and eventually memory leaks.

Troubleshoot possible causes of memory overflow

Memory overflow is caused by too many unreferenced objects (garbage) that cause the JVM to fail to recycle in time, resulting in memory overflow. If this phenomenon occurs, code troubleshooting can be performed:

(1) Whether the classes and reference variables in the application use Static modification too much, such as public staitc Students;

It is best to use only basic types or strings to use static decoration in the attributes of the class . Such as: public static int i = 0; public static String str;

(2) Whether a lot of recursion or infinite recursion is used in the application (a lot of new objects are used in recursion )

(3) Whether a large number of loops or infinite loops are used in the App (a large number of newly created objects are used in the loop)

(4) Check whether the method of querying all records from the database is used in the application. That is, the method of querying all at once , if the amount of data exceeds 100,000, it may cause memory overflow . Therefore, "paging query" should be used when querying.

(5) Check whether there are arrays, lists, and maps that store object references instead of objects , because these references will prevent the corresponding objects from being released and will be stored in memory in large quantities.

(6) Check whether the operation of "+ for non-literal strings" is used .

Because the content of the String class is immutable, new objects will be generated every time you run "+". If there are too many new String objects, there will be too many new String objects, which will cause the JVM to fail to recycle in time and cause memory overflow .

Four memory overflow scenarios

  1. Stack overflow (StackOverflowError)
  2. Heap overflow (OutOfMemoryError: Java heap space)
  3. Permanent generation overflow (OutOfMemoryError: PermGen space)
  4. direct memory overflow
1. Heap overflow

If there is no heap memory that can be allocated when creating an object, the JVM will throw an OutOfMemoryError: java heap space exception.

Example of heap overflow:

/**
 * VM Args: -Xms20m -Xmx20m -XX:+HeapDumpOnOutOfMemoryError
 */
public static void main(String[] args) {
    
    
    List<byte[]> list = new ArrayList<>();
    int i=0;
    while(true){
    
    
        list.add(new byte[5*1024*1024]);
        System.out.println("分配次数:"+(++i));
    }
}

operation result:

分配次数:1
分配次数:2
分配次数:3

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid2464.hprof ...
Heap dump file created [16991068 bytes in 0.047 secs]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at com.ghs.test.OOMTest.main(OOMTest.java:16)

Attachment: The dump file will be generated in the root directory of the project

From the above example, we can see that a memory overflow occurred during the fourth memory allocation.

Two, stack overflow

When the stack space is insufficient, the following two situations need to be handled:

  • If the stack depth requested by the thread is greater than the maximum depth allowed by the virtual machine, a StackOverflowError will be thrown
  • The virtual machine cannot apply for enough memory space when expanding the stack depth, and an OutOfMemberError will be thrown

Attachment: Most of the current virtual machine stacks are dynamically scalable.

Here is a case

public class StackSOFTest {
    
    

    int depth = 0;

    public void sofMethod(){
    
    
        depth ++ ;
        sofMethod();
    }

    public static void main(String[] args) {
    
    
        StackSOFTest test = null;
        try {
    
    
            test = new StackSOFTest();
            test.sofMethod();
        } finally {
    
    
            System.out.println("递归次数:"+test.depth);
        }
    }
}

Results of the:

递归次数:982
Exception in thread "main" java.lang.StackOverflowError
    at com.ghs.test.StackSOFTest.sofMethod(StackSOFTest.java:8)
    at com.ghs.test.StackSOFTest.sofMethod(StackSOFTest.java:9)
    at com.ghs.test.StackSOFTest.sofMethod(StackSOFTest.java:9)
……后续堆栈信息省略
3. Method area overflow

The method area stores the relevant information of the Class, and then directly manipulates the bytecode with the help of CGLib to generate a large number of dynamic classes.

public class MethodAreaOOMTest {
    
    

    public static void main(String[] args) {
    
    
        int i=0;
        try {
    
    
            while(true){
    
    
                Enhancer enhancer = new Enhancer();
                enhancer.setSuperclass(OOMObject.class);
                enhancer.setUseCache(false);
                enhancer.setCallback(new MethodInterceptor() {
    
    
                    @Override
                    public Object intercept(Object obj, Method method, Object[] args, MethodProxy proxy) throws Throwable {
    
    
                        return proxy.invokeSuper(obj, args);
                    }
                });
                enhancer.create();
                i++;
            }
        } finally{
    
    
            System.out.println("运行次数:"+i);
        }
    }

    static class OOMObject{
    
    

    }
}

operation result:

运行次数:56
Exception in thread "main" 
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "main"
Fourth, direct memory overflow

DirectMemory can be specified by -XX:MaxDirectMemorySize,

If not specified, the default is the same as the maximum value of the Java heap (specified by -Xmx).

How to simulate direct memory overflow? NIO will use direct memory, you can simulate it through NIO,

In the following example, NIO is skipped and UnSafe is used directly to allocate direct memory.

public class DirectMemoryOOMTest {
    
    

    /**
     * VM Args:-Xms20m -Xmx20m -XX:MaxDirectMemorySize=10m
     * @param args
     */
    public static void main(String[] args) {
    
    
        int i=0;
        try {
    
    
            Field field = Unsafe.class.getDeclaredFields()[0];
            field.setAccessible(true);
            Unsafe unsafe = (Unsafe) field.get(null);
            while(true){
    
    
                unsafe.allocateMemory(1024*1024);
                i++;
            }
        } catch (Exception e) {
    
    
            e.printStackTrace();
        }finally {
    
    
            System.out.println("分配次数:"+i);
        }
    }
}

operation result:

Exception in thread "main" java.lang.OutOfMemoryError
    at sun.misc.Unsafe.allocateMemory(Native Method)
    at com.ghs.test.DirectMemoryOOMTest.main(DirectMemoryOOMTest.java:20)
分配次数:27953
Memory overflow summary:
  • Stack memory overflow : The stack depth required by the program is too large.
  • Heap memory overflow : Distinguish between memory leaks and insufficient memory capacity. Leakage depends on how the object is referenced by GC Root, and if it is insufficient, increase the -Xms and -Xmx parameters.
  • Permanent generation overflow : Class objects are not released, Class objects occupy too much information, and there are too many Class objects.
  • Direct memory overflow : Where in the system will use direct memory.

Troubleshooting memory overflow and memory leaks

The following tools will probably be used

1. Check the memory usage and garbage collection through jstat, and check whether the memory usage and garbage collection are abnormal;

jstat is a command-line tool provided in JDK, which is mainly used to print statistics related to JVM performance data.

It mainly includes the following aspects: garbage collection (GC) data, compilation (Compilation) related data, class loading information (Class Loader)

The biggest advantage of jstat is that it can capture these data in real time while the JVM is running.

Using jstack, we can generate a thread snapshot of the virtual machine at the current moment, including a collection of method stacks being executed by each thread in the virtual machine, which is used to locate the cause of long-term pauses in threads, such as deadlocks, infinite loops, and long-term external resources. wait wait

2. Check the memory allocation through jmap -heap to see if the memory space is full, resulting in the inability to allocate enough memory space;

Sometimes, just looking at thread snapshots is not enough.

Need to further observe the object instance, at this time we can use the jmap command.

Jmap can be used to view memory information, the number of instances and the size of occupied memory.

jmap -histo[:live] prints the number of instances of each class, memory usage, and class full name information.

Note: The prefix "*" will be added to the beginning of the internal class name of the VM. If live is added to the sub-parameter, only the number of live objects will be counted.

3. To check the reason for garbage collection through gclog, you need to specify to record the garbage collection log when the service starts;

4. Arthas online search

Approximate steps:

step1: jps to get the process id
  • Use jpsto find out the unique ID of this process in the local virtual machine, because this LVMID is needed in the subsequent troubleshooting process to determine which virtual machine process to monitor.
  • The command format is:
jps [ options ] [ hostid ]

commonly used jps -l.

step2: jstat view GC statistics
  • View the percentage of the total station space used.
  • Command format:jstat [ option vmid [interval[s|ms] [count]] ]

example

jstat -gcutil 20954 1000
  • gcutil refers to: the percentage of the total space of the space station used.
  • 20954 refers to: pid
  • 1000 means: Query once every 1000 milliseconds, and keep checking.

The new generation Eden area (E, means Eden) uses 97.14% of the space,

The two Survivor areas (S0, S1, representing Survivor0, Survivor1) are 64.38% and 0% respectively,

The old age (O, means Old) uses 29.07%.

Minor GC (YGC, representing Young GC) has occurred 32 times since the program was running, with a total time of 1.016 seconds.

Full GC (FGC, full GC) occurs 3 times, and the total time of Full GC is 0.4 seconds.

The total time-consuming (GCT, GC Time) is 1.416 seconds.

step3: Analysis of memory leaks and large objects

There are usually two methods of analysis:

  • Type 1: Lightweight online analysis:

Step 1: jmap View the top N objects that occupy the most resources in the process,

Step 2: Know which object consumes memory, and then it is not difficult to locate the code.

  • Type 2: heavyweight offline analysis:

Use the "Java memory imaging tool: jmap" to generate a heap dump snapshot (commonly called headdump or dump file).

Dump the heap and then analyze it with tools such as MAT, but it takes a long time to dump the heap, and the file is huge, and then drag it back from the server to the local import tool.

Type 1: Lightweight memory leaks and online analysis of large objects:

Step 1: jmap View the top N objects that occupy the most resources in the process,

jmap command format:jmap [ option ] vmid

Use the command as follows:jmap -histo:live 1293

jmap View the top 20 objects that occupy the most resources in the process,

jmap -histo pid|head -N

例如:
jmap -histo 1293|head -20

There will be large objects of classes in your own project, and the fourth column is the class name.

Step 2: Use Arthas to locate the code where the large object is located

Arthas is a Java diagnosticAlibaba tool that was open sourced in September 2018 . Support , using the command line interactive mode, providing automatic incompleteness, which can conveniently locate and diagnose online program running problems. As of the writing of this article, 17000+ have been harvested.JDK6+TabStar

The official documentation of Arthas is very detailed. This article also refers to the content of the official documentation. At the same time, there are not only problem feedbacks in the open source Githubproject , but also a large number of use cases, which can also be used for learning reference.Issues

Open source address: https://github.com/alibaba/arthas

Official documentation: https://alibaba.github.io/arthas

Thanks to the powerful and rich functions of Arthas , what Arthas can do is beyond imagination.

The following are just a few common usage scenarios, and more usage scenarios can be explored by yourself after getting familiar with Arthas .

  1. Is there a global view into the health of the system?
  2. Why is the CPU increased again? Where is the CPU occupied?
  3. Are multiple threads running deadlocked? Is there a blockage?
  4. The program takes a long time to run. Where does it take a long time? How to monitor it?
  5. Which jar package is this class loaded from? Why are various types of related Exception reported?
  6. Why is the code I changed not executed? Could it be that I didn't commit? Branching wrong?
  7. If you encounter a problem and cannot debug online, can you only republish it by adding a log?
  8. Is there any way to monitor the real-time running status of the JVM?

Here are some common commands of Arthas ,

Order introduce
dashboard Live data panel for the current system
thread View the thread stack information of the current JVM
watch Method Execution Data Observation
trace The internal call path of the method, and output the time spent on each node on the method path
stack Output the call path where the current method is called
tt The time-space tunnel of method execution data, records the input parameters and return information of each call of the specified method, and can observe these calls at different times
monitor Method Execution Monitoring
jvm View current JVM information
vmoption View and update parameters related to JVM diagnosis
sc View the class information loaded by the JVM
sm View the method information of the loaded class
jad Decompile the source code of the specified loaded class
classloader View classloader inheritance tree, urls, class loading information
heapdump Heap dump function similar to jmap command

Nien's tip: How to use Arthas to locate the code of the large object, the operation steps are relatively cumbersome.

How to use Arthas to locate the code of the location of the large object, please see the "Tuning Bible" video for details.

After the "Tuning Bible" pdf is finished, the supporting video will be released soon

Type 2: Heavyweight memory leaks and online analysis of large objects:

Use jmap or dump information to analyze using tools Mat or JProfiler or JVisualVM

jmap -dump:live,format=b,file=/dump.bin pid

例如:
jmap -dump:live,format=b,file=/dump.bin 1293

Nien's tip: How to use JVisualVM to analyze and locate the code of the large object after exporting using jmap, the operation steps are also relatively cumbersome.

How to use jmap and JVisualVM to analyze and locate the code of the large object, please see the "Tuning Bible" video for details.

After the "Tuning Bible" pdf is finished, the supporting video will be released soon

The fifth trick: check the operating system configuration

  • Limit on the number of file handles
  • Limits on the number of memory maps
  • Thread limit

etc.

Online case 1: OOM causes suspended animation

This is a case from the Internet, the original text is as follows:

https://blog.csdn.net/weixin_42130396/article/details/126020231

Phenomenon

In a production accident, thread OOM was caused by querying too much data from the database at one time: Java heap space exception

**Pay attention to the data scale:** 15W data of the data table, JVM heap memory 2G

**The online situation is:** When the thread OOM occurs, the java process does not hang up immediately.

The problem is, we often say: when OOM occurs, the program will hang. The situation here is: OOM occurred, and the JVM did not hang.

Special Note: The JVM does not necessarily hang when OOM occurs

Nin wrote a detailed article introducing:

Meituan side: After OOM, will the JVM definitely exit? Why?

accident scene

The monitoring found that one of the online services was hung up , the process was still there, and springboot died in a suspended animation.

Log in to the server query log and found java.lang.OutOfMemoryError: GC overhead limit exceeded,

The error means: jvm spends a lot of time doing gc, but there is no effect of recycling (creating a lot of large objects and not recycling garbage).

In the exception log, it can be seen that the cause of OOM is caused by hibernate calling repository.findAll()

identify the problem

Export the hprof file to the local and use the jvisualvm that comes with jdk for analysis.

Let's see what's big first.

It is still exporting the hprof file to the local and using jvisualvm that comes with jdk for analysis. This time, I am not so lucky. I can find useful clues by clicking and jumping to "View the thread that caused the OutOfMemoryError exception error", so let's see what's big. object

You can see that the list.size actually has 159482. Expand it to see what objects are stored in this list.

Expand and view metadata——>fields——>fullName (or tableName and columnName) to see if there are any familiar tables and fields that can locate all the data of which table is queried

So far, it has been found out that jpa has queried all the data in the department table.

Combined with the scheduled tasks executed at this time, the code location is finally found:

List<SysDepartmentMainData> departments = departmentMainDataRepository.findAll();

But here comes the question, how can there be more than 150,000 departmental data?

The department table in this project is a scheduled task. The table is cleared every day and then the company master data is synchronized to cover all department information.

This method adds transactions, so there will be no situation where there is no data in the table after the table is cleared but the call to the main data interface fails.

There are only more than 4,000 department tables for viewing the company's master data, so how did the 150,000 come in?

Then we have to look at the code logic of the storage table, and we found the problem, the following is the cause of the problem and the final fix

Online case 2: Insufficient number of operating system threads leads to suspended animation

Phenomenon

The monitoring found that one of the online services was hung up , the process was still there, and springboot died in a suspended animation.

However, the operating system is normal

cpu, memory, io, etc. are all normal

Troubleshooting process

a. Memory status query

jstat -gcutil pid

no problem.

b. Memory snapshot export

sudo -u wwwroot `jmap -dump:live,format=b,file=heap001 pid`

When the above command fails to export, execute the following statement

sudo -u wwwroot `jmap -F -dump:live,format=b,file=heap001 pid`

no problem.

c. Stack information export

sudo -u wwwroot `jstack pid > aaa.txt `

sudo -u wwwroot `jstack pid > aaa.txt `

d. View memory information

free

no problem.

e. ulimit view

Pay attention to different user situations

has a problem.

not enough threads

Solution

1. Clear page cache

echo 3 > /proc/sys/vm/drop_caches

2. ulimit configuration adjustment

The phenomenon of unreasonable configuration is:

1. The program cannot be entered

2. No abnormal log

3. The number of threads and the number of database connections are not many

4. The program process is normal

1、查看命令:ulimit -a
2、修改vim /etc/security/limits.conf,添加或修改配置(可能之前已经存在),添加完成后关闭putty并重新连接通过ulimit -a查看是否生效

* soft nproc 35535
* hard nproc 65535
* soft nofile 35535
* hard nofile 65535

3、如果上述修改无法生效,尝试查看或修改vim /etc/security/limits.d/20-nproc.conf,内容同上
4、如果2、3均无法生效,联系运维排查无法生效问题

Online case 3: Slow request processing speed, resulting in suspended animation

Background description:

While the project is running, it suddenly cannot handle new requests.

The tps that come are tens of thousands, and all the machines suddenly become 0, which has a great impact on online business!

So immediately check the service running status, but it is still running, so it can be considered that SpringBoot is suspended animation.

Troubleshooting process:

Step 1: View logs, system resources, and jvm memory

no problem

The second step is to check the status of the port corresponding to the service.

Use the command netstat -anp |grepport number to view the detailed situation of the port,

As follows, there are information such as protocol type, source ip+port, tcp protocol status, etc. Here, we mainly focus on the number of connections of the tcp protocol and the corresponding status of each connection.

 netstat -an |grep 80  
tcp4       0      0  192.168.31.24.51380    120.253.255.102.443    ESTABLISHED
tcp4       0      0  192.168.31.24.50935    192.30.71.80.443       ESTABLISHED
tcp4      31      0  192.168.31.24.50863    3.80.20.154.443        CLOSE_WAIT 
tcp6       0      0  fe80::b7ca:330b:.1028  fe80::5059:ee74:.1025  ESTABLISHED
tcp6       0      0  fe80::b7ca:330b:.1024  fe80::5059:ee74:.1024  ESTABLISHED

It is also possible to determine the number and status of current tcp port connections by grep specific tcp status codes.

netstat -an |grep 80 |grep CLOSE_WAIT
netstat -an |grep 80 |grep CLOSE_WAIT
tcp4      31      0  192.168.31.24.50863    3.80.20.154.443        CLOSE_WAIT 
tcp4      31      0  192.168.31.24.50855    3.80.20.154.443        CLOSE_WAIT 
tcp4      31      0  192.168.31.24.50854    3.80.20.154.443        CLOSE_WAIT 
tcp4      31      0  192.168.31.24.50853    3.80.20.154.443        CLOSE_WAIT 
tcp4      31      0  192.168.31.24.50852    3.80.20.154.443        CLOSE_WAIT

In addition, it can also be used ss -lnr |grep 80to determine the number of Send-Q and Recv-Q of the corresponding port.

It can also reflect whether the tcp of the corresponding port is blocked.

In the third step, the TCP connection queue is full, causing suspended animation

According to the tcp situation of the service port in the previous step, it is determined that the final problem is that the TCP connection queue of the port is full because the tcp connection is not released correctly, and subsequent requests cannot enter the connection queue, causing Tomcat to freeze!

Finally, after identifying the problem, go to the service to find the cause.

It may be that the corresponding http request does not respond, or because the service processing speed is far behind the number of requests, the connection queue is slowly filled up during operation.

solution:

  1. When encountering such a situation, you can first try to restart the service, and you may continue to support the service. After a certain period of time, the service will enter suspended animation; at the same time, shut down the front side of the business to reduce or even shut down the number of requests. Minimize business loss.
  2. Roll back the version. Generally, the occurrence of problems is caused by the launch of new services. Roll back the problem-free version first.
  3. Identify the cause of the problem and solve it in a targeted manner.

Online case 4: Log output to the console leads to suspended animation

The following is a file download stuck, the essence is still a case of online suspended animation:

Problem: File download stuck

On a large file download cluster (more than 100 instances), a user notified us one day:

你们下载一个文件咋下载了一整天都没下载下来

According to the logs and downloaded files, it is not because the download file is slow, but the program is stuck, that is to say, suspended animation.

Analysis process

  1. Analyze the log, track the program, and find that there is nothing wrong with the code, and the log can't tell why
  2. jstack -l pidstack trace, stack information given in part
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode):

"Attach Listener" #186 daemon prio=9 os_prio=0 tid=0x00007fbf98001000 nid=0x28e8 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"pool-47-thread-4" #185 prio=5 os_prio=0 tid=0x00007fbf4c44c800 nid=0x2b76 runnable [0x00007fbf348e3000]
   java.lang.Thread.State: RUNNABLE
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:326)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        - locked <0x0000000080223058> (a java.io.BufferedOutputStream)
        at java.io.PrintStream.write(PrintStream.java:480)
        - locked <0x0000000080223038> (a java.io.PrintStream)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at ch.qos.logback.core.joran.spi.ConsoleTarget$1.write(ConsoleTarget.java:37)
        at ch.qos.logback.core.encoder.LayoutWrappingEncoder.doEncode(LayoutWrappingEncoder.java:131)
        at ch.qos.logback.core.OutputStreamAppender.writeOut(OutputStreamAppender.java:187)
        at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:212)
        at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:100)
        at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
        at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:48)
        at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
        at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.info(Logger.java:591)
        at com.baidu.adb.client.download.AbstractPartitionDownloader$Worker.run(AbstractPartitionDownloader.java:99)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - <0x0000000587fa71f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
        - <0x000000008021cbe8> (a java.util.concurrent.locks.ReentrantLock$FairSync)
 
  "pool-47-thread-3" #184 prio=5 os_prio=0 tid=0x00007fbf4c44b800 nid=0x2b74 waiting on condition [0x00007fbf35af5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000008021cbe8> (a java.util.concurrent.locks.ReentrantLock$FairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
  ....

As can be seen from the stack information, pool-47-thread-4 is in RUNNABLE and has acquired 2 locks

Locked ownable synchronizers:
        - <0x0000000587fa71f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
        - <0x000000008021cbe8> (a java.util.concurrent.locks.ReentrantLock$FairSync)

The status of pool-47-thread-3 and pool-47-thread-2 is WAITING, and they are both in the waiting on condition state, and they are both waiting for 0x000000008021cbe8, which is a deadlock .

Look at the stack information of pool-47-thread-4 again, at ch.qos.logback.core.joran.spi.ConsoleTarget$1.write(ConsoleTarget.java:37)that is to say, logback has not released the lock, and it is on the console appender

Check the logback file, there is an appender in the logger that is console

logback.xml

solution

If the logback worker thread cannot complete, it must be because its blocking queue is full.

The worker thread is waiting to deposit its log entry, and because the thread is waiting, we know it released the lock on the queue, but it still holds the lock on the print stream.

The console writing thread is blocked trying to acquire a lock on the print stream, so they are deadlocked.

The minimal fix to avoid code changes is to swap out the console appender for one that doesn't need to acquire a lock on the print stream.

in conclusion:

The program has printed information output on the terminal, but after running in the background, the information cannot be output to the terminal.

After printing too much information, the buffer is full, and the program stops running, causing suspended animation

Solution:

  1. The program running in the background does not print output information to the terminal
    Since the program uses log4j, the Console configuration output in the log4j.xml configuration is blocked
  2. Use nohup to append the print information to the nohup.out file, which will generate a large file over time

Online case 5: Server-side CLOSE_WAIT online suspended animation

The following is a real case of online suspended animation:

A spring boot project provides rest interface services externally, but 500 connection timeouts will occur every three to five times.

identify the problem

After excluding the memory, CPU, and disk full scenarios, I accidentally found that this machine has a large number of network connections CLOSE_WAIT.

Observe the CLOSE_WAIT situation and count the number of close_wait connections

lsof -i:8091 |grep "CLOSE_WAIT"  |wc -l

The number of CLOSE_WAIT connections is 67 and growing.

Why is there a large number of CLOSE_WAIT states? We have to introduce the four waved hands during the socket disconnection process.

Since TCP connections are full-duplex, each direction must be closed separately. Assume that the terminate command is initiated by the client.

When the client end transfers data or needs to disconnect:

  1. The client sends a FIN message to the server. (The serial number is M)
    1.1. Indicates that by calling the close (socket) API, the client Client will terminate the connection from the Client to the Server.
    1.2 means that the Client will no longer send data to the Server. (But the Server can continue to send to the Client)
    1.3 The Client status changes to FIN_WAIT_1
  2. After receiving the FIN, the server sends an ACK message to the client. (serial number is M+1)
    2.1 Server status changes to CLOSE_WAIT
    2.2 Client status changes to FIN_WAIT_2 after receiving ACK with serial number (M+1)
  3. The server side also sends a FIN message to the client side. (The serial number is N)
    3.1 means that the server also terminates the connection to the client in this direction.
    3.2. By calling the close(socket) API.
    3.3 The server status changes to LAST_ACK
  4. After the client receives the message FIN, it also sends an ACK message to the server. (Serial number N+1)
    4.1 Client status changes to TIME_WAIT
  5. The server side receives the ACK with the sequence number (N+1)
    5.1 The status of the server becomes CLOSED.
  6. After bringing 2MSL,
    the status of 6.1 Client also changes to CLOSE.
  7. At this point, a complete TCP connection is closed.

Two basic questions:

  1. Q: When does Sever's CLOSE_WAIT appear?
    A: After the Sever receives the Client's FIN message.
  2. Q: When does the state CLOSE_WAIT transition to the next state?
    A: After the Server sends a FIN message to the Client.

Back to the previous question: Why is there a large number of CLOSE_WAIT states?

If the server has not sent a FIN message to the client (calling the close() API), then this CLOSE_WAIT will always exist.

Reason analysis :

From the above we can see that CLOSE_WAIT appears, indicating that the server did not initiate the close() operation, which is basically a problem with the user's server program; usually, the server waits for the client to access, if the client exits and requests to close the connection, the server Consciously close () the corresponding connection. But the server has not had time to consciously close(), and the time difference between the server returning ack, the server connection status is close_wait.

Ordinarily, this time difference is very short, not enough to prevent the entire application from responding to the outside world, unless in one case, the concurrency reaches the limit of tomcat's processing requests at a certain time, and at the same time, the entire thread pool for processing requests is fully occupied , and wait for close.

The online SpringBoot application just above basically fits this scenario.

Troubleshooting

The troubleshooting idea is to query thread information through jstack to see where http is blocked

jps

jstack $PID > stack.txt

First find the pid of the process, and then export the thread information through the jstack command.

In stack.txt, search for BLOCKED and find that a large number of threads are blocked.

文件内容查找: 
grep BLOCKED  stack.txt

Search for nio threads that are BLOCKED

# -C 3 展示匹配内容及后面3行
grep "BLOCKED" -C 3 | grep "http-nio-exec"   stack.txt

http-nio-xx-exec-It is the name of the thread that tomcat processes the request. It can be seen that most threads are blocked. This is the root cause of suspended animation.

You can also see the stack information in the thread information, which clearly tells us which line of code the blocking occurred in.

Follow up the code Sure enough, the code requests two http addresses that can no longer be accessed normally.

After modifying the code, release it to the development environment, continue to observe the CLOSE_WAIT situation, and count the number of close_wait connections

lsof -i:8091 |grep "CLOSE_WAIT"  |wc -l

It was 67 before the revision and still growing.

After modification, it becomes 0.

ok, the problem of close_wait online has been solved. SpringBoot suspended animation has also been resolved.

SpringBoot online service suspended animation solution, CPU memory is normal

Troubleshooting

The old rule is that several nodes of the same service do not respond in a cluster environment. If it is not resolved in time, it may form an avalanche effect.

Check the service log first to see if there is an error, and check the service cpu and memory status politely and habitually.

Review first, if there is no error reported in the service. If the cpu or memory is abnormal, follow the steps below to troubleshoot.

Troubleshooting

1. Information collection and analysis

Because the service health monitoring does not respond, the cpu and memory are normal, check the stack information directly to see what the threads are doing

jstack -l PID >> a.log

In the output of Jstack, the Java thread status is mainly the following:

RUNNABLE thread running or I/O waiting

BLOCKED thread is waiting for monitor lock (synchronized keyword)

TIMED_WAITING The thread is waiting to wake up, but a time limit is set

WAITING thread is waiting infinitely to wake up

It is found that all are WAITING threads.

"http-nio-8888-exec-6666" #8833 daemon prio=5 os_prio=0 tid=0x00001f2f0016e100 nid=0x667d waiting on condition [0x00002f1de3c5200]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000007156a29c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at com.alibaba.druid.pool.DruidDataSource.takeLast(DruidDataSource.java:1897)
at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1458)
....
at org.apache.ibatis.executor.CachingExecutor.query(CachingExecutor.java:109)
at com.github.pagehelper.PageInterceptor.intercept(PageInterceptor.java:143)
at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
at com.sun.proxy.$Proxy571.query(Unknown Source)

2. Locate key information and track source code

The key information is as follows:

  at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
  at com.alibaba.druid.pool.DruidDataSource.takeLast(DruidDataSource.java:1897)  

DruidDataSource.takeLast method

DruidConnectionHolder takeLast() throws InterruptedException, SQLException {
    
    
    try {
    
    
        while (poolingCount == 0) {
    
    
            emptySignal(); // send signal to CreateThread create connection

            if (failFast && isFailContinuous()) {
    
    
                throw new DataSourceNotAvailableException(createError);
            }

            notEmptyWaitThreadCount++;
            if (notEmptyWaitThreadCount > notEmptyWaitThreadPeak) {
    
    
                notEmptyWaitThreadPeak = notEmptyWaitThreadCount;
            }
            try {
    
    
                // 数据库的连接都没有释放且被占用,连接池中无可用连接,导致请求被阻塞
                notEmpty.await(); // signal by recycle or creator
            } finally {
    
    
                notEmptyWaitThreadCount--;
            }
            notEmptyWaitCount++;

            if (!enable) {
    
    
                connectErrorCountUpdater.incrementAndGet(this);
                throw new DataSourceDisableException();
            }
        }
    } catch (InterruptedException ie) {
    
    
        ...
    }
    ...
        return last;
}

The core code in the source code is as follows:

// 数据库的连接都没有释放且被占用,连接池中无可用连接,导致请求被阻塞
notEmpty.await(); // signal by recycle or creator

When obtaining a connection from the Druid connection pool, if the connection in the pool has not been released and is occupied, there is no available connection in the connection pool, causing the request to be blocked.

Locate the problem code in combination with the log error report. Due to an error reporting that the available connection is not released normally, the await has been stuck.
The problem code is as follows:

try {
    
    
    SqlSession sqlSession = sqlSessionFactory.openSession(ExecutorType.BATCH);
    TestMapper mapper = sqlSession.getMapper(TestMapper.class);
    mapper.insetList(list);
    sqlSession.flushStatements();
} catch (Exception e) {
    
    
    e.printStackTrace();
}

Problem recurrence

Reproduce in the multi-active environment according to the above information. The monitoring check does not respond because the threads are full and waiting.

The tomcat thread is full:

Tomcat default parameters:

The maximum number of worker threads, the default is 200.
server.tomcat.max-threads=200
The maximum number of connections defaults to 10000
server.tomcat.max-connections=10000
Waiting queue length, the default is 100.
server.tomcat.accept-count=100

Minimum number of working idle threads, default 10.
server.tomcat.min-spare-threads=100

The default parameters of the Druid connection pool are as follows:

The configuration parameters of the Druid connection pool are as follows:

Attributes illustrate suggested value
username Username to log in to the database
password User password to log in to the database
initialSize The default is 0, how many connections are initialized in the connection pool when starting the program 10-50 is enough
maxActive The default is 8, the maximum number of active sessions supported in the connection pool
maxWait The default is -1. When the program requests a connection from the connection pool, if the value of maxWait is exceeded, the request is considered to have failed, that is, the connection pool has no available connection, and the unit is milliseconds. When it is set to -1, it means infinite waiting 100
minEvictableIdleTimeMills After the idle time of a connection in the pool reaches N milliseconds, the connection pool will recycle the connection when it checks the idle connection next time, which should be less than the firewall timeout setting net.netfilter.nf_conntrack_tcp_timeout_established see description
timeBetweenEvictionRunsMillis Frequency of checking idle connections, in milliseconds, non-positive integer means no checking
keepAlive If the program has no close connection and the idle time exceeds minEvictableIdleTimeMillis, the SQL specified by validationQuery will be executed to ensure that the program connection will not be killed by the pool, and its range does not exceed the number of connections specified by minIdle true
minIdle The default is 8. When recycling idle connections, at least minIdle connections will be guaranteed. Same as initialSize
removeAbandoned It is required that after the program gets a connection from the pool, it must be closed after N seconds, otherwise druid will forcibly recycle the connection, regardless of whether the connection is active or idle, so as to prevent the process from occupying the connection without closing it. false, set to true when it is found that the program has not closed the connection normally
removeAbandonedTimeout Set the time limit for druid to forcibly recycle the connection. When the program gets from the pool to the connection, after this value is exceeded, druid will forcibly recycle the connection, in seconds. Should be greater than the maximum running time of the business
logAbandoned When druid forcibly recycles the connection, whether to record the stack trace in the log true
testWhileIdle When a program requests a connection, whether the pool first checks whether the connection is valid when allocating the connection. (efficient) true
validationQuery SQL statement to check whether the connection in the pool is still available, drui will connect to the database to execute the SQL, if it returns normally, it means the connection is available, otherwise it means the connection is not available
testOnBorrow 程序申请连接时,进行连接有效性检查(低效,影响性能) false
testOnReturn 程序返还连接时,进行连接有效性检查(低效,影响性能) false
poolPreparedStatements 缓存通过以下两个方法发起的SQL: public PreparedStatement prepareStatement(String sql) public PreparedStatement prepareStatement(String sql,int resultSetType, int resultSetConcurrency) true
maxPoolPrepareStatementPerConnectionSize 每个连接最多缓存多少个SQL 20
filters 这里配置的是插件,常用的插件有:监控统计: filter:stat 日志监控: filter:log4j 或者 slf4j 防御SQL注入: filter:wall stat,wall,slf4j
connectProperties 连接属性。比如设置一些连接池统计方面的配置。 druid.stat.mergeSql=true;druid.stat.slowSqlMillis=5000 比如设置一些数据库连接属性

解决

1、Druid连接池的配置超时参数

spring: 
  redis:
    host: localhost
    port: 6379
    password: 
  datasource:
    druid:
      stat-view-servlet:
        enabled: true
        loginUsername: admin
        loginPassword: 123456
    dynamic:
      druid:
        initial-size: 10
        min-idle: 5
        maxActive: 100
        maxWait: 60000
        timeBetweenEvictionRunsMillis: 60000
        minEvictableIdleTimeMillis: 300000
        validationQuery: SELECT 1 FROM DUAL
        testWhileIdle: true
        testOnBorrow: false
        testOnReturn: false
        poolPreparedStatements: true
        maxPoolPreparedStatementPerConnectionSize: 20
        filters: stat,slf4j,wall
        connectionProperties: druid.stat.mergeSql\=true;druid.stat.slowSqlMillis\=5000

2、异常及时关闭连接

发生异常,及时关闭连接

try {
    
    
    SqlSession sqlSession = sqlSessionFactory.openSession(ExecutorType.BATCH);
    TestMapper mapper = sqlSession.getMapper(TestMapper.class);
    mapper.insetList(list);
    sqlSession.flushStatements();
} catch (Exception e) {
    
    
    e.printStackTrace();
    sqlSession.close(); //异常及时关闭连接
}

SpringBoot假死检测与自愈

  • 首先,配置 当SpringBoot发生oom时自动退出
  • 其次,配置进行假死检测与自愈

当SpringBoot发生oom时自动退出

在使用springboot开发应用时遇到一个问题,当springboot发生内存溢出时,springboot并没有自动退出,因为当发生oom时springboot对oom异常进行了管理(oom异常也可以被try catch捕获)
解决方法:在启动jar包时加入命令行参数

XX:+CrashOnOutOfMemoryError

当发生内存溢出时jvm会自动退出,并且在当前程序目录下生成error文件日志,里面包括是哪个线程引起的oom,当前内存使用情况等。

配置进行假死检测与自愈

为了防止这个问题,临时采取定期检查该站点url的方式判断tomcat的运行情况。

写一个shell脚本:当取得到健康检查url状态码不是200时,强制重启SpringBoot。

#!/bin/bash

n=`curl -I -s  |grep "200 OK" |wc -l`

 
if [ $n -ne 1 ]

  then

   source /etc/profile

   /usr/local/app/bin/deploy.sh stop 

  /usr/local/app/bin/deploy.sh start

fi

用crond每隔一段时间执行一次上面的假死检测与自愈脚本。

最后的 一些成熟的建议

给出一些实战经验,让工作中更加从容:

  1. 调优参数务必加上-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=,发生OOM让JVM自动dump出内存,方便后续分析问题解决问题
  2. 堆内存不要设置的特别大,因为你设置的特别大,发生OOM时生成的dump文件就特别大,不好分析。建议不超过8G。
  3. 想主动dump出JVM的内存,有挺多方式,但不管哪种方式,主动dump内存会引发STW,请择时操作。即通过arthas提供的命令heapdump主动dump出JVM的内存,这个操作会引发FGC,背后是STW,操作时请选择好时机,不然老板可能提刀来见。
  4. 我提供的代码务必拉下来跑跑,找下感觉。最好是自己也去写一份与我提供的不同的,加深理解,加深印象。

JVM相关的启动参数

-XX:+HeapDumpOnOutOfMemoryError

从字面就可以很容易的理解,在发生OutOfMemoryError异常时,进行堆的Dump,这样就可以获取异常时的内存快照了。

-XX:HeapDumpPath=/usr/local/heap-dump/

这个也很好理解,就是配置HeapDump的路径,

方便我们管理,这里我们配置为/usr/local/heap-dump/,当然你也可以根据自己的需要,定义为其他的目录。

参考文献:

https://www.cnblogs.com/etangyushan/p/6909437.html

https://blog.csdn.net/longxy520/article/details/84752660

https://blog.csdn.net/nmyphp/article/details/115544822

https://blog.csdn.net/web15285868498/article/details/124347503

https://blog.csdn.net/u011983531/article/details/63250882

https://www.iteye.com/blog/bijian1013-2271600

https://zhuanlan.zhihu.com/p/269892810

https://blog.csdn.net/qq_35764295/article/details/127753003

《Java 调优圣经》迭代计划

尼恩团队的所有文章和PDF电子书,都是 持续迭代的模式, 最新版本永远是最全的。

尼恩团队结合资深架构经验和行业案例,给大家梳理一个系列的**《Java 调优圣经》PDF** 电子书,包括本文在内规划的五个部分:

(1) 调优圣经1:零基础精通Jmeter分布式压测,10Wqps+压测实操 (已经完成)

(2) 调优圣经2:从70s到20ms,一次3500倍性能优化实战,方案人人可用 (已经完成)

(3) 调优圣经3:如何做mysql调优?绝命7招让慢SQL调优100倍,实现你的Mysql调优自由 (已经完成)

(4) 调优圣经4:SpringBoot假死,十万火急,怎么救火? (本文)

(5) 调优圣经5:零基础精通JVM调优实操,实现JVM自由 (写作中)

(6) 调优圣经6:零基础精通Linux、Tomcatl调优实操,实现基础设施调优自由 (写作中)

以上的多篇文章,后续将陆续在 技术自由圈 公众号发布。 完整的《Java 调优圣经》PDF,可以找尼恩获取。

技术自由的实现路径 PDF:

实现你的 架构自由:

吃透8图1模板,人人可以做架构

10Wqps评论中台,如何架构?B站是这么做的!!!

阿里二面:千万级、亿级数据,如何性能优化? 教科书级 答案来了

峰值21WQps、亿级DAU,小游戏《羊了个羊》是怎么架构的?

100亿级订单怎么调度,来一个大厂的极品方案

2个大厂 100亿级 超大流量 红包 架构方案

… 更多架构文章,正在添加中

实现你的 响应式 自由:

响应式圣经:10W字,实现Spring响应式编程自由

这是老版本 《Flux、Mono、Reactor 实战(史上最全)

实现你的 spring cloud 自由:

Spring cloud Alibaba 学习圣经

分库分表 Sharding-JDBC 底层原理、核心实战(史上最全)

一文搞定:SpringBoot、SLF4j、Log4j、Logback、Netty之间混乱关系(史上最全)

实现你的 linux 自由:

" Linux Commands Encyclopedia: 2W More Words, One Time to Realize Linux Freedom "

Realize your online freedom:

" Detailed explanation of TCP protocol (the most complete in history) "

" Three Network Tables: ARP Table, MAC Table, Routing Table, Realize Your Network Freedom!" ! "

Realize your distributed lock freedom:

" Redis Distributed Lock (Illustration - Second Understanding - The Most Complete in History) "

" Zookeeper Distributed Lock - Diagram - Second Understanding "

Realize your king component freedom:

" King of the Queue: Disruptor Principles, Architecture, and Source Code Penetration "

" The King of Cache: Caffeine Source Code, Architecture, and Principles (the most complete in history, 10W super long text) "

" The King of Cache: The Use of Caffeine (The Most Complete in History) "

" Java Agent probe, bytecode enhanced ByteBuddy (the most complete in history) "

Realize your interview questions freely:

4000 pages of "Nin's Java Interview Collection" 40 topics

Please go to the following "Technical Freedom Circle" official account to get the PDF file update of Nien's architecture notes and interview questions↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/131405713