Record a handle leak troubleshooting experience (too many open files)

Today, the Lantern Festival, the New Year is coming to an end. Since I was observing the server logs and maintenance every day before, there were no major problems. However, during the Chinese New Year holiday, the diary was too large. It is normal to generate 1M logs every day, but when I checked the server one day in the middle of the year, I found that there were 42G logs, and the online functions were also a little abnormal. I became nervous and quickly started the investigation trip.

1. Back up the abnormal log of the server and restore the online function.
Use the cp command of the linux system to copy the abnormal log for backup.

cp [OPTION]... SOURCE... DIRECTORY

Restart the server process and the function returns to normal.

Second, check the log, preliminary positioning

After the function is restored, start looking for the cause of the abnormality. Because the abnormal log is too large and the download is slow, the split command is used to split the log file, each block is 1G in size. But 1 G is still very big, I divide the first small block again, each block is 100M.

split -b 100M mylog.txt

Finally, download the first 100M file for viewing.
The finally downloaded log is displayed as follows:

java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:411)
at sun.nio.ch.DatagramChannelImpl.<init>(DatagramChannelImpl.java:142)
at sun.nio.ch.SelectorProviderImpl.openDatagramChannel(SelectorProviderImpl.java:46)
at java.nio.channels.DatagramChannel.open(DatagramChannel.java:182)
at lbms.plugins.mldht.kad.utils.AddressUtils.getDefaultRoute(AddressUtils.java:254)
at lbms.plugins.mldht.kad.RPCServerManager.startNewServers(RPCServerManager.java:147)
at lbms.plugins.mldht.kad.RPCServerManager.refresh(RPCServerManager.java:51)
at lbms.plugins.mldht.kad.DHT.update(DHT.java:929)
at lbms.plugins.mldht.kad.DHT.lambda$started$11(DHT.java:767)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

java.net.SocketException: Too many open files
at sun.nio.ch.Net.socket0(Native Method)
at sun.nio.ch.Net.socket(Net.java:411)
at sun.nio.ch.DatagramChannelImpl.<init>(DatagramChannelImpl.java:142)
at sun.nio.ch.SelectorProviderImpl.openDatagramChannel(SelectorProviderImpl.java:46)
at java.nio.channels.DatagramChannel.open(DatagramChannel.java:182)
at lbms.plugins.mldht.kad.utils.AddressUtils.getDefaultRoute(AddressUtils.java:254)
at lbms.plugins.mldht.kad.RPCServerManager.startNewServers(RPCServerManager.java:147)
at lbms.plugins.mldht.kad.RPCServerManager.refresh(RPCServerManager.java:51)
at lbms.plugins.mldht.kad.DHT.update(DHT.java:929)
at lbms.plugins.mldht.kad.DHT.lambda$started$11(DHT.java:767)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
...

As shown above, the java.net.SocketException: Too many open files exception is continuously written in the log in a loop, and it has not stopped. So I searched for related issues on major IT websites and tried some solutions.
The general description is: The
file handle of the linux system is not enough and has been used up.
According to the online program, use ulimit -a to query and permanently modify the file handle configuration file, and the number of queries again is set to 65535.
Insert picture description here
I think this should be fine .

3. The conjecture leaks and confirms the conjecture
However, this is not the case. After the server continued to run for a week, this exception occurred again.
So I reconsidered this exception, which may be caused by the leak of the handle, and checked it according to the ideas provided by this website.
Statistical methods for detecting program handle leaks
According to the prompts, a background program that monitors process handles is written on the Linux system for monitoring:
Insert picture description here
The following command is to query the number of handles currently occupied by the process.

 ls -l /proc/进程ID/fd |wc -l 

Set up handle occupancy statistics every 600 seconds, wait for a day, and use the excel table to make a discount trend chart:
Insert picture description here
through the line chart, it is further confirmed that it is the cause of the handle leakage.

4. Locate the leak location and solve the problem

The handle leak must be caused by the program. Handle usually refers to a file, socket or other. The leak of the handle is definitely caused by the file operation flow in java or the socket is not closed. So again and again check the position of each setting file operation, and found that there is no unclosed situation.
Transfer the battlefield to the local IDE, use the windows task process manager to debug, test each business repeatedly, and observe the handle and thread usage.
windows task manager
In the process of repeatedly testing each business, the thread and handle occupancy of each business will be released after the business ends. There is only one business. Even if it ends, the number of threads and handles occupied has not been released. Therefore, test this business in depth again, and use the JVM stack command to query the usage of each thread:

	jstack -l 进程ID

After the end of this business, I found that a large number of threads were still in the RUNNABLE state, and were executing to receive socket data packets, as follows:
Insert picture description here
According to the prompts of these threads, I went back to the code, and finally found that it was in the business logic branch process of the project. In the case of runtime object execution timeout, the shutdown process was not performed, which caused many users in the process of using this service, even if there was a timeout, the front end returned a prompt, but the back end did not close and recycle, and it has been running. .

V. Verify that the leak is resolved.
After applying this patch, re-release the online observation and draw a line chart:
Insert picture description here

The handle occupancy is stable within 600, this time the handle leak investigation has come to an end.

6. Summary
Through this experience, I re-admired the code. It is not terrible to encounter a problem, and it can be solved by in-depth research and bit by bit.
I hope my experience can provide some troubleshooting ideas and references for those who encounter problems.

Guess you like

Origin blog.csdn.net/weixin_43901067/article/details/114113036