Program too many open files troubleshooting and solution

1. Problems arise

Recently, an error "socket: too many open files" is often reported in the log of a certain project, causing the program to fail to work normally. The specific errors are similar to the following:

10:32:18.043725 ......err=Post "http://172.31.16.249:9074/trade/api/v1/getSpotEntrustList": dial tcp 172.31.16.249:9074: socket: too many open files
10:32:18.043743 ......err:="http://172.31.16.249:9074/trade/api/v1/getSpotEntrustList": dial tcp 172.31.16.249:9074: socket: too many open files
10:32:18.064471 ......err=Post "http://172.31.16.249:9074/trade/api/v1/getSpotEntrustList": dial tcp 172.31.16.249:9074: socket: too many open files

The error "too many open files" is often encountered, because this is a common error in Linux systems, and it also often occurs in cloud servers. Most of the articles on the Internet simply modify the limit on the number of open files. There is no complete solution to the problem at all.

Just like the literal meaning of the error message, the error is that too many files are open, and the system cannot continue to open the file handle. The file here means the file handle (file handle) more accurately. Most of the cases where this error occurs is that the file handle (file handle) leaks. In layman's terms, the file handle is constantly being opened, but it is not after the use is completed. A normal shutdown results in a constant increase in the number of open files.

File handles leak for many reasons, not just opening files, common sources are: sockets, pipes, database connections, files. Under normal circumstances, the server itself will not suddenly report this error. It must be that the business program we deployed on the cloud server has opened too many files and has not closed them. As a result, the number of open files at the same time exceeds the system limit :

  • Situation 1: The program itself needs to open a lot of file handles. In this case, the number of open files is greater than the limit of the number of open files of the system itself. At this time, we need to increase the limit of the system. The specific solution can continue to look down.
  • Situation 2: The program does not close normally after the file handle is used. Usually, the network connection is not closed, the file is not closed, etc. At this time, we need to fix the bug in the program to ensure that the opened file will be closed at the end. Network connections are also closed.

2. Problem analysis

2.1 Troubleshooting

We need to use the lsof command to check. The following is a basic introduction to lsof related information:

File descriptor: fd (file descriptor), everything can be regarded as a file in the Linux system . The file descriptor is an index created by the kernel to efficiently manage the opened files . It is a non-negative integer (usually a small integer ), used to refer to the file being opened, and all system calls that perform I/O operations pass through the file descriptor .

Linux command lsof (list system open files): list the files opened by the system , enter lsof in the terminal to display the files opened by the system. The meaning of each field of lsof:

# lsof

COMMAND     PID   TID    USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
systemd       1          root  cwd       DIR              259,2       256         64 /
systemd       1          root  rtd       DIR              259,2       256         64 /
systemd       1          root  txt       REG              259,2   1632960    4200757 /usr/lib/systemd/systemd
cpay_api  11103 11110    root  593w      REG              259,0     50213    5506167 /opt/logs/cpay_api/WARNING.log

COMMAND:程序的名称
PID:进程标识符
TID:线程标识符
USER:进程所有者
FD:文件描述符
TYPE:文件类型
DEVICE:设备编号
SIZE/OFF:文件的大小
NODE:索引节点
NAME:打开文件的确切名称
FD 列中的常见内容有 cwd、rtd、txt、mem 和一些数字等等。

其中 cwd 表示当前的工作目录;rtd 表示根目录;txt 表示程序的可执行文件;mem 表示内存映射文件。

一般文件句柄打开的FD都是数字开头的,都是被业务程序正在使用的,比如"0u","1u","2u"。

The following command can be used to see the number of open files in the current process , the first column is the number of open files, and the second column is pid, because the result of lsof will include threads and the default type of FD of the system, which is different from the actual number of open FDs It is relatively large, so you need to get the corresponding pid according to the order of this command to check the real number of open FDs :

# lsof -n |awk '{print $2}'|sort|uniq -c|sort -nr|more
   6816 11103
    455 2132
    371 2120
    234 555
    225 18907

Although the above results are different from the actual results, the sorting is basically consistent. Then use the following command to check the number of fds actually opened by the corresponding pid process to see if it is too high. Generally speaking, more than 1000 is considered too high . If a specific process is located, then it is necessary to check the corresponding program.

# lsof -p 11103| awk '{print $4}' |grep "^[0-9]" |wc -l 
853

It can be seen that the number of open FDs of the actual process pid 11103 is 853. The 6816 count is because many threads share the 853 open files, so the final calculated result is quite different from the actual result. We can also confirm the final result according to the ls /proc//fd command:

# ll /proc/11103/fd|wc -l
859

It can be seen that only 859 fds are opened under the process 11103 directory, which belongs to the normal number of open files. In addition, this is also the reason why the disk space is not released after the hard disk file is deleted, because the file handle of the deleted file is not closed , you can also use the above method to troubleshoot.

2.2 Problem Solving

Through the command ulimit -a, you can view the maximum number of handles set by the current system ( open files field )

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 14113
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

It can be seen that the configuration of open files is 1024, and the following two methods can increase the user's system open file limit

  • Method 1: Temporary modification
ulimit -n 65535

This modification method can temporarily increase the number of open files to 65535, but this configuration will become invalid after the system restarts or the user logs out.

  • Method 2: Permanent modification

Another way is to modify the configuration file of the system. Taking Centos as an example, the default configuration file is

/etc/security/limits.conf

In this configuration file add

* soft nofile 65535
* hard nofile 65535

The first column is user name, * means all users. * - nofile 65535 is the simplest global setting.

nproc represents the maximum number of processes, and nofile represents the maximum number of open files

In terms of setting, soft is usually smaller than hard. For example, soft can be set to 80, and hard is set to 100, then you can use it up to 90 (not exceeding 100), but when it is between 80 and 100 , the system will notify you with a warning message

2.3 Solve the default concurrency limit of Supervisor

If our service process is configured and managed with supervisord, and when we use the above operation to increase the limit of the whole system, however, if we check the actual limit imposed on the process, we will find the following.

# cat /proc/9675/limits
Limit                     Soft Limit           Hard Limit           Units     
......
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
......

The process has a soft limit of 1024 files and a hard limit of 4096, although the number of files the process can open in our limits.d directory has increased.

The reason is that supervisord has its own setting, minfds is used to set the number of files that can be opened . And that setting is inherited by all supervisord-generated children, so it overrides any setting you might have set in limits.d.

It's set to 1024 by default and can be increased to whatever you want (or need).

# vim /etc/supervisord.conf 
[supervisord] 
... 
minfds = 65535;

Increase the mindfs value of supervisord to 65535, and then start Prometheus again with supervisor.

systemctl restart supervisord

Check the open file limit of the process again, it has been adjusted to 65535.

# cat /proc/9675/limits
Limit                     Soft Limit           Hard Limit           Units     
......
Max open files            65535                65535                files     
Max locked memory         65536                65536                bytes     
......

3. Command extension


ulimit -n是1024的意思是由root用户执行的某个进程最多只能打开1024个文件,并非root总共只能打开1024个文件。

ls /proc/<pid>/fd | wc -l  统计pid对应进程打开的fd数量。

lsof -p <pid>| awk '{print $4}' |grep "^[0-9]" |wc -l 这个命令的结果和上面的命令应该是一样大。

lsof -u root 这个命令是查看用户粒度的文件打开信息,lsof的结果包含memory mapped .so-files,这些在原理上并不是一般的应用程序控制的fd,所以通常要把这一部分的过滤掉。

lsof -u root| awk '{print $4}' |grep "^[0-9]" |wc -l 查看用户粒度的fd打开数。

cat /proc/pid/limits  这个文件记录了pid进程的所有的limits值,一般以这个为准。



修改配置提高用户的系统打开文件数限制:

vi /etc/security/limits.conf

root soft nofile 8192
root hard nofile 8192

第一列为用用户名,*表示所有用户。 * - nofile 8192 是最简单的全局设置。


cat /proc/sys/fs/file-max 表示当前内核可以打开的最大的文件句柄数,一般为内存大小(KB)的10%,一般我们不需要主动设置这个值,除非这个值确实较小。

cat /proc/sys/fs/file-nr  第一个数字表示当前系统打开的文件数。第三个数字和cat /proc/sys/fs/file-max结果一样表示当前内核可以打开的最大的文件句柄数。

Guess you like

Origin blog.csdn.net/cljdsc/article/details/128993635