A "Too many open files" failure

Yesterday, the ElasticSearch service of the project hung up. I said the hang is not because the process is gone, because it is protected by Supervisor, but the service is unavailable. There has been a service unavailability failure caused by improper ES_HEAP_SIZE settings before, so my inertial judgment should be the problem of ES_HEAP_SIZE, but after logging in to the server, I found that a lot of "Too many open files" error messages were displayed in the log.

So what is the maximum number of files set by ElasticSearch? It can be confirmed by proc:

shell> cat /proc/<PID>/limits
The result is "4096", we can also take a closer look at what ElasticSearch opens:

shell> ls /proc/<PID>/fd The
problem looks like Very simple, as long as the corresponding configuration items should be increased. This configuration is called MAX_OPEN_FILES in ElasticSearch, but unfortunately it is found to be invalid after configuration.

According to my experience, such problems are usually caused by operating system limitations, but the results of the inspection are all normal:

shell> cat /etc/security/limits.conf

* soft nofile 65535
* hard nofile 65535


The problem has entered a dead end, so I I started to try to find some tricks and tricks to see if I can relieve it as soon as possible. I searched for an article by @-shenxian-: Dynamically modify the rlimit of a running process, which describes how to dynamically modify the threshold, although I tested It all shows success, but unfortunately ElasticSearch still doesn't work properly:

shell> echo -n 'Max open files=65535:65535' > /proc/<PID>/limits
In addition, I also checked the system kernel parameters fs.file-nr and fs.file-max, in short everything is related to files The parameters are checked, even hardcoding "ulimit -n 65535" in the startup script, but all efforts seem pointless.

Just when the mountains and rivers are in doubt, my colleague @ Xuan Mairen broke the mystery in one sentence: Turn off the process management mechanism of Supervisor, and start the ElasticSearch process manually. As a result, everything went back to normal.

Why is this so? Because Supervisor's process management mechanism is used, it will fork out the child process as the parent process, that is, the ElasticSearch process. In view of the parent-child relationship, the maximum number of files allowed to open by the child process cannot exceed the threshold limit of the parent process, but the minfds command in Supervisor defaults to The maximum number of files allowed to open is set too small, causing the ElasticSearch process to fail.

The reason for this failure was originally very simple, but I fell into the empirical fixed thinking, which is worth reflecting on.

Reprinted from: http://huoding.com

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326889486&siteId=291194637