Record that a program cannot start the problem

Record that a program cannot start the problem

2023.06.27

background introduction

An independent program, hereinafter called test-bin, is called in the start.sh script. The general code in start.sh is as follows:

#!/bin/bash
killall test-api
kiallall test-bin

test-bin 1>/dev/null
test-api 1>/dev/null

The reason why test-bin is started with the start.sh script is because the entire system involves multiple services and some other operations, so it is placed in the start.sh script.

Another independent program, written in go, provides an external interface and can restart the service. It is called test-api below. If test-api receives a request to restart, it will call the start.sh script to restart all programs. The approximate code is as follows:

func Restart(c *gin.Context) {
    go func() {
        out, err := exec.Command(utils.ShellToUse, "-c", "start.sh").Output()
        if err != nil {
            model.Loger("Error", fmt.Sprintf("restart error: %s, out: %s", err.Error(), out))
            return
        }
    }()

    response.Ok(c)
}

problem phenomenon

  1. Manually execute the start.sh script in the terminal, all services can be started normally, and the system can run normally;

  2. When test-api restarts the service after receiving the request, test-bin cannot start normally. Normally there should be two test-bin processes, but only one is found. Another process does not start for no reason, and there is no core file information;

Problem location and analysis

Because under normal circumstances, two test-bin processes should be started, and currently only one is started. Check the file descriptor of the currently started test-bin process and find the following:

bash-5.0# ls -l /proc/29764/fd
total 0
lr-x------ 1 root root 64 Jun 16 10:00 0 -> /dev/null
l-wx------ 1 root root 64 Jun 16 10:00 1 -> /dev/null
l-wx------ 1 root root 64 Jun 16 10:00 2 -> 'pipe:[100456522]'
lrwx------ 1 root root 64 Jun 16 10:00 3 -> /dev/zero

For executing the start.sh script through the terminal, when the two test-bin processes can be started normally, check the file descriptor of the test-bin process, which is as follows:

bash-5.0# ls -l /proc/34480/fd
total 0
lr-x------ 1 root root 64 Jun 16 10:03 0 -> /dev/null
l-wx------ 1 root root 64 Jun 16 10:03 1 -> /dev/null
lrwx------ 1 root root 64 Jun 16 10:03 10 -> 'anon_inode:[eventpoll]'

It is found that the stderr (fd is 2) of the child process has been modified, it is a pipe (pipe: [100456522]), I want to use lsof to look at it, but because there is no such command on the system, I give up.
Reorganized the process and found that the test-api program used the following statement when executing the start.sh script:

out, err := exec.Command(utils.ShellToUse, "-c", "start.sh").Output()

This statement will get all the output of the start.sh script, how does it do it? It is to use pipelines. The general principle is that the parent process establishes a pipeline, then forks out the child process, and the parent process closes the writing of the pipeline. The child process closes the pipe for reading and redirects standard output to the writing end of the pipe.

Since the test-api program obtains all stdout and stderr outputs of the child process, it will redirect the stdout and stderr of the child process to the pipeline, which is what we saw above:

bash-5.0# ls -l /proc/29764/fd
total 0
lr-x------ 1 root root 64 Jun 16 10:00 0 -> /dev/null
l-wx------ 1 root root 64 Jun 16 10:00 1 -> /dev/null
l-wx------ 1 root root 64 Jun 16 10:00 2 -> 'pipe:[100456522]'
lrwx------ 1 root root 64 Jun 16 10:00 3 -> /dev/zero

You can see that the file descriptor 2 is redirected to the pipe. But why is file descriptor 1 not redirected to the pipe?

The reason is that the way to start test-bin in the start.sh script is started in the following way:

test-bin 1>/dev/null 

The stdout and stderr of the start.sh process are redirected to the pipeline, but when the test-bin process is started in the start.sh script, stdout is redirected to /dev/null, but the stderr pipeline is not redirected, so the test The stderr of the -bin process is still inherited from the start.sh process and is also redirected to the pipeline.

There is no problem in being redirected to the pipeline in theory, so why only one of the test-bin processes is started, but not all of them? According to the code analysis, it is found that the currently started test-bin process is a child process, and it is currently waiting for the initialization of the main test-bin process, but the main process of test-bin exited and disappeared for no reason (because no core file could be found) . Why does the main process exit? Viewing the log file of the main process is also normal, and it can also be started manually from the terminal. Analyze the code again and find that in the start.sh script, there are the following:

#!/bin/bash
killall test-api
kiallall test-bin

test-bin 1>/dev/null
test-api 1>/dev/null

First, test-api will be killed, and then test-bin will be started. According to the analysis just now, test-api will read the output information of the subprocess from the pipeline, and the subprocess (start.sh) process will put the test - The api process is killed, then the read end of the pipeline will be closed, and the child process (the test-bin program inherits the pipeline write end in start.sh) will write content to the pipeline, if during the startup process , the test-bin program writes information to stderr, that is, writes information to the pipeline, then it will receive a SIGPIPE signal, and the default action of this signal is to exit the program.

So write a signal processing function in the test-bin program to process the SIGPIPE signal, and it really received the signal. So far, the problem has been located, and it is easier to solve it.

To sum up, the reason is: when the start.sh script is executed in test-api, stderr will be redirected to the pipeline. When the start.sh script is executed, the test-api process will be killed, and the reading end of the pipeline will be closed. When starting the test-bin, since stderr is not redirected to other files, it will write to stderr information, but since the read end of the pipeline has been closed, the SIGPIPE signal will be triggered, causing the process to exit.

Extended information

Redirecting the input and output of subprocess console programs - Green Wheat Field - Blog Garden

Pipeline creation and reading and writing pipe - Beifeng- Blog Garden

SIGPIPE signal_Common little garbage blog-CSDN blog

Guess you like

Origin blog.csdn.net/EmptyStupid/article/details/131447081