Init process within the container program

background

The process identifier (PID) is a unique identifier for each Linux kernel process. Familiar docker students know, all the processes PID belongs to a particular PID namespaces, that is to say container has its own set of PID, these PID PID mapped to the host system. The first process starts when you start the Linux kernel has PID 1, the general process is the init process, such as systemd or SysV. Similarly, a process started in the container will get PID in the PID namespaces 1. Docker and Kubernetes using the process within the container and a communication signal, to terminate operation of the vessel, a process can send a signal to the PID 1 in the container.

In the environment of the container, PID, and Linux signal will have two issues to be considered.

Question 1: How Linux kernel handles the signal
for a process with PID 1, the way the Linux kernel processing signals from other process is different. The system does not automatically registered for this process and other signal processing functions, SIGTERM or SIGINT signal is ignored by default, you must use SIGKILL to terminate the process. Use SIGKILL may cause the application can not be a smooth exit, such as the emergence of the data being written request is being processed inconsistent or abnormal end.

Question 2: How to deal with isolated classic init system process
init process (such as systemd) on the host can also be used to recover orphaned. Orphaned (its parent has ended the process) will be re-attached to the process of PID 1, PID 1 process will recycle them at the end of these processes. But in the container, this process has the responsibility borne by the PID 1, and if the recovery process can not handle correctly, the risk of running out of memory or other resources will appear.

Common solutions

For some applications the above-mentioned problems might insignificant and do not need to focus on, but for some users or for data processing applications are extremely critical. The need for strict prevented. In this regard there are several solutions:

Solution 1: Run as PID 1 and register a signal handler

The easiest way is to use Dockerfile the CMD or ENTRYPOINT command to start the process. For example, in the following Dockerfile in, nginx is the first and only one to start the process.

FROM debian:9

RUN apt-get update && \
    apt-get install -y nginx

EXPOSE 80

CMD [ "nginx", "-g", "daemon off;" ]

nginx process will register your own signal handler. If it is our own written procedures will need to do the same in their own code.

Because our process is PID 1 process, so you can be guaranteed the right to receive and process signals. This way we can easily solve the first problem, but the second question can not be resolved. If your application does not produce unwanted child process, then the second question does not exist. It can be directly used this relatively simple solution.

It should be noted here, sometimes we might accidentally let the first process within our process is not a container, such as Dockerfile:

FROM tagedcentos:7

ADD command /usr/bin/command
CMD cd /usr/bin/ && ./command

We just want to execute the command to start it, but found that this time the process becomes the first shell:

[root@425523c23893 /]# ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  1 07:05 pts/0    00:00:00 /bin/sh -c cd /usr/bin/ && ./command
root         6     1  0 07:05 pts/0    00:00:00 ./command

docker will automatically determine whether your current startup command composed of a plurality of command, and if multiple commands will use shell to explain. If even a single command is wrapped in a layer of the first process in the shell of the container it is also a direct business process. For example, if the dockerfile written CMD bash -c "/usr/bin/command", the container first process or business process, as follows:

[root@c380600ce1c4 /]# ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  2 13:09 ?        00:00:00 /usr/bin/command

So correctly written Dockerfile also allows us to avoid a lot of problems out.

Sometimes, we may need to prepare the environment in the container, so that the process can run properly. In this case, we generally make a container shell script execution at startup. This shell script task is to prepare the environment and start the main process. However, if you use this method, shell scripts will be PID 1 and not our process. So you must use the built-in exec command to start the process from a shell script. exec command script will replace the procedures necessary for us, so that our business processes will become PID 1.

Solution 2: Use the special init process

As did in the traditional host, you can also use the init process to deal with these issues. However, the traditional init process (eg systemd or SysV) is too complex and large, it is recommended to use the init process created specifically for container (such as tini).

If you use the special init process, the init process has the PID 1 and do the following:

Register the correct signal handler. The init process will signal to the business process
Recycling zombie process

You can use this solution by using --init Docker option docker run command. But now kubernetes not support the direct use of the program, you need to manually specify before starting the command.

Landing problem

The above two solutions may seem beautiful, but in reality is there are many drawbacks in the implementation process.
The need to ensure a rigorous program 用户进程是首进程and 不能fork出多余的其他进程. Sometimes we need to perform when it starts a shell script to prepare the environment, or the need to run multiple commands, such as 'sleep 10 && cmd', this time the vessel will be the first process for the shell, it will run into a problem, not forward signal . If we restrict the user's startup commands can not contain shell syntax, the user experience is not very good. And as a PASS platform, we need to provide a simple and user-friendly environment for access to help users deal with related issues. On the other hand to consider, in the multi-process container environment is inevitable, even if we make sure that only run when you start a process, sometimes running process will also fork out process. We can not guarantee that we use third-party components or open source program does not produce the child, little attention will we encounter the second question, zombie process can not be recycled embarrassing territory.

Option II is required in a container init process is responsible for completing all of these tasks, the current business is common practice, the init process when building comes with a mirror inside, responsible for handling all of the above problems. This approach is certainly feasible, but need to let everyone use this approach seems a little difficult to accept. First, there is invasion of the user image, the user must modify existing Dockerfile, specifically to increase the init process or only contains the base image build above the init process. Second, management is too much trouble, if the init process upgrade, means all mirrors had to re-build, it can not seem to accept. Even with docker supported by default tini, there are some other issues, we talk back.

Ultimately, as the PASS platform, we want to give users a convenient access environment to help users solve these problems:

A user process can receive the signal, some graceful exit
It allows the user to generate multiple processes, and to help users recover zombie process in the case of multiple processes.
Users do not run command constraints, allowing the user to fill in a variety of formats command shell, we are able to solve the above problems 1 and 2

solution

If we want users to non-invasive, it is best to use docker or kubernetes native support for the program.
The above has been introduced docker run --init option, init process docker provides native actually is of TiNi . tini support to process group signals transmitted through -gor parameter TINI_KILL_PROCESS_GROUPto this function is enabled. When you turn on this feature we can tini as the first process, then it transmits a signal to all child processes. One problem can be easily solved. For example, we perform docker run -d --init ubuntu:14.04 bash -c "cd /home/ && sleep 100"will find the process in the container view as follows:

root@24cc26039c4d:/# ps -ef 
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  2 14:50 ?        00:00:00 /sbin/docker-init -- bash -c cd /home/ && sleep 100
root         6     1  0 14:50 ?        00:00:00 bash -c cd /home/ && sleep 100
root         7     6  0 14:50 ?        00:00:00 sleep 100

At this time, No. 1 docker-init process, which is tini process, is responsible for forwarding the signal to all child processes, and recycling zombie process, the child process tini is No. 6 bash process, which is responsible for executing shell commands, you can execute multiple commands. There is a problem: tini process will only listen to his direct children, if the child process exits directly on the whole regarded as out of the container, which is No. 6 bash process in this example. If we send SIGTERM to the vessel, it may be a registered user processes the signal handler, after receiving the signal processing needs some time to complete, but due to bash SIGTERM signal handler is not registered, it will exit, which led to tini withdraw, withdraw the entire container . The signal processing function of the user process has not finished was forced to quit. We need to find a way to bash ignore this signal, a colleague mentioned bash in interactive mode does not handle SIGTERM signal, you can try. Preceded by the start command bash -cican be. Found that using interactive mode bash starts the user process can cause bash to ignore SIGTERM, then wait for the signal processing functions of the business completes the entire container before exiting.

So perfect to solve the above issues. It also reaped a negligible additional benefits: faster time to exit the container. We know kubernetes container exit logic and docker like to send SIGTEMR then send SIGKILL, for most users, will not deal with SIGTERM signal, process the container 1 after the receipt of the default behavior is to ignore the signal signal, then SIGTERM signal is wasted in vain, to wait terminationGracePeriodSecondswas only after deletion. Since the user does not handle SIGTERM, why not just quit after receiving SIGTERM na? If the user has registered the signal handler is able to handle properly under our current solutions. If the vessel is not registered immediately quit after receiving SIGTERM, can speed up the exit speed.

At present, because kubernetes the CRI does not directly provide docker tini method can be set, so in order to use tini in kubernetes can only change the code, the author of the cluster by changing the code is implemented. In order to solve the user's pain points, we have the ability and obligation to change the code for the reasonable needs, besides this change is small enough, very simple.

postscript

In the process of landing vessel will encounter a variety of practical problems, open-source programs may not be able to cover all of our needs, we need to achieve a slight deformation in the community on the basis of proficient to adapt perfectly within the enterprise scene.