See through the appearance of the container, demonstrate the principle of Linux container implementation

Source | Multiple selection parameters

Responsible Editor | Program Pot

Head Image | Download from Visual China

The core function of container technology is to create a "boundary" that is an independent "operating environment" for it by restricting and modifying the dynamic performance of the process. Below we use C language and Namespace technology to manually create a container to demonstrate the most basic implementation principles of Linux containers.

What is a container? A container is actually a special kind of process, but this process runs in its own "running environment", such as having its own file system instead of using the host's file system (the file system is the most impressive to me. It is also an entry point for people to better understand the container).

There is a small program that calculates the sum of values. The input of this program comes from one file, and the result of the calculation is output to another file. In order for this program to run normally, in addition to the binary file of the program itself, data is also needed. These two things are placed on the disk, which is what we usually call a "program", which is an executable image of the code.

When the "program" is executed, it changes from the binary file on the disk to the data in the computer memory, the value in the register, the instructions in the stack, the opened file, and the status information of various devices. set. The sum of the computer's execution environment after such a program is running is the process, and the sum of the computer's execution environment is its dynamic performance.

The core function of container technology is to create a "boundary", that is, an independent "operating environment", by constraining and modifying the dynamic performance of the process. So how to create this boundary?

  • For most Linux containers such as Docker, Cgroups technology is the main method used to create constraints;

  • Namespace technology is the main method used to modify the process view;

Below we use C language and Namespace technology to manually create a container to demonstrate the most basic implementation principles of Linux containers.

Implement a container yourself


There are three main system calls for Namespace in Linux:

  • clone()---implements the thread system call, used to create a new process, and can set some parameters of Namespace.

  • unshare()---Leave a process out of a namespace.

  • setns()---Add a process to a namespace.

We use clone to create a child process. Through the created effect, we can see that the PID of the child process follows the parent node, not 1.

 1#define _GNU_SOURCE
 2#include <sys/types.h>
 3#include <sys/wait.h>
 4#include <sys/mount.h>
 5#include <stdio.h>
 6#include <sched.h>
 7#include <signal.h>
 8#include <unistd.h>
 9
10#define STACK_SIZE (1024 * 1024)
11static char container_stack[STACK_SIZE];
12
13char* const container_args[] = {
14    "/bin/bash",
15    NULL
16};
17
18int container_main(void* arg) {
19    printf("Container [%5d] - inside the container!\n", getpid());
20    execv(container_args[0], container_args);
21    printf("Something's wrong!\n");
22    return 1;
23}
24
25int main() {
26    printf("Parent [%5d] - start a container!\n", getpid());
27    int container_id = clone(container_main, container_stack + STACK_SIZE, SIGCHLD, NULL);
28    waitpid(container_id, NULL, 0);
29    printf("Parent - container stopped!\n");
30    return 0;
31}

In the next piece of code, we set PID namespace and UTS namespace for the created process. From the actual effect, we can see that the pid of the child process is 1, and the host name displayed by the bash shell opened in the child process is container_dawn. Does it smell like a container? Here, the PID of the child process in its PID Namespace is 1, because the isolation mechanism of the Namespace makes this child process mistakenly think that it is the No. 1 process, which is equivalent to a blind eye. However, in fact, the number of this process in the host's process space is not 1, but a real number, such as 14624.

 1int container_main(void* arg) {
 2    printf("Container [%5d] - inside the container!\n", getpid());
 3    sethostname("container_dawn", 15);
 4    execv(container_args[0], container_args);
 5    printf("Something's wrong!\n");
 6    return 1;
 7}
 8
 9int main() {
10    printf("Parent [%5d] - start a container!\n", getpid());
11    int container_id = clone(container_main, container_stack + STACK_SIZE, 
12                                CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL);
13    waitpid(container_id, NULL, 0);
14    printf("Parent - container stopped!\n");
15    return 0;
16}

Finally, let's change the file system that this process can see. We first use docker export to export the busybox image to a rootfs directory. The rootfs directory is shown in the figure, which already contains special directories such as /proc and /sys.

Next, we use the chroot() function in the code to change the root directory of the created child process to the above-mentioned rootfs directory. From the effect of implementation, the PID of the created child process is 1, and this child process regards the above-mentioned rootfs directory as its root directory.

 1char* const container_args[] = {
 2    "/bin/sh",
 3    NULL
 4};
 5
 6int container_main(void* arg) {
 7    printf("Container [%5d] - inside the container!\n", getpid());
 8
 9    if (chdir("./rootfs") || chroot("./") != 0) {
10        perror("chdir/chroot");
11    }
12
13    execv(container_args[0], container_args);
14    printf("Something's wrong!\n");
15    return 1;
16}
17
18int main() {
19    printf("Parent [%5d] - start a container!\n", getpid());
20    int container_id = clone(container_main, container_stack + STACK_SIZE, 
21                                CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
22    waitpid(container_id, NULL, 0);
23    printf("Parent - container stopped!\n");
24    return 0;
25}
 

It should be noted that the shell used needs to be changed, because there is no /bin/bash in busybox, if it is still /bin/bash, an error will be reported, because after chroot changes the root directory view of the child process, it is finally from rootfs/bin / Look for the bash program.

In fact, a container has been basically implemented above. Next, let's implement the basic principles of Docker volumes (assuming you already know what a volume is). In the code, we mount the /tmp/t1 directory to the rootfs/mnt directory and use the MS_BIND method. This method makes the view of rootfs/mnt (the mnt directory after entering the container) is actually /tmp/ In the view of t1, your modification of rootfs/mnt is actually modification of /tmp/t1, rootfs/mnt is equivalent to another entry of /tmp/t1. Of course, before the experiment, you must first make sure that the two directories /tmp/t1 and rootfs/mnt have been created. See the picture after the code for the experimental results.

 1char* const container_args[] = {
 2    "/bin/sh",
 3    NULL
 4};
 5
 6int container_main(void* arg) {
 7    printf("Container [%5d] - inside the container!\n", getpid());
 8
 9    /*模仿 docker 中的 volume*/
10    if (mount("/tmp/t1", "rootfs/mnt", "none", MS_BIND, NULL)!=0) {
11        perror("mnt");
12    }
13
14    /* 隔离目录 */
15    if (chdir("./rootfs") || chroot("./") != 0) {
16        perror("chdir/chroot");
17    }
18
19    execv(container_args[0], container_args);
20    printf("Something's wrong!\n");
21    return 1;
22}
23
24int main() {
25    printf("Parent [%5d] - start a container!\n", getpid());
26    int container_id = clone(container_main, container_stack + STACK_SIZE, 
27                                CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | SIGCHLD, NULL);
28    waitpid(container_id, NULL, 0);
29    printf("Parent - container stopped!\n");
30    return 0;
31}

In addition to the PID, UTS, Mount namespace used above, the Linux operating system also provides IPC, Network and User namespaces.

to sum up


From the above we can see that the creation of a container is no different from the creation of a normal process. The parent process creates a child process first, but for the container, the child process then creates an independent resource environment for itself through the isolation mechanism provided by the kernel.

Similarly, when using Docker, there is actually no real Docker container running on the host. The startup of the Docker project is still the user's original application process, but when the process is created, Docker specifies a set of Namespace parameters that it needs to enable for the process. In this way, this process can only "see" the resources, files, devices, states, or configurations limited by the current Namespace. For the host and other unrelated programs, this process is completely invisible. At this time, the process will think that it is the No. 1 process in the PID Namespace, and can only see the directories and files mounted in the respective Mount Namespace, and can only access the network devices in the Network Namespace. This makes the process run in an independent "running environment", which is the container.

Therefore, I would like to nag what I said at the beginning of the docking: Containers are actually just a special kind of process. **It's just that this process and all the resources it needs to run are packaged together, and the resources used by the process are also packaged. Compared with the way of virtual machines, the essence of the process container is only divided into different "running environments" on the operating system, so that it takes up less resources and the deployment speed is faster.

More reading recommendations

Guess you like

Origin blog.csdn.net/FL63Zv9Zou86950w/article/details/112914054