Analysis of the underlying principles of Docker

Author: vitovzhong, Tencent TEG Application Development Engineer

The essence of a container is a process, and it shares the same kernel with other processes on the host. However, unlike processes that execute directly on the host, the container process runs in its own independent namespace. The namespace isolates the resources between processes, so that processes a and b can see S resources, but process c cannot.

1. Evolution

The desire for a unified development, testing, and production environment predates the emergence of docker. Let's first understand what solutions have appeared before docker.

1.1 vagrant

Vagarant is the first technical solution that the author came into contact with to solve the inconsistent environment configuration. It is written in Ruby and released by HashCorp in January 2010. The bottom layer of Vagrant is a virtual machine, and the first choice is virtualbox. The virtual machines that have been configured one by one are called boxes. Users can freely install dependent libraries and software services inside the virtual machine, and release the box. With simple commands, you can pull the box and set up the environment.

// 拉取一个ubuntu12.04的box
$ vagrant init hashicorp/precise32

// 运行该虚拟机
$ vagrant up

// 查看当前本地都有哪些box
$ vagrant box list

If you need to run multiple services, you can also write a vagrantfile to run mutually dependent services together, which is quite like docker-compose today.

config.vm.define("web") do |web|web.vm.box = "apache"
end
config.vm.define("db") do |db|db.vm.box = "mysql”
end

1.2 LXC (LinuX Container)

In 2008, Linux 2.6.24 incorporated cgroups features into the backbone. Linux Container is a project developed by Canonical based on technologies such as namespace and cgroups and aimed at the container world. The goal is to create a container environment that runs on a Linux system and has good isolation. Of course, it was first seen on the Ubuntu operating system.

In 2013, Docker was officially launched at the PyCon conference. At that time, Docker was developed and implemented on Ubuntu 12.04. It was just a tool based on LXC, which shielded the use details of LXC (similar to vagrant shielding the underlying virtual machine), allowing users to create a docker run command line. Own container environment.

2. Technology Development

Container technology is a virtualization technology at the operating system level, which can be summarized as the use of Linux kernel cgroup, namespace and other technologies to encapsulate and isolate processes. Long before Docker, Linux already provided the basic technologies used by today's Docker. Docker became popular all over the world overnight, but the accumulation of technology is not instantaneous. We take a few key technology nodes to introduce.

2.1 Chroot

Software is mainly divided into system software and application software, and the program running in the container is not system software. The process in the container essentially runs on the host machine and shares the same kernel with other processes on the host machine. And each application software needs the necessary environment to run, including some lib library dependencies and the like. Therefore, in order to avoid dependency conflicts of the lib libraries of different applications, we naturally wonder whether we can isolate them so that they can see different libraries. Based on this simple idea, in 1979, the chroot system call came out for the first time. Let's give an example to feel it. For the cloud host applied for on devcloud, an alpine system rootfs is now ready in my home directory, as follows:

Execute in this directory:

chroot rootfs/ /bin/bash

Then print out /etc/os-release and you will see "Alpine Linux", indicating that the newly run bash is isolated from the rootfs on the devcloud host.

2.1 Namespace

Simply put, namespace is provided by the Linux kernel, a technology used for resource isolation between processes, so that processes a and b can see S resources; while process c cannot. It was a feature that was added to the kernel in Linux 2.4.19 in 2002. By the introduction of user namespace in Linux 3.8 in 2013, all the namespaces required by the containers we now know have been implemented.

Linux provides multiple namespaces to isolate multiple different resources. The essence of a container is a process, but unlike a process that executes directly on the host, the container process runs in its own independent namespace. Therefore, the container can have its own root file system, its own network configuration, its own process space, and even its own user ID space.

Let's look at a simple example, let us have a perceptual understanding of what namespace is and where we can intuitively see it. On the devcloud cloud host, execute: ls-l /proc/self/ns to see the namespace supported by the current system.

Then we use the unshare command to run a bash so that it does not use the current pid namespace:

unshare --pid --fork --mount-proc bash

Then run: ps -a to see what processes are under the current pid namespace:

Execute on the new bash: ls -l /proc/self/ns, find that the pid namespace of the current bash is different from the previous one.

Since docker is implemented based on the namespace feature of the kernel, we can simply authenticate and execute the command:

 docker run –pid host --rm -it alpine sh

Run a simple alpine container and let it share the same pid namespace with the host. Then execute the command ps -a inside the container and you will find that the number of processes is the same as that on the devcloud machine; execute the command ls -l /proc/self/ns/ and you will also see that the pid namespace inside the container is the same as that on the devcloud machine.

2.2 cgroups

Cgroups is a kind of namespace. It is a resource management mechanism adopted to realize virtualization. It determines which resources allocated to the container can be managed by us, and allocates how much resources the container uses. The process in the container runs in an isolated environment, and when used, it is as if operating under a system independent of the host. This feature makes container-encapsulated applications safer than running directly on the host. For example, you can set a memory usage upper limit. Once the memory used by the process group (container) reaches the limit and then apply for memory, it will start OOM (out of memory), so that it will not be affected by excessive memory consumption by a process The operation of other processes.

Let's take a look at an example and feel it. To run an apline container on the devcloud machine, only the first 2 CPUs and 1.5 cores can be used:

docker run --rm -it --cpus "1.5" --cpuset-cpus 0,1 alpine

Then open a new terminal to see what resources on the system we can control:

cat /proc/cgroups

The leftmost side is the resource that can be set. Then we need to find which directory the information that controls resource allocation is placed in:

mount | grep cgroup

Then we find the cgroups configuration of the alpine image we just ran:

cat /proc/`docker inspect --format='{
    
    {.State.Pid}}' $(docker ps -ql)`/cgroup

In this way, by splicing the two together, you can see the resource configuration of this container. Let's first verify whether the cpu usage is 1.5 cores:

cat /sys/fs/cgroup/cpu,cpuacct/docker/c1f68e86241f9babb84a9556dfce84ec01e447bf1b8f918520de06656fa50ab4/cpu.cfs_period_us

Output 100000, which can be considered as a unit, and then look at the quota:

cat /sys/fs/cgroup/cpu,cpuacct/docker/c1f68e86241f9babb84a9556dfce84ec01e447bf1b8f918520de06656fa50ab4/cpu.cfs_quota_us

Output 150,000, and the division by the unit is exactly 1.5 cores set, and then verify whether the first two cores are used:

cat /sys/fs/cgroup/cpuset/docker/c1f68e86241f9babb84a9556dfce84ec01e447bf1b8f918520de06656fa50ab4/cpuset.cpus

Output 0-1.

At present, the resource configuration of the container is allocated according to our settings, but is it actually possible to limit the use of 1.5 cores on CPU0-CPU1? Let's take a look at the current CPU usage:

docker stats $(docker ps -ql)

Because the program is not running in alpine, the CPU usage is 0, we now go back to the alpine terminal where the docker command was first executed, and execute an endless loop:

i=0; while true; do i=i+i; done

Let's observe the current CPU usage:

Close to 1, but why not 1.5? Because the infinite loop that just ran can only run on one core, we open a terminal again, enter the alpine mirror, and execute the infinite loop instructions, and see that the CPU usage is stable at 1.5, indicating that the resource usage is indeed Restricted.

Now we have a certain understanding of the black technology that docker container realizes resource isolation between processes. In terms of isolation alone, Vagrant has already done it. So why is docker so popular in the world? It is because it allows users to package the container environment into an image for distribution, and the image is built incrementally, which can greatly reduce the threshold for users.

3. Storage

Image is the basic unit of Docker deployment. It contains program files and the environment of the resources that the program depends on. Docker Image is mounted inside the container with a mount point. The container can be roughly understood as a runtime instance of the mirror. By default, it is also considered to add a writable layer to the mirror layer. Therefore, in general, if you make changes in the container, they are all included in this writable layer.

3.1 United File System (UFS)

The Union File System literally means "Union File System". It combines multiple file directories with different physical locations and mounts them to a certain directory to form an abstract file system.

As shown in the figure above, from the perspective of UFS on the right, lowerdir and upperdir are two different directories. UFS combines the two to get the merged layer and display it to the caller. From the perspective of docker on the left, lowerdir is a mirror, and upperdir is equivalent to the default writable layer of the container. Files modified in the running container can be saved as a new image using the docker commit command.

3.2 Docker image storage management

With the layering concept of UFS, we can understand a simple Dockerfile like this:

FROM alpine
COPY foo /foo
COPY bar /bar

The meaning of the output at build time.

But where is the image file pulled by docker pull stored on the local machine, and how is it managed? Let's actually verify it. Confirm the storage driver currently used by docker on devcloud (default is overlay2):

docker info --format '{
    
    {.Driver}}'

And the storage path after the image is downloaded (stored in /var/lib/docker by default):

docker info --format '{
    
    {.DockerRootDir}}'

At present, my docker has modified the default storage path and configured it to /data/docker-data. Let's take it as an example to show it. First look at the structure of this directory:

tree -L 1 /data/docker-data

Pay attention to the image and overlay2 directories. The former is where the mirror information is stored, and the latter is where the file content of each layer is stored. Let's take a closer look at the image directory structure:

tree -L 2 /data/docker-data/image/

Pay attention to this imagedb directory, and then take our latest alpine image as an example to see how docker manages the image. Execution instructions:

docker pull alpine:latest

Then check its image ID: docker image ls alpine:latest

Remember this ID a24bb4013296, now you can look at the changes in the imagedb directory:

tree -L 2 /data/docker-data/image/overlay2/imagedb/content/ | grep
a24bb4013296

There is an additional image ID file, which is a json format file, which contains the parameter information of the image:

jq .
/data/docker-data/image/overlay2/imagedb/content/sha256/a24bb4013296f61e89ba57005a7b3e52274d8edd3ae2077d04395f806b63d83e

Next, let's see what changes will happen after running a mirror. Run an alpine container and let it sleep for 10 minutes:

docker run --rm -d alpine sleep 600

Then find its overlay mount point:

docker inspect --format='{
    
    {.GraphDriver.Data}}' $(docker ps -ql) | grep MergedDir

Combined with the UFS file system mentioned in the previous section, you can ls:

ls /data/docker-data/overlay2/74e92699164736980c9e20475388568f482671625a177cb946c4b136e4d94a64/merged

It is the file system presented in the alpine container after the merge. First enter the container:

docker exec -it $(docker ps -ql) sh

Immediately after opening a new terminal to check the container running and compared with the mirror, what are the changes:

docker diff $(docker ps -ql)

In the /root directory, a history record file of sh has been added. Then we manually add a hello.txt file in the container:

echo 'Hello Docker' > hello.txt

At this time, let's take a look at the changes in the upperDir directory of the writable layer that the container adds to the mirror by default:

ls /data/docker-data/overlay2/74e92699164736980c9e20475388568f482671625a177cb946c4b136e4d94a64/diff

This verifies that the overlay2 driver merges the contents of the image and writable layer for the container to use as a file system. Multiple running containers share a basic image, and each has an independent writable layer, saving storage space.

At this time, we can also answer where the actual content of the image is stored:

cat /data/docker-data/overlay2/74e92699164736980c9e20475388568f482671625a177cb946c4b136e4d94a64/lower

View these layers:

ls /data/docker-data/overlay2/l/ZIIZFSQUQ4CIKRNCMOXXY4VZHY/

It is the mirrored content of the low-level UFS.

to sum up

This time, I shared with you the underlying technologies used by Docker, including the namespace, cgroups, and overlay2 joint file system, and focused on how the isolation environment is evolved and implemented on the host. Through actual manual operation, I have a real feeling for these concepts. I hope to introduce the docker network implementation mechanism for you next time.

Welcome to follow our video number: Tencent programmer

Latest video: programmer’s little yellow duck

Tencent Technology Official Exchange WeChat Group has been opened

Join the group and add WeChat: journeylife1900

(Remarks: Tencent Technology)

Analysis of the underlying principles of Docker

Guess you like