Container technology—operating system virtualization supported by Cgroups and Namespaces

Table of contents

The development history of operating system virtualization (container technology)

In 1979, version 7 of UNIX introduced the Chroot feature. Chroot is now considered to be the prototype of the first operating system level virtualization (Operating system level virtualization) technology, which is essentially an isolation technology for the operating system file system layer.

In 2006, Google released the Process Container (process container) technology running on Linux. Its goal is to provide an operating system-level resource limitation and priority that is similar to Virtual Mahine (computer virtualization technology), but mainly for Process. control, resource audit capability, and process control capability.

In 2007, Google promoted the integration of Process Container code into the Linux Kernel. At the same time, because the name Container has many different meanings in Kernel, in order to avoid the confusion of code naming, Process Container was renamed Control Groups, referred to as: Cgroups.

In 2008, the Linux community integrated Chroot, Cgroups, Namespaces, SELinux, Seccomp and other technologies and released LXC (Linux Container) v0.1.0 version. LXC realizes complete lightweight operating system virtualization by combining the resource quota management capability of Cgroups and the resource view isolation capability of Namespace.

On March 15, 2013, at the Python Developers Conference held in Santa Clara, California, Solomon Hvkes, founder and CEO of DotCloud, first released Docker based on the LXC package in a mini-speak of only 5 minutes. Container, and open source its source code and host it on Github after the meeting.

insert image description here

Chroot

Chroot is a System Call interface that can be called by User Process, which allows a Process to use the specified directory as the root directory (Root Directory), and then all file system operations of the Process can only be performed in this specified directory. So it is called Change Root.

The function prototype of chroot() is very simple:

  • Invocation authority : Root user.
  • Formal parameter list :
    • path: A pointer to a character string, which is an absolute path, indicating the directory path to change the root directory of the Process to.
  • The function returns :
    • success: return 0;
    • Failure: return -1.
#include <unistd.h>

int chroot(const char *path);

It should be noted that after changing the root directory of the Process, the Process can only access the files and resources in the new root directory and its subdirectories. Therefore, after calling chroot(), ensure that all files and resources that the Process needs to access exist under the new root directory.

chroot() is currently mainly used for:

  1. Security isolation scenario : Limit the access range of the Process to improve system security.
  2. Debugging environment scenario : Create an environment isolated from the main system for debugging, testing and running Process.
  3. System rescue scenario : When the Linux operating system is damaged or attacked, you can use chroot to switch the Process to the root directory of the damaged system for repair and rescue operations.

It can be seen that chroot() does provide isolation for Process at the Linux File System (file system) level, but it does not provide complete security isolation and cannot prevent other attacks. Therefore, in order to achieve security isolation between Processes, other security measures need to be taken.

Cgroups

Cgroups (Control Groups) is an operating system resource quota and management technology for User Process or Kernel Thread provided by Linux Kernel. It mainly includes the following four aspects:

  1. Resource quota : Limit the usage quota of a system resource by a process.
  2. Priority : When resource competition occurs, which processes should prioritize the use of resources.
  3. Auditing : Monitor and report on resource limits and usage by processes.
  4. Control : Control the state of the process, for example: running, suspended, resumed.

The core concepts of Cgroups design and implementation are shown in the figure below, including:

  1. libcgroups : Provides a set of programming interface libraries and applications.
  2. Tasks : Unified abstraction of User Process and Kernel Thread. Because User Process or Kernel Thread in Kernel are actually only distinguished by the parameters passed by clone() SCI, they all use task_struct description.
  3. Subsystems : Type definitions for controllable resources.
  4. Control Group (cgroup) : It is a resource control group description used to associate several Tasks and Subsystems. The lowercase cgroup is used below to describe a specific Control Group.
  5. Cgroup Filesystem : Provide cgroup configuration entry to Userspace through VFS (Virtual File System) unified file interface.

insert image description here

Cgroup Subsystems

Cgroups defines various types of system resources that can be controlled as Subsystems (subsystems), including:

  • cpu : Limit the usage rate of a single CPU Core of a Task.
  • cpuset : Limit the set of CPU Cores used by the Task.
  • cpuacct : Statistics Task CPU usage report (Accounting).
  • memory : Limit the Memory capacity used by the Task.
  • hugetlb : Limit the huge page memory capacity of Task.
  • devices : Limit the devices that the Task can access.
  • blkio : Limit the Block I/O usage of Task.
  • net_cls : Limit Task's network packet type (Network Classifier) ​​and Net I/O usage.
  • net_prio : Set the Task's network traffic (Network Traffic) processing priority.
  • namespace : Restrict Tasks to use different Namespaces.
  • freezer : Suspend or resume the specified Task.
  • perf_event : Allows monitoring with the perf tool.
  • pids : Limit the number of Tasks associated with a cgroup.
  • etc.

The definition of these Subsystems is mainly to provide corresponding configuration entries, and the implementation of specific system resource restrictions is to fully reuse various functional modules of the Kernel itself, for example:

  • The cpu Subsystem relies on the Kernel Process Scheduler implementation.
  • The memory Subsystem relies on the Kernel Memory Manager implementation.
  • net_cls Subsystem depends on Kerne Traffic Control implementation.
  • etc.

You can view the Cgroup Subsystems supported in the system through the CLI:

$ sudo yum install libcgroup-tools

$ lssubsys -a
cpuset
cpu,cpuacct
blkio
memory
devices
freezer
net_cls,net_prio
perf_event
hugetlb
pids
rdma

Cgroup Filesystem

Cgroups provides a unified cgroup configuration entry to Userspace through the Kernel VFS (Virtual File System, virtual file system) file interface.

You can view the mount path and content of the current Cgroup Filesystem through the CLI:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
tmpfs            16G     0   16G   0% /sys/fs/cgroup

$ ll /sys/fs/cgroup/
总用量 0
drwxr-xr-x. 4 root root  0 61 16:22 blkio
lrwxrwxrwx. 1 root root 11 61 16:22 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 61 16:22 cpuacct -> cpu,cpuacct
drwxr-xr-x. 4 root root  0 61 16:22 cpu,cpuacct
drwxr-xr-x. 2 root root  0 61 16:22 cpuset
drwxr-xr-x. 4 root root  0 61 16:22 devices
drwxr-xr-x. 2 root root  0 61 16:22 freezer
drwxr-xr-x. 2 root root  0 61 16:22 hugetlb
drwxr-xr-x. 4 root root  0 61 16:22 memory
lrwxrwxrwx. 1 root root 16 61 16:22 net_cls -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 61 16:22 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 61 16:22 net_prio -> net_cls,net_prio
drwxr-xr-x. 2 root root  0 61 16:22 perf_event
drwxr-xr-x. 4 root root  0 61 16:22 pids
drwxr-xr-x. 4 root root  0 61 16:22 systemd

It can be seen that by default, Cgroups will create their own Cgroup Filesytem for Subsystems, which contains the files needed to set resource quotas and associate with several Tasks. As follows.

$ ll /sys/fs/cgroup/memory/
总用量 0
-rw-r--r--.  1 root root 0 61 16:22 cgroup.clone_children
--w--w--w-.  1 root root 0 61 16:22 cgroup.event_control
-rw-r--r--.  1 root root 0 61 16:22 cgroup.procs
-r--r--r--.  1 root root 0 61 16:22 cgroup.sane_behavior
-rw-r--r--.  1 root root 0 61 16:22 memory.failcnt
--w-------.  1 root root 0 61 16:22 memory.force_empty
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.failcnt
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.limit_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.max_usage_in_bytes
-r--r--r--.  1 root root 0 61 16:22 memory.kmem.slabinfo
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.tcp.failcnt
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.tcp.limit_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--.  1 root root 0 61 16:22 memory.kmem.tcp.usage_in_bytes
-r--r--r--.  1 root root 0 61 16:22 memory.kmem.usage_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.limit_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.max_usage_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.memsw.failcnt
-rw-r--r--.  1 root root 0 61 16:22 memory.memsw.limit_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.memsw.max_usage_in_bytes
-r--r--r--.  1 root root 0 61 16:22 memory.memsw.usage_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.move_charge_at_immigrate
-r--r--r--.  1 root root 0 61 16:22 memory.numa_stat
-rw-r--r--.  1 root root 0 61 16:22 memory.oom_control
----------.  1 root root 0 61 16:22 memory.pressure_level
-rw-r--r--.  1 root root 0 61 16:22 memory.soft_limit_in_bytes
-r--r--r--.  1 root root 0 61 16:22 memory.stat
-rw-r--r--.  1 root root 0 61 16:22 memory.swappiness
-r--r--r--.  1 root root 0 61 16:22 memory.usage_in_bytes
-rw-r--r--.  1 root root 0 61 16:22 memory.use_hierarchy
-rw-r--r--.  1 root root 0 61 16:22 notify_on_release
-rw-r--r--.  1 root root 0 61 16:22 release_agent
drwxr-xr-x. 55 root root 0 61 16:23 system.slice
-rw-r--r--.  1 root root 0 61 16:22 tasks
drwxr-xr-x.  2 root root 0 61 16:23 user.slice

Among them, the Core interface file of cgroup is prefixed with cgroup:

  • cgroup.clone_children : Identifies whether the Child cgroup will inherit the Parent cgroup. The default value is 0, which means no inheritance.
  • cgroup.procs : When it is Root cgroup, all PIDs in the Hierarchy will be recorded.
  • etc.

The one prefixed with memory is the Controller interface file, which is the controller designed by the Cgroups resource allocation model, including:

  1. Weight : allocate resources according to the weight ratio.
  2. limit(max) : limit resources from being overused.
  3. Protection : It can be hard protection or soft protection.
  4. Allocation : Resource allocation parameters.

The rest also has a management interface file, for example:

  • notify_on_release : Indicates whether to execute release_agent when the last Task of this cgroup exits.
  • release_agent : It is a path, which is used to automatically clean up unused cgroups after the Task exits.
  • tasks : Records the list of Tasks associated with this cgroup.

Cgroup Hierarchy

Cgroups use the Filesystem method to provide operation entry, and another advantage brought by it is that it supports Cgroup Hierarchy (hierarchical) organizational form, which is represented as a tree structure.

When the user creates a Child cgroup in the Parent cgroup, the Child cgroup will also automatically create the required configuration files, and can configure whether to inherit the relevant configuration of the Parent cgroup. As shown below.

$ mkdir /sys/fs/cgroup/memory/cgrp1/

$ ls /sys/fs/cgroup/memory/cgrp1/
cgroup.clone_children  memory.kmem.limit_in_bytes          memory.kmem.tcp.usage_in_bytes  memory.memsw.max_usage_in_bytes  memory.soft_limit_in_bytes  tasks
cgroup.event_control   memory.kmem.max_usage_in_bytes      memory.kmem.usage_in_bytes      memory.memsw.usage_in_bytes      memory.stat
cgroup.procs           memory.kmem.slabinfo                memory.limit_in_bytes           memory.move_charge_at_immigrate  memory.swappiness
memory.failcnt         memory.kmem.tcp.failcnt             memory.max_usage_in_bytes       memory.numa_stat                 memory.usage_in_bytes
memory.force_empty     memory.kmem.tcp.limit_in_bytes      memory.memsw.failcnt            memory.oom_control               memory.use_hierarchy
memory.kmem.failcnt    memory.kmem.tcp.max_usage_in_bytes  memory.memsw.limit_in_bytes     memory.pressure_level            notify_on_release

insert image description here

Finally, a tasks file is included in the Filesystem of each cgroup, which is used to save the Tasks list associated with the current cgroup. If we want to add a User Process to a cgroup, we can write its PID into the tasks file. as follows:

$ cd /sys/fs/cgroup/memory/cgrp1    # 进入 cgrp1
$ echo 1029 > tasks                 # 将 PID 1029 的 Process 添加到 cgrp1 的 tasks 列表

Operating Rules for Cgroups

When using Cgroups, users must follow certain operating rules, otherwise errors will occur. The purpose of operation rules is to avoid conflicts in the configuration of resource quotas.

  1. A Hierarchy can attach multiple Subsystems, as shown in the figure below, the cpu and memory Subsystems are attached to the same Hierarchy.
    insert image description here

  2. A Subsystem that has been Attached can only be Attached on an empty Hierarchy again, and cannot be Attached to a Hierarchy that has been Attached to other Subsystems. As shown in the figure below, the cpu Subsystem has been Attached to Hierarchy A, and the memory Subsystem has been attached to Hierarchy B. Therefore, the cpu Subsystem can no longer be Attached to Hierarchy B, but can only be Attached to another empty Hierarchy C.
    insert image description here

  3. Each Task can only be in the only cgroup tasks of the same Hierarchy, and can be in multiple cgroup tasks of different Hierarchy. As shown in the figure below, this ensures that the same cgroup quota for a Task is unique.
    insert image description here

  4. When the child process is forked out, it automatically inherits the cgroups of the parent process, but after fork, it can be adjusted to other cgroups as needed, as shown in the following figure:
    insert image description here

Code Implementation of Cgroups

Now look back at the declaration and definition of cgroup in Kernel, which is designed as a tree data structure:

struct cgroup {
    
    
    ...
    // 下面 3 个字段把 cgroup 设计成了一个树数据结构
    struct list_head sibling;   // 兄弟节点
    struct list_head children;  // 子节点
    struct cgroup *parent;      // 父节点

    struct dentry *dentry;      // cgroup 对应的目录对象

    // cgroup 关联的 subsystems 对象
    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]; 
    ...
};

insert image description here

By default, during Kernel startup, a rootnode (root node) is automatically instantiated, and all Subsystems' cgroup FSs are associated with this rootnode.

static struct cgroupfs_root rootnode;

struct cgroupfs_root {
    
    
    struct super_block *sb;            // Root cgroup FS 的挂载点(VFS 使用)
    ...
    struct list_head subsys_list;      // Root cgroup 绑定的 Subsystems 列表
    struct cgroup top_cgroup;          // Root cgroup 对象
    int number_of_cgroups;             // Root cgroup 拥有的 cgroups 的数量
    ...
};

If users want to manually mount Subsystems to other cgroup FS, they can also use the mount command to mount, as shown in the following command:

$ mount -t cgroup -o memory memory /sys/fs/cgroup/memory1

In addition, the cgroup_subsys_state field in cgroup is used to associate with a Subsystems State (subsystem resource statistics structure) list, and then associate with several specific Subsystems.

struct cgroup_subsys_state {
    
    
    struct cgroup *cgroup; // 指向 cgroup 对象
    atomic_t refcnt;       // 引用计数器
    unsigned long flags;   // 标志位
};

struct mem_cgroup {
    
    
    // 资源统计对象通用部分
    struct cgroup_subsys_state css;

    // 资源统计对象私有部分
    struct res_counter res;  // 用于统计 tasks 的内存使用情况
    struct mem_cgroup_lru_info info;
    int prev_priority;
    struct mem_cgroup_stat stat;
};

insert image description here

It can be seen that cgroup and Subsystems have a one-to-many relationship, as shown in the figure below.
insert image description here

At the same time, because a Task can be associated with multiple cgroups, the many-to-many relationship between Tasks and Subsystems is finally realized. As shown below:

  • ProcessA belongs to /sys/fs/cgroup/memory/cgrp1/cgrp3 and /sys/fs/cgroup/cpu/cgrp2/cgrp3, so ProcessA is associated with two Cgroup Subsystems States, mem_groupA and task_groupA.
  • ProcessB belongs to /sys/fs/cgroup/memory/cgrp1/cgrp4 and /sys/fs/cgroup/cpu/cgrp2/cgrp3, so ProcessB is associated with the two Cgroup Subsystems States mem_groupB and task_groupA.

insert image description here

In the task_struct of Task, record the Cgroup Subsystems State list associated with it through the css_set field, as follows:

struct task_struct {
    
    
    ...
    struct css_set *cgroups;
    ...
};

struct css_set {
    
    
    ...
    // 用于收集不同 cgroup 的资源统计对象
    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
};

Finally, a mapping relationship structure of M×N Linkage is formed between User Process and Cgroups. As shown below.

insert image description here

Namespaces

Linux Namespaces (namespace) is an operating system-level resource view isolation technology, which can divide the global resources of Linux into resources visible within the scope of Namespace.

There are many types of Namespaces, which basically cover the basic elements required to form an operating system:

  1. UTS namespace (system hostname)
  2. Time namespace (system time)
  3. PID namespace (system process number)
  4. IPC namespace (system interprocess communication)
  5. Mount namespace (system file system)
  6. Network namespace (system network)
  7. User namespace (system user permissions)
  8. Cgroup namespace (system Cgroup)

User Process is the main service object of Namespace, and there are three main SCIs related to it:

  1. clone() : Create a Process and set the type parameter of Namespace Instance at the same time.
  2. setns() : Add a Process to the specified Namespace Instance.
  3. unshare() : Take a Process out of the specified Namespace Instance.

As shown in the figure below, each Namespaces has its own clone type parameter:
insert image description here

Through the /proc/{pid}/ns file, you can view which Namespaces Instances the specified Process runs in, and each Namespace Instance has a unique identifier.

$ ls -l --time-style='+' /proc/$$/ns
总用量 0
lrwxrwxrwx. 1 root root 0  ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0  mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0  net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0  pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0  user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0  uts -> uts:[4026531838]

In the end, users can create a variety of different types of Namespaces Instance to provide isolation of operating system resources, combined with creating a variety of different types of cgroups to provide operating system resource quotas, it constitutes a basic operating system container. Namely: Process Container.

UTS namespace

UTS namespace provides the isolation of Hostname and Domain Name for Container.

The Process in the Container can be configured by calling the sethostname and setdomainname commands as needed, so that each Container can be regarded as an independent node in the network.

PID namespace

The PID namespace provides the isolation of the process ID for the Container.

Each Containers has its own process environment, and the init Process of Container is PID No. 1 process, which acts as the parent process of all child processes. In order to achieve process isolation, you first need to create a process with PID 1, which has the following characteristics:

  • If a child process leaves the parent process (the parent process does not wait for it), then the init Process will be responsible for reclaiming resources and ending the child process.
  • If the init Process is terminated, the Kernel will call SIGKILL to terminate all processes in this PID namespace.

IPC namespace

The IPC namespace provides the isolation of IPC (inter-process) communication mechanisms for Containers, including mechanisms such as semaphores, message queues, and shared memory.

Each Container has the following /proc file interface:

  • /proc/sys/fs/mqueue : POSIX Message Queues interface type;
  • /proc/sys/kernel : System V IPC interface type;
  • /proc/sysvipc : System V IPC interface type.

Mount namespace

The Mount namespace provides the isolation of the Filesystem mount point for the Container, and then realizes the isolation of the VFS.

Each Container has the following /proc file interface, which can form an independent rootfs (Root file system):

  • /proc/[pid]/mounts
  • /proc/[pid]/mountinfo
  • /proc/[pid]/mountstats

In fact, Mount namespace is developed based on the continuous improvement of Chroot. The rootfs created for Container is only the files, directories and configurations contained in an operating system distribution, and does not include Kernel files.

Network namespace

Network namespace provides the isolation of network resources for Container, including:

  • Network devices
  • IPv4 and IPv6 protocol stacks (IPv4, IPv6 protocol stack)
  • IP routing tables
  • Firewall rules
  • Sockets
  • /proc/[pid]/net
  • /sys/class/net
  • /proc/sys/net

It should be noted that the same Network device can only exist in one Namespace Instance, so it is often used in combination with virtual network devices.

insert image description here

User namespace

User namespace provides Container with isolation related to user permissions and security attributes, including: User ID, User Group ID, Root directory, and special permissions.

Each Container has the following /proc file interface:

  • /proc/[pid]/uid_map
  • /proc/[pid]/gid_map

Application of Docker to Cgroups and Namespaces

When we create a Docker Container, we can view the cgroups and namespaces of the Container.

  1. Check the container ID (cfca1212d140) and PID (2240) configuration.
$ docker ps
CONTAINER ID   IMAGE                   COMMAND   CREATED         STATUS       PORTS     NAMES
cfca1212d140   centos:centos7.9.2009   "bash"    18 months ago   Up 2 hours             vim-ide

$ docker inspect --format='{
    
    {.State.Pid}}' cfca1212d140
2240
  1. Check the cgroups configuration of the Container.
$ ll /sys/fs/cgroup/memory/docker/
总用量 0
drwxr-xr-x. 2 root root 0 62 03:40 cfca1212d1407a89632a439e974e246d1f6edd0bbef9079f06addf2613e1d46f

$ cat /sys/fs/cgroup/memory/docker/cfca1212d1407a89632a439e974e246d1f6edd0bbef9079f06addf2613e1d46f/cgroup.procs 
2240

$ cat /sys/fs/cgroup/memory/docker/cfca1212d1407a89632a439e974e246d1f6edd0bbef9079f06addf2613e1d46f/memory.limit_in_bytes
9223372036854771712
  1. Check out the namespaces configuration of the Container.
$ ls -l --time-style='+' /proc/2240/ns
总用量 0
lrwxrwxrwx. 1 root root 0  ipc -> ipc:[4026532433]
lrwxrwxrwx. 1 root root 0  mnt -> mnt:[4026532431]
lrwxrwxrwx. 1 root root 0  net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0  pid -> pid:[4026532434]
lrwxrwxrwx. 1 root root 0  user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0  uts -> uts:[4026532432]

reference documents

  • https://mp.weixin.qq.com/s/EdRVEJ0i5j9eHwd8QK-cDg
  • https://juejin.cn/post/6921299245685276686
  • https://zhuanlan.zhihu.com/p/388101355

Guess you like

Origin blog.csdn.net/Jmilk/article/details/130993558