K8S+DevOps Architect Practical Course | Implementation Principles

Video source: Station B "Docker&k8s Tutorial Ceiling, Absolutely the best one taught by Station B, this set of learning k8s to get all the core knowledge of Docker is here"

Organize the teacher's course content and test notes while studying, and share them with everyone. Any infringement will be deleted. Thank you for your support!

Attach a summary post: K8S+DevOps Architect Practical Course | Summary

Problems to be solved by the virtualization core: resource isolation and resource limitation

Virtual machine hardware virtualization technology realizes complete isolation of resources through a hypervisor layer.
The container is virtualization at the operating system level, which uses the Cgroup and Namespace features of the kernel, and this function is completely implemented through software.

Namespace resource isolation

A namespace is an abstraction of global resources. Resources are placed in different namespaces, and resources in each namespace are isolated from each other.

Classification	system call parameters	Relevant kernel version
Mount namespaces	CLONE_NEWNS	Linux 2.4.19
UTS namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC namespaces	CLONE_NEWIPC	Linux 2.6.19
PID namespaces	CLONE_NEWPID	Linux 2.6.24
Network namespaces	CLONE_NEWNET	Started on Linux 2.6.24 Finished on Linux 2.6.29
User namespaces	CLONE_NEWUSER	Started on Linux 2.6.23 Finished on Linux 3.8

Look at the namespace of the current process:

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 16 18:17 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Sep 16 18:17 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Sep 16 18:17 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Sep 16 18:17 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Sep 16 18:17 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Sep 16 18:17 uts -> uts:[4026531838]

$ nohup ping www.baidu.com &
$ ps aux | grep ping
root     188890  0.1   0.0  150088   1996 pts/0    S   20:25   0:00 ping www.baidu.com
$ ls ls /proc/18889/ns

We know that the docker container is actually a process for the operating system. We can simulate the basic principle of resource isolation of the container in the original way:

In the Linux system, the system call for process creation can usually be implemented through clone(), and the prototype is as follows:

int clone(int(*child_func)(void*), void *child_stack, int flags, void*arg);

child_func: Pass in the main function of the program run by the child process.
child_stack: Pass in the stack space used by the child process.
flags: Indicates which CLONE_* flags are used.
args: used to pass in user parameters.

Example 1: Implement process-independent UTS space

#define _GNU_SOURCE
#include <sys/mount.h>
#include <sys/types.h>I
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE(1024 * 1024)
static char container_stack[STACK_SIZE];
char* const container_args[] = {
  "/bin/bash",
  NULL
};

int container_main(void* arg)
{
  printf("Container - inside the container!\n");
  sethostname("container", 10); /* 设置hostname */
  execv(container_args[0], container_args);
  printf("Something's wrong!\n");
  return 1;
}

int main()
{
  printf("Parent-start a container!\n");
  int container_pid = clone(container_main, container_stack+STACK_SIZE, CLONE_NEWUTS | SIGCHLD, NULL);
  waitpid(container_pid, NULL, 0);
  printf("Parent - container stopped!\n");
  return 0;
}

Compile and test:

$ yum install gcc
$ gcc -o ns_uts ns_uts.c
$ ./ns_uts
$ hostname

$ echo $$
19102

# 开新终端，对比两个进程的命名空间号，发现uts的是不同的
$ ls -l /proc/19102/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 16 20:47 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 uts -> uts:[4026532441]

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Sep 16 20:47 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Sep 16 20:47 uts -> uts:[4026531838]

#测试不传递CLONE_NEWUTS的情况

Example 2: Implement container-independent process space

#define _GNU_SOURCE
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdio.h>
#include <sched.h>
#include <signal.h>
#include <unistd.h>
#define STACK_SIZE(1024 * 1024)
static char container_stack[STACK_SIZE];
char* const container_args[] = {
  "/bin/bash"，
  NULL
};

int container_main(void* arg)
{
  printf("Container[%5d] - inside the container!\n", getpid();
  sethostname("container", 10); /* 设置hostname */
  execv(container_args[0], container_args);
  printf("Something's wrong!\n");
  return 1;
}

int main()
{
  printf("Parent[%5d] - start a container!\n",getpid());
  int container_pid = clone(container_main, container_stack+STACK_SIZE, CLONE_NEWUTS | CLONE_NEWPID | SIGCHLD, NULL);
  waitpid(container_pid, NULL, 0);
  printf("Parent - container stopped!\n");
  return 0;
}

Compile and test:

$ gcc -o ns_pid ns_pid.c
$ ./ns_pid
$ echo $$

How to determine whether processes belong to the same namespace:

$ ./ns_pid
Parent [8061] - start a container!
$ pstree -p 8061
pid1(8061)———bash(8062)———pstree(8816)
$ ls -l /proc/8061/ns
lrwxrwxrwx 1 root root 0 Jun 24 12:51 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 net -> net:[4026531968]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 uts -> uts:[4026531838]
$ ls -l /proc/8062/ns
lrwxrwxrwx 1 root root 0 Jun 24 12:51 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 net -> net:[4026531968]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 pid -> pid:[4026534845]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Jun 24 12:51 uts -> uts:[4026534844]

## 发现pid和uts是和父进程使用了不同的ns，其他的则是继承了父进程的命名空间

To sum up: Generally speaking, when docker starts a container, it will call the interface of Linux Kernel Namespace to create a virtual space. When creating, it can support the following settings (you can choose at will), and docker is set by default. .

pid: for process isolation (PID: process ID)
net: management network interface (NET: network)
ipc: manages access to IPC resources (IPC: interprocess communication (semaphores, message queues, and shared memory))
mnt: manage file system mount point (MNT: mount)
uts: isolate hostname and domain name
user: isolate users and user groups

CGroup resource limit

The isolation between containers can be guaranteed through namespace, but it is impossible to control how many resources each container can occupy. If one of the containers is performing CPU-intensive tasks, it will affect the performance and execution efficiency of tasks in other containers, resulting in Multiple containers interact and compete for resources. How to limit the resource usage of multiple containers has become the main problem after solving the isolation of process virtual resources.

Control Groups (CGroups for short) are able to isolate physical resources on the host machine, such as CPU, memory, disk I/O, and network bandwidth. Each CGroup is a group of processes restricted by the same criteria and parameters. What we need to do is actually add the container process to the specified Cgroup. For an in-depth understanding of CGroup, please click here.

UnionFS union file system

Linux namespace and cgroup respectively solve the resource isolation and resource limitation of the container, so the container is very lightweight, usually dozens or hundreds of containers can run in each machine, these containers share the same image, or each of the image Copy a copy, and then run independently? If the full file system is copied between each container, it will cause at least the following problems:

Running containers will be slower
The pressure of containers and images on the disk space of the host

How to solve this problem---Docker's storage driver

mirror layered storage
UnionFS

A Docker image is composed of a series of layers, and each layer represents an instruction in the Dockerfile, such as the following Dockerfile:

FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py

The Dockerfile here contains 4 commands, each of which creates a layer. The following shows the structure of the container layer running on the image built by the above Dockerfile:

The image is stacked by these layers layer by layer. These layers in the image are read-only. When we run the container, we can add new writable layers on top of these basic layers, that is, we usually Said container layer, all changes made to the running container (such as writing new files, modifying existing files, deleting files) will be written to this container layer.

The operation of the container layer mainly utilizes the copy-on-write (CoW) technology. CoW is copy-on-write, which means that it is copied only when it needs to be written. This is for the modification scenario of existing files. CoW technology allows all containers to share the file system of the image, and all data is read from the image. Only when the file needs to be written, the file to be written is copied from the image to its own file system for modification. So no matter how many containers share the same image, the write operation is performed on the copy copied from the image to its own file system, and the source file of the image will not be modified, and multiple containers operate on the same A copy of the file will be generated in the file system of each container, and each container modifies its own copy, which is isolated from each other and does not affect each other. Using CoW can effectively improve disk utilization.

The files of each layer in the image are scattered in different directories. How to integrate the files of these different directories together?

UnionFS is actually a file system service designed for the Linux operating system to combine multiple file systems into the same mount point. It can combine layers in different folders (Union) into the same folder, and the whole process of union is called Union Mount.

The picture above shows the implementation of AUFS. AUFS is an implementation of Docker storage driver. Docker also supports different storage drivers, including aufs, devicemapper, overlay2, zfs, and Btrfs. In the latest Docker, overlay2 replaces aufs became the recommended storage driver, but aufs will still be used as the default driver for Docker on machines without an overlay2 driver.