Docker-based Slurm job management system
Aliyun server settings
Reference video: https://www.bilibili.com/video/BV177411K7bH
Step 1 - Apply for Alibaba Cloud server
You can apply for a one-month Alibaba Cloud host for free. I applied for a one-month 1-core 2G cloud server with a bandwidth of 4M and a system disk of 40G. The installed system is CentOS 8.4 64-bit version.
Step 2 - Modify the instance
After entering the cloud server ECS,
click on the running instance i-uf689okdsil887t0h11x, and you can see that the server’s public network IP address will be used for ssh login. Next,
modify the instance host name and reset the instance password. After modification, restart immediately alright.
Step 3 - Open the security group and perform port mapping
The cloud server purchased on Alibaba Cloud needs to enable the security group setting, otherwise it cannot be accessed from the outside. Click the
configuration rule in the operation bar, enter the security group, and add the port number you need to open. The last example uses the port 8888. Please be sure to open the
default open port . There are 22 (subsequent ssh use). The port number I added is as shown in the picture above. If necessary, you can still come in and add it later.
Step 4 - Use xshell to connect remotely
Go to the official website to download Xshell 7 and install it. Create a new session, fill in your own Alibaba Cloud public network IP, and then fill in the user name root and the password you just set on the server to enter the server. If you see Welcome to Alibaba Cloud Elastic Compute Service!, it means you have entered the server
# 按照提示输入将命令行激活
[root@Iceland ~]# systemctl enable --now cockpit.socket
Created symlink /etc/systemd/system/sockets.target.wants/cockpit.socket → /usr/lib/systemd/system/cockpit.socket.
# check 服务器当前环境
[root@Iceland ~]# pwd
/root
[root@Iceland ~]# cd /
[root@Iceland /]# ls
bin dev home lib64 mnt proc run srv tmp var
boot etc lib media opt root sbin sys usr
[root@Iceland ~]# uname -r # 查看操作系统内核版本
4.18.0-305.3.1.el8.x86_64
[root@Iceland /]# cat /etc/os-release # 查看操作系统详细信息
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
Install Docker on the server
Official website documentation: https://docs.docker.com/engine/install/centos/
Step 1 - Uninstall the old version of docker
[root@Iceland /]# sudo yum remove docker \
> docker-client\
> docker-client-latest \
> docker-common \
> docker-latest \
> docker-latest-logrotate \
> docker-logrotate \
> docker-engine
No match for argument: docker
No match for argument: docker-clientdocker-client-latest
No match for argument: docker-common
No match for argument: docker-latest
No match for argument: docker-latest-logrotate
No match for argument: docker-logrotate
No match for argument: docker-engine
No packages marked for removal.
Dependencies resolved.
Nothing to do.
Complete! # 由于是新服务器所以并没有这些老版本 docker
Step 2 - Install the mirror repository
[root@Iceland /]# yum install -y yum-utils # 安装 yum-utils
Last metadata expiration check: 2:09:26 ago on Sat 28 Aug 2021 06:38:17 PM CST.
Dependencies resolved.
=============================================================================================
Package Architecture Version Repository Size
=============================================================================================
Installing:
yum-utils noarch 4.0.18-4.el8 baseos 71 k
Transaction Summary
=============================================================================================
Install 1 Package
Total download size: 71 k
Installed size: 22 k
Downloading Packages:
yum-utils-4.0.18-4.el8.noarch.rpm 1.7 MB/s | 71 kB 00:00
---------------------------------------------------------------------------------------------
Total 1.6 MB/s | 71 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : yum-utils-4.0.18-4.el8.noarch 1/1
Running scriptlet: yum-utils-4.0.18-4.el8.noarch 1/1
Verifying : yum-utils-4.0.18-4.el8.noarch 1/1
Installed:
yum-utils-4.0.18-4.el8.noarch
Complete!
[root@Iceland /]# yum-config-manager \ # 建立稳定链接的仓库
> --add-repo \
> https://download.docker.com/linux/centos/docker-ce.repo
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
# 国外仓库网站较慢后面会用阿里云的仓库
Step 3 - Install docker engine
[root@Iceland /]# yum install docker-ce docker-ce-cli containerd.io # 安装这 3 个组件
Docker CE Stable - x86_64 18 kB/s | 15 kB 00:00
Dependencies resolved.
=====================================================================================================
Package Arch Version Repository Size
=====================================================================================================
Installing:
containerd.io x86_64 1.4.9-3.1.el8 docker-ce-stable 30 M
docker-ce x86_64 3:20.10.8-3.el8 docker-ce-stable 22 M
docker-ce-cli x86_64 1:20.10.8-3.el8 docker-ce-stable 29 M
Installing dependencies:
container-selinux noarch 2:2.164.1-1.module_el8.4.0+886+c9a8d9ad appstream 52 k
docker-ce-rootless-extras x86_64 20.10.8-3.el8 docker-ce-stable 4.6 M
docker-scan-plugin x86_64 0.8.0-3.el8 docker-ce-stable 4.2 M
fuse-common x86_64 3.2.1-12.el8 baseos 21 k
fuse-overlayfs x86_64 1.6-1.module_el8.4.0+886+c9a8d9ad appstream 73 k
fuse3 x86_64 3.2.1-12.el8 baseos 50 k
fuse3-libs x86_64 3.2.1-12.el8 baseos 94 k
libcgroup x86_64 0.41-19.el8 baseos 70 k
libslirp x86_64 4.3.1-1.module_el8.4.0+575+63b40ad7 appstream 69 k
slirp4netns x86_64 1.1.8-1.module_el8.4.0+641+6116a774 appstream 51 k
Enabling module streams:
container-tools rhel8
Transaction Summary
=====================================================================================================
Install 13 Packages
Total download size: 90 M
Installed size: 377 M
Is this ok [y/N]: y # 中间等待输入 y 即可
Downloading Packages:
(1/13): container-selinux-2.164.1-1.module_el8.4.0+886+c9a8d9ad.noar 1.4 MB/s | 52 kB 00:00
(2/13): fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64.rpm 1.9 MB/s | 73 kB 00:00
(3/13): libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64.rpm 1.3 MB/s | 69 kB 00:00
(4/13): fuse-common-3.2.1-12.el8.x86_64.rpm 1.4 MB/s | 21 kB 00:00
(5/13): slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64.rpm 2.7 MB/s | 51 kB 00:00
(6/13): fuse3-3.2.1-12.el8.x86_64.rpm 4.0 MB/s | 50 kB 00:00
(7/13): libcgroup-0.41-19.el8.x86_64.rpm 4.6 MB/s | 70 kB 00:00
(8/13): fuse3-libs-3.2.1-12.el8.x86_64.rpm 4.7 MB/s | 94 kB 00:00
(9/13): docker-ce-20.10.8-3.el8.x86_64.rpm 5.5 MB/s | 22 MB 00:03
(10/13): docker-ce-rootless-extras-20.10.8-3.el8.x86_64.rpm 3.5 MB/s | 4.6 MB 00:01
(11/13): containerd.io-1.4.9-3.1.el8.x86_64.rpm 4.7 MB/s | 30 MB 00:06
(12/13): docker-scan-plugin-0.8.0-3.el8.x86_64.rpm 3.5 MB/s | 4.2 MB 00:01
(13/13): docker-ce-cli-20.10.8-3.el8.x86_64.rpm 3.6 MB/s | 29 MB 00:08
-----------------------------------------------------------------------------------------------------
Total 11 MB/s | 90 MB 00:08
warning: /var/cache/dnf/docker-ce-stable-fa9dc42ab4cec2f4/packages/containerd.io-1.4.9-3.1.el8.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Docker CE Stable - x86_64 3.1 kB/s | 1.6 kB 00:00
Importing GPG key 0x621E9F35:
Userid : "Docker Release (CE rpm) <[email protected]>"
Fingerprint: 060A 61C5 1B55 8A7F 742B 77AA C52F EB6B 621E 9F35
From : https://download.docker.com/linux/centos/gpg
Is this ok [y/N]: y # 中间等待输入 y 即可
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : docker-scan-plugin-0.8.0-3.el8.x86_64 1/13
Running scriptlet: docker-scan-plugin-0.8.0-3.el8.x86_64 1/13
Installing : docker-ce-cli-1:20.10.8-3.el8.x86_64 2/13
Running scriptlet: docker-ce-cli-1:20.10.8-3.el8.x86_64 2/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Installing : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 3/13
Installing : containerd.io-1.4.9-3.1.el8.x86_64 4/13
Running scriptlet: containerd.io-1.4.9-3.1.el8.x86_64 4/13
Running scriptlet: libcgroup-0.41-19.el8.x86_64 5/13
Installing : libcgroup-0.41-19.el8.x86_64 5/13
Running scriptlet: libcgroup-0.41-19.el8.x86_64 5/13
Installing : fuse3-libs-3.2.1-12.el8.x86_64 6/13
Running scriptlet: fuse3-libs-3.2.1-12.el8.x86_64 6/13
Installing : fuse-common-3.2.1-12.el8.x86_64 7/13
Installing : fuse3-3.2.1-12.el8.x86_64 8/13
Installing : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 9/13
Running scriptlet: fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 9/13
Installing : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64 10/13
Installing : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64 11/13
Installing : docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Running scriptlet: docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Installing : docker-ce-3:20.10.8-3.el8.x86_64 13/13
Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64 13/13
Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 13/13
Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64 13/13
Verifying : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch 1/13
Verifying : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64 2/13
Verifying : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64 3/13
Verifying : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64 4/13
Verifying : fuse-common-3.2.1-12.el8.x86_64 5/13
Verifying : fuse3-3.2.1-12.el8.x86_64 6/13
Verifying : fuse3-libs-3.2.1-12.el8.x86_64 7/13
Verifying : libcgroup-0.41-19.el8.x86_64 8/13
Verifying : containerd.io-1.4.9-3.1.el8.x86_64 9/13
Verifying : docker-ce-3:20.10.8-3.el8.x86_64 10/13
Verifying : docker-ce-cli-1:20.10.8-3.el8.x86_64 11/13
Verifying : docker-ce-rootless-extras-20.10.8-3.el8.x86_64 12/13
Verifying : docker-scan-plugin-0.8.0-3.el8.x86_64 13/13
Installed:
container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch
containerd.io-1.4.9-3.1.el8.x86_64
docker-ce-3:20.10.8-3.el8.x86_64
docker-ce-cli-1:20.10.8-3.el8.x86_64
docker-ce-rootless-extras-20.10.8-3.el8.x86_64
docker-scan-plugin-0.8.0-3.el8.x86_64
fuse-common-3.2.1-12.el8.x86_64
fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64
fuse3-3.2.1-12.el8.x86_64
fuse3-libs-3.2.1-12.el8.x86_64
libcgroup-0.41-19.el8.x86_64
libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64
slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64
Complete!
Although this command has already installed docker, it has not started (the same as the server, it must be started before running).
Step 4 - Start docker and verify
[root@Iceland /]# systemctl start docker
[root@Iceland /]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete
Digest: sha256:7d91b69e04a9029b99f3585aaaccae2baa80bcf318f4a5d2165a9898cd2dc0a1
Status: Downloaded newer image for hello-world:latest
Hello from Docker!
This message shows that your installation appears to be working correctly.
To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon. # 客户连接守护进程
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.(amd64) # pull 镜像
3. The Docker daemon created a new container from that image which runs the
executable that produces the output you are currently reading. # 通过镜像生成容器运行
4. The Docker daemon streamed that output to the Docker client, which sent it
to your terminal. # 守护进程将信息显示到终端
To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/
For more examples and ideas, visit:
https://docs.docker.com/get-started/
The most important information above is to explain the 4 steps of docker operation, so far docker is also installed
[root@Iceland /]# docker version
Client: Docker Engine - Community
Version: 20.10.8
API version: 1.41
Go version: go1.16.6
Git commit: 3967b7d
Built: Fri Jul 30 19:53:39 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.8
API version: 1.41 (minimum version 1.12)
Go version: go1.16.6
Git commit: 75249d8
Built: Fri Jul 30 19:52:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: v1.0.1-0-g4144b63
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Tips: Alibaba Cloud Mirror Accelerator
Log in to Alibaba Cloud --> Container Mirroring Service --> Mirroring Tools --> Mirroring Accelerator, and copy and run the 4 commands corresponding to CentOS.
[root@Iceland /]# sudo mkdir -p /etc/docker # 新建目录
[root@Iceland /]# sudo tee /etc/docker/daemon.json <<-'EOF' # 配置镜像地址文件
> {
> "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
> }
> EOF
{
"registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
}
[root@Iceland /]# sudo systemctl daemon-reload # 重启守护进程
[root@Iceland /]# sudo systemctl restart docker # 重启 docker
Docker network configuration
Understand docker0 bridging technology
[root@Iceland /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo # 本机回环地址
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0 # 阿里云内网地址
valid_lft 315352421sec preferred_lft 315352421sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 # docker0 地址
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
Each container in docker communicates within a network segment by bridging with docker0 (similar to a router). It is worth noting that one of the containers will build a pair of evth-pair virtual interfaces with docker0, which will appear in pairs and disappear in pairs. Only in this way can we ensure that the containers are independent from each other and can efficiently interconnect and communicate with each other, as well as communicate with the external network.
test
[root@Iceland ~]# docker run -d -P --name tomcat01 tomcat # -P表示端口随机映射新建容器运行
Unable to find image 'tomcat:latest' locally
latest: Pulling from library/tomcat
1cfaf5c6f756: Pull complete
c4099a935a96: Pull complete
f6e2960d8365: Pull complete
dffd4e638592: Pull complete
a60431b16af7: Pull complete
4869c4e8de8d: Pull complete
9815a275e5d0: Pull complete
c36aa3d16702: Pull complete
cc2e74b6c3db: Pull complete
1827dd5c8bb0: Pull complete
Digest: sha256:1af502b6fd35c1d4ab6f24dc9bd36b58678a068ff1206c25acc129fb90b2a76a
Status: Downloaded newer image for tomcat:latest
b530e79cc32b45ed6222496013b66ab663eaef74c83dc62610b252b18d1a3310
[root@Iceland ~]# docker exec -it tomcat01 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo # 本机回环地址
valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 # eth 桥接地址 6 和 7 一对
valid_lft forever preferred_lft forever
[root@Iceland ~]# ping 172.17.0.2 # 直接可以通过地址从 Linux 命令行 ping 通到容器内部
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.101 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 172.17.0.2: icmp_seq=3 ttl=64 time=0.064 ms
^C
--- 172.17.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.064/0.078/0.101/0.016 ms
[root@Iceland ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
valid_lft 315312613sec preferred_lft 315312613sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
7: veth0a09b40@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default # 相较之前IP地址多出的这一个就是和创建容器对应的 7 号桥接地址
link/ether e6:88:6f:4a:e9:4c brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::e488:6fff:fe4a:e94c/64 scope link
valid_lft forever preferred_lft forever
Docker will assign a pair of interfaces to each container for the communication between the container and the bridge. Using this technology, the containers can be isolated from each other and can communicate efficiently, laying the foundation for the communication implementation of the Slurm cluster deployed later.
Containers use link technology to communicate with each other
Since the IP of the container may change, it is hoped that the --link can be used to communicate with the container ID instead of the IP
[root@Iceland ~]# docker run -d -P --name tomcat02 tomcat
07758a3a228c004fbf6cc8092b714d1249f921c4ba9360846206fc7915083f97
[root@Iceland ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
07758a3a228c tomcat "catalina.sh run" 5 seconds ago Up 4 seconds 0.0.0.0:49154->8080/tcp, :::49154->8080/tcp tomcat02
b530e79cc32b tomcat "catalina.sh run" 50 minutes ago Up 50 minutes 0.0.0.0:49153->8080/tcp, :::49153->8080/tcp tomcat01
[root@Iceland ~]# docker exec -it tomcat02 ping tomcat01
3ping: tomcat01: Name or service not known # 发现直接通过容器名在一个容器里无法连接另一个容器
# 通过增加运行时指令 --link 可以解决
[root@Iceland ~]# docker run -d -P --name tomcat03 --link tomcat02 tomcat
6e185946062f3af377eb58c34408471685cca20d8ca0b2873b24514856eda7d8
[root@Iceland ~]# docker exec -it tomcat03 ping tomcat02 # 通过指定03与02连接,发现可以互联
PING tomcat02 (172.17.0.3) 56(84) bytes of data.
64 bytes from tomcat02 (172.17.0.3): icmp_seq=1 ttl=64 time=0.131 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=2 ttl=64 time=0.091 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=3 ttl=64 time=0.076 ms
^C
--- tomcat02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 53ms
rtt min/avg/max/mdev = 0.076/0.099/0.131/0.024 ms
# 但是反向02却ping不同03,因为双向都需要配置
Query the bridge information of tomcat03 through the command, –link is equivalent to adding a line of one-way mapping to 02 in the hosts configuration.
[root@Iceland ~]# docker exec -it tomcat03 cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3 tomcat02 07758a3a228c # 这里就绑定了 02
172.17.0.4 6e185946062f
[root@Iceland ~]# docker exec -it tomcat02 cat /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.3 07758a3a228c
It can also be seen from here that docker0 is inconvenient. –link is an official definition with limitations and cannot be customized. It is impossible to bind it in every direction. And docker0 does not support container ID connection access.
Advanced building a custom network
network mode
- bridge: bridge mode docker0 (default)
- none: do not configure the network
- host: share the network with the host
- container: Container network connectivity (very limited)
[root@Iceland ~]# docker network --help
Usage: docker network COMMAND
Manage networks
Commands:
connect Connect a container to a network
create Create a network # 通过 create 创建自定义桥接网络
disconnect Disconnect a container from a network
inspect Display detailed information on one or more networks
ls List networks
prune Remove all unused networks
rm Remove one or more networks
Run 'docker network COMMAND --help' for more information on a command.
# docker0 是默认域名不能访问
[root@Iceland ~]# docker rm -f $(docker ps -aq) # 首先将之前的容器及其网络配置删除
6e185946062f
07758a3a228c
b530e79cc32b
[root@Iceland ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[root@Iceland ~]# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
tomcat latest 266d1269bb29 10 days ago 668MB
[root@Iceland ~]# ip addr # 可以看到这里已经只有最开始3行网络了
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
valid_lft 315310384sec preferred_lft 315310384sec
inet6 fe80::216:3eff:fe29:ef40/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:5dff:fee9:e1b7/64 scope link
valid_lft forever preferred_lft forever
# 创建新的桥接网络,--driver [网络类型] --subnet [子网范围] --gateway [网关地址]
[root@Iceland ~]# docker network create --driver bridge --subnet 192.168.0.0/16 --gateway 192.168.0.1 mynet
57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e
[root@Iceland ~]# docker network ls
NETWORK ID NAME DRIVER SCOPE
9223b334e60a bridge bridge local
8d96801ccaf3 host host local
57c914464f0a mynet bridge local
a5ff794b6d74 none null local
[root@Iceland ~]# docker network inspect mynet
[
{
"Name": "mynet",
"Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
"Created": "2021-08-29T09:07:38.248210817+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {
},
"Config": [
{
"Subnet": "192.168.0.0/16", # 看到网络已经设置好了
"Gateway": "192.168.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
},
"Options": {
},
"Labels": {
}
}
]
Test custom network
[root@Iceland ~]# docker run -d -P --name tomcat-net-01 --net mynet tomcat
c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898
[root@Iceland ~]# docker run -d -P --name tomcat-net-02 --net mynet tomcat
91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb
[root@Iceland ~]# docker network inspect mynet
[
{
"Name": "mynet",
"Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
"Created": "2021-08-29T09:07:38.248210817+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": {
},
"Config": [
{
"Subnet": "192.168.0.0/16",
"Gateway": "192.168.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb": {
"Name": "tomcat-net-02",
"EndpointID": "4df2dc1c5314bb02ae69ef7b47e32e658cb3aaaf7c65074bfddfe38629ba65be",
"MacAddress": "02:42:c0:a8:00:03",
"IPv4Address": "192.168.0.3/16", # 看到这里的IP就是我们定义的192.168.0.3
"IPv6Address": ""
},
"c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898": {
"Name": "tomcat-net-01",
"EndpointID": "7d92fc552cb88f410b207075e473afde36f63020dc63f0de7923fd7137e19b1f",
"MacAddress": "02:42:c0:a8:00:02",
"IPv4Address": "192.168.0.2/16", # 看到这里的IP就是我们定义的192.168.0.2
"IPv6Address": ""
}
},
"Options": {
},
"Labels": {
}
}
]
The advantage of a custom bridge network is that different networks (different subnets) are isolated from each other, but the interconnection within the network is very complete, and the two containers can ping each other, which fixes the --link problem.
[root@Iceland ~]# docker exec -it tomcat-net-01 ping 192.168.0.3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.119 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=0.092 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=0.080 ms
^C
--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 64ms
rtt min/avg/max/mdev = 0.080/0.097/0.119/0.016 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.102 ms
^C
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 27ms
rtt min/avg/max/mdev = 0.101/0.106/0.116/0.010 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping tomcat-net-01 # 直接通过容器ID也可以
PING tomcat-net-01 (192.168.0.2) 56(84) bytes of data.
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=3 ttl=64 time=0.086 ms
^C
--- tomcat-net-01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 51ms
rtt min/avg/max/mdev = 0.086/0.094/0.098/0.005 ms
Docker Compose - batch container orchestration
Official documentation: https://docs.docker.com/compose/
The original docker process of a single container, docker file --> docker build --> docker run, requires manual operation of a single container, which is stretched in the face of a large number of clusters, so docker compose realizes the automatic operation of multiple containers through configuration files
official introduction
Using Compose is basically a three-step process:
- Define your app’s environment with a
Dockerfile
so it can be reproduced anywhere. - Define the services that make up your app in
docker-compose.yml
so they can be run together in an isolated environment. - Run
docker compose up
and the Docker compose command starts and runs your entire app. You can alternatively rundocker-compose up
using the docker-compose binary.
Two important concepts of compose:
- services, containers, applications
- Project project, a set of associated containers
Step 1 - Install compose
# 先从 GitHub 下载 compose 文件
[root@Iceland ~]# sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 给文件授权
[root@Iceland ~]# sudo chmod +x /usr/local/bin/docker-compose
[root@Iceland ~]# docker-compose version # 证明安装成功
docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l 10 Sep 2019
Step 2 - Official Website Example
The official website example is a python application, the counter uses redis, because the server is too slow and the download error is not demonstrated, the approximate steps are as follows:
- Step 1: Write application app.py
- Step 2: The dockerfile application is packaged into a mirror image (the stand-alone application is not online)
- step 3: docker-compose yaml file (defining the entire service, the environment required online) - core file
- Step 4: Start the compose project (docker-compose up runs the whole set of services)
Slurm cluster construction experiment
参考blog:https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601
Step 1 - Slurm Architecture Description
We'll create a Slurm cluster using docker-compose, which allows us to create an environment from a docker image (built by the author). Docker-compose will create containers and networks to communicate in isolated environments. Each container is a component of the cluster.
- slurmmaster is a container with slurmctld (Slurm's central management daemon).
- slurmnode[1-3] is a container with slurmd (Slurm's compute node daemon).
- slurmjupyter is a container with jupyterlab. This allows to use JupyterLab as a cluster client to interact with the cluster. As end users, we will use JupyterLab to interact with Slurm through the browser.
- cluster_default network, docker-compose will create a network to join and keep all containers. Containers inside the network can see each other.
The following scheme shows how all components interact.
Step 2 Write yaml file
Since mirroring is used, the entire project only needs a yaml file, which defines the steps of pulling images, and you only need to enter docker-compose up -d on the command line to run.
# 新建文件夹 cluster 用来存放文件
[root@Iceland ~]# mkdir cluster
[root@Iceland ~]# ls
cluster composetest
[root@Iceland ~]# cd cluster
[root@Iceland cluster]# vim docker-compose.yml
The docker-compose.yml file is as follows:
services:
slurmjupyter: # 开通容器 slurmjupyter
image: rancavil/slurm-jupyter:19.05.5-1 # 镜像仓库rancavil是作者名字 Rodrigo Ancavil 缩写
hostname: slurmjupyter
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 8888:8888
slurmmaster:
image: rancavil/slurm-master:19.05.5-1
hostname: slurmmaster
user: admin
volumes:
- shared-vol:/home/admin
ports:
- 6817:6817
- 6818:6818
- 6819:6819
slurmnode1: # 定义容器1的各项参数
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode1
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode1
links:
- slurmmaster # 和之前自定义网络类似,这里定义 node1 能与 master 连接,下面同理
slurmnode2:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode2
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode2
links:
- slurmmaster
slurmnode3:
image: rancavil/slurm-node:19.05.5-1
hostname: slurmnode3
user: admin
volumes:
- shared-vol:/home/admin
environment:
- SLURM_NODENAME=slurmnode3
links:
- slurmmaster
volumes:
shared-vol:
Step 3 - Run docker-compose up
[root@Iceland cluster]# docker-compose up -d # 开始部署,接下来是安装步骤
Creating network "cluster_default" with the default driver # docker-compose会自动按yaml生成自定义网络
Creating volume "cluster_shared-vol" with default driver
Pulling slurmjupyter (rancavil/slurm-jupyter:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-jupyter
83ee3a23efb7: Pull complete
db98fc6f11f0: Pull complete
f611acd52c6c: Pull complete
87f6e2c4791b: Pull complete
1301353d4fa3: Pull complete
3347f4fbce33: Pull complete
0cf1a37339f3: Pull complete
e78d0881f8c1: Pull complete
37049fe9d876: Pull complete
a8fa566a7a57: Pull complete
24af49ba4a2f: Pull complete
97b9029f86ee: Pull complete
Digest: sha256:17a72e8e4c5d687359c2923af7166e84f9bd3b63146145421bbac006ce141d45
Status: Downloaded newer image for rancavil/slurm-jupyter:19.05.5-1
Pulling slurmmaster (rancavil/slurm-master:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-master
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
e216e1a311d3: Pull complete
ab998a26ee04: Pull complete
499f3426618c: Pull complete
b5b815649fa6: Pull complete
2f04debb872c: Pull complete
4050a9c6f8d3: Pull complete
Digest: sha256:1979f86166b58213380604dcd7c1fcdb2438a40c44add2ff356be47160a97ab3
Status: Downloaded newer image for rancavil/slurm-master:19.05.5-1
Pulling slurmnode1 (rancavil/slurm-node:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-node
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
d82ef016a552: Pull complete
5865a097296e: Pull complete
0602a8c59a76: Pull complete
6f2545f38103: Pull complete
608c665d03da: Pull complete
c80540692f3b: Pull complete
Digest: sha256:ae650d12fbdaddd29208d7638aa0498c655bfe5a33f4fd07d57e51eb211f18c2
Status: Downloaded newer image for rancavil/slurm-node:19.05.5-1
Creating cluster_slurmmaster_1 ... done
Creating cluster_slurmjupyter_1 ... done
Creating cluster_slurmnode1_1 ... done
Creating cluster_slurmnode2_1 ... done
Creating cluster_slurmnode3_1 ... done
[root@Iceland cluster]# docker-compose ps # 可以看到 5 个容器都运行正常
Name Command State Ports
-------------------------------------------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Up 0.0.0.0:8888->8888/tcp,:::8888->8888/tcp
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Up 3306/tcp,
0.0.0.0:6817->6817/tcp,:::6817->6817/tcp,
0.0.0.0:6818->6818/tcp,:::6818->6818/tcp,
0.0.0.0:6819->6819/tcp,:::6819->6819/tcp
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Up 6817/tcp, 6818/tcp, 6819/tcp
Remember to enter the IP address of the server: 8888 into the browser to see the JupyterLab interface we are running
This is the installed Slurm queue extension function
Click this button to enter the Slurm Queue management interface
Click the command line button on the previous page to enter the internal view
admin@slurmjupyter:~$ scontrol show node # 查看节点信息,看见3个节点都在
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:15:59 SlurmdStartTime=2021-08-29T06:38:14
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode2 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode2 NodeHostName=slurmnode2 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode3 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUTot=1 CPULoad=0.31
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=slurmnode3 NodeHostName=slurmnode3 Version=19.05.5
OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021
RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=slurmpar
BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
CfgTRES=cpu=1,mem=1M,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Step 4 - Run a Slurm example
First create a new file in JupyterLab and rename it to test.py, and enter the following code - simply let the program sleep for 15s
#!/usr/bin/env python3
import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
print('Process started {}'.format(dt.now()))
print('NODE : {}'.format(socket.gethostname()))
print('PID : {}'.format(os.getpid()))
print('Executing for 15 secs')
time.sleep(15)
print('Process finished {}\n'.format(dt.now()))
Continue to create a new script file job.sh to assign work to all slurmnode[1-3], here specify to broadcast test.py and the number of tasks is 3, and output the result to the file result.out
#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#
#SBATCH --ntasks=3
#
sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py
Then enter the Slurm Queue management interface, click Submit Job to submit the work that the cluster needs to do, here we just submit the job.sh file, select the file type, the path is /home/admin/job.sh, click Submit Job.
Remember to click Reload again to load the job into the system, so that the work will start running in the cluster. After about 15 seconds, there will be an additional output file of result.out in the sidebar. Double-click to view it is obtained by parallel computing of 3 computing nodes. result.
So far, the instance of Slurm has been completed, but due to the fact that the purchased free server has 1 core and 2G, more complex matrix multiplication cannot be submitted.
Finally remember to close the service
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1 ... done
Stopping cluster_slurmnode2_1 ... done
Stopping cluster_slurmnode3_1 ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1 ... done
[root@Iceland cluster]# docker-compose ps
Name Command State Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Exit 137
填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。
[外链图片转存中...(img-mS70KiC7-1630323014459)]
>至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。
最后记得关闭服务
```shell
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1 ... done
Stopping cluster_slurmnode2_1 ... done
Stopping cluster_slurmnode3_1 ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1 ... done
[root@Iceland cluster]# docker-compose ps
Name Command State Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmmaster_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode1_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode2_1 /etc/slurm-llnl/docker-ent ... Exit 137
cluster_slurmnode3_1 /etc/slurm-llnl/docker-ent ... Exit 137
Special thanks to the docker video of Kuangshen at station B, "As long as you don't die, learn from death"