Docker-based Slurm job management system

Docker-based Slurm job management system

Aliyun server settings

Reference video: https://www.bilibili.com/video/BV177411K7bH

Step 1 - Apply for Alibaba Cloud server

You can apply for a one-month Alibaba Cloud host for free. I applied for a one-month 1-core 2G cloud server with a bandwidth of 4M and a system disk of 40G. The installed system is CentOS 8.4 64-bit version.
insert image description here

Step 2 - Modify the instance

After entering the cloud server ECS,
insert image description here
click on the running instance i-uf689okdsil887t0h11x, and you can see that the server’s public network IP address will be used for ssh login. Next,
insert image description here
modify the instance host name and reset the instance password. After modification, restart immediately alright.
insert image description here
insert image description here

Step 3 - Open the security group and perform port mapping

The cloud server purchased on Alibaba Cloud needs to enable the security group setting, otherwise it cannot be accessed from the outside. Click the
insert image description here
configuration rule in the operation bar, enter the security group, and add the port number you need to open. The last example uses the port 8888. Please be sure to open the
insert image description here
default open port . There are 22 (subsequent ssh use). The port number I added is as shown in the picture above. If necessary, you can still come in and add it later.

Step 4 - Use xshell to connect remotely

Go to the official website to download Xshell 7 and install it. Create a new session, fill in your own Alibaba Cloud public network IP, and then fill in the user name root and the password you just set on the server to enter the server. If you see Welcome to Alibaba Cloud Elastic Compute Service!, it means you have entered the server
insert image description here

# 按照提示输入将命令行激活
[root@Iceland ~]# systemctl enable --now cockpit.socket
Created symlink /etc/systemd/system/sockets.target.wants/cockpit.socket → /usr/lib/systemd/system/cockpit.socket.
# check 服务器当前环境
[root@Iceland ~]# pwd
/root
[root@Iceland ~]# cd /
[root@Iceland /]# ls
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
[root@Iceland ~]# uname -r		# 查看操作系统内核版本
4.18.0-305.3.1.el8.x86_64
[root@Iceland /]# cat /etc/os-release		# 查看操作系统详细信息
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"

Install Docker on the server

Official website documentation: https://docs.docker.com/engine/install/centos/

Step 1 - Uninstall the old version of docker

[root@Iceland /]# sudo yum remove docker \
> docker-client\
> docker-client-latest \
> docker-common \
> docker-latest \
> docker-latest-logrotate \
> docker-logrotate \
> docker-engine
No match for argument: docker
No match for argument: docker-clientdocker-client-latest
No match for argument: docker-common
No match for argument: docker-latest
No match for argument: docker-latest-logrotate
No match for argument: docker-logrotate
No match for argument: docker-engine
No packages marked for removal.
Dependencies resolved.
Nothing to do.
Complete!		# 由于是新服务器所以并没有这些老版本 docker

Step 2 - Install the mirror repository

[root@Iceland /]# yum install -y yum-utils		# 安装 yum-utils 
Last metadata expiration check: 2:09:26 ago on Sat 28 Aug 2021 06:38:17 PM CST.
Dependencies resolved.
=============================================================================================
 Package               Architecture       Version                   Repository          Size
=============================================================================================
Installing:
 yum-utils             noarch             4.0.18-4.el8              baseos              71 k

Transaction Summary
=============================================================================================
Install  1 Package

Total download size: 71 k
Installed size: 22 k
Downloading Packages:
yum-utils-4.0.18-4.el8.noarch.rpm                            1.7 MB/s |  71 kB     00:00    
---------------------------------------------------------------------------------------------
Total                                                        1.6 MB/s |  71 kB     00:00     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                     1/1 
  Installing       : yum-utils-4.0.18-4.el8.noarch                                       1/1 
  Running scriptlet: yum-utils-4.0.18-4.el8.noarch                                       1/1 
  Verifying        : yum-utils-4.0.18-4.el8.noarch                                       1/1 

Installed:
  yum-utils-4.0.18-4.el8.noarch                                                              

Complete!
[root@Iceland /]# yum-config-manager \		# 建立稳定链接的仓库
>     --add-repo \
>     https://download.docker.com/linux/centos/docker-ce.repo
Adding repo from: https://download.docker.com/linux/centos/docker-ce.repo		
# 国外仓库网站较慢后面会用阿里云的仓库

Step 3 - Install docker engine

[root@Iceland /]# yum install docker-ce docker-ce-cli containerd.io		# 安装这 3 个组件
Docker CE Stable - x86_64                                             18 kB/s |  15 kB     00:00    
Dependencies resolved.
=====================================================================================================
 Package                    Arch    Version                                  Repository         Size
=====================================================================================================
Installing:
 containerd.io              x86_64  1.4.9-3.1.el8                            docker-ce-stable   30 M
 docker-ce                  x86_64  3:20.10.8-3.el8                          docker-ce-stable   22 M
 docker-ce-cli              x86_64  1:20.10.8-3.el8                          docker-ce-stable   29 M
Installing dependencies:
 container-selinux          noarch  2:2.164.1-1.module_el8.4.0+886+c9a8d9ad  appstream          52 k
 docker-ce-rootless-extras  x86_64  20.10.8-3.el8                            docker-ce-stable  4.6 M
 docker-scan-plugin         x86_64  0.8.0-3.el8                              docker-ce-stable  4.2 M
 fuse-common                x86_64  3.2.1-12.el8                             baseos             21 k
 fuse-overlayfs             x86_64  1.6-1.module_el8.4.0+886+c9a8d9ad        appstream          73 k
 fuse3                      x86_64  3.2.1-12.el8                             baseos             50 k
 fuse3-libs                 x86_64  3.2.1-12.el8                             baseos             94 k
 libcgroup                  x86_64  0.41-19.el8                              baseos             70 k
 libslirp                   x86_64  4.3.1-1.module_el8.4.0+575+63b40ad7      appstream          69 k
 slirp4netns                x86_64  1.1.8-1.module_el8.4.0+641+6116a774      appstream          51 k
Enabling module streams:
 container-tools                    rhel8                                                           

Transaction Summary
=====================================================================================================
Install  13 Packages

Total download size: 90 M
Installed size: 377 M
Is this ok [y/N]: y		# 中间等待输入 y 即可
Downloading Packages:
(1/13): container-selinux-2.164.1-1.module_el8.4.0+886+c9a8d9ad.noar 1.4 MB/s |  52 kB     00:00    
(2/13): fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64.rpm  1.9 MB/s |  73 kB     00:00    
(3/13): libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64.rpm      1.3 MB/s |  69 kB     00:00    
(4/13): fuse-common-3.2.1-12.el8.x86_64.rpm                          1.4 MB/s |  21 kB     00:00    
(5/13): slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64.rpm   2.7 MB/s |  51 kB     00:00    
(6/13): fuse3-3.2.1-12.el8.x86_64.rpm                                4.0 MB/s |  50 kB     00:00    
(7/13): libcgroup-0.41-19.el8.x86_64.rpm                             4.6 MB/s |  70 kB     00:00    
(8/13): fuse3-libs-3.2.1-12.el8.x86_64.rpm                           4.7 MB/s |  94 kB     00:00    
(9/13): docker-ce-20.10.8-3.el8.x86_64.rpm                           5.5 MB/s |  22 MB     00:03    
(10/13): docker-ce-rootless-extras-20.10.8-3.el8.x86_64.rpm          3.5 MB/s | 4.6 MB     00:01    
(11/13): containerd.io-1.4.9-3.1.el8.x86_64.rpm                      4.7 MB/s |  30 MB     00:06    
(12/13): docker-scan-plugin-0.8.0-3.el8.x86_64.rpm                   3.5 MB/s | 4.2 MB     00:01    
(13/13): docker-ce-cli-20.10.8-3.el8.x86_64.rpm                      3.6 MB/s |  29 MB     00:08    
-----------------------------------------------------------------------------------------------------
Total                                                                 11 MB/s |  90 MB     00:08     
warning: /var/cache/dnf/docker-ce-stable-fa9dc42ab4cec2f4/packages/containerd.io-1.4.9-3.1.el8.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Docker CE Stable - x86_64                                            3.1 kB/s | 1.6 kB     00:00    
Importing GPG key 0x621E9F35:
 Userid     : "Docker Release (CE rpm) <[email protected]>"
 Fingerprint: 060A 61C5 1B55 8A7F 742B 77AA C52F EB6B 621E 9F35
 From       : https://download.docker.com/linux/centos/gpg
Is this ok [y/N]: y		# 中间等待输入 y 即可
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                             1/1 
  Installing       : docker-scan-plugin-0.8.0-3.el8.x86_64                                      1/13 
  Running scriptlet: docker-scan-plugin-0.8.0-3.el8.x86_64                                      1/13 
  Installing       : docker-ce-cli-1:20.10.8-3.el8.x86_64                                       2/13 
  Running scriptlet: docker-ce-cli-1:20.10.8-3.el8.x86_64                                       2/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Installing       : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           3/13 
  Installing       : containerd.io-1.4.9-3.1.el8.x86_64                                         4/13 
  Running scriptlet: containerd.io-1.4.9-3.1.el8.x86_64                                         4/13 
  Running scriptlet: libcgroup-0.41-19.el8.x86_64                                               5/13 
  Installing       : libcgroup-0.41-19.el8.x86_64                                               5/13 
  Running scriptlet: libcgroup-0.41-19.el8.x86_64                                               5/13 
  Installing       : fuse3-libs-3.2.1-12.el8.x86_64                                             6/13 
  Running scriptlet: fuse3-libs-3.2.1-12.el8.x86_64                                             6/13 
  Installing       : fuse-common-3.2.1-12.el8.x86_64                                            7/13 
  Installing       : fuse3-3.2.1-12.el8.x86_64                                                  8/13 
  Installing       : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    9/13 
  Running scriptlet: fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    9/13 
  Installing       : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                       10/13 
  Installing       : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                    11/13 
  Installing       : docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Running scriptlet: docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Installing       : docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Running scriptlet: container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch          13/13 
  Running scriptlet: docker-ce-3:20.10.8-3.el8.x86_64                                          13/13 
  Verifying        : container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch           1/13 
  Verifying        : fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                    2/13 
  Verifying        : libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                        3/13 
  Verifying        : slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                     4/13 
  Verifying        : fuse-common-3.2.1-12.el8.x86_64                                            5/13 
  Verifying        : fuse3-3.2.1-12.el8.x86_64                                                  6/13 
  Verifying        : fuse3-libs-3.2.1-12.el8.x86_64                                             7/13 
  Verifying        : libcgroup-0.41-19.el8.x86_64                                               8/13 
  Verifying        : containerd.io-1.4.9-3.1.el8.x86_64                                         9/13 
  Verifying        : docker-ce-3:20.10.8-3.el8.x86_64                                          10/13 
  Verifying        : docker-ce-cli-1:20.10.8-3.el8.x86_64                                      11/13 
  Verifying        : docker-ce-rootless-extras-20.10.8-3.el8.x86_64                            12/13 
  Verifying        : docker-scan-plugin-0.8.0-3.el8.x86_64                                     13/13 
Installed:
  container-selinux-2:2.164.1-1.module_el8.4.0+886+c9a8d9ad.noarch                                   
  containerd.io-1.4.9-3.1.el8.x86_64                                                                
  docker-ce-3:20.10.8-3.el8.x86_64                                                                  
  docker-ce-cli-1:20.10.8-3.el8.x86_64                                                 
  docker-ce-rootless-extras-20.10.8-3.el8.x86_64                                         
  docker-scan-plugin-0.8.0-3.el8.x86_64                                                         
  fuse-common-3.2.1-12.el8.x86_64                                                                
  fuse-overlayfs-1.6-1.module_el8.4.0+886+c9a8d9ad.x86_64                                
  fuse3-3.2.1-12.el8.x86_64                                                                
  fuse3-libs-3.2.1-12.el8.x86_64                                               
  libcgroup-0.41-19.el8.x86_64                                                          
  libslirp-4.3.1-1.module_el8.4.0+575+63b40ad7.x86_64                                      
  slirp4netns-1.1.8-1.module_el8.4.0+641+6116a774.x86_64                                    
Complete!

Although this command has already installed docker, it has not started (the same as the server, it must be started before running).

Step 4 - Start docker and verify

[root@Iceland /]# systemctl start docker
[root@Iceland /]# docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
b8dfde127a29: Pull complete 
Digest: sha256:7d91b69e04a9029b99f3585aaaccae2baa80bcf318f4a5d2165a9898cd2dc0a1
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.		# 客户连接守护进程
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.(amd64)		# pull 镜像
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.		# 通过镜像生成容器运行
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.		# 守护进程将信息显示到终端

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash
Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/
For more examples and ideas, visit:
 https://docs.docker.com/get-started/

The most important information above is to explain the 4 steps of docker operation, so far docker is also installed

[root@Iceland /]# docker version
Client: Docker Engine - Community
 Version:           20.10.8
 API version:       1.41
 Go version:        go1.16.6
 Git commit:        3967b7d
 Built:             Fri Jul 30 19:53:39 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true
Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:52:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b63
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Tips: Alibaba Cloud Mirror Accelerator

Log in to Alibaba Cloud --> Container Mirroring Service --> Mirroring Tools --> Mirroring Accelerator, and copy and run the 4 commands corresponding to CentOS.
insert image description here

[root@Iceland /]# sudo mkdir -p /etc/docker		# 新建目录
[root@Iceland /]# sudo tee /etc/docker/daemon.json <<-'EOF'		# 配置镜像地址文件
> {
    
    
>   "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
> }
> EOF
{
    
    
  "registry-mirrors": ["https://lisay8ar.mirror.aliyuncs.com"]
}
[root@Iceland /]# sudo systemctl daemon-reload		# 重启守护进程
[root@Iceland /]# sudo systemctl restart docker		# 重启 docker

Docker network configuration

Understand docker0 bridging technology

[root@Iceland /]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo		# 本机回环地址
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0	# 阿里云内网地址
       valid_lft 315352421sec preferred_lft 315352421sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0	# docker0 地址
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever

insert image description here
Each container in docker communicates within a network segment by bridging with docker0 (similar to a router). It is worth noting that one of the containers will build a pair of evth-pair virtual interfaces with docker0, which will appear in pairs and disappear in pairs. Only in this way can we ensure that the containers are independent from each other and can efficiently interconnect and communicate with each other, as well as communicate with the external network.

test

[root@Iceland ~]# docker run -d -P --name tomcat01 tomcat		# -P表示端口随机映射新建容器运行
Unable to find image 'tomcat:latest' locally
latest: Pulling from library/tomcat
1cfaf5c6f756: Pull complete 
c4099a935a96: Pull complete 
f6e2960d8365: Pull complete 
dffd4e638592: Pull complete 
a60431b16af7: Pull complete 
4869c4e8de8d: Pull complete 
9815a275e5d0: Pull complete 
c36aa3d16702: Pull complete 
cc2e74b6c3db: Pull complete 
1827dd5c8bb0: Pull complete 
Digest: sha256:1af502b6fd35c1d4ab6f24dc9bd36b58678a068ff1206c25acc129fb90b2a76a
Status: Downloaded newer image for tomcat:latest
b530e79cc32b45ed6222496013b66ab663eaef74c83dc62610b252b18d1a3310
[root@Iceland ~]# docker exec -it tomcat01 ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo		# 本机回环地址
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0		# eth 桥接地址 6 和 7 一对
       valid_lft forever preferred_lft forever
[root@Iceland ~]# ping 172.17.0.2		# 直接可以通过地址从 Linux 命令行 ping 通到容器内部
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.101 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.069 ms
64 bytes from 172.17.0.2: icmp_seq=3 ttl=64 time=0.064 ms
^C
--- 172.17.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.064/0.078/0.101/0.016 ms
[root@Iceland ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
       valid_lft 315312613sec preferred_lft 315312613sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever
7: veth0a09b40@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 		# 相较之前IP地址多出的这一个就是和创建容器对应的 7 号桥接地址
    link/ether e6:88:6f:4a:e9:4c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::e488:6fff:fe4a:e94c/64 scope link 
       valid_lft forever preferred_lft forever

Docker will assign a pair of interfaces to each container for the communication between the container and the bridge. Using this technology, the containers can be isolated from each other and can communicate efficiently, laying the foundation for the communication implementation of the Slurm cluster deployed later.
insert image description here

Containers use link technology to communicate with each other
Since the IP of the container may change, it is hoped that the --link can be used to communicate with the container ID instead of the IP

[root@Iceland ~]# docker run -d -P --name tomcat02 tomcat
07758a3a228c004fbf6cc8092b714d1249f921c4ba9360846206fc7915083f97
[root@Iceland ~]# docker ps
CONTAINER ID   IMAGE     COMMAND             CREATED          STATUS          PORTS                                         NAMES
07758a3a228c   tomcat    "catalina.sh run"   5 seconds ago    Up 4 seconds    0.0.0.0:49154->8080/tcp, :::49154->8080/tcp   tomcat02
b530e79cc32b   tomcat    "catalina.sh run"   50 minutes ago   Up 50 minutes   0.0.0.0:49153->8080/tcp, :::49153->8080/tcp   tomcat01
[root@Iceland ~]# docker exec -it tomcat02 ping tomcat01
3ping: tomcat01: Name or service not known		# 发现直接通过容器名在一个容器里无法连接另一个容器
# 通过增加运行时指令 --link 可以解决
[root@Iceland ~]# docker run -d -P --name tomcat03 --link tomcat02 tomcat
6e185946062f3af377eb58c34408471685cca20d8ca0b2873b24514856eda7d8
[root@Iceland ~]# docker exec -it tomcat03 ping tomcat02		# 通过指定03与02连接,发现可以互联
PING tomcat02 (172.17.0.3) 56(84) bytes of data.
64 bytes from tomcat02 (172.17.0.3): icmp_seq=1 ttl=64 time=0.131 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=2 ttl=64 time=0.091 ms
64 bytes from tomcat02 (172.17.0.3): icmp_seq=3 ttl=64 time=0.076 ms
^C
--- tomcat02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 53ms
rtt min/avg/max/mdev = 0.076/0.099/0.131/0.024 ms
# 但是反向02却ping不同03,因为双向都需要配置

Query the bridge information of tomcat03 through the command, –link is equivalent to adding a line of one-way mapping to 02 in the hosts configuration.

[root@Iceland ~]# docker exec -it tomcat03 cat /etc/hosts
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters
172.17.0.3	tomcat02 07758a3a228c		# 这里就绑定了 02 
172.17.0.4	6e185946062f
[root@Iceland ~]# docker exec -it tomcat02 cat /etc/hosts
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
ff00::0	ip6-mcastprefix
ff02::1	ip6-allnodes
ff02::2	ip6-allrouters
172.17.0.3	07758a3a228c

It can also be seen from here that docker0 is inconvenient. –link is an official definition with limitations and cannot be customized. It is impossible to bind it in every direction. And docker0 does not support container ID connection access.

Advanced building a custom network

network mode

  • bridge: bridge mode docker0 (default)
  • none: do not configure the network
  • host: share the network with the host
  • container: Container network connectivity (very limited)
[root@Iceland ~]# docker network --help
Usage:  docker network COMMAND
Manage networks
Commands:
  connect     Connect a container to a network
  create      Create a network		# 通过 create 创建自定义桥接网络
  disconnect  Disconnect a container from a network
  inspect     Display detailed information on one or more networks
  ls          List networks
  prune       Remove all unused networks
  rm          Remove one or more networks
Run 'docker network COMMAND --help' for more information on a command.
# docker0 是默认域名不能访问
[root@Iceland ~]# docker rm -f $(docker ps -aq)		# 首先将之前的容器及其网络配置删除
6e185946062f
07758a3a228c
b530e79cc32b
[root@Iceland ~]# docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
[root@Iceland ~]# docker images
REPOSITORY   TAG       IMAGE ID       CREATED       SIZE
tomcat       latest    266d1269bb29   10 days ago   668MB
[root@Iceland ~]# ip addr		# 可以看到这里已经只有最开始3行网络了
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:16:3e:29:ef:40 brd ff:ff:ff:ff:ff:ff
    inet 172.30.31.209/20 brd 172.30.31.255 scope global dynamic noprefixroute eth0
       valid_lft 315310384sec preferred_lft 315310384sec
    inet6 fe80::216:3eff:fe29:ef40/64 scope link 
       valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:5d:e9:e1:b7 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:5dff:fee9:e1b7/64 scope link 
       valid_lft forever preferred_lft forever
# 创建新的桥接网络,--driver [网络类型] --subnet [子网范围]  --gateway [网关地址]
[root@Iceland ~]# docker network create --driver bridge --subnet 192.168.0.0/16 --gateway 192.168.0.1 mynet
57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e
[root@Iceland ~]# docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
9223b334e60a   bridge    bridge    local
8d96801ccaf3   host      host      local
57c914464f0a   mynet     bridge    local
a5ff794b6d74   none      null      local
[root@Iceland ~]# docker network inspect mynet
[
    {
    
    
        "Name": "mynet",
        "Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
        "Created": "2021-08-29T09:07:38.248210817+08:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
    
    
            "Driver": "default",
            "Options": {
    
    },
            "Config": [
                {
    
    
                    "Subnet": "192.168.0.0/16",		# 看到网络已经设置好了
                    "Gateway": "192.168.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
    
    
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
    
    },
        "Options": {
    
    },
        "Labels": {
    
    }
    }
]

Test custom network

[root@Iceland ~]# docker run -d -P --name tomcat-net-01 --net mynet tomcat
c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898
[root@Iceland ~]# docker run -d -P --name tomcat-net-02 --net mynet tomcat
91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb
[root@Iceland ~]# docker network inspect mynet
[
    {
    
    
        "Name": "mynet",
        "Id": "57c914464f0a0e9423483cf16dd5c71dc02c65d02218149e14a3fc169a45ad5e",
        "Created": "2021-08-29T09:07:38.248210817+08:00",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
    
    
            "Driver": "default",
            "Options": {
    
    },
            "Config": [
                {
    
    
                    "Subnet": "192.168.0.0/16",
                    "Gateway": "192.168.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
    
    
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
    
    
            "91ce2929f0083f0bba803fa12ccf11b1b0cff36b3c807ada42e5fbe1aadef1cb": {
    
    
                "Name": "tomcat-net-02",
                "EndpointID": "4df2dc1c5314bb02ae69ef7b47e32e658cb3aaaf7c65074bfddfe38629ba65be",
                "MacAddress": "02:42:c0:a8:00:03",
                "IPv4Address": "192.168.0.3/16",		# 看到这里的IP就是我们定义的192.168.0.3
                "IPv6Address": ""
            },
            "c2e8c4d6ec1af68bea8dcad213a9c693151859667f26336c596aedf4189aa898": {
    
    
                "Name": "tomcat-net-01",
                "EndpointID": "7d92fc552cb88f410b207075e473afde36f63020dc63f0de7923fd7137e19b1f",
                "MacAddress": "02:42:c0:a8:00:02",
                "IPv4Address": "192.168.0.2/16",		# 看到这里的IP就是我们定义的192.168.0.2
                "IPv6Address": ""
            }
        },
        "Options": {
    
    },
        "Labels": {
    
    }
    }
]

The advantage of a custom bridge network is that different networks (different subnets) are isolated from each other, but the interconnection within the network is very complete, and the two containers can ping each other, which fixes the --link problem.

[root@Iceland ~]# docker exec -it tomcat-net-01 ping 192.168.0.3
PING 192.168.0.3 (192.168.0.3) 56(84) bytes of data.
64 bytes from 192.168.0.3: icmp_seq=1 ttl=64 time=0.119 ms
64 bytes from 192.168.0.3: icmp_seq=2 ttl=64 time=0.092 ms
64 bytes from 192.168.0.3: icmp_seq=3 ttl=64 time=0.080 ms
^C
--- 192.168.0.3 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 64ms
rtt min/avg/max/mdev = 0.080/0.097/0.119/0.016 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.116 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.101 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.102 ms
^C
--- 192.168.0.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 27ms
rtt min/avg/max/mdev = 0.101/0.106/0.116/0.010 ms
[root@Iceland ~]# docker exec -it tomcat-net-02 ping tomcat-net-01		# 直接通过容器ID也可以
PING tomcat-net-01 (192.168.0.2) 56(84) bytes of data.
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=1 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=2 ttl=64 time=0.098 ms
64 bytes from tomcat-net-01.mynet (192.168.0.2): icmp_seq=3 ttl=64 time=0.086 ms
^C
--- tomcat-net-01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 51ms
rtt min/avg/max/mdev = 0.086/0.094/0.098/0.005 ms

Docker Compose - batch container orchestration

Official documentation: https://docs.docker.com/compose/

The original docker process of a single container, docker file --> docker build --> docker run, requires manual operation of a single container, which is stretched in the face of a large number of clusters, so docker compose realizes the automatic operation of multiple containers through configuration files

official introduction

Using Compose is basically a three-step process:

  1. Define your app’s environment with a Dockerfile so it can be reproduced anywhere.
  2. Define the services that make up your app in docker-compose.yml so they can be run together in an isolated environment.
  3. Run docker compose up and the Docker compose command starts and runs your entire app. You can alternatively run docker-compose up using the docker-compose binary.

Two important concepts of compose:

  • services, containers, applications
  • Project project, a set of associated containers

Step 1 - Install compose

# 先从 GitHub 下载 compose 文件
[root@Iceland ~]# sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
# 给文件授权
[root@Iceland ~]# sudo chmod +x /usr/local/bin/docker-compose
[root@Iceland ~]# docker-compose version		# 证明安装成功
docker-compose version 1.29.2, build 5becea4c
docker-py version: 5.0.0
CPython version: 3.7.10
OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019

Step 2 - Official Website Example

The official website example is a python application, the counter uses redis, because the server is too slow and the download error is not demonstrated, the approximate steps are as follows:

  • Step 1: Write application app.py
  • Step 2: The dockerfile application is packaged into a mirror image (the stand-alone application is not online)
  • step 3: docker-compose yaml file (defining the entire service, the environment required online) - core file
  • Step 4: Start the compose project (docker-compose up runs the whole set of services)

Slurm cluster construction experiment

参考blog:https://medium.com/analytics-vidhya/slurm-cluster-with-docker-9f242deee601

Step 1 - Slurm Architecture Description

We'll create a Slurm cluster using docker-compose, which allows us to create an environment from a docker image (built by the author). Docker-compose will create containers and networks to communicate in isolated environments. Each container is a component of the cluster.

  • slurmmaster is a container with slurmctld (Slurm's central management daemon).
  • slurmnode[1-3] is a container with slurmd (Slurm's compute node daemon).
  • slurmjupyter is a container with jupyterlab. This allows to use JupyterLab as a cluster client to interact with the cluster. As end users, we will use JupyterLab to interact with Slurm through the browser.
  • cluster_default network, docker-compose will create a network to join and keep all containers. Containers inside the network can see each other.
    The following scheme shows how all components interact.

insert image description here

Step 2 Write yaml file

Since mirroring is used, the entire project only needs a yaml file, which defines the steps of pulling images, and you only need to enter docker-compose up -d on the command line to run.

# 新建文件夹 cluster 用来存放文件
[root@Iceland ~]# mkdir cluster
[root@Iceland ~]# ls
cluster  composetest
[root@Iceland ~]# cd cluster
[root@Iceland cluster]# vim docker-compose.yml

The docker-compose.yml file is as follows:

services:
  slurmjupyter:		# 开通容器 slurmjupyter
        image: rancavil/slurm-jupyter:19.05.5-1		# 镜像仓库rancavil是作者名字 Rodrigo Ancavil 缩写
        hostname: slurmjupyter
        user: admin
        volumes:
                - shared-vol:/home/admin
        ports:
                - 8888:8888
  slurmmaster:
        image: rancavil/slurm-master:19.05.5-1
        hostname: slurmmaster
        user: admin
        volumes:
                - shared-vol:/home/admin
        ports:
                - 6817:6817
                - 6818:6818
                - 6819:6819
  slurmnode1:		# 定义容器1的各项参数
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode1
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode1
        links:
                - slurmmaster		# 和之前自定义网络类似,这里定义 node1 能与 master 连接,下面同理
  slurmnode2:
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode2
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode2
        links:
                - slurmmaster
  slurmnode3:
        image: rancavil/slurm-node:19.05.5-1
        hostname: slurmnode3
        user: admin
        volumes:
                - shared-vol:/home/admin
        environment:
                - SLURM_NODENAME=slurmnode3
        links:
                - slurmmaster
volumes:
        shared-vol:

Step 3 - Run docker-compose up

[root@Iceland cluster]# docker-compose up -d		# 开始部署,接下来是安装步骤
Creating network "cluster_default" with the default driver	# docker-compose会自动按yaml生成自定义网络
Creating volume "cluster_shared-vol" with default driver
Pulling slurmjupyter (rancavil/slurm-jupyter:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-jupyter
83ee3a23efb7: Pull complete
db98fc6f11f0: Pull complete
f611acd52c6c: Pull complete
87f6e2c4791b: Pull complete
1301353d4fa3: Pull complete
3347f4fbce33: Pull complete
0cf1a37339f3: Pull complete
e78d0881f8c1: Pull complete
37049fe9d876: Pull complete
a8fa566a7a57: Pull complete
24af49ba4a2f: Pull complete
97b9029f86ee: Pull complete
Digest: sha256:17a72e8e4c5d687359c2923af7166e84f9bd3b63146145421bbac006ce141d45
Status: Downloaded newer image for rancavil/slurm-jupyter:19.05.5-1
Pulling slurmmaster (rancavil/slurm-master:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-master
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
e216e1a311d3: Pull complete
ab998a26ee04: Pull complete
499f3426618c: Pull complete
b5b815649fa6: Pull complete
2f04debb872c: Pull complete
4050a9c6f8d3: Pull complete
Digest: sha256:1979f86166b58213380604dcd7c1fcdb2438a40c44add2ff356be47160a97ab3
Status: Downloaded newer image for rancavil/slurm-master:19.05.5-1
Pulling slurmnode1 (rancavil/slurm-node:19.05.5-1)...
19.05.5-1: Pulling from rancavil/slurm-node
83ee3a23efb7: Already exists
db98fc6f11f0: Already exists
f611acd52c6c: Already exists
87f6e2c4791b: Already exists
d82ef016a552: Pull complete
5865a097296e: Pull complete
0602a8c59a76: Pull complete
6f2545f38103: Pull complete
608c665d03da: Pull complete
c80540692f3b: Pull complete
Digest: sha256:ae650d12fbdaddd29208d7638aa0498c655bfe5a33f4fd07d57e51eb211f18c2
Status: Downloaded newer image for rancavil/slurm-node:19.05.5-1
Creating cluster_slurmmaster_1  ... done
Creating cluster_slurmjupyter_1 ... done
Creating cluster_slurmnode1_1   ... done
Creating cluster_slurmnode2_1   ... done
Creating cluster_slurmnode3_1   ... done
[root@Iceland cluster]# docker-compose ps		# 可以看到 5 个容器都运行正常
         Name                       Command               State                      Ports                   
-------------------------------------------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Up      0.0.0.0:8888->8888/tcp,:::8888->8888/tcp   
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Up      3306/tcp,                                  
                                                                  0.0.0.0:6817->6817/tcp,:::6817->6817/tcp,  
                                                                  0.0.0.0:6818->6818/tcp,:::6818->6818/tcp,  
                                                                  0.0.0.0:6819->6819/tcp,:::6819->6819/tcp   
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp               
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp               
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Up      6817/tcp, 6818/tcp, 6819/tcp 

Remember to enter the IP address of the server: 8888 into the browser to see the JupyterLab interface we are running

insert image description here

This is the installed Slurm queue extension function

insert image description here

Click this button to enter the Slurm Queue management interface

insert image description here

Click the command line button on the previous page to enter the internal view

insert image description here

admin@slurmjupyter:~$ scontrol show node		# 查看节点信息,看见3个节点都在
NodeName=slurmnode1 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode1 NodeHostName=slurmnode1 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:15:59 SlurmdStartTime=2021-08-29T06:38:14
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode2 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode2 NodeHostName=slurmnode2 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=slurmnode3 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=1 CPULoad=0.31
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=slurmnode3 NodeHostName=slurmnode3 Version=19.05.5
   OS=Linux 4.18.0-305.3.1.el8.x86_64 #1 SMP Tue Jun 1 16:14:33 UTC 2021 
   RealMemory=1 AllocMem=0 FreeMem=141 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=slurmpar 
   BootTime=2021-08-28T11:16:00 SlurmdStartTime=2021-08-29T06:38:15
   CfgTRES=cpu=1,mem=1M,billing=1
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Step 4 - Run a Slurm example

First create a new file in JupyterLab and rename it to test.py, and enter the following code - simply let the program sleep for 15s

#!/usr/bin/env python3
  
import time
import os
import socket
from datetime import datetime as dt
if __name__ == '__main__':
    print('Process started {}'.format(dt.now()))
    print('NODE : {}'.format(socket.gethostname()))
    print('PID  : {}'.format(os.getpid()))
    print('Executing for 15 secs')
    time.sleep(15)
    print('Process finished {}\n'.format(dt.now()))

insert image description here

Continue to create a new script file job.sh to assign work to all slurmnode[1-3], here specify to broadcast test.py and the number of tasks is 3, and output the result to the file result.out

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=result.out
#
#SBATCH --ntasks=3
#
sbcast -f test.py /tmp/test.py
srun python3 /tmp/test.py

insert image description here

Then enter the Slurm Queue management interface, click Submit Job to submit the work that the cluster needs to do, here we just submit the job.sh file, select the file type, the path is /home/admin/job.sh, click Submit Job.

insert image description here

Remember to click Reload again to load the job into the system, so that the work will start running in the cluster. After about 15 seconds, there will be an additional output file of result.out in the sidebar. Double-click to view it is obtained by parallel computing of 3 computing nodes. result.

insert image description here

So far, the instance of Slurm has been completed, but due to the fact that the purchased free server has 1 core and 2G, more complex matrix multiplication cannot be submitted.

Finally remember to close the service

[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1   ... done
Stopping cluster_slurmnode2_1   ... done
Stopping cluster_slurmnode3_1   ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1  ... done
[root@Iceland cluster]# docker-compose ps
         Name                       Command                State     Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Exit 137 
填到系统,这样工作就在集群跑起来了,过15s 左右侧边栏就会多出一个 result.out 的输出文件,双击查看就是 3 个计算结点并行计算得到的结果。

[外链图片转存中...(img-mS70KiC7-1630323014459)]

>至此Slurm 的实例完成,受限于购买的免费服务器是1核2G的原因,无法提交更复杂的矩阵乘法。

最后记得关闭服务

```shell
[root@Iceland cluster]# docker-compose stop
Stopping cluster_slurmnode1_1   ... done
Stopping cluster_slurmnode2_1   ... done
Stopping cluster_slurmnode3_1   ... done
Stopping cluster_slurmjupyter_1 ... done
Stopping cluster_slurmmaster_1  ... done
[root@Iceland cluster]# docker-compose ps
         Name                       Command                State     Ports
--------------------------------------------------------------------------
cluster_slurmjupyter_1   /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmmaster_1    /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode1_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode2_1     /etc/slurm-llnl/docker-ent ...   Exit 137        
cluster_slurmnode3_1     /etc/slurm-llnl/docker-ent ...   Exit 137 

Special thanks to the docker video of Kuangshen at station B, "As long as you don't die, learn from death"

Guess you like

Origin blog.csdn.net/m0_47455189/article/details/120003893