[Getting Started] Build your own docker container [Cluster] [SSH] [Docker Hub]


We already know that we can create a container based on an existing image. This container also encapsulates the dependencies and operating environment required for the project. When we are ready to run the program on a different cluster, we do not need to reconfigure the environment. We only need to change a configuration Once a good image is pulled, it can be used directly.
Create a container b based on image A, perform a series of operations in b (install various dependent environments), create a new image B based on b, and push B to Docker Hub to back up container b. See below for the specific operation process.

1 pull an (official) image

Pull an official image on Docker Hub, or it can be another user's image, but it is not sure whether the specific internal configuration can meet your needs, so it is best to pull the official image and then modify it yourself.
nvidia/cuda’s Docker Hub official account: https://hub.docker.com/r/nvidia/cuda/tags

docker pull [OPTIONS] IMAGE_NAME[:TAG]

sudo docker pull nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04

Basically, sudo must be added before the docker command. Sudo will be omitted in the following paragraphs. Don’t forget to add it when actually using it!

According to the CUDA version required for the project, select the image to be pulled under the official nvidia account of Docker Hub. There are two versions: runtime and devel. The most intuitive difference between the two is the size. Some also have a base version. Please refer to the appendix for specific differences. Here we choose the devel version. Ubuntu is also the 18.04 version which is the same as the local operating system.runtime or devel

After executing the above instructions, check whether the image was successfully pulled down:

docker image ls -a

2 Create a container based on the image

run creates and runs a container:

docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

docker run -it -name gait nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04 bash

If a container already exists (created before, temporarily closed or exited), use the start command:

docker start [OPTIONS] CONTAINER [CONTAINER...]

docker start -i gait
docker start gait

The former -i enters interactive mode , and the latter starts the container in the background without entering the interactive end.

3 Configure the container

Now we have created a new container based on the image. There are no required dependencies in the container yet. You can use inspect to view the container information (a lot of content is returned, and I can’t understand it yet):

docker inspect [OPTIONS] NAME|ID [NAME|ID...]

docker inspect gait

Before configuring the container, first check whether the image it is based on is correct, type:

nvcc -V

Return cuda information

3.1 Install python

This step may be omitted.
Update the package list and system:

apt update
apt upgrade

3.1.1 Install build tools and dependencies

Install some common development and build tools and dependent libraries:

apt install build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev wget

3.1.2 Download and unzip the python source code package

wget https://www.python.org/ftp/python/3.8.16/Python-3.8.16.tgz
tar -xf Python-3.8.16.tgz

3.1.3 Enter the source code directory and configure installation options

cd Python-3.8.16
./configure --enable-optimizations

3.1.4 Compile and install python

make altinstall

Use altinstall instead of install here to avoid replacing the system's default python version.

3.1.5 Check whether the installation is successful

python3.8 --version

3.2 Install torch and related configurations

It is recommended to install using the whl method, which will be more stable and eliminates the need to consider issues such as slow download and installation of remote agents.

3.2.1 Install pip

apt-get install -y python3.8-pip

Here I use apt-get, but actually I should use apt.

3.2.2 Install torch

Find the required version of the whl file on the official website, download it locally and upload it to the server, and finally copy it from the server to the container.

docker cp [OPTIONS] SRC_PATH DEST_PATH

※Update pip (the previous version of pip could not install the whl file using the install command. I don’t know why, just update it)

python3.8 -m pip install --upgrade pip
pip install torch-1.10.0+cu113-cp38-cp38-linux_x86-64.whl

Check if the installation is successful

python3.8
import torch
print(torch.cuda.is_avalibale())

Just return True (you can also take a look at pip list)

3.3 Find something else

pyyaml, tensorboard, opencv-python, tqdm, cornea

pip install pyyaml==6.0 tensorboard==2.11.0 opencv-python==4.6.0.66 tqdm==4.64.1 kornia==0.6.10

4 Package the container into an image and push it to Docker Hub

Remember to log in to your Docker Hub account. You can log in before or in this step.

docker commit [OPTIONS] CONTAINER [REPOSITORY[:TAG]]
docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]
docker push [OPTIONS] NAME[:TAG]

docker commit gait gait_img:v1.0
docker tag gait_img:v1.0 carrothu0727/gait_img:v1.0
docker push carrothu0727/gait_img:v1.0

The colon is followed by the tag tag. If no tag is specified, the default is latest.

5 The cluster submits the job, reports an error when running, and supplements the installation dependencies.

5.1 Cluster submission job

5.1.1 Basic settings

basic settings

Task name: Generally default, you can also change it to a name that is easy to view.
Resource pool: claster cloud
multi-machine multi-card: single machine (take a single machine as an example)

5.1.2 Task settings

Mission settings 1

Mission setting 2

Task role name: Default (cannot be changed)
Resource specifications: Choose whichever one can be used. Here I just randomly selected
the Docker image: If there are previous job packaging containers stored on the cluster platform, you can directly select them in the drop-down menu Select; if not, turn off the selection switch on the right, enter the image name in Docker Hub and give it the pull
command: default sleep infinity

5.1.3 Storage configuration

Storage configuration

Add a new storage volume and select the storage node from the drop-down menu. The storage volume stores data sets and codes related to the project... The default path is /root/data1 and you can modify it to a
name that is easy to remember. Only in this step can you explicitly Once you see the path, you have to check it now after submitting the assignment, don’t forget.
/root/data1 here is the mount point. Mounting refers to connecting the top-level directory in the device file to a directory under the Linux root directory (this directory is preferably an empty directory). Accessing this directory is equivalent to accessing Device files.

5.1.4 Environment variable configuration

Just leave it as default, no need to select anythingEnvironment variable configuration

After the above four steps (5.1.1~5.1.4) are completed, the job status shows "Creating" . It usually takes 10 to 30 minutes to wait. The length of time is related to the image size.
Creating

Please add image description

After the creation is completed, "Running" will be displayed , so the job has been submitted successfully!
Running

5.2 Running errors and solutions (skip)

The following are some errors that occurred when my project file was running. They are purely records. You can skip them because different errors may occur in each project file. Just prescribe the right medicine~

ERROR1——No_bz2

ModuleNotFoundError: No module named '_bz2'

Copy from the system's python3.8 file to the python3.8 file in the local folder, and fill in whatever is missing.

ERROR2——No_lzma

ModuleNotFoundError: No module named '_lzma'

The solution is the same as above.

ERROR3——CUDA OOM

RuntimeError: CUDA out of memory.

This is a big problem, refer to this article:
Solve the CUDA: Out Of Memory problem caused by Pytorch's video memory fragmentation by setting max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF

ERROR4——cuDNN cannot be found

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

There are many reasons that can lead to this kind of error. Some people say that changing a small batch will do. I forgot how to solve it in the end. It should be changing the batch...

ERROR5——No ANTIALIAS

AttributeError: module'PIL.Image'has no attribute'ANTIALIAS'

The latest version of pillow does not have 'ANTIALIAS' and needs to be downgraded:

pip install pillow==9.4.0

0 ssh connect to cluster

When we have completed the previous installation and configuration work, we can directly open the terminal on the web page (that is, enter the container in the cluster) to run the program. It is considered "done"! However, one thing cannot be ignored. The web connection is unstable and the terminal window often loses the connection inexplicably. On the one hand, we can use the tmux terminal reuse tool to avoid losing everything once the connection is disconnected. On the other hand, it is recommended to use ssh connection. There is no need to connect directly to the web page. The following is a brief introduction to the process of installing and configuring ssh, and connecting pycharm ssh to the cluster.

0.1 Install ssh in docker container

It is still in the container that was configured locally at the beginning, not the one on the cluster. If the cluster container is closed, it will be gone next time .
The original tutorial used apt-get, which is an older package management tool. Now it is more recommended to use apt!

apt update
apt install openssh-server

0.2 Set root password

This password will be used later when local pycharm remotely connects to the cluster.

passwd

0.3 Modify configuration file

vim /etc/ssh/sshd_config

Comment out PermitRootLogin prohibit-password , which means prohibiting ssh login with password as root user;
add PermitRootLogin yes , which means ssh login as root is allowed.

0.4 Restart the ssh service

/etc/init.d/ssh restart

At this point, we have completed a container with ssh connection function. Follow the above steps to push it to Docker Hub, and then proceed with the following operations.

0.5 The ssh port is exposed when the cluster submits a job

Please add image description
Add an ssh auto-start command to the command:

/etc/init.d/ssh restart && sleep infinity

New port setting
service: Other
container ports: 22 (the default port of the container is 22, please refer to the introduction of SFTP connection in the appendix)
Host port: 11111 (just enter any one yourself, it will be used when configuring pycharm)

0.6 pycharm configure ssh connection

Please add image description

Type: Select SFTP
Host: Cluster container ip (that is, the host IP on the cluster)
User name: root
Password: It is the one set before.
After filling in the above content , Test Connection is successful!

Appendix 1: The difference between runtime, devel and base

runtime devel base
Is the package version used to run the compiled application. It usually contains the minimum runtime libraries and dependencies required by the application to ensure that the application can run properly on the target system. The runtime library contains the functions and resources required by the program at runtime. Is the package version used for developing and compiling software. It contains header files, static libraries, dynamic libraries and other development tools needed to compile and build applications. Development versions of packages enable developers to compile and test applications on their systems. It is the most basic package version, containing the minimum files and functions, and is usually the basis for building other versions.
Header files, static libraries, etc. used for development are generally not included because these are not necessary for the normal operation of the application. Development files corresponding to the runtime libraries are usually included so that developers can link and use these libraries. Some core libraries and tools may be included, but typically no runtime libraries or development tools. It can serve as the basis for other versions, installing additional components as needed to implement specific functionality.

In general, the runtime version is used to run applications, the devel version is used to develop and compile applications, and the base version is the most basic version and serves as the basis for other versions. When developing and deploying software, select the appropriate version as needed to meet development and operation needs.
Memory size: devel>runtime>base

Appendix 2: Differences between FTP, FTPS, and SFTP

FTP (File Transfer Protocol), FTPS (FTP over SSL/TLS), and SFTP (SSH File Transfer Protocol) are different protocols used for file transfer. The differences between them are as follows:

FTP(File Transfer Protocol) FTPS(FTP over SSL/TLS) SFTP(SSH File Transfer Protocol)
Is a standard network protocol that uses clear text transmission for file transfer between clients and servers. FTPS is an extension of FTP that provides secure file transfer by adding an SSL/TLS encryption layer on top of FTP. SFTP is a protocol for file transfer over a secure channel through the SSH (Secure Shell) protocol.
FTP does not provide encryption, which means that files and credentials may be transmitted in clear text during the transfer process, posing security risks. FTPS uses SSL/TLS encryption to protect control connections and data connections, ensuring transmission confidentiality and data integrity. SFTP uses SSH sessions for authentication and data encryption, providing secure protection for file transfers.
FTP uses two separate connections (control connection and data connection) for file transfer. The default port number used by FTPS is 990 (control connection), the same data connection port as FTP (usually 20), and requires the server to have an SSL/TLS certificate to provide encryption and authentication. SFTP uses a single connection, both for control commands and data transmission, in a secure channel through the SSH protocol. The default port number used is 22, which is consistent with the SSH default port.

Summary:
FTP is the most basic file transfer protocol and does not provide encryption;
FTPS provides encryption and security on FTP by adding an SSL/TLS layer;
SFTP is a protocol that uses the SSH protocol for encryption and secure file transfer.

Reference blog:

Solve the CUDA: Out Of Memory problem caused by Pytorch's video memory fragmentation by setting max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF

AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘

[docker-cuda]——The difference between base, runtime and devel

Guess you like

Origin blog.csdn.net/weixin_45074807/article/details/131812147