1 Introduction
2. Concept
3. Installation
3.1 Installation preparation
3.1.1 Install curl
sudo apt-get install libcurl3-gnutls=7.47.0-1ubuntu2
sudo apt install curl
sudo apt-get install x11-xserver-utils
sudo apt-get remove docker docker-engine docker.io containerd runc
xhost +
3.2 Install docker
3.2.1
curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun
Check if the installation is successful
sudo docker help
Use the docker help command to view all docker commands, indicating that the installation has been successful.
3.2.3 Install docker2
# 1.
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
# 2.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# 3.
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 4
sudo apt-get update
# 5
sudo apt-get install -y nvidia-docker2
Modify file:
sudo vim /etc/docker/daemon.json
The red box is the added parameter
{
"registry-mirrors": ["https://xxxxxxx.mirror.aliyuncs.com"],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
3.2.2 Mirror acceleration
Sometimes you encounter difficulties in pulling images from DockerHub in China. In this case, you can configure the image accelerator. For example: HKUST mirror, Alibaba Cloud, etc. Taking Alibaba Cloud as an example, Alibaba Cloud image acquisition address:. After logging in, select Image Accelerator on the left menu to see your exclusive address:
Then write the following content in /etc/docker/daemon.json
(if the file does not exist, please create a new file):
{"registry-mirrors":["https://XXX.mirror.aliyuncs.com/"]}
Then restart the service:
sudo systemctl daemon-reload
sudo systemctl restart docker
3.2.3 Local login
Docker officially maintains a public warehouse, Docker Hub, which contains most of the basic images we need.
First register an account, and then log in locally:
Pay attention to the username and password used. Not email
sudo docker login
3.2.4 Add user permissions
- Create a group named docker. If the group already exists, an error will be reported. You can ignore this error:
sudo groupadd docker
- Add the current user to the group docker:
sudo gpasswd -a ${USER} docker
- Restart the docker service (please use with caution in production environment):
sudo systemctl restart docker
- Add access and execution permissions:
sudo chmod a+rw /var/run/docker.sock
- After the operation is completed, verify it. You don’t need to bring sudo now:
docker info
4. Operation
4.1 Pull the image
sudo docker pull lingjunlh/torch1.9.1-cuda11.1
4.2 Containers
4.2.1 Create container
4.2.1.1 Setting docker in the terminal can display the relevant visual interface
- Host terminal runs
DISPLAY=:0.0
xhost +
- View environment variables
echo ${
DISPLAY}
4.2.1.2
sudo nvidia-docker run -it --privileged=true -p 7777:8888 --gpus all --ipc=host -v /data:/data -e DISPLAY=unix$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -e GDK_SCALE -e GDK_DPI_SCALE --name test1 b7a4c /bin/bash
-i
:Interactive operation-t
:terminal-p 7777:8888
:Map port 7777 of the host to port 8888 of the container- –privileged=true: call GPU resources
-ipc=host
: Let the container share memory with the host--name xxxxx
:Define a personalized name for the container-v /home/shcd/Documents/gby:/gby
: Mount the/home/shcd/Documents/gby
address on the host into the container and name it the/data
folder- This way the contents of this folder can be shared between the container and the host.
- Because once the container is closed, all changes in the container will be cleared, so mounting an address like this can save the data in the container locally. -
90be7604e476
is the ID of the image you installed.- You can view it after the docker images command just now. Of course, you can also directly write the full name ufoym/deepo:all-py36-jupyter
/bin/bash
: The command is placed after the image name. Here we hope to have an interactive shell, so we use /bin/bash
4.2.2 Enter the running container
docker attach d6a0f155273a
- d6a0f155273a: container id
4.2.3 Exit the container
ctrl+D
4.2.4 Delete container
1) First you need to stop all containers
docker stop $(docker ps -a -q)
2)删除所有的容器(只删除单个时把后面的变量改为container id即可)
docker rm $(docker ps -a -q)
4.2.5 Start container
sudo docker start 容器id
other
docker logs
View the container running log using the container ID
docker logs -tf 容器id
docker logs --tail num 容器id # num为要显示的日志条数
docker top container id to view process information in the container
docker top 容器id
docker inspect container id to view the metadata of the container
docker inspect 容器id
4.3 Uninstall
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
sudo apt-get purge -y nvidia-docker2
5 commonly used commands
5.1 Commands
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-compose-plugin
docker ps 查看当前运行中的容器
docker ps -a 查看所有容器
docker images 查看镜像列表
docker rm container-id 删除指定 id 的容器
docker stop/start container-id 停止/启动指定 id 的容器
docker rmi image-id 删除指定 id 的镜像
docker volume ls 查看 volume 列表
docker network ls 查看网络列表
docker ps -s 查看docker 容器大小
5.2 Docker View container and image size
- View overall size
docker system df
- View the detailed size of each image and container
docker system df -v
5.3 Stopping and killing containers
- When docker stop is executed, it first sends a TERM signal to the container, allowing the container to do some protective and security operations that must be done before exiting, and then allows the container to automatically stop running. If the container does not stop running within a period of time, execute kill - 9 command to forcefully terminate the container.
sudo docker stop test
-
test: container name
-
When docker kill is executed, no matter what state the container is in or what program is running, the kill -9 command is directly executed to forcefully terminate the container.
5.4 Delete image
docker rmi image-id
5.5 Copy files to the imageDocker cp
Reference
Function: Copy files in the host to the target docker container
# 将容器内的文件拷贝到主机内
docker cp [OPTIONS] CONTAINER:SRC_PATH DEST_PATH]
# 将主机内的文件拷贝到容器中
docker cp [OPTIONS] SRC_PATH CONTAINER:DEST_PATH|
Example
- Copy the host/www/runoob directory to the container 96f7f14e99ab, and rename the directory to www.
docker cp /www/runoob 96f7f14e99ab:/www
- Copy the container
96f7f14e99ab
's/www
directory to the host's /tmp directory.docker cp 96f7f14e99ab:/www /tmp/
buildimage
run
Case
- Download the required pytorch
1.8 version URL
Other versions
Download the following three
torch-1.8.2+cu111-cp38-cp38-linux_x86_64.whl
torchaudio-0.8.2-cp38-cp38-linux_x86_64.whl
torchvision-0.9.2+cu111-cp38-cp38-linux_x86_64.whl
- Open dockerfile
gedit Dockerfile
- Dockerfile
#安装python运行环境
#
################################################
#基于哪个镜像生成新的镜像
FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
RUN rm /etc/apt/sources.list.d/cuda.list
#作者名
MAINTAINER SunPengfei
#设置环境变量
ENV TZ Asia/Shanghai
ENV LANG zh_CN.UTF-8
# 拷贝下载好的whl文件到镜像中
#COPY torch-1.10.1+cu111-cp38-cp38-linux_x86_64.whl /tmp
#COPY torchaudio-0.10.0+cu111-cp38-cp38-linux_x86_64.whl /tmp
#COPY torchvision-0.11.0+cu111-cp38-cp38-linux_x86_64.whl /tmp
#执行命令
#替换为阿里源
RUN sed -i 's#http://archive.ubuntu.com/#http://mirrors.aliyun.com/#' /etc/apt/sources.list \
&& sed -i 's#http://security.ubuntu.com/#http://mirrors.aliyun.com/#' /etc/apt/sources.list
#更新软件源并安装软件
RUN apt-get update -y \
&& apt-get -y install iputils-ping \
&& apt-get -y install wget \
&& apt-get -y install net-tools \
&& apt-get -y install vim \
&& apt-get -y install openssh-server \
&& apt-get -y install python3.8 \
&& apt-get -y install python3-pip python3-dev python3.8-dev \
&& apt-get -y install libgl1 \
&& apt-get -y install git \
&& cd /usr/local/bin \
&& rm -f python \
&& rm -f python3 \
&& rm -f pip \
&& rm -f pip3 \
&& ln -s /usr/bin/python3.8 python \
&& ln -s /usr/bin/python3.8 python3 \
&& ln -s /usr/bin/pip3 pip \
&& ln -s /usr/bin/pip3 pip3 \
&& python -m pip install --upgrade pip \
&& cd /tmp \
&& pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html \
&& apt-get clean \
&& rm -rf /tmp/* /var/lib/apt/lists/* /var/tmp/* \
- build
sudo docker build -t ubuntu18:v0 .
-t
:- Write image tagImageName
: - This is the name you want to give the image.TagName
: - This is the label you want to give the image.dir
: - The directory where the Dockerfile is located.
Submit and save
Container submission generates image
docker commit -m="描述信息" -a="作者" 容器id 目标镜像名:[TAG]
-
-a
:The submitted image author; -
-m
:Explanatory text when submitting; -
-p
: Pause the container when committing.
keep
There are two ways, one is to submit first and then save; the other is to export the container directly
save image
docker save ID > xxx.tar
docker load < xxx.tar
save container
docker export ID >xxx.tar
docker import xxx.tar containr:v1
How to solve insufficient disk space in Docker
1. Check the usage of all disks on the server:
df -h
As you can see, the red box is the size of the system disk. The total size is 188G (much smaller than other disks). It was full before, but the blogger has done the migration, so there is a lot of space.
2. Check the space size of docker image and container storage directory
du -sh /var/lib/docker/
3. Stop the docker service
service docker stop
4. Migrate docker to a large-capacity disk
4.1 Method 1: Create a soft connection(推荐)
- Enter root
su root
- Move file location
#移动文件位置
cp -a /var/lib/docker /data/
- Create a soft link
#创建软连接
sudo ln -fs /data/docker /var/lib/docker
- Reload
#重新加载配置&查看位置
systemctl daemon-reload
systemctl restart docker
service docker start
- Verify
If it is still there, it proves valid
docker images
4.2 Method 2
- First create the directory
mkdir -p 大磁盘目录/docker/lib/
- migrate
rsync -avz /var/lib/docker /mnt/docker/lib/
5. Edit /etc/docker/daemon.json, add parameters, and bind the docker directory migration
Modify file:
sudo vim /etc/docker/daemon.json
The red box is the added parameter
{
"registry-mirrors": ["https://xxxxxxx.mirror.aliyuncs.com"],
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"data-root":"/data/docker/lib/docker"
}
6. Reload and restart the docker service
systemctl daemon-reload && systemctl restart docker
But I still failed to run systemctl, so I restarted docker using the following command:
service docker restart
7. Check whether docker is bound to the new directory
docker info
If the Docker Root Dir changes from /var/lib/docker to the directory you specified, the migration is successful.
8. Delete the old docker directory
rm -rf /var/lib/docker
9. Set up proxy
9.1 Valid
Configure the host/etc/default/docker
export http_proxy="http://127.0.0.1:8889/"
export https_proxy="http://127.0.0.1:8889/"
export HTTP_PROXY="http://127.0.0.1:8889/"
export HTTPS_PROXY="http://127.0.0.1:8889/"
export all_proxy="socks5h://localhost:1089"
export ALL_PROXY="socks5h://localhost:1089"
Restart docker
sudo systemctl daemon-reload
sudo systemctl restart docker
9.2 Method 2
Invalid
Open the file on the host machine
sudo vim ~/.docker/config.json
Join an agent
9.2 Test is invalid
export ALL_PROXY='socks5://127.0.0.1:1080'
The IP address here is that of the host computerip
2. Shared network
Use directly in the container when sharing the network with the host machine
Use when creating the container--network=host
parameters
sudo nvidia-docker run -it --privileged=true -p 7777:8888 --network=host --gpus all --ipc=host -v /data:/data --name test1 b7a4c /bin/bash
Then set the proxy within docker, such as global proxy
export ALL_PROXY='socks5://127.0.0.1:1080'
- After mapping the proxy port, use it directly in the container
When docker run, take the parameter -p to map the proxy port to the container, and use it in the container, for example:
docker run -p 1080:1080 .....
export ALL_PROXY='socks5://127.0.0.1:1080'
10.0, docker library configuration
1. libGL
Report an error
ImportError: libGL.so.1: cannot open shared object file
Install
apt-get update && apt-get install libgl1
2. ping installation
apt-get install -y iputils-ping
3. Apex installation
Install:RuntimeError: Error compiling objects for extension
git clone https://github.com/NVIDIA/apex
cd apex
python setup.py install --cpp_ext --cuda_ext
- Error one:
Compile again
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
4. boost
Report an error
fatal error: boost/geometry.hpp: No such file or directory
Solution
apt-get update
apt-get install libboost-all-dev
5. hidden
- Go tothe corresponding website to download the corresponding version of cudnn, the Linux computer is X86-64
- Unzip
tar -xzvf cudnn-10.1-linux-x64-v8.0.5.39.tgz //XXX.tgz是下载的cudnn的压缩包
- Move the corresponding file
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/include/cudnn_version.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
11. docker clear cache
docker system prune
Order:
- Used to clean disks, delete closed containers, useless data volumes and networks, and dangling images (ie, untagged images)
- Stopped container (container)
- A volume that is not used by any container
- A network that is not associated with any container
- All dangling images (image)
docker system prune -a
Order
- For a more thorough cleanup, you can delete all containers that do not use Docker images.
Note that these two commands will delete the containers you have temporarily closed and the Docker images that are not used... so be sure to think clearly before using them.
2. docker remote debugging
2.1 vscode plug-in installation
remote-ssh
remote development
2.2 docker container configuration
- Start the container and install ssh
apt-get update
apt-get install openssh-server
- Set the password for remote login
If you want to log in to the container directly using the root account, set the root password
passwd
- Add root account login permission
- edited article
vim /etc/ssh/sshd_config
Revised below
#注释掉
PermitRootLogin prohibit-password
#添加
PasswordAuthentication yes
PermitRootLogin yes
#Port 写容器端口
Port 9901
Restart ssh
service ssh restart
2.3 vscode configuration
ctrl+shift+p
- Open configuration
4. Start configuration
# 随便起
Host 2080Ti
# 主机IP
HostName 10.119.XXX.XXX
# DOCKER root用户
User root
# User ubuntu
# docker 端口
Port 9901
2.4 Connection
ctrl_shift+P
+连接到对应名字即可
2.5 Remote visualization such as open3d
2.5.1 Container installation
- Installation inside the container
apt-get install x11-xserver-utils
apt-get install x11-apps
- First do not log in to docker and run it under the current terminal
If the experiment is unsuccessful, restart the container. After restarting, reset the current instructions
DISPLAY=:0.0
xhost +
- After logging into the container, run again
DISPLAY=:0.0
xhost +
- View environment variables
echo ${
DISPLAY}
- Modify the configuration file under ubuntu server
Reference 1
I have experienced remote connection failure, which manifested as
- Client vscode connection container stuck
- But connecting to a remote computer, yes.
Try the following configuration to resolve
- open a file
vim /etc/ssh/sshd_config
- Will
AllowTcpForwarding no
AllowAgentForwarding no
Replace with
AllowTcpForwarding yes
AllowAgentForwarding yes
- After saving, restart the sshd service
systemctl restart sshd
2.5.2 Local installation
- Install vcxsrv locally
vcxsrv free download link
- Generally, when customizing the download path, one is a permission issue and the other is a convenience issue in path search.
- Continue all the way until the download is complete.
-
Start the service
Open XLaunch, remember thisDisplay number 0
, and default to [Next] for the rest until completion.
-
Modify vcxsrv configuration. Find the files in the installation directory
- 0 represents the above mentioned
DISPLAY:0
- Add the IP of the remote server and save
2.5.3 vscode configuration
- Open the file
C:\Users\用户名.ssh\config
and add the following 3 lines:
ForwardX11 yes
ForwardX11Trusted yes
ForwardAgent yes
2. launch
placement
in.vscode/launch.json
in case
"env":{
"DISPLAY":":0.0"}
2.6 tensorboard call
- Opening
vscode
End conda activate env_name
- Enter the tf_log directory and run the command
tensorboard --logdir=work_dirs_name --port='6009'
- Click on the URL to enter the browser to view
3 Docker packages the local image and copies it to other hosts to run
Reference
In addition to pull, the other way for docker to obtain the image is to package and copy the local image to other hosts for running. Assuming that the connection between the local warehouse and the remote warehouse is abnormal in the real environment, then it is also a solution for us to distribute the pre-packaged image to other docker nodes.
The specific steps are as follows:
- Execute the following command to find the name and version number of the packaged image (version number=TAG)
docekr images
- Two ways for docker to package images (just choose one to execute)
docker save 镜像名字:版本号 > /root/打包名字.tar
docker save -o /root/打包名字.tar 镜像名字:版本号
-
Distribute the packaged image to the /root/ directory of other hosts
-
Load the image into a tarball package
docker load < /root/打包名字.tar
- Check the image ID from load
docekr images
- The name and version number of the image just loaded are none. We need to use the tag command to give the name and version number.
docker tag 镜像ID 镜像名字:版本号