Development machine scenario overview
At present, many companies manage GPU resources very simply and crudely when doing AI development scenarios. Most of them are managed by the development team and assigned to development engineers by operation and maintenance on a per-unit basis. In AI development involving development scenarios and testing scenarios, many development, testing and even training tasks are put together to use resources. At this time, users will encounter some problems in resource usage:
From a management perspective, users will encounter the problem that resources cannot be managed and scheduled uniformly, and good monitoring and resource statistics cannot be achieved; from the perspective of algorithm personnel, the problem they face is that resources are tight and must be coordinated with each other, and they cannot be flexible. The problem of dynamically using and applying for resources.
Based on the above problems, the OrionX AI computing resource pooling solution provides corresponding solutions, aiming to improve GPU utilization, provide a flexible scheduling platform, unified management of computing resources for users, and achieve elastic expansion and on-demand use.
Based on the usage habits of algorithm engineers, we have summarized three development machine scenarios:
-
SSH mode: This mode gives the algorithm staff a machine, whether it is a physical machine, a virtual machine or a container, and the algorithm staff directly connects remotely to develop the algorithm and use resources.
-
Jupyter mode: Jupyter is also a tool that has been used more frequently by algorithmic personnel in recent years. Many companies have transformed it into integrated development tools, and Jupyter can be deployed in containers or virtual machines for algorithmic personnel to use.
-
CodeServer mode: The server version of Microsoft's VSCode. In recent years, many companies have adopted this tool. The way of using resources is similar to Jupyter, and it is also deployed in virtual machines or containers.
We will introduce the best practices of OrionX vGPU based on these three scenarios through three articles. Today, let us start exploring with SSH.
Environmental preparation
The environment includes physical machines or virtual machines, network environments, GPU cards, operating systems, and container platforms.
Hardware environment
This POC environment prepares three virtual machines, including one CPU node and two GPU nodes. Each GPU node has a T4 card.
The operating system isubuntu 18.04
Management network: Gigabit TCP
Remote call network: 100G RDMA
Kubernetes environment
To install the K8s environment on three nodes, you can use kubeadm to install it, or some deployment tools:
-
kubekey
-
kuboard-spray
The current deployment Kubernetes environment is as follows:
root@sc-poc-master-1:~# kubectl get node
NAME STATUS ROLES AGE VERSION
sc-poc-master-1 Ready control-plane,master,worker 166d v1.21.5
sc-poc-worker-1 Ready worker 166d v1.21.5
sc-poc-worker-2 Ready worker 166d v1.21.5
The master node is the CPU node, and the worker node is 2 T4 GPU nodes.
OrionX vGPU pooled environment
Refer to Trend Technology's "OrionX Implementation Plan - K8s Version". After deployment, we can view the OrionX components in orion's namespace:
root@sc-poc-master-1:~# kubectl get pod -n orion
NAME READY STATUS RESTARTS AGE
orion-container-runtime-hgb5p 1/1 Running 3 63d
orion-container-runtime-qmghq 1/1 Running 1 63d
orion-container-runtime-rhc7s 1/1 Running 1 46d
orion-exporter-fw7vr 1/1 Running 0 2d21h
orion-exporter-j98kj 1/1 Running 0 2d21h
orion-gui-controller-all-in-one-0 1/1 Running 2 87d
orion-plugin-87grh 1/1 Running 6 87d
orion-plugin-kw8dc 1/1 Running 8 87d
orion-plugin-xpvgz 1/1 Running 8 87d
orion-scheduler-5d5bbd5bc9-bb486 2/2 Running 7 87d
orion-server-6gjrh 1/1 Running 1 74d
orion-server-p87qk 1/1 Running 4 87d
orion-server-sdhwt 1/1 Running 1 74d
Development machine scenario: SSH remote connection mode
For this test, we started a Pod resource, allocated OrionX vGPU resources, and then entered the Pod for development through kubectl exec. The image used by the Pod is the official TensorFlow image: tensorflow/tensorflow:2.4.3-gpu.
The deployed Yaml file is as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
name: tf-243
namespace: orion
spec:
replicas: 1
selector:
matchLabels:
name: tf-243
template:
metadata:
labels:
name: tf-243
spec:
#nodeName: sc-poc-master-1
hostNetwork: true
schedulerName: orion-scheduler
containers:
- name: tf-243
image: tensorflow/tensorflow:2.4.3-gpu
imagePullPolicy: Always
#imagePullPolicy: IfNotPresent
command: ["bash", "-c"]
args: ["while true; do sleep 30; done;"]
resources:
requests:
virtaitech.com/gpu: 1
limits:
virtaitech.com/gpu: 1
env:
- name : ORION_GMEM
value : "5000"
- name : ORION_RATIO
value : "60"
- name: ORION_VGPU
value: "1"
- name: ORION_RESERVED
value: "1"
- name: ORION_CROSS_NODE
value: "1"
- name: ORION_EXPORT_CMD
value: "orion-smi -j"
- name: ORION_CLIENT_ID
name: "orion"
- name : ORION_GROUP_ID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: ORION_K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ORION_K8S_POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
For detailed parameters of OrionX, please refer to the "OrionX User Manual". Here are some commonly used parameters:
ORION_VGPU: refers to the number of OrionX vGPU applied for, such as 1 or more
ORION_GMEM: refers to the requested video memory size, calculated in MB, 5000 is the video memory size of 5GB
ORION_RATIO: refers to the amount of computing power applied for. The computing power of OrionX vGPU is divided according to the percentage of physical cards. The number filled in here is a percentage number. 60 means applying for a single physical card. 60% of the computing power, the minimum unit is 1%, the maximum is 100%. 100% of the time means that the vGPU has applied for the resources of the entire physical card.
ORION_RESERVED: It means that the resource applied for is in reserved mode. OrionX has two modes when applying for resources, one is reserved and the other is non-reserved. The reserved mode is resource pre-allocation, which is already allocated when the Pod is started, regardless of whether the Pod is running AI tasks, similar to when using a physical GPU; the non-reserved mode is that no resources are allocated when the Pod is started, and only when there are AI tasks It is actually allocated when running, and when the task ends, the resources will be automatically released.
Apply for resources using reservation mode
In this test, we first use the reserved mode to apply for resources and deploy the TF image:
# kubectl create -f 09-deploy-tf-243-gpu.yaml
# kubectl get pod
NAME READY STATUS RESTARTS AGE
tf-243-84657d76b5-jqqc8 0/1 ContainerCreating 0 30m
# 等着拉镜像启动
By viewing the OrionX GUI, the OrionX vGPU resource has been allocated. Since it is a reservation mode, resources will be pre-allocated regardless of whether they are used.
After starting the Pod, we enter the Pod to check whether the applied resources have taken effect.
root@sc-poc-master-1:~# kubectl exec -it tf-243-84657d76b5-jqqc8 -- bash
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@sc-poc-master-1:/# env | grep -iE "vgpu|gmem|ratio|rese"
ORION_GMEM=5000
ORION_RESERVED=1
ORION_RATIO=60
ORION_VGPU=1
Checking the environment variables shows that the applied resources have taken effect.
Since this image is an official image and the base os image of TensorFlow is Ubuntu 18.04, update the domestic source and then install Git. We will download one laterTensorFlow benchmark
as a test script to simulate development testing.
# sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list
# apt update
# apt install git -y
# git clone https://github.com/tensorflow/benchmarks
After downloading the TensorFlow Benchmark, we can directly run the Benchmark to simulate a training task. The running mode is similar to the actual R&D scenario. After writing the code in the Pod, run the code. The Benchmark running code is as follows:
python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--forward_only=False \
--data_name=imagenet \
--model=resnet50 \
--num_batches=200 \
--num_gpus=1 \
--batch_size=32
After running normally, we will see an output in the window
The result shows that 81.3 images are trained per second. We can also see the operation of the task through the task interface of OrionX GUI. The task is allocated 60% of the computing power and 5G video memory, which is the same as the actual requested resources.
Apply for resources in non-reservation mode
The previous one is reserved mode. This time, change the parameters of yaml file deploymentORION_RESERVED
to 0, which is non-reserved mode. The non-reservation mode does not allocate resources when the Pod is started. Resources will only be allocated when there are tasks running. This mode can increase the efficiency of time-division multiplexing of GPU resources by several times. With this function, GPU resources can also be over-allocated in R&D test scenarios to further improve resource utilization.
After the Pod is started, it is found through the OrionX GUI interface that no resources have been applied for.
We also run the above Benchmark task and find that the resources have been applied for when the task is started.
python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--forward_only=False \
--data_name=imagenet \
--model=resnet50 \
--num_batches=200 \
--num_gpus=1 \
--batch_size=32
After the task is completed, we can go to the OrionX GUI to check that the resources have been dynamically released. Through this mode, we can improve the time-sharing multiplexing efficiency of GPU resources and increase the average utilization by 3-5 times.
Remote POD via SSH
The previous steps were to enter the Pod through the kubectl command line. In many companies, algorithm personnel are not directly given this permission. So, can we remotely access a Pod through a remote virtual machine? The answer is yes. Next, we will introduce how to remotely develop a Pod through SSH. The conditions required for remote connection to Pod are the same as those for remote virtual machines: opening the SSH port, configuring the username and password (or secret key method), opening the remote connection port and connection method.
-
Open the SSH port of the Pod
To open the SSH port, we need to modify the original image. We need to install the SSH service and start the sshd service at startup. The modified Dockerfile is as follows. For the convenience of testing, we also clone the benchmark script into the image:
-
Configuration key
In order to ensure security, we use a secret key to remotely connect to the Pod this time. First, we can generate our own secret key through xshell or the direct Linux command line, and then mount the secret key to the Pod through configmap. The configmap is as follows:
# cat ssh-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: ssh-config
data:
sshd_config: |
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM no
authorized_keys: |
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwgGoBqurkauiio8zhh6HRX/8PJ0daXmnW38EyvIghW9au7qG3yBjxEzsDcPpUILne1gMmb6WO1+IdENPIsqZ1ycsfrKjpCbeXUKL7vbuUasBKlkoG/xvhCy1G+GTEwSdyPQnjYsE5cnTedIvbd0wfSjgtMqa3D4fKT/1eCBoGs8n4yPKOZo8l/jKFv5/ph8qi5uvNPMdWx43+4prpOVN8oPLWRSFJ1WZ8zTRGOwnkdi0LZLrbQ7OqMaEsUKrMndAH56e9MToex2J3ngbTYceFGo2SWCKGAmy32RFvmoxHfCjUQlcGvElNh5OEPlBSGMc5RLXQlrzpD5iVm7hkzgxzQ==
-
Modify the startup parameters of the image and mount the configmap. The Pod startup Yaml is as follows:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
name: tf-243
namespace: orion
spec:
replicas: 1
selector:
matchLabels:
name: tf-243
template:
metadata:
labels:
name: tf-243
spec:
nodeName: sc-poc-worker-1
#hostNetwork: true
schedulerName: orion-scheduler
containers:
- name: tf-243
image: tensorflow/tensorflow:2.4.3-gpu-ssh
imagePullPolicy: IfNotPresent
#imagePullPolicy: IfNotPresent
#command: ["bash", "-c"]
#args: ["while true; do sleep 30; done;"]
resources:
requests:
virtaitech.com/gpu: 1
limits:
virtaitech.com/gpu: 1
env:
- name : ORION_GMEM
value : "5000"
- name : ORION_RATIO
value : "60"
- name: ORION_VGPU
value: "1"
- name: ORION_RESERVED
value: "0"
- name: ORION_CROSS_NODE
value: "1"
- name: ORION_EXPORT_CMD
value: "orion-smi -j"
- name: ORION_CLIENT_ID
name: "orion"
- name : ORION_GROUP_ID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: ORION_K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ORION_K8S_POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
ports:
- containerPort: 22
volumeMounts:
- name: ssh-volume
subPath: sshd_config
mountPath: /etc/ssh/sshd_config
- name: ssh-volume
subPath: authorized_keys
mountPath: /root/.ssh/authorized_keys
volumes:
- name: ssh-volume
configMap:
name: ssh-config
-
Connect to Pod remotely via NodePort
If possible, you can use LB. Currently, we provide a connection address for Pod through NodePort. Create svc as follows:
apiVersion: v1
kind: Service
metadata:
name: ssh-service
spec:
type: NodePort
ports:
- protocol: TCP
port: 22
targetPort: 22
selector:
name: tf-243
-
Deploy the TensorFlow image and remotely via xshell
Deploy the above yaml file directly and deploy it as follows:
# kubectl get pod
NAME READY STATUS RESTARTS AGE
tf-243-9fb569b9d-klmtj 1/1 Running 0 12m
# kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ssh-service NodePort 172.16.7.214 <none> 22:32550/TCP 2d
Viewed through svc, a port 32550 is mapped. We can remotely access the service by adding the xshell remote host IP to the 32550 port. We use the previous script to run the benchmark again, and it can run normally. At this point, we can remotely develop the Pod through SSH.
The above are the best practices for the development machine scenario of OrionX vGPU in SSH mode. We will continue to share the development practices of OrionX vGPU based on Jupyter and CodeServer modes in future articles. Welcome to leave a message to discuss!