Best Practice SSH Mode in OrionX vGPU R&D Test Scenario

Development machine scenario overview

At present, many companies manage GPU resources very simply and crudely when doing AI development scenarios. Most of them are managed by the development team and assigned to development engineers by operation and maintenance on a per-unit basis. In AI development involving development scenarios and testing scenarios, many development, testing and even training tasks are put together to use resources. At this time, users will encounter some problems in resource usage:

From a management perspective, users will encounter the problem that resources cannot be managed and scheduled uniformly, and good monitoring and resource statistics cannot be achieved; from the perspective of algorithm personnel, the problem they face is that resources are tight and must be coordinated with each other, and they cannot be flexible. The problem of dynamically using and applying for resources.

Based on the above problems, the OrionX AI computing resource pooling solution provides corresponding solutions, aiming to improve GPU utilization, provide a flexible scheduling platform, unified management of computing resources for users, and achieve elastic expansion and on-demand use.

Based on the usage habits of algorithm engineers, we have summarized three development machine scenarios:

  • SSH mode: This mode gives the algorithm staff a machine, whether it is a physical machine, a virtual machine or a container, and the algorithm staff directly connects remotely to develop the algorithm and use resources.

  • Jupyter mode: Jupyter is also a tool that has been used more frequently by algorithmic personnel in recent years. Many companies have transformed it into integrated development tools, and Jupyter can be deployed in containers or virtual machines for algorithmic personnel to use.

  • CodeServer mode: The server version of Microsoft's VSCode. In recent years, many companies have adopted this tool. The way of using resources is similar to Jupyter, and it is also deployed in virtual machines or containers.

We will introduce the best practices of OrionX vGPU based on these three scenarios through three articles. Today, let us start exploring with SSH.

Environmental preparation

The environment includes physical machines or virtual machines, network environments, GPU cards, operating systems, and container platforms.

Hardware environment

This POC environment prepares three virtual machines, including one CPU node and two GPU nodes. Each GPU node has a T4 card.

The operating system isubuntu 18.04

Management network: Gigabit TCP

Remote call network: 100G RDMA

Kubernetes environment

To install the K8s environment on three nodes, you can use kubeadm to install it, or some deployment tools:

  • kubekey

  • kuboard-spray

The current deployment Kubernetes environment is as follows:

root@sc-poc-master-1:~# kubectl get node
NAME              STATUS   ROLES                         AGE    VERSION
sc-poc-master-1   Ready    control-plane,master,worker   166d   v1.21.5
sc-poc-worker-1   Ready    worker                        166d   v1.21.5
sc-poc-worker-2   Ready    worker                        166d   v1.21.5

The master node is the CPU node, and the worker node is 2 T4 GPU nodes.

OrionX vGPU pooled environment

Refer to Trend Technology's "OrionX Implementation Plan - K8s Version". After deployment, we can view the OrionX components in orion's namespace:

root@sc-poc-master-1:~# kubectl get pod -n orion 
NAME                                 READY   STATUS    RESTARTS   AGE
orion-container-runtime-hgb5p        1/1     Running   3          63d
orion-container-runtime-qmghq        1/1     Running   1          63d
orion-container-runtime-rhc7s        1/1     Running   1          46d
orion-exporter-fw7vr                 1/1     Running   0          2d21h
orion-exporter-j98kj                 1/1     Running   0          2d21h
orion-gui-controller-all-in-one-0    1/1     Running   2          87d
orion-plugin-87grh                   1/1     Running   6          87d
orion-plugin-kw8dc                   1/1     Running   8          87d
orion-plugin-xpvgz                   1/1     Running   8          87d
orion-scheduler-5d5bbd5bc9-bb486     2/2     Running   7          87d
orion-server-6gjrh                   1/1     Running   1          74d
orion-server-p87qk                   1/1     Running   4          87d
orion-server-sdhwt                   1/1     Running   1          74d

Development machine scenario: SSH remote connection mode

For this test, we started a Pod resource, allocated OrionX vGPU resources, and then entered the Pod for development through kubectl exec. The image used by the Pod is the official TensorFlow image: tensorflow/tensorflow:2.4.3-gpu.

The deployed Yaml file is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  name: tf-243
  namespace: orion
spec:
  replicas: 1
  selector:
    matchLabels:
      name: tf-243
  template:
    metadata:
      labels:
        name: tf-243
    spec:
            #nodeName: sc-poc-master-1
      hostNetwork: true
      schedulerName: orion-scheduler
      containers:
      - name: tf-243
        image: tensorflow/tensorflow:2.4.3-gpu 
        imagePullPolicy: Always 
        #imagePullPolicy: IfNotPresent
        command: ["bash", "-c"]
        args: ["while true; do sleep 30; done;"]
        resources:
          requests:
            virtaitech.com/gpu: 1
          limits:
            virtaitech.com/gpu: 1
        env:
          - name : ORION_GMEM
            value : "5000"
          - name : ORION_RATIO
            value : "60"
          - name: ORION_VGPU
            value: "1"
          - name: ORION_RESERVED
            value: "1"
          - name: ORION_CROSS_NODE
            value: "1"
          - name: ORION_EXPORT_CMD
            value: "orion-smi -j" 
          - name: ORION_CLIENT_ID
            name: "orion"
          - name : ORION_GROUP_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.uid
          - name: ORION_K8S_POD_NAME
            valueFrom:
             fieldRef:
               fieldPath: metadata.name
          - name: ORION_K8S_POD_UID
            valueFrom:
              fieldRef:
                fieldPath: metadata.uid

For detailed parameters of OrionX, please refer to the "OrionX User Manual". Here are some commonly used parameters:

  • ORION_VGPU: refers to the number of OrionX vGPU applied for, such as 1 or more

  • ORION_GMEM: refers to the requested video memory size, calculated in MB, 5000 is the video memory size of 5GB

  • ORION_RATIO: refers to the amount of computing power applied for. The computing power of OrionX vGPU is divided according to the percentage of physical cards. The number filled in here is a percentage number. 60 means applying for a single physical card. 60% of the computing power, the minimum unit is 1%, the maximum is 100%. 100% of the time means that the vGPU has applied for the resources of the entire physical card.

  • ORION_RESERVED: It means that the resource applied for is in reserved mode. OrionX has two modes when applying for resources, one is reserved and the other is non-reserved. The reserved mode is resource pre-allocation, which is already allocated when the Pod is started, regardless of whether the Pod is running AI tasks, similar to when using a physical GPU; the non-reserved mode is that no resources are allocated when the Pod is started, and only when there are AI tasks It is actually allocated when running, and when the task ends, the resources will be automatically released.

Apply for resources using reservation mode

In this test, we first use the reserved mode to apply for resources and deploy the TF image:

# kubectl create -f 09-deploy-tf-243-gpu.yaml

# kubectl get pod
NAME                                READY   STATUS              RESTARTS   AGE
tf-243-84657d76b5-jqqc8             0/1     ContainerCreating   0          30m
# 等着拉镜像启动

By viewing the OrionX GUI, the OrionX vGPU resource has been allocated. Since it is a reservation mode, resources will be pre-allocated regardless of whether they are used.

After starting the Pod, we enter the Pod to check whether the applied resources have taken effect.

root@sc-poc-master-1:~# kubectl exec -it tf-243-84657d76b5-jqqc8  -- bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@sc-poc-master-1:/# env | grep -iE "vgpu|gmem|ratio|rese"
ORION_GMEM=5000
ORION_RESERVED=1
ORION_RATIO=60
ORION_VGPU=1

Checking the environment variables shows that the applied resources have taken effect.

Since this image is an official image and the base os image of TensorFlow is Ubuntu 18.04, update the domestic source and then install Git. We will download one laterTensorFlow benchmark as a test script to simulate development testing.

#  sed -i 's/archive.ubuntu.com/mirrors.aliyun.com/g' /etc/apt/sources.list
# apt update
# apt install git -y
# git clone https://github.com/tensorflow/benchmarks

After downloading the TensorFlow Benchmark, we can directly run the Benchmark to simulate a training task. The running mode is similar to the actual R&D scenario. After writing the code in the Pod, run the code. The Benchmark running code is as follows:

python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--forward_only=False \
--data_name=imagenet \
--model=resnet50 \
--num_batches=200 \
--num_gpus=1 \
--batch_size=32

After running normally, we will see an output in the window

The result shows that 81.3 images are trained per second. We can also see the operation of the task through the task interface of OrionX GUI. The task is allocated 60% of the computing power and 5G video memory, which is the same as the actual requested resources.

Apply for resources in non-reservation mode

The previous one is reserved mode. This time, change the parameters of yaml file deploymentORION_RESERVED to 0, which is non-reserved mode. The non-reservation mode does not allocate resources when the Pod is started. Resources will only be allocated when there are tasks running. This mode can increase the efficiency of time-division multiplexing of GPU resources by several times. With this function, GPU resources can also be over-allocated in R&D test scenarios to further improve resource utilization.

After the Pod is started, it is found through the OrionX GUI interface that no resources have been applied for.

We also run the above Benchmark task and find that the resources have been applied for when the task is started.

python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
--forward_only=False \
--data_name=imagenet \
--model=resnet50 \
--num_batches=200 \
--num_gpus=1 \
--batch_size=32

After the task is completed, we can go to the OrionX GUI to check that the resources have been dynamically released. Through this mode, we can improve the time-sharing multiplexing efficiency of GPU resources and increase the average utilization by 3-5 times.

Remote POD via SSH

The previous steps were to enter the Pod through the kubectl command line. In many companies, algorithm personnel are not directly given this permission. So, can we remotely access a Pod through a remote virtual machine? The answer is yes. Next, we will introduce how to remotely develop a Pod through SSH. The conditions required for remote connection to Pod are the same as those for remote virtual machines: opening the SSH port, configuring the username and password (or secret key method), opening the remote connection port and connection method.

  • Open the SSH port of the Pod

To open the SSH port, we need to modify the original image. We need to install the SSH service and start the sshd service at startup. The modified Dockerfile is as follows. For the convenience of testing, we also clone the benchmark script into the image:

  • Configuration key

In order to ensure security, we use a secret key to remotely connect to the Pod this time. First, we can generate our own secret key through xshell or the direct Linux command line, and then mount the secret key to the Pod through configmap. The configmap is as follows:

# cat ssh-configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: ssh-config
data:
  sshd_config: |
    PasswordAuthentication no
    ChallengeResponseAuthentication no
    UsePAM no
  authorized_keys: |
    ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAwgGoBqurkauiio8zhh6HRX/8PJ0daXmnW38EyvIghW9au7qG3yBjxEzsDcPpUILne1gMmb6WO1+IdENPIsqZ1ycsfrKjpCbeXUKL7vbuUasBKlkoG/xvhCy1G+GTEwSdyPQnjYsE5cnTedIvbd0wfSjgtMqa3D4fKT/1eCBoGs8n4yPKOZo8l/jKFv5/ph8qi5uvNPMdWx43+4prpOVN8oPLWRSFJ1WZ8zTRGOwnkdi0LZLrbQ7OqMaEsUKrMndAH56e9MToex2J3ngbTYceFGo2SWCKGAmy32RFvmoxHfCjUQlcGvElNh5OEPlBSGMc5RLXQlrzpD5iVm7hkzgxzQ==
  • Modify the startup parameters of the image and mount the configmap. The Pod startup Yaml is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  name: tf-243
  namespace: orion
spec:
  replicas: 1
  selector:
    matchLabels:
      name: tf-243
  template:
    metadata:
      labels:
        name: tf-243
    spec:
      nodeName: sc-poc-worker-1
      #hostNetwork: true
      schedulerName: orion-scheduler
      containers:
      - name: tf-243
        image: tensorflow/tensorflow:2.4.3-gpu-ssh
        imagePullPolicy: IfNotPresent
        #imagePullPolicy: IfNotPresent
        #command: ["bash", "-c"]
        #args: ["while true; do sleep 30; done;"]
        resources:
          requests:
            virtaitech.com/gpu: 1
          limits:
            virtaitech.com/gpu: 1
        env:
          - name : ORION_GMEM
            value : "5000"
          - name : ORION_RATIO
            value : "60"
          - name: ORION_VGPU
            value: "1"
          - name: ORION_RESERVED
            value: "0"
          - name: ORION_CROSS_NODE
            value: "1"
          - name: ORION_EXPORT_CMD
            value: "orion-smi -j" 
          - name: ORION_CLIENT_ID
            name: "orion"
          - name : ORION_GROUP_ID
            valueFrom:
              fieldRef:
                fieldPath: metadata.uid
          - name: ORION_K8S_POD_NAME
            valueFrom:
             fieldRef:
               fieldPath: metadata.name
          - name: ORION_K8S_POD_UID
            valueFrom:
              fieldRef:
                fieldPath: metadata.uid
        ports:
        - containerPort: 22
        volumeMounts:
        - name: ssh-volume
          subPath: sshd_config
          mountPath: /etc/ssh/sshd_config
        - name: ssh-volume
          subPath: authorized_keys
          mountPath: /root/.ssh/authorized_keys
      volumes:
      - name: ssh-volume
        configMap:
          name: ssh-config   
  • Connect to Pod remotely via NodePort

If possible, you can use LB. Currently, we provide a connection address for Pod through NodePort. Create svc as follows:

apiVersion: v1
kind: Service
metadata:
  name: ssh-service
spec:
  type: NodePort
  ports:
  - protocol: TCP
    port: 22
    targetPort: 22
  selector:
    name: tf-243
  • Deploy the TensorFlow image and remotely via xshell

Deploy the above yaml file directly and deploy it as follows:

# kubectl get pod
NAME                                READY   STATUS    RESTARTS   AGE
tf-243-9fb569b9d-klmtj              1/1     Running   0          12m

# kubectl get svc
NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                        AGE
ssh-service                       NodePort    172.16.7.214     <none>        22:32550/TCP                                   2d

Viewed through svc, a port 32550 is mapped. We can remotely access the service by adding the xshell remote host IP to the 32550 port. We use the previous script to run the benchmark again, and it can run normally. At this point, we can remotely develop the Pod through SSH.

The above are the best practices for the development machine scenario of OrionX vGPU in SSH mode. We will continue to share the development practices of OrionX vGPU based on Jupyter and CodeServer modes in future articles. Welcome to leave a message to discuss!

Guess you like

Origin blog.csdn.net/m0_49711991/article/details/127766705