OpenPAI Research Handbook

OpenPAI Research

OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. OpenPAI supports on-premises, cloud, and hybrid environments of all sizes, and the platform can be customized and extended according to user needs, making everyday AI tasks easier for users and administrators.

1.1 OpenPAI architecture

The main design goal of OpenPAI is to promote the full-process AI development by multiple users, and for this purpose, it has introduced a Marketplace that supports multi-user sharing of models and data. The framework of v0.14.0 and previous versions is composed of Kubernetes, Hadoop, and Yarn. v1.0.0 and The framework of later versions is managed by Kubernetes, and the architecture of v1.0.0 is shown in Figure 1.1.
insert image description here

1.2 OpenPAI Installation Manual

1.2.1 Cluster planning

According to the requirements of the OpenPAI manual, at least three machines are required to build a cluster. The following takes the configuration of three machines as an example to introduce the installation and deployment process of OpenPAI. The functions and addresses of the three machines are shown in Table 1.1.

In addition, the OpenPAI project document recommends that the system be Ubuntu 16.04 LTS, but Ubuntu 18.04 is also possible; secondly, the master and worker must be physical machines (if it is a virtual machine, the graphics card needs to be directly connected to the virtual machine, which is not suitable for a production environment), and dev can be a hard disk A virtual machine with a space of no less than 40G (only used when installing and maintaining the system, without wasting physical machine resources); finally, since kubernetes does not support swap, you cannot add a swap partition when installing the system, otherwise the node will hang when restarting.

1.2.2 Basic environment preparation

(1) Network settings: Set the cluster network according to the IP address in the cluster planning process, mainly including the settings of ip address, subnet mask, gateway, and dns name server. After setting the network, use the restart command to refresh, so that the three Machines can access each other within the same LAN.
(2) Remote access settings: first install openssh-server for each machine, then configure ssh to allow users with root privileges to log in, and finally start ssh for access.
(3) Install Docker: Since OpenPAI implements the container cluster management system based on kubernetes, it is necessary to install the Docker container on each machine first. First, you need to update the configuration of the operating system; then, execute the command to install Docker; finally, install nvidia-container-runtime, and add the source address of the Alibaba Cloud image.
(4) Configure the master node master: You only need to configure NTP on the master node machine on an existing basis.

1.2.3 Install OpenPAI

(1) Clone the OpenPAI project and select the OpenPAI version to be installed.
(2) Configure the config file and layout file: Since gcr.azk8s.cn and shaiictestblob01.blob.core.chinacloudapi.cn in the official recommended configuration method have stopped maintenance, it is necessary to use the image of Alibaba Cloud and the image of docker hub Match to configure config; layout file is configured according to hardware parameters.
(3) Install and deploy Kubernetes: Execute the installation script directly, and the console exits normally, indicating that the installation is successful; after the deployment is completed, information such as config, ID, and username will appear.
(4) Enter the master node ip in the browser to use OpenPAI for development.

1.3 OpenPAI basic management

1.3.1 Front-end management interface

The Web portal provides some basic management functions. After successful installation, if you log in as an administrator, you can find several management buttons on the left column, as shown in Figure 1.2.

Figure 1.2 Front-end management interface

(1) Service interface: The service page displays the OpenPAI services deployed in Kubernetes.

Figure 1.3 Service interface

(2) Hardware utilization interface: The hardware page displays the CPU, GPU, memory, disk and network utilization of each node in the cluster. Different utilizations are displayed in different colors. If you hover the mouse over these colored circles, then the page will display the exact utilization percentage.

Figure 1.4 Hardware Utilization Interface

(3) User management interface: The user management interface is used to create, modify, and delete users. When creating, you can choose two user types: administrator user and non-administrator user. This page is displayed only when OpenPAI is deployed in the basic authentication mode (the basic authentication mode is the default authentication mode). If the cluster uses AAD to manage users, this page is unavailable.

Figure 1.5 User Management Interface

(4) Abnormal job interface: The abnormal jobs section is provided for administrators on the home page. If the task runs for more than 5 days or the GPU usage is below 10%, it is considered as an abnormal task, and the administrator can choose to stop the abnormal task if necessary.

Figure 1.6 Abnormal task interface

1.3.2 Setting up data storage

Currently, there are many ways and types of data storage. For the convenience of use and management, Kubernetes proposes the concepts of PV and PVC. PV (Persistent Volume) is equivalent to a disk partition, which is an abstraction of the underlying shared storage; PVC (Persistent Volume Claim) is a resource demand application issued by the user to the kubernetes system. In OpenPAI, PV is mainly used for data storage, and the storage process follows the following steps:
(1) Create PV and PVC on Kubernetes as PAI storage.
(2) Confirm that the worker nodes have the correct environment.
(3) Authorize PVCs to specific user groups.

Figure 1.7 Schematic diagram of the working mechanism of PV and PVC

1.3.3 Setting up a virtual cluster

OpenPAI supports two schedulers: Kubernetes default scheduler and Hivedscheduler.
Hivedscheduler is a Kubernetes Scheduler for deep learning. It supports virtual cluster division, topology-aware resource guarantee, and performance-optimized Gang Scheduling, which are not supported by the Kubernetes default scheduler. And currently only Hivedscheduler supports virtual cluster settings, and Kubernetes default scheduler does not.
If we have 3 nodes: worker1, worker2, worker3, they are all in the default virtual cluster. Now we want to create two virtual clusters: one is called default, which contains two nodes; the other is called new, which contains one node, then the GPU virtual cluster can be configured like this:

代码片段1.1:设置GPU虚拟集群
# services-configuration.yaml
...
hivedscheduler:
  config: |
    physicalCluster:
      skuTypes:
        DT:
          gpu: 1
          cpu: 5
          memory: 56334Mi
      cellTypes:
        DT-NODE:
          childCellType: DT
          childCellNumber: 4
          isNodeLevel: true
        DT-NODE-POOL:
          childCellType: DT-NODE
          childCellNumber: 3
      physicalCells:
      - cellType: DT-NODE-POOL
        cellChildren:
        - cellAddress: worker1
        - cellAddress: worker2
        - cellAddress: worker3
    virtualClusters:
      default:
        virtualCells:
        - cellType: DT-NODE-POOL.DT-NODE
          cellNumber: 3
...

If you add a CPU machine, the OpenPAI manual recommends setting up a pure CPU virtual cluster directly, do not mix CPU nodes and GPU nodes in a virtual cluster

1.3.4 Set up Docker image cache

The Docker image cache is implemented in OpenPAI as a docker-cache service, which can help users avoid the problem of waiting when deploying services or users submitting tasks exceeding the limit. Docker image cache is configured as a pull-through cache with Azure Blob Storage or Linux file system as the storage backend. In addition, through the provided docker-cache configuration distribution script, users can easily use their own docker registry or pull-through cache.
Docker image cache provides three usage methods:
(1) Start the cache service that uses Azure Blob Storage as the storage backend: set the relevant fields in the config.yaml configuration file to Azure Blob Storage during installation, and complete the installation, as shown in the code Shown in Fragment 1.2.

代码片段1.2:设置Azure Blob Storage作为存储后端
enable_docker_cache: true
docker_cache_storage_backend: "azure"
docker_cache_azure_account_name: "forexample"
docker_cache_azure_account_key: "forexample"

(2) Start the cache service that uses the Linux file system as the storage backend: Similarly, set the relevant fields in the configuration file to the Linux file system, and complete the installation, as shown in code snippet 1.3.

代码片段1.3:设置Linux文件系统作为存储后端
enable_docker_cache: true
docker_cache_storage_backend: "filesystem"
# docker_cache_azure_account_name: ""
# docker_cache_azure_account_key: ""
# docker_cache_azure_container_name: "dockerregistry"
docker_cache_fs_mount_path: "/var/lib/registry"

(3) Use a custom registry: For users who want the OpenPAI cluster to use a custom registry, a simple way is to modify ./contrib/kubespray/docker-cache-config-distribute.yml, which is responsible for modifying each The docker daemon configuration of each node. By default, the playbook will add port 30500 of the kube-master node as the entry point of the docker-cache service, so you only need to modify { { hostvars[groups['kube-master'][0]]['
in this file ip'] }}:30500 is the corresponding: string to use the custom registry, as shown in code snippet 1.4.

代码片段1.4:设置Linux文件系统作为存储后端
roles:
    - role: '../roles/docker-cache/install'
      vars:
        enable_docker_cache: true
        docker_cache_host: "{
    
    { hostvars[groups['kube-master'][0]]['ip'] }}:30500"
  tasks:
    - name: Restart service docker config from /etc/docker/daemon.json after update
      ansible.builtin.systemd:
        name: docker

Guess you like

Origin blog.csdn.net/weixin_43427721/article/details/127493997