Volcano community v1.8.0 is officially released

The Volcano Community v1.8.0 version is officially released . This version adds the following new features:

  • Support vGPU scheduling and isolation

  • Supports preemption of vGPU and user-defined resources

  • Added JobFlow workflow orchestration engine

  • Node load-aware scheduling and rescheduling support diverse monitoring systems

  • Optimize Volcano's ability to schedule general services

  • Optimize the release and archiving of the Volcano charts package

Support vGPU scheduling and isolation

Since the explosion of ChatGPT, the research and development of AI large models have emerged in an endless stream, and different types of AI large models have also been launched one after another. Because their huge training tasks require a lot of computing power, the supply of computing power with GPU as the core has become the key to the development of the large model industry. infrastructure. In actual usage scenarios, users have pain points such as low resource utilization and inflexible resource allocation in the use of GPU resources. They must purchase a large amount of redundant heterogeneous computing power to meet business needs, and heterogeneous computing power itself is expensive. For the development of enterprises has brought a great burden.

Starting from version 1.8, Volcano provides an abstract general framework for shareable devices (GPU, NPU, FPGA...), developers can customize multiple types of shared devices based on this framework; currently, Volcano has been implemented based on this framework GPU virtualization features, support GPU device multiplexing, resource isolation and other capabilities, details are as follows:

  • GPU sharing: Each task can apply for some resources of a GPU card, and the GPU card can be shared among multiple tasks.

  • Device memory control: GPU can be allocated according to device memory (for example: 3000M) or allocated in proportion (for example: 50%) to realize GPU virtualization resource isolation capability.

For more information about vGPU, refer to:

Supports preemption of vGPU and user-defined resources

Currently, Volcano supports CPU, Memory and other basic resource preemption functions. For GPU resources and users' secondary development and scheduling plug-ins based on the Volcano framework, and self-managed resources (such as: NPU, network resources, etc.), it does not yet support preemption capabilities well.

In version 1.8, Volcano refactored node filtering-related processing (PredicateFn callback function), and added a Status type to the returned result, which is used to identify whether the current node meets the job delivery conditions in scenarios such as scheduling and preemption. The GPU preemption function has been released based on the optimized framework, and the scheduling plug-in developed by the user based on Volcano can be adapted and upgraded according to the business scenario.

For more information about supporting extended resource preemption, refer to: https://github.com/volcano-sh/volcano/pull/2916

Added JobFlow workflow orchestration engine

The workflow orchestration engine is widely used in high-performance computing, AI biomedicine, image processing, beauty, game AGI, scientific computing and other scenarios to help users simplify the management of multiple task parallelism and dependencies, and greatly improve the overall computing efficiency.

JobFlow is a lightweight task flow orchestration engine that focuses on Volcano's job orchestration. It provides Volcano with job probes, job completion dependencies, job failure rate tolerance, and other diverse job dependency types, and supports complex process control primitives. The specific capabilities are as follows:

  • Support large-scale job management and complex task flow orchestration

  • Supports real-time query of the running status and task progress of all associated jobs

  • Supports automatic operation of jobs and scheduled start to release labor costs

  • Multiple action strategies can be set for different tasks, and corresponding actions can be triggered when the task meets certain conditions, such as timeout retry, node failure drift, etc.

The JobFlow task running demo is as follows:

For more information about JobFlow, refer to: https://github.com/volcano-sh/volcano/blob/master/docs/design/jobflow/README.md

Node load-aware scheduling and rescheduling support diverse monitoring systems

The status of the Kubernetes cluster changes in real time with the creation and termination of tasks. In some scenarios (such as: adding and deleting nodes, changes in the affinity of Pod and Node, dynamic changes in the job life cycle, etc.), the resource utilization rate between cluster nodes is inconsistent. Balancing, node performance bottlenecks, offline and other problems, at this time, scheduling and rescheduling based on real load can help us solve the above problems.

Before Volcano version 1.8, the index acquisition of real load scheduling and rescheduling only supports Prometheus. Starting from version 1.8, Volcano optimizes the monitoring index acquisition framework, adds ElasticSearch monitoring system support, and supports smooth docking with a small adaptation workloadMore type monitoring system.

For more information on supporting multiple monitoring systems, refer to:

Optimize Volcano's ability to schedule microservices

Add Kubernetes default scheduler plug-in switch

Volcano is a unified integrated scheduling system that not only supports computing jobs such as AI and BigData, but also supports microservice workloads. It is compatible with scheduling plug-ins such as PodTopologySpread, VolumeZone, VolumeLimits, NodeAffinity, and PodAffinity of the Kubernetes default scheduler, and Kubernetes default scheduling plug-in capabilities Enabled by default in Volcano.

Since Volcano 1.8, the Kubernetes default scheduling plugin can be freely selected to be turned on and off through the configuration file, and all of them are turned on by default. If you choose to turn off some plugins, such as turning off the PodTopologySpread and VolumeZone plugins, you can set the corresponding values ​​in the predicate plugin is false, the configuration is as follows:

actions: "allocate, backfill, preempt"
tiers:
- plugins:
  - name: priority
  - name: gang
  - name: conformance
- plugins:
  - name: drf
  - name: predicates
    arguments:
      predicate.VolumeZoneEnable: false
      predicate.PodTopologySpreadEnable: false
  - name: proportion
  - name: nodeorder

For more information, please refer to: https://github.com/volcano-sh/volcano/issues/2748

Enhanced ClusterAutoscaler compatibility

In the Kubernetes platform, Volcano is not only used as a scheduler for batch computing services, but also more and more used as a scheduler for general services. Node horizontal scaling (Cluster Autoscaler) is one of the core functions of Kubernetes, which plays an important role in facing the surge of user traffic and saving operating costs. Volcano optimizes job scheduling and other related logic, and enhances the compatibility and interaction with ClusterAutoscaler, mainly in the following two aspects:

  • The pod that enters the pipeline state in the scheduling phase triggers capacity expansion in time
  • Candidate nodes are scored in gradients to reduce the impact of cluster terminating pods on the scheduling load, and prevent pods from entering invalid pipeline states, resulting in cluster expansion by mistake

For more information, refer to:

Refined management of Node resources to enhance resilience

When the node reports abnormal information due to some reason, such as device-plugin, and the total amount of certain resources of the node is less than the allocated resource amount, Volcano will consider the node data inconsistent, isolate the node, and stop scheduling any new work to the node load. In version 1.8, fine-grained management of node resources is carried out. For example, when the total GPU resource capacity of a node is less than the allocated resources, pods that apply for GPU resources are prohibited from being rescheduled to the node. Jobs that apply for non-GPU resources will still be Allow normal scheduling to this node.

For more information, refer to: https://github.com/volcano-sh/volcano/issues/2999

Optimize the release and archiving of the Volcano charts package

As Volcano is used in production environments and cloud environments with more and more users, simple and standard installation actions are crucial. Since version 1.8, Volcano has optimized charts package publishing and archiving actions, standardized the installation and use process, and completed the migration of historical versions (v1.6, v1.7) to the new helm warehouse. The usage method is as follows:

  • Add Volcano charts warehouse address
helm repo add volcano-sh https://volcano-sh.github.io/helm-chart
  • Query all installable versions of Volcano
helm search repo volcano -l
  • Install the latest version of Volcano
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
  • Install the specified version of Volcano, for example: 1.7.0
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace --version 1.7.0

For more information about the Volcano charts package, refer to: https://github.com/volcano-sh/helm-charts

Guess you like

Origin www.oschina.net/news/254878/volcano-1-8-0-released