Implementation principle of resource reservation after Volcano v1.2

 1. Background introduction       

 Before Volcano v1.2, resource reservation was implemented through Reserve action. For specific implementation, please refer to:

Interpretation of Volcano Job Resource Reservation Design Principles-Cloud Community-HUAWEI CLOUD

        Reserve action completes resource reservation. Bind the selected target job to the node. Reserve action, select action and Reservation plugin constitute the resource reservation mechanism. Reserve action must be configured after allocate action. R reserve action has been deprecated since v1.2 and replaced by SLA plugin. The following focuses on the way of SLA.

2. Introduction to SLA

        When users apply jobs to the Volcano, they may need to add some specific constraints to the job, for example, the maximum pending time is designed to prevent the job from starving. These constraints can be seen as a Service Level Agreement (SLA) between Volcano and users. The sla plugin is therefore provided to receive and implement SLA settings for individual jobs and for the entire cluster.

        The full name of SLA is Service Level agreement. When a user submits a job to the volcano, special constraints may be added to the job, such as the maximum waiting time (JobWaitingTime). These constraints can be regarded as a service agreement between the user and the volcano. The SLA plugin can receive or send SLA parameters for a single job/whole cluster .

3. Scene

        According to business needs, users can customize SLA-related parameters in their own clusters. For example, for clusters with high real-time service requirements, JobWaitingTime can be set as small as possible. For clusters that are dominated by batch computing jobs, the JobWaitingTime can be set to a larger value. Specific SLA parameters and parameter optimization need to be combined with specific services and related performance evaluation results.

4. Implementation principle

1. In the sla plugin, sla-waiting-timeparameters are provided to realize job resource reservation: sla-waiting-timethe maximum time Pendingor inqueuestate that a job should stay without being assigned. Upon completion sla-waiting-time, slathe plugin sets the job to take effect inqueueimmediately enqueue. The plugin then slalocks free resources of Pods preallocated to the job allocate, even if the job Readyhas not yet completed. In this way, slathe plugin realizes the election and resource reservation of large jobs, thus replacing the electreserveaction in v1.1.0.

2. sla-waiting-timeYou can set parameters for one job or for all jobs in the cluster.

For a job, users can set them in the job notes in the following format:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    sla-waiting-time: 1h2m3s

For all jobs, users can set fields in plugin parameters in the following format sla-waiting-time:slavolcano-scheduler-configmap

  actions: "enqueue, allocate, backfill"
  tiers:
  - plugins:
    - name: priority
    - name: gang
    - name: sla
      arguments:
        sla-waiting-time: 1h2m3s

3.slaThe plugin returns 3 callback functions: JobEnqueueableFn, JobPipelinedFnand JobOrderFn:

(1)JobEnqueueableFnPermitReturned when a job in state has waited Pendinglonger than sla-waiting-time, and the job will enqueuebe executed immediately inqueueregardless of other plugins returning Rejector Abstainrejecting the job inqueue.

(2)JobPipelinedFnPermitReturned when a job in status has waited inqueuelonger than sla-waiting-time, and the job will Pipelinedbecome status immediately, regardless of other plugins returning Rejector Abstainrejecting the job Pipelined. In this way allocate, the action reserves resources for the job's pods even if the job is not ready yet.

(3)JobOrderFnAdjust the order of this job in the waiting queue for enqueuethe & allocateoperation. The closer to  sla-waiting-timethe waiting time of the job, the higher the score of the job in the plugin JobOrderFn, slaso that the job has a greater probability of becoming a front int priority queue, which means that it can access more idle resources and have more High priority is inqueueassigned with and.

5. Execution flowchart of SLA plug-in

                                

 6. Reference materials:

Actions | Volcano

Plugins | Volcano

volcano/sla-plugin.md at master · volcano-sh/volcano · GitHub

Guess you like

Origin blog.csdn.net/lovebaby1689/article/details/126831505