Apollo: Scalable collaborative scheduling for cloud-scale computing

Summary

Effective scheduling of data parallel computing jobs on cloud-scale computing clusters is critical to job performance, system throughput, and resource utilization. As the scale of clusters and more complex workloads with various characteristics grow, this becomes more and more challenging. This article introduces Apollo, a highly scalable collaborative scheduling framework that has been deployed on Microsoft's production cluster and can efficiently schedule thousands of calculations (millions of tasks) on tens of thousands of computers every day. The framework uses the global cluster information to execute scheduling decisions in a distributed manner through a loosely coordinated mechanism. Each scheduling decision will consider future resource availability and optimize various performance and system factors together in a unified model. Apollo has powerful functions that can cope with unexpected system dynamic changes, and can use idle system resources gracefully while providing guaranteed resources when needed.

1 Introduction

A system similar to MapReduce makes it easy to program data parallel computing and allows jobs to process terabytes of data on large commercial hardware clusters. Each data processing job contains many tasks, which are related to each other and describe the execution sequence of the tasks. The task is the basic unit of computing and dispatching to the server for execution.

Efficient scheduling (tracking task dependencies and assigning tasks to servers for execution when ready) is critical to overall system performance and quality of service. The increasing popularity and diversity of data parallel computing make scheduling more and more difficult. For example, the scale of production clusters used for data parallel computing is constantly expanding, with each cluster having more than 20,000 servers. More and more thousands of users from many different organizations submit jobs to the cluster every day, resulting in a peak rate of thousands of scheduling requests per second. The submitted jobs are diverse in nature, with multiple characteristics in terms of data volume, computational logic complexity, parallelism, and resource requirements. The scheduler must (i) be scalable on a cluster of tens of thousands of servers to make tens of thousands of scheduling decisions per second; (ii) maintain fair resource sharing among different users and groups; (iii) according to data parts Factors such as availability, job characteristics, and server load make high-quality scheduling decisions to make full use of resources in the cluster while minimizing job delays. This article introduces the Apollo scheduling framework, which has been fully deployed to schedule jobs in Microsoft's cloud-scale production clusters to provide services for various online services. Apollo efficiently schedules billions of tasks every day, and uses the following methods to meet the scheduling challenges in large clusters.

  • In order to strike a balance between scalability and scheduling quality, Apollo uses a distributed (loose) collaborative scheduling framework to make independent scheduling decisions in an optimistic and collaborative manner by merging cluster utilization information. This design achieves an appropriate balance: it avoids the independent scheduler of a fully distributed architecture from making suboptimal (and often conflicting) decisions, and at the same time eliminates the scalability bottleneck and single point of failure of centralized design .
  • In order to obtain high-quality scheduling decisions, Apollo schedules each task on the server to minimize the task completion time. The estimation model includes various factors and allows the scheduler to perform weighted decisions instead of only considering data location or server load. The data parallelism of calculation allows Apollo to continuously improve the estimated value of task execution time based on the runtime statistics of similar tasks during the execution of the job.
  • In order to provide cluster information to the scheduler, Apollo introduced a lightweight hardware independent mechanism to advertise the load on the server. When combined with the local task queue on each server, this mechanism provides a view of future resource availability on all servers, which the scheduler can use in decision-making.
  • In order to cope with unexpected cluster dynamic changes, sub-optimal estimates, and other abnormal runtime behaviors (these behaviors are normal for large clusters), through a series of correction mechanisms to dynamically adjust and correct sub-optimal decisions at runtime, this Apollo makes it very powerful . Apollo provides a unique delay correction mechanism that only resolves significant impact conflicts between schedulers, and this method works very well in practice.
  • In order to improve cluster utilization while maintaining low job latency, Apollo introduces opportunistic scheduling, which creates two types of tasks: regular tasks and opportunistic tasks. Apollo ensures low latency for routine tasks, and at the same time uses opportunity tasks to increase utilization to fill the gaps left by routine tasks. Apollo also uses a token-based mechanism to manage capacity and avoids system overload by limiting the total number of regular tasks.
  • In order to avoid service interruption or performance degradation when Apollo replaces the previous scheduler deployed in production, Apollo supports phased deployment to production clusters and large-scale verification. These limitations have received little attention in research, but they are crucial in practice. This article shares the experience of achieving these demanding goals.

According to actual observation, Apollo schedules more than 20,000 tasks per second in a production cluster of more than 20,000 machines. It also has high scheduling quality, in which 95% of regular tasks have a queuing delay of less than 1 second, while maintaining a high (over 80%) balanced CPU utilization rate throughout the cluster.

2. Scheduling of production scale

Apollo is the basic scheduling framework of Microsoft's distributed computing platform, which can meet large-scale data analysis for various business needs. A typical cluster contains tens of thousands of servers, which are interconnected by a network. The distributed file system stores data in partitions of distributed copies, similar to GFS and HDFS. All calculation jobs are written in SCOPE (a high-level scripting language similar to SQL) with user-defined processing logic. The optimizer (optimizer) converts the job into a directed acyclic graph (DAG) execution plan, where vertices are tasks, edges are data flows between tasks, and each task represents a basic computing unit. Tasks that perform the same calculation on different partitions with the same input are logically grouped together in stages. The number of tasks in each stage represents the degree of parallelism (DOP).

image.png

Figure 1: An example of SCOPE execution diagram

Figure 1 shows an example of SCOPE's execution diagram, which greatly simplifies an important production operation that collects user click information and provides insight into advertising effects. Conceptually, this job performs a connection between an unstructured user log and structured input pre-partitioned by a jion key. The plan first uses a partitioning scheme to partition unstructured input from other inputs: stages S1 and S2 partition the data separately and summarize each partition. Then the partition connection is performed in stage S4. According to the amount of input data, the DOP of S1 is set to 312, the DOP of S5 is set to 10, and the DOP of S2, S3 and S4 is set to 150.

2.1 Capacity management and token

In order to ensure the fairness and predictability of performance, the system uses a token-based mechanism to allocate capacity for jobs. Each token is defined as the right to perform routine tasks on the computers in the cluster, and the task is predefined to consume up to CPU and memory. For example, if a job is allocated 100 tokens, it means that it can run 100 tasks, and each task consumes at most a predefined maximum CPU and memory.

For security and resource sharing reasons, a virtual cluster will be created for each user group. Each virtual cluster allocates a certain amount of capacity based on the number of tokens and maintains a queue of all submitted jobs. The submitted job contains the target virtual cluster, the necessary credentials, and the number of tokens required for execution. The virtual cluster uses various admission control strategies and decides how and when to allocate the tokens it owns to the submitted jobs. Jobs that have not obtained the required token will be queued in the virtual cluster. The system also supports multiple functions, such as job priority, pause, upgrade and cancel.

Once the job gets the required token and starts to execute, the scheduler will execute the plan (optimized execution plan), assign tasks to the server under the premise of observing the token allocation, strengthen task dependency, and provide fault tolerance.

2.2 The essence of job scheduling

Scheduling jobs involves the following functions:

    1. Ready list (ready list): maintain a list of tasks that are ready to be scheduled: initially, the list includes those leaf tasks that operate on the original input (for example, the tasks in steps S1 and S3 in Figure 1);
    2. Task priority: sort the ready list appropriately;
    3. Capacity management: manage the capacity allocated to the job, and decide when to schedule tasks according to the capacity management policy;
    4. Task scheduling: decide where to schedule tasks and assign them to selected servers;
    5. Failure recovery: monitor the scheduled task, start the recovery operation when the task fails, and mark the job as failed if it cannot be recovered;
    6. Task completion: After the task is completed, check the tasks that depend on it in the execution graph. If all tasks that these tasks depend on have been completed, move them to the ready list;
    7. Job completion: Repeat the whole process until all tasks in the job are completed.

2.3 Production workload characteristics

The characteristics of production workloads have greatly influenced Apollo's design, with production computing clusters running more than 100,000 jobs per day. At any point in time, there are hundreds of jobs running simultaneously. In order to meet a variety of business scenarios and requirements, these jobs are very different in almost many ways. For example, a large job processes TB to PB of data, contains complex business logic and dozens of complex connections, aggregations, and user-defined functions, has hundreds of stages, and contains more than one million tasks in the execution plan. It takes several hours to complete. On the other hand, small jobs can handle gigabytes of data and can be completed in a few seconds. In SCOPE, different amounts of resources are also allocated for different jobs. Support business changes over time, and workloads are constantly changing. This diversity of workloads poses a huge challenge to the basic scheduling framework, which needs to be dealt with effectively. This article describes several job characteristics in a production environment to demonstrate the diversity and dynamics of computing workloads.

image.png

Figure 2: Heterogeneous workloads

In SCOPE, the DOP of the work phase is selected according to the amount of data to be processed and the complexity of each calculation. Even in a single job, as the amount of data changes during the job, DOP will change at different stages. Figure 2(a) shows the distribution of stage DOP in the production environment. It ranges from one to tens of thousands. Almost 40% of the stages have DOP less than 100, accounting for less than 2% of the total workload. More than 98% of the tasks belong to the stage where the DOP exceeds 100. The large scale of these stages allows the scheduler to extract statistical information from certain tasks to infer the behavior of other tasks in the same stage, and Apollo can use these behaviors to make wise decisions. The job size varies greatly from a single vertex to millions of vertices in each job graph. As shown in Figure 2(b), the amount of data processed by each job ranges from gigabytes to tens of petabytes. The task execution time varies from less than 100 milliseconds to several hours, as shown in Figure 2(c). 50% of the tasks run in less than 10 seconds and are very sensitive to scheduling delays. Some tasks require external files (such as executable files, configuration, and lookup tables) to perform, so initialization costs are incurred. In some cases, such external files required for execution are larger than the actual input to be processed, which means that the location should be based on the cache location of these files rather than the input location. In general, such a large number of jobs created a very high scheduling request rate, with a peak of over 100,000 requests per second, as shown in Figure 2(d).

The dynamic and diversified characteristics of the production computing workload and the cluster environment bring some challenges to the scheduling framework, including scalability, efficiency, robustness, and resource usage balance. The Apollo scheduling framework is designed to solve these challenges on Microsoft's large production clusters.

3.Apollo framework

In order to support the scale and scheduling rate required for production workloads, Apollo adopts a distributed and collaborative architecture, where the scheduling of each job is executed independently, and aggregated global cluster load information is merged.

3.1 Architecture overview

image.png

Figure 3: Apollo architecture diagram

Figure 3 is Apollo's architecture diagram. The job manager (JM), also known as the scheduler, manages the life cycle of each job. The global cluster load information used by each JM is provided through the cooperation of two other modules in the Apollo framework: ResourceMonitor (RM) for each cluster and ProcessNode (PN) for each server. The PN process running on each server is responsible for managing the local resources on the server and performing local scheduling, while the RM continuously aggregates the load information from the PN in the entire cluster, so as to provide each JM with a global view of the cluster status. Wise scheduling decisions.

Although RM is regarded as a single logical module, it can be implemented in different configurations through different mechanisms, because it essentially solves the problem of monitoring the dynamic changes of large-scale distributed resource collections. For example, it can use a tree hierarchy or directory service with an eventually consistent agreement. Apollo's architecture can adapt to any such configuration. This solution uses Paxos master-slave configuration to implement RM. RM is by no means the key to performance: even if RM is temporarily unavailable, such as during a temporary master-slave switchover due to a machine failure, Apollo can still continue (decrease quality) to make scheduling decisions. In addition, once the task is scheduled to the PN, JM will obtain the latest load information directly from the frequent status updates of the PN.

In order to better predict future resource utilization and optimize scheduling quality, each PN maintains a local task queue assigned to the server, makes inferences based on the queue, and announces its future resource availability in the form of a waiting time matrix. Therefore, Apollo uses estimation-based methods to make task scheduling decisions. Specifically, Apollo refers to the waiting time matrix summarized by RM, as well as the various characteristics of the task to be scheduled, such as the location of the input. However, the dynamics of clusters pose many challenges in practice. For example, the waiting time matrix may be out of date, the estimated value may not be optimal, and the cluster environment may sometimes be unpredictable. Therefore, Apollo introduced a correction mechanism to improve robustness and dynamically adjusted scheduling decisions during runtime. Finally, providing guaranteed resources (for example, ensuring SLAs) for jobs and achieving high cluster utilization are difficult to balance at the same time, because the cluster load and job resource requirements are constantly fluctuating. Apollo solves this problem through opportunistic scheduling, which creates a second type of task to use idle resources.

3.2 PN queue and waiting time matrix

The PN on each server manages the task queue assigned to that server to provide forecasts for future resource availability. When JM schedules a task on the server, it sends a task creation request, which contains (i) fine-grained resource requirements (number of CPU cores and memory), (ii) estimated running time, and (iii) running the task Required file list tasks (such as executable files and configuration files). After receiving the task creation request, PN will cache the required files to the local directory. PN monitors CPU and memory usage, refers to the resource requirements of tasks in the queue, and executes them when capacity is available. It can maximize the use of resources by performing as many tasks as possible according to the CPU and memory requirements of a single task. The PN queue is mainly FIFO, but it can be reordered. For example, later tasks that require fewer resources can fill in the gaps without affecting the expected start time of other tasks.

Using task queues allows the scheduler to actively schedule tasks to PN based on future resource availability (rather than based on instantaneous availability). Apollo comprehensively refers to task waiting time (sufficient available resources) and other task characteristics to optimize task scheduling. Using task queues can also copy files before resources are available, thereby shielding the cost of task initialization and avoiding idle gaps between tasks. This direct dispatch mechanism provides the efficiency required for tasks (especially small tasks), because any agreement to negotiate will incur a lot of overhead.

PN also provides feedback to JM to help improve the accuracy of mission runtime estimates. Initially, JM used the query optimizer to provide conservative estimates based on the operations and data volume in the task. Tasks of different data sets perform the same calculations at the same stage, and their runtime characteristics are similar, and the statistical information of the execution of earlier tasks can help improve the runtime estimation of later tasks. After the task starts running, PN will monitor its overall resource usage and update the status to JM through information such as memory usage, CPU time, execution time, and I/O throughput. JM then uses this information along with other factors (such as operating characteristics and input size) to optimize resource usage and predict the expected runtime of tasks from the same stage.

PN further disclosed the current load on the server and summarized it through RM. Ideally, its load information representation should convey predictions about future resource availability, shield the heterogeneity of servers in the data center (for example, servers with 64GB of memory and 128GB of memory have different capacities), and be simple enough for frequent use Update. Apollo's solution is a waiting time matrix, where each unit corresponds to the expected waiting time of a task that requires a certain amount of CPU and memory. Figure 3 contains an example of a matrix: the value of the unit <12GB, 4 cores> is 10, which means that a task that requires 4 CPU cores and 12GB of memory must wait 10 seconds in this PN to obtain its resource quota. PN maintains an expected waiting time matrix for future tasks of various resource quotas based on currently running and queued tasks. The algorithm simulates local task execution and evaluates how long future tasks with a given CPU/memory requirement will wait for execution on that PN. PN frequently updates this matrix by referring to actual resource conditions and the latest task runtime and resource estimates. Finally, the PN sends this matrix and a timestamp to each JM running or queuing tasks in the PN, and it also uses the heartbeat mechanism to send the matrix to the RM.

3.3 Estimate-based scheduling

JM must use RM to provide a summary view of the waiting time matrix and the various characteristics of the task to be scheduled to determine which server the task is scheduled to. Apollo's use of estimation-based methods in a single unified model must consider various factors that affect the quality of scheduling decisions (usually conflicting factors).

image.png

Figure 4: Example of task scheduling

This article uses an example to illustrate the importance of considering various factors at the same time, as well as the benefits of having a local queue on each server. Figure 4 (a) shows a simplified server diagram, which contains two racks, each rack connected through a layered network, each rack has four servers. Suppose you can read data from a local disk at a speed of 160MB/s, read data from the same rack at a speed of 100MB/s, and read data from another rack at a speed of 80MB/s. Consider scheduling a task that has two inputs (one 100MB is stored in server A and the other 5GB is stored in server C), which is an I/O intensive task. Figure 4(b) shows four scheduling options, where servers A and B can be used immediately, and server C has the best data location. However, D is the best choice among these four choices. This can only be realized when data location and latency are considered at the same time. This example also illustrates the value of local queues: there is no local queue on each server, and any scheduling mechanism that checks for immediate resource availability will choose a non-optimal choice like A or B.

Therefore, Apollo fully considers various factors and executes scheduling by estimating the completion time of the task. First, use the formula to estimate the completion time of the task without failure (represented by Esucc)

I represents the initialization time used to obtain the files required by the task. If these files are cached locally, it may be 0. The expected waiting time (denoted as W) is obtained by looking up the waiting time matrix of the target server. Task runtime (denoted as R) consists of I/O time and CPU time. Calculate the I/O time as the input size divided by the expected I/O throughput. I/O can come from local memory, disk or network with various bandwidths. In general, R's estimate will initially contain information from the optimizer, and use runtime statistics for tasks in the same phase for fine-grained adjustments.

Second, consider the possibility of task failure to calculate the final completion time estimate, denoted by C. In a real large-scale environment, hardware failures, maintenance, repairs, and software deployment are inevitable. To mitigate its impact, RM also collected recent and past maintenance schedule information on each server. All in all, let the probability of success Psucc be used to calculate C, as shown below. The penalty constant Kfail determined empirically is used to model the failure cost of the server on the task completion time.

Task priority. In addition to the completion time estimation, the order of task execution also has a great influence on the overall job delay. For example, for the job graph in Figure 1, the tasks in S1 run for an average of 1 minute, and the tasks in S2 run for an average of 2 minutes. Potential partition tilt causes the tasks to run for up to 10 minutes, and these tasks run for an average of 30 seconds in S3. . Therefore, effective implementation of S1 and S2 is undoubtedly critical to achieving the shortest run time. Therefore, the scheduler should allocate resources to S1 and S2 before considering S3. In S2, the scheduler should start from the vertex with the largest input as early as possible, because it is most likely to be on the critical path of the job.

The optimizer at each stage marks the static priority of the task by analyzing the job DAG and calculating the potential critical path of the job execution. The tasks in the phase will be prioritized based on the input size. Apollo schedules tasks and allocates resources in order of priority from high to low. Since the job contains a limited number of tasks, it is impossible to starve tasks with low static priority, because eventually they will be the only remaining tasks to be executed, and sooner or later they will be executed.

Stable matching. In order to improve efficiency, Apollo schedules tasks with similar priorities in batches, and transforms the task scheduling problem into a matching problem between tasks and servers. For each task, all servers in the cluster can be searched to find the best match. On large clusters, this method becomes very expensive. On the contrary, Apollo limits the search space of the task to a set of candidate servers, including (i) a set of servers on which a large amount of input is located (ii) a set of servers in the same rack as the first set of servers (iii) from Randomly select two servers from a group of light-load servers;

image.png

Figure 5: Matching example

Greedy algorithm can be applied to each task in turn, selecting the server with the shortest estimated completion time in each step. However, the result of the greedy algorithm is very sensitive to the order of task matching and often leads to sub-optimal decisions. Figure 5 shows an example in which three tasks are scheduled. Assume that both Task1 and Task2 read data from server A, and Task3 reads data from server B, as shown by the dotted line. Each server has the ability to initiate a task. The greedy matcher first matches Task1 with server A, then matches Task2 with server B, because Task1 has been scheduled on A, and finally matches Task3 with server C, as shown by the solid line. Task3 should be assigned to server B to obtain a better position is a better match.

Therefore, Apollo uses a variant of the stable matching algorithm to match tasks and servers. For each task in the batch, Apollo will find the server with the shortest estimated completion time as the task's proposal. If the server receives only one task proposal, the server will accept the proposal. When multiple tasks propose the same server, conflicts occur. In this case, the server will choose the task that saves the most time to complete. The unselected task withdraws its proposal and enters the next iteration that tries to match the remaining tasks and servers. The algorithm continues to iterate until all tasks are allocated or the maximum number of iterations is reached. As shown in Figure 5, the stable matcher matches Task2 to C and Task3 to B, which effectively utilizes data locality, thereby improving job performance.

Then, the scheduler sorts all matching pairs according to their quality to determine the scheduling order. If the matched task has a shorter server waiting time, the match is considered to be of higher quality. The scheduler traverses the sorted matches and schedules in order until the allocated capacity is exceeded. If opportunistic scheduling is enabled, the scheduler will continue to schedule tasks until the opportunistic scheduling limit is reached.

In order to simplify the matching algorithm and make a trade-off between efficiency and quality, Apollo can only assign one task to each server in a batch. Otherwise, considering the newly assigned task, Apollo must update the server's waiting time matrix, which adds to the algorithm. The complexity (every match is based on the assumption that no new tasks are assigned). This simplification may lead to suboptimal matching of the task, as this is another variant of the greedy algorithm. Apollo mitigates the impact in two ways: if the quality of the suboptimal match is low, sorting the matches by quality will cause the dispatch of the task to be delayed and then re-evaluate. Even if a suboptimal match is scheduled, the correction mechanism described in Section 3.4 is designed to capture this situation and reschedule the task when needed.

3.4 Correction mechanism

In Apollo, each JM can independently schedule tasks with high frequency without delay. This is very critical for scheduling workloads with a large number of small tasks. However, due to the distributed nature of scheduling, several JMs may make competitive decisions at the same time. In addition, information used for scheduling decisions (such as the waiting time matrix) may be out of date; task waiting time and running time may be underestimated or overestimated. Apollo has internal mechanisms to meet these challenges and use the latest status information to dynamically adjust scheduling decisions.

Unlike previous solutions (such as in Omega), conflicts are resolved immediately within the scheduling time, and Apollo optimistically assigns tasks to the PN queue before correcting it (postponing correction). This design choice is based on a large number of historical observations that conflicts are not always harmful. If different job managers schedule tasks to the same server with sufficient resources at the same time and can run two tasks in parallel; the tasks previously scheduled on the server may be completed soon, and resources are released as soon as possible so that there is no need to resolve any conflicts. In these cases, the local queue makes possible a delay correction mechanism, thus avoiding unnecessary overhead caused by eager detection and resolution of conflicts. The correction mechanism will continuously use the latest information to re-evaluate scheduling decisions and make appropriate adjustments when necessary.

Replica scheduling. When JM obtains new information from the PN during task creation, upgrade or monitoring its task queue, it will compare the information from the PN (and the waiting time elapsed so far) with the information of the scheduling decision that has been made Compare. The scheduler will re-evaluate the decision if the following situations occur:

(I) The updated expected waiting time is significantly higher than the original expected waiting time;

(Ii) The expected waiting time is greater than the average time of tasks in the same phase;

(Iii) The elapsed waiting time has been greater than the average value;

The first case indicates that the task completion time on the server is underestimated, while the second/third case indicates that the quality of matching the task to the server is low. Modifying the scheduling decision will trigger the replica task to be scheduled to the new expected server. When a task starts, the other copy tasks will be discarded.

Random. Multiple JMs may schedule tasks to the same light-load PN without knowing each other, resulting in scheduling conflicts. Apollo adds a small random number to each estimated completion time. This random factor helps to reduce the chance of conflict between different JMs choosing the same server. This number is usually proportional to the communication interval between JM and PN, and has no significant impact on the quality of scheduling decisions.

Confidence. The cluster aggregate information obtained from RM contains waiting time matrices of different ages, some of which may be out of date. The scheduler sets a lower confidence level for the older waiting time matrix, because the waiting time is likely to have changed from the time the matrix is ​​calculated to the present. When the confidence of the waiting time matrix is ​​low, the scheduler will produce a pessimistic estimate by looking for the waiting time of tasks that consume more CPU and memory.

Laggard detection. Laggards are tasks that progress more slowly than other tasks and have a serious impact on job performance. Apollo's laggard detection mechanism predicts the amount of time remaining for each task by monitoring the rate of data processing and the rate of CPU consumption. Other tasks in the same phase will be used as benchmarks for comparison. When the time required to rerun the task is significantly less than the time required to complete the task, a copy task will be started. They will be executed in parallel until the first one finishes, or until the duplicate task catches up with the original task. The scheduler also monitors the I/O rate and detects laggards caused by slow intermediate inputs. When a task slows down due to abnormal I/O delays, it can rerun the upstream copy task to provide an alternate I/O path.

3.5 Opportunity scheduling

In addition to achieving large-scale high-quality scheduling, Apollo also designs methods to improve cluster utilization. Cluster utilization fluctuates over time for several reasons. First, all users do not submit jobs at the same time and completely consume their allocated capacity. A typical example is that the cluster load on weekdays is always higher than on weekends. Second, the requirements for resources are different. Even daily operations with the same calculation logic will consume different amounts of resources due to changes in the size of their input data. Finally, a complete job usually goes through multiple stages, with different levels of parallelism and varying resource requirements. This load fluctuation on the system provides an opportunity for the scheduler to improve job performance by increasing utilization, but at the cost of reduced predictability. How to use occasional spare computing resources wisely without compromising SLA is still challenging.

Apollo introduces opportunistic scheduling to make proper use of free resources when they are available. Tasks can use enough tokens to cover their resource consumption in normal mode, or they can be executed in opportunistic mode without allocating resources. Each scheduler first applies optimistic scheduling of regular tasks with tokens. If all tokens are used and there are still tasks to be scheduled, opportunistic scheduling can be used to schedule opportunistic tasks. By running opportunistic tasks with lower priority on each server, you can prevent performance degradation caused by regular tasks, and if the server is under high load, any opportunistic tasks can be preempted or terminated.

An imminent challenge is to prevent a job from unfairly consuming all available resources. For opportunistic tasks, Apollo uses random allocation to achieve resource probabilistic fairness. When tokens are available and assigned to opportunistic tasks, Apollo will upgrade the opportunistic tasks to regular tasks.

Random allocation mechanism: Ideally, opportunity resources should be fairly distributed among jobs and proportional to the assignment of tokens. This is particularly challenging because the load of the entire cluster and the load of a single server fluctuates over time, which makes it difficult (or even impossible) to guarantee absolute instantaneous fairness. On the contrary, Apollo focuses on avoiding the worst-case scenario where some jobs occupy all the available capacity of the cluster, and aims for average fairness.

Apollo achieves this goal by setting the maximum opportunity quota according to the proportion of the assigned tokens of the job. For example, a job with n tokens can allocate up to cn opportunity tasks (c is a constant). When the PN has available capacity and the regular queue is empty, the PN will select a random task from the opportunistic task queue to execute, no matter when the job is dispatched. If the resources required by the selected task exceed the available resources, random selection continues until no tasks can be executed. Compared with the FIFO queue, the advantage of this algorithm is that it allows jobs to be started later to quickly share capacity.

Since the degree of parallelism of a job varies during its life cycle, the number of tasks to be scheduled will also vary. As a result, homework may not always be able to dispatch enough opportunistic tasks to take full advantage of its opportunistic benefits. By allowing each scheduler to increase the weight of opportunistic tasks in the random selection process to compensate for the decrease in the number of tasks, this can further enhance the system. For example, a weight of 2 means that the probability of the task being selected is twice. The total weight of all opportunistic tasks issued by this job shall not exceed its opportunistic benefits.

Under an ideal job load, the task runs for the same time and consumes the same amount of resources, and in a perfectly balanced cluster, the strategy is that the jobs equally share the allocated resources in proportion to the opportunity resources. However, in reality, tasks are very different from resource requirements at runtime. As tasks are completed and new tasks are ready, the number of tasks assigned to each job will continue to change. In addition, the job may never have enough parallelism to take full advantage of its opportunity benefits. Designing a fully decentralized mechanism that guarantees strong fairness in a dynamic environment is still a challenging topic in the future.

Task upgrade. If the server encounters resource pressure, opportunistic tasks are likely to starve. In addition, opportunistic tasks can wait indefinitely in the queue. In order to avoid job starvation, after the token is allocated, the tasks scheduled by chance can be upgraded to regular tasks. Because the job requires at least one token to run, and the number of tasks in the job is limited, the scheduler can convert hungry opportunistic tasks into regular tasks at a certain point, thereby preventing the job from starving to death.

After the opportunistic task is dispatched, the scheduler tracks the task in its ready list until the task is completed. When scheduling regular tasks, the scheduler will consider both unscheduled tasks and previously scheduled opportunistic tasks that are still waiting to be executed. Each scheduler assigns its tokens to tasks in order of priority from high to low. Although upgrading the opportunity task on the same machine is not an inevitable choice, it may be the preferred choice because there is no initialization time. The scheduler tends to upgrade opportunistic tasks on computers with fewer regular tasks by calculating the overall consumption, while waiting for computers with a heavier temporary load to release resources. This strategy can make better use of tokens and achieve better load balancing.

Guess you like

Origin blog.csdn.net/weixin_42663840/article/details/108306505