Apache Spark will support stage-level resource control and scheduling

Apache Spark will support stage-level resource control and scheduling

Past memory big data Past memory big data

background

Those who are familiar with Spark know that when a Spark job is started, we need to specify the number of Exectuors and information such as memory and CPU. However, when a Spark job is running, it may contain many stages. These different stages may require different resources. Due to the current Spark design, we cannot set fine-grained resources for each stage. Moreover, it is difficult for even a senior engineer to accurately estimate a more appropriate configuration, so that the parameters set when the job starts are suitable for each stage of Spark.

Let's consider this scenario: We have a Spark job, which has a total of two Stages. The first stage mainly performs basic ETL operations on the data we input. In this stage, a large number of tasks are generally started, but each task only requires a small amount of memory and a small number of cores (such as 1 core). After the first stage is processed, we use the result of the ETL processing as the input of the ML algorithm. This stage requires only a few tasks, but each task requires a lot of memory, GPUs, and CPU.

You should often encounter business scenarios like the above. We need to set different resources for different stages. However, the current Spark does not support this fine-grained resource configuration, which leads us to set up a large number of resources when the job is started, which may lead to a waste of resources, especially in machine learning scenarios.

However, it is worthwhile to note that Thomas Graves, the chief system software engineer from Nvidia, gave the community an ISSUE, which is SPIP: Support Stage level resource configuration and scheduling, which aims to allow Spark to support Stage-level resource configuration and scheduling. You can also tell from the name that this is a SPIP (short for Spark Project Improvement Proposals), SPIP is mainly to mark major user-oriented or cross-domain changes, rather than small incremental improvements. Therefore, it can be seen that this function has greatly modified Spark, which will have a relatively large impact on users.

After submitting this SPIP, the author sent an email to the community, explaining the purpose of this SPIP, the problems to be solved, etc., and then let everyone vote to decide whether this SPIP should be developed. The good news is that there has been a round of voting by the community, with 6 votes in favor and 1 vote against. It means that the SPIP has passed and will enter the development state.

Apache Spark will support stage-level resource control and scheduling

If you want to learn about Spark, Hadoop or Hbase-related articles in time, please follow the WeChat public account: iteblog_hadoop

design

A lot of it in the front, let's take a look at how this scheme is designed. In order to achieve this function, some new APIs need to be added to the existing RDD class to specify the resources needed for this RDD calculation, such as adding the following two methods:


def withResources(resources: ResourceProfile): this.type
def getResourceProfile(): Option[ResourceProfile]

The above withResources method is mainly used to set the resourceProfile of the current RDD and return the current RDD instance. The resources specified in ResourceProfile include cpu, memory and additional resources (GPU/FPGA/etc). We can also use it to achieve other functions, such as limiting the number of tasks for each stage and specifying some parameters for shuffle. However, in order to simplify the design and implementation, only the resources currently supported by Spark are currently considered. For Task, you can set cpu and additional resources (GPU/FPGA/ etc.); for Executor, you can set cpu, memory and additional resources (GPU/FPGA/ Wait) . Executor resources include cpu, memory, and additional resources (GPU, FPGA, etc.). By adding the above method to the existing RDD class, this makes all the evolved RDD inherited from RDD support setting resources, of course, including the RDD generated by the input file.

When programming, the user can set the ResourceProfile through the withResources method, of course, it is certainly not possible to set unlimited resources. The resources required by Executor and task can be set at the same time through ResourceProfile.require. The specific interface is as follows:


def require(request: TaskResourceRequest): this.type
def require(request: ExecutorResourceRequest): this.type

class ExecutorResourceRequest(
 val resourceName: String,
 val amount: Int, // potentially make this handle fractional resources
 val units: String, // units required for memory resources
 val discoveryScript: Option[String] = None,
 val vendor: Option[String] = None)

class TaskResourceRequest(
 val resourceName: String,
 val amount: Int) // potentially make this handle fractional resources

The reason why ResourceProfile is used to wrap ExecutorResourceRequest or TaskResourceRequest is that it can be easily implemented if we need to add new features later. For example, we can add the ResourceProfile.prefer method to ResourceProfile to realize that the program will run the job if enough resources are requested, and the job will fail if sufficient resources are not requested.

Of course, the realization of this function needs to rely on Spark's Dynamic Allocation mechanism. If the user does not enable Dynamic Allocation (spark.dynamicAllocation.enabled=false) or the user does not set the ResourceProfile for the RDD, then it runs according to the existing resource application mechanism, otherwise the new mechanism is used.

Because each RDD can specify the ResourceProfile, and DAGScheduler can calculate the conversion of multiple RDDs in one stage, so Spark needs to resolve the resource application conflicts of multiple RDDs in the same stage. Of course, some RDDs will also have cross-stage situations, such as reduceByKey, so for this situation Spark needs to apply the ResourceProfile settings to these two stages.

how to use

Then if the RDD adds the above method, we can set the resource usage of each Task as follows:


val rp = new ResourceProfile()
rp.require(new ExecutorResourceRequest("memory", 2048))
rp.require(new ExecutorResourceRequest("cores", 2))
rp.require(new ExecutorResourceRequest("gpu", 1, Some("/opt/gpuScripts/getGpus")))
rp.require(new TaskResourceRequest("gpu", 1))

val rdd = sc.makeRDD(1 to 10, 5).mapPartitions { it =>
  val context = TaskContext.get()
  context.resources().get("gpu").get.addresses.iterator
}.withResources(rp)

val gpus = rdd.collect()

The ResourceProfile above specifies that Executor requires 2GB of memory, 2 cores, and a GPU; Task requires a GPU.

to sum up

This article only introduces the simple implementation of this function. There are many issues that need to be considered in the actual design and development. For details, please refer to SPARK-27495. For the corresponding design files, refer to Stage Level Scheduling SPIP Appendices API/Design. Because this is a relatively large feature, it may take several months to implement. I believe that with this function, we will use the resources of the cluster more reasonably.

Guess you like

Origin blog.51cto.com/15127589/2677059