Spark _09 resource scheduling and task scheduling

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_41946557/article/details/102750558

  • Spark resource scheduling and task scheduling process:

After starting the cluster, Worker node resources will report to the Master node, Master mastered the cluster resources.

When submitting a Spark Application, according to the dependency between the RDD Application DAG formed in a directed acyclic graph.

After job submission, Spark creates two objects in the Driver-side: DAGScheduler and TaskScheduler, DAGScheduler is scheduled high-level task scheduler, is an object. DAGScheduler main role is to DAG according to the width between the RDD dependency is divided into a number of Stage, then these will be submitted to the Stage in the form of TaskSet of TaskScheduler (TaskScheduler is a low-level scheduler task scheduling, here TaskSet is actually a collection, inside the package is a task of a task, that is, the degree of parallelism in the task task stage), the TaskSchedule iterates TaskSet set, get each task after task will send to the computing node to execute Executor (in fact, is sent to the Executor the ThreadPool thread pool to execute).

task will be to TaskScheduler feedback operation Executor thread pool, when a task fails, by TaskScheduler responsible retry will be re-sent to the Executor to perform the task, the default retry three times. If the retry still fails three times, then the stage where the task has failed. stage failure by DAGScheduler responsible retry sending TaskSet to TaskSchdeuler, Stage default retry four times again. If the retry after four still fails, then the job will fail. job fails, Application failed.

TaskScheduler not only can retry a failed task, will retry straggling (backward, slow) task (that is, slow execution speed much task than the other task). If you have a slow running task so TaskScheduler will start a new task to perform the same processing logic and the slow-running task. Two task which first executed, the results of which will task to prevail. This is the Spark of speculative execution mechanism. Speculative execution is off by default in Spark. Speculative execution can be configured through spark.speculation property.

note:

  • For ETL into the type of business you want to close the database speculative execution mechanism, so that no duplication of data storage.
  • If you encounter a situation data skew, opening speculative execution is likely to lead to restart the task will have been treated the same logic, the task may have been in a state of never-ending process.
  • Spark graphical resource scheduling and task scheduling process

 

  • Fine-grained and coarse-grained resource application resource application
  • Coarse-grained resource application (Spark)

Before Application execution, all of the resource request is completed, when the resource application is successful, the task will be scheduled when all the task execution is completed, it will release some of the resources.

Pros: Before Application execution, all of the resources you sign up, each task directly using the resources on it, do not need to perform the task in front of their own resources to apply, task start on the fast, faster task execution, stage executed soon up, job on the fast, application execution on the fast.

缺点:直到最后一个task执行完成才会释放资源,集群的资源无法充分利用。

  • 细粒度资源申请(MapReduce)

Application执行之前不需要先去申请资源,而是直接执行,让job中的每一个task在执行前自己去申请资源,task执行完成就释放资源。

优点:集群的资源可以充分利用。

缺点:task自己去申请资源,task启动变慢,Application的运行就相应的变慢了。

 

Guess you like

Origin blog.csdn.net/qq_41946557/article/details/102750558