Spark Architecture and Operating Mechanism (6) - Task Scheduling

    After DAGScheduler submits the established TaskSet to TaskScheduler, the scheduling of tasks by DAGScheduler is completed. DAGScheduler just finishes dividing the directed acyclic graph of jobs into Stages and generates an ordered sequence of computations that Stages perform. The TaskScheduler really determines which physical node executes each Task in the Stage. The relationship between the directed acyclic graph, DAGScheduler, and TaskScheduler is as follows:


    TaskScheduler will create a TaskSetManager for each TaskSet, and the TaskSetManager is responsible for the management and scheduling of tasks in the TaskSet, but the TaskSetManager does not directly communicate with the underlying physical nodes, but The interaction of each TaskSetManager with the underlying physical nodes is coordinated through the TaskScheduler.

    1) The TaskScheduler will communicate with the underlying Master and Worker nodes through the SchedulerBackend; the SchedulerBackend is the background process of the scheduler; the TaskScheduler will select different SchedulerBackends according to different deployment methods, and the SchedulerBackend can use YARN, Mesos, Standalone and other forms;
    2 ) The TaskScheduler will provide the available physical resources to the TaskSetManager, and the TaskSetManager will determine which physical resource each Task executes on, and send the scheduling plan to the TaskScheduler, which will submit the Task to the Spark cluster for actual execution;

    After the TaskScheduler submits the Task to Spark, it will track the execution of the Task, and if the Task submission fails, it will try to resubmit it. Multiple Tasks in the same TaskSet are executed in parallel. If the Task assigned to a certain physical node is not completed for a long time, delaying the execution time of the entire TaskSet, the TaskScheduler will select a new physical node to execute the Task.

    The scheduling strategy of TaskSetManager is based on location-aware task scheduling. TaskSetManager will try to select the best locality (Locality) Executor for each Task to match according to the available resources. That is to say, Spark will first determine the location of each partition of the first RDD in the same Stage. Each Task is a series of operations on a partition, and Spark will try to keep the Task operations on a partition in the same location. It is ideal to put the front and rear operations of the same task in the same process. If the conditions are not satisfied, then retreat and seek to run in the same computing node. If it cannot be satisfied, retreat and seek to run in the same rack. Finally, the TaskSetManager selects the most suitable computing resource and submits the application for the resource to the TaskScheduler.
 
    An application will eventually be split into multiple TaskSet task sets. Since some TaskSets have no dependencies, there may be multiple runnable TaskSets at the same time and hand them over to TaskScheduler for scheduling. In this case, Spark can use two scheduling modes: FIFO and FAIR.
   
     FIFO: First-in, first-out scheduling mode, which is the default scheduling mode of Spark.
    
     FAIR: Fair scheduling mode. In FAIR mode, each computing task has equal priority, and Spark allocates computing resources to each task in a round-robin manner. FAIR mode provides users with the concept of a scheduling pool. Users can put important computing tasks into a scheduling pool, and set the weight of the scheduling pool to make the characters in the scheduling pool get higher priority. In addition, when multiple users use the Spark cluster at the same time, the FAIR mode sets up a scheduling pool for each user, and the scheduling pool has the same priority to ensure that each user can use Spark computing resources equally.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326292849&siteId=291194637