Application Execution && Chapter 4 Spark Task Scheduling Mechanism

(any content)

Type anything you want here

Execution of the application

(1) RDD dependencies

insert image description here

shuffledRDD

"ShuffledRDD" is a class in the Apache Spark framework, which is used to represent the distributed data set after the Shuffle operation.

In Spark, Shuffle refers to the operation of repartitioning and reordering data. When the data needs to be regrouped, aggregated or sorted, Spark will perform the Shuffle operation to redistribute the data to different computing nodes for processing. Shuffle is a very critical operation in Spark, which has a significant impact on performance and resource consumption.

The "ShuffledRDD" class is a special RDD (Resilient Distributed Dataset) that represents a Shuffled dataset. It includes the following key features and functions:

  1. Data repartitioning: ShuffledRDD repartitions data to different computing nodes for subsequent data processing operations. The method of partitioning can be configured according to the needs of the application.

  2. Data reordering: ShuffledRDD will reorder data to meet the ordering needs of the application. It can sort the data by a specified collation, or by the key in a key-value pair.

  3. Shuffle dependency: There is a Shuffle dependency between ShuffledRDD and the upstream RDD, indicating that the Shuffle operation is generated by the upstream RDD. This ensures that Shuffle operations can be recalculated and resumed in the event of a compute node failure.

  4. Persistence and serialization: ShuffledRDD can improve performance through persistence (caching), and supports data serialization and deserialization operations to improve data transmission efficiency.

By using "ShuffledRDD", developers can use Spark's Shuffle operation to repartition and reorder data. This is important for applications that require data aggregation, grouping, sorting, etc. ShuffledRDD provides highly scalable and fault-tolerant features, making it possible to perform Shuffle operations on large-scale data sets.

ShuffleDependency

"ShuffleDependency" is a class in Apache Spark that represents the dependency between two stages in a Spark job involving Shuffle operations.

Spark performs a Shuffle operation when data needs to be repartitioned, aggregated, or sorted. The "ShuffleDependency" class is used to represent the dependencies of this Shuffle operation, which contains the following key features and functions:

  1. Partition method: It specifies the partition method of data during Shuffle operation. This information is used to determine the number of output partitions and how the data is distributed among the partitions.

  2. Aggregation function: ShuffleDependency allows specifying an aggregation function for aggregating data during Shuffle operations. Useful when performing operations like reduceByKey or aggregateByKey.

  3. Shuffle mapping stage: ShuffleDependency indicates the dependency from the mapping stage to the reducing stage, in which the Shuffle operation takes place. It captures the relationship between map output partitions and reduce input partitions.

  4. Serialization and deserialization: ShuffleDependency supports serialization and deserialization of Shuffle metadata, including partition information, Shuffle output and dependencies.

ShuffleDependency plays an important role in the execution plan of Spark jobs, which captures the information needed for data shuffling and reallocation. It enables Spark to efficiently perform Shuffle operations on the cluster, and supports tasks such as data aggregation, grouping, and sorting.

MapPartitionRDD

"MapPartitionRDD" is a class in Apache Spark that is used to represent the result of performing mapPartitions operations on RDDs.

In Spark, mapPartitions is a transformation operation that allows developers to perform batch operations on each partition of an RDD instead of element-by-element. mapPartitions applies a function to each partition's data and produces a new RDD.

The "MapPartitionRDD" class is a special type that represents an RDD after a mapPartitions operation. It includes the following key features and functions:

  1. Partition information: MapPartitionRDD inherits from the parent RDD and retains the partition information of the parent RDD. It can create corresponding partitions according to the partitioning method of the parent RDD.

  2. Dependency: There is a dependency between MapPartitionRDD and parent RDD, which is used to build the dependency graph of RDD in the job.

  3. Data Transformation: MapPartitionRDD transforms the data in each partition by applying a user-supplied function. The function takes an iterator as an input parameter, can perform arbitrary operations on the elements in the iterator, and returns a new iterator as the output result.

  4. Lazy loading: Similar to other RDDs, MapPartitionRDDs are loaded lazily, and calculations are only triggered when operations are performed.

By using "MapPartitionRDD", developers can perform batch processing on each partition of RDD, thereby improving the efficiency of data processing. This is useful for scenarios that require complex operations on each partition, such as data transformations, filtering, calculations, etc.

It should be noted that "MapPartitionRDD" is an internal class of Spark, which is used to represent the internal structure and conversion operations of RDD, and is not directly exposed to users. Users usually use RDD-related transformation operations (such as mapPartitions) to manipulate data without directly manipulating the "MapPartitionRDD" class.

OneToOneDependency

"OneToOneDependency" is a class in Apache Spark used to represent a one-to-one dependency between RDDs.

In Spark, RDD (Resilient Distributed Dataset) is an abstract representation of distributed data, and dependencies are used to describe the relationship between RDDs and the dependencies of transformation operations. A one-to-one dependency means that each parent RDD partition corresponds to exactly one child RDD partition.

The "OneToOneDependency" class has the following key characteristics and functionality:

  1. Partition mapping: Each parent RDD partition corresponds to a child RDD partition, and there is a one-to-one mapping relationship between them.

  2. Dependency: OneToOneDependency represents the dependency between RDDs and is used to build a dependency graph of RDDs.

  3. Conversion operation: OneToOneDependency is used to describe a one-to-one conversion operation, which means that each partition of the child RDD only depends on the corresponding partition of the parent RDD.

  4. Lazy loading: Similar to other RDDs, OneToOneDependency is also lazy-loaded, and calculations are only triggered when operations are performed.

The "OneToOneDependency" class is usually used internally by Spark and is not directly exposed at the user level. It is used to build a dependency graph of RDDs, ensuring that each child RDD partition maintains a one-to-one relationship with the corresponding parent RDD partition during transformation operations.

(2) Division of stages

insert image description here

shuffledRDD

"ShuffledRDD" is a class in the Apache Spark framework, which is used to represent the distributed data set after the Shuffle operation.

In Spark, Shuffle refers to the operation of repartitioning and reordering data. When the data needs to be regrouped, aggregated or sorted, Spark will perform the Shuffle operation to redistribute the data to different computing nodes for processing. Shuffle is a very critical operation in Spark, which has a significant impact on performance and resource consumption.

The "ShuffledRDD" class is a special RDD (Resilient Distributed Dataset) that represents a Shuffled dataset. It includes the following key features and functions:

  1. Data repartitioning: ShuffledRDD repartitions data to different computing nodes for subsequent data processing operations. The method of partitioning can be configured according to the needs of the application.

  2. Data reordering: ShuffledRDD will reorder data to meet the ordering needs of the application. It can sort the data by a specified collation, or by the key in a key-value pair.

  3. Shuffle dependency: There is a Shuffle dependency between ShuffledRDD and the upstream RDD, indicating that the Shuffle operation is generated by the upstream RDD. This ensures that Shuffle operations can be recalculated and resumed in the event of a compute node failure.

  4. Persistence and serialization: ShuffledRDD can improve performance through persistence (caching), and supports data serialization and deserialization operations to improve data transmission efficiency.

By using "ShuffledRDD", developers can use Spark's Shuffle operation to repartition and reorder data. This is important for applications that require data aggregation, grouping, sorting, etc. ShuffledRDD provides highly scalable and fault-tolerant features, making it possible to perform Shuffle operations on large-scale data sets.

RDD

RDD (Resilient Distributed Dataset) is a core concept in Apache Spark and the main data abstraction of Spark.

RDD is a distributed, immutable collection of data that can be computed and manipulated in parallel in a Spark cluster. It has the following properties:

  1. Resilient: RDD is fault-tolerant, that is, when a node fails, the lost data can be recalculated through the RDD lineage information to ensure data reliability and recoverability.

  2. Distributed: RDD can be distributed on multiple nodes in the cluster and supports parallel computing. Data on each node can be calculated and processed independently.

  3. Immutable (Immutable): RDD is an immutable collection of data that cannot be modified once created. Every time a transformation operation is performed on an RDD, a new RDD is generated.

  4. Lazy Evaluation: RDD has the characteristics of lazy evaluation, and the actual calculation will only be performed when it needs to return a result or trigger an action operation.

RDD supports rich conversion operations and action operations. Transformation is used to convert and process RDD, such as map, filter, reduceByKey, etc. Actions are used to trigger calculations and return results, such as count, collect, save, etc.

By using RDD, developers can take advantage of Spark's distributed computing capabilities for large-scale data processing and analysis. Spark provides rich APIs and functions for complex data operations and algorithm processing.

It should be noted that the Spark 2.0 version introduces the Dataset API, which is a higher-level abstraction of RDD, providing type safety and optimized execution plans. Therefore, in new Spark applications, it is recommended to use the Dataset API to process data instead of manipulating RDD directly.

ShuffleMapStage

"ShuffleMapStage" is a concept in Apache Spark, which is used to represent the stage involving Shuffle operation in Spark job.

In Spark, Shuffle refers to the operation of repartitioning and reordering data. When the data needs to be regrouped, aggregated or sorted, Spark will perform the Shuffle operation to redistribute the data to different computing nodes for processing. Shuffle is an important operation in Spark, which has a significant impact on performance and resource consumption.

"ShuffleMapStage" represents a stage, which contains a series of tasks (tasks) and the corresponding Shuffle dependencies. A ShuffleMapStage is responsible for processing input data and Shuffle operations, and generating Shuffle output results.

Here are some key features and functions of "ShuffleMapStage":

  1. Task division: ShuffleMapStage divides the data processing tasks in the job into a set of parallel execution tasks, and each task is responsible for processing a partition of the input data.

  2. Shuffle dependencies: ShuffleMapStage represents dependencies from data sources (such as RDDs) to Shuffle operations. It defines the partitioning method, partitioner, aggregation function and other information of the input data.

  3. Data processing and Shuffle: ShuffleMapStage is responsible for processing input data and performing Shuffle operations to repartition and reorder data.

  4. Task scheduling and execution: ShuffleMapStage schedules tasks to be executed on computing nodes in the cluster, and manages the progress and status of tasks.

ShuffleMapStage plays a key role in the execution of Spark jobs. It combines data processing and shuffle operations to ensure that data is repartitioned and reordered the way it needs to be. At the same time, it is also responsible for managing the scheduling and execution of tasks to achieve efficient parallel computing.

It should be noted that "ShuffleMapStage" is an internal concept of Spark, which is used to represent the stage and task division of the Shuffle operation. For ordinary Spark application developers, it is enough to pay more attention to the RDD conversion operation and data processing logic, without directly operating the "ShuffleMapStage" class.

ResultStage

"ResultStage" is a concept in Apache Spark used to represent the result stage in a Spark job.

In Spark, a job (Job) consists of a series of stages (Stage). Each stage represents a set of interdependent tasks (Tasks) that can be executed in parallel. Spark divides jobs into multiple stages for parallel execution on computing nodes, and implements the computing process of the entire job through data transmission and dependencies between stages.

"ResultStage" is a special kind of stage, which represents the last stage responsible for returning the calculation result to the application or saving it to an external storage system.

Here are some key features and functions of "ResultStage":

  1. Result calculation: ResultStage performs the final calculation operation, generates and returns the calculation result to the application program or saves it to an external storage system.

  2. Dependency: ResultStage usually depends on the previous stage, and performs further processing according to the calculation results of the previous stage.

  3. Data transfer: ResultStage may need to obtain data from previous stages and perform operations such as summarizing, merging, or aggregating to generate the final calculation result.

  4. Task scheduling and execution: ResultStage is responsible for scheduling tasks to be executed on computing nodes in the cluster, and managing the progress and status of tasks.

ResultStage plays a key role in the execution of Spark jobs. It represents the final stage of the job and is responsible for generating the final calculation results. It relies on the previous stages, and performs subsequent processing and aggregation based on the calculation results of the previous stages.

It should be noted that "ResultStage" is an internal concept of Spark, which is used to represent the execution process and result stage of a job. For ordinary Spark application developers, it is enough to pay more attention to the RDD conversion operation and data processing logic, without directly operating the "ResultStage" class.

(3) Segmentation of tasks

In Spark, task segmentation is the process of decomposing a job into small task units that can be executed in parallel. The purpose of task segmentation is to divide the work in a job into multiple tasks for parallel execution on multiple computing nodes in the cluster.

Task segmentation usually occurs in the following stages:

  1. RDD partition: If the input of the job is an RDD (Resilient Distributed Dataset), the RDD needs to be divided into multiple partitions first. The partition of RDD determines the granularity of the task, and each partition will become an independent task.

  2. Job division: A job (Job) usually consists of multiple stages (Stage). Each stage can be further divided into tasks. The basis of job division is to divide the job into multiple stages according to the dependency relationship of data and the conversion relationship of computing operations, and each stage contains a set of tasks that can be executed in parallel.

  3. Data locality: In the process of task segmentation, Spark also considers data locality, that is, assigns tasks to the computing nodes where the data resides as much as possible. This helps reduce data transfer overhead and improve task execution efficiency.

  4. Task scheduling: After tasks are divided, the tasks need to be scheduled to be executed on the computing nodes in the cluster. Spark's task scheduler will distribute tasks to available computing nodes according to factors such as cluster resources and task priority.

It should be noted that task segmentation is automatically handled by the Spark framework, and developers do not need to explicitly write task segmentation code. Spark automatically divides the job into multiple stages and tasks according to the RDD dependencies and conversion operations of the job, and schedules them to be executed in the cluster.

Task segmentation is an important step in Spark job execution, which allows Spark to implement parallel task execution and data processing in a distributed environment. Through task segmentation, Spark can make full use of cluster resources and improve job execution efficiency and performance.

(4) Scheduling of tasks

insert image description here

Task

In Apache Spark, a task (Task) is the smallest execution unit of a job (Job), representing a job executed in the cluster. Each task typically processes a partition of data and applies specified transformations or operations to it.

The following are key features and concepts about tasks:

  1. Partition processing: Each task is responsible for processing a partition in the data set, which is the basic unit of data distribution in the cluster. Tasks perform computational operations by applying specified functions on partitions.

  2. Data locality: Spark optimizes task scheduling and assigns tasks to computing nodes that store corresponding data partitions as much as possible. This can reduce the overhead of data transfer and improve the efficiency of task execution.

  3. Parallel execution: Spark executes tasks in the cluster in parallel. It can run multiple tasks on multiple computing nodes at the same time, thereby utilizing cluster resources to achieve high-performance distributed computing.

  4. Task dependency: There is a dependency relationship between tasks, that is, some tasks need to be executed after other tasks are completed. This dependency is determined through RDD transformation operations and data dependencies.

  5. Task scheduling: The task scheduler is responsible for assigning tasks to available computing nodes in the cluster. It schedules and allocates based on factors such as cluster resources, task dependencies, and data locality.

Tasks play a key role in Spark, they are the basic unit of job execution, and realize large-scale data processing and analysis through parallel execution and distributed computing. The Spark framework automatically handles task scheduling and execution. Developers only need to define conversion operations and build jobs, and Spark will automatically divide them into tasks and execute them in the cluster.

It should be noted that tasks are an internal concept of Spark, and there is usually no need to directly manipulate tasks in application code. Developers are mainly concerned with defining transformation operations, building workflows, and optimizing data processing logic.

TaskPool

In Apache Spark, TaskPool (task pool) is a component used to manage and schedule tasks. It is part of the Spark task scheduler, responsible for maintaining the queue and execution status of tasks, and assigning tasks according to available resources and scheduling policies.

TaskPool has the following main functions and features:

  1. Task queue management: TaskPool maintains a task queue and manages the execution order of tasks according to first-in-first-out (FIFO) or other scheduling strategies.

  2. Resource management: TaskPool tracks available computing resources, such as CPU cores, memory, etc. It decides how many tasks to assign and how to execute them based on the availability of resources.

  3. Task scheduling: TaskPool is responsible for assigning tasks from the task queue to available executors (Executor) or computing nodes, and ensuring that tasks are executed according to the specified scheduling policy.

  4. Task Status Tracking: TaskPool tracks the execution status of each task, such as running, completed, failed, etc. It maintains the status information of the task and provides it to the task scheduler for further decision-making.

  5. Error handling and retry: TaskPool captures errors that may occur during task execution, and performs error handling and task retry according to the configured strategy.

TaskPool is an important component in the Spark task scheduler, which ensures that tasks are allocated and executed according to the specified scheduling strategy and resource management principles. It helps Spark efficiently manage the execution order of tasks, resource utilization, and error handling to achieve high performance and reliability of jobs.

It should be noted that TaskPool is an internal component of Spark, and usually does not need to be directly manipulated in application code. Spark provides flexible task scheduling and resource management configuration options, and developers can configure and optimize according to specific needs and cluster environments.

(5) Execution of tasks

insert image description here

Task

In Apache Spark, a task (Task) is the smallest execution unit of a job (Job), representing a job executed in the cluster. Each task typically processes a partition of data and applies specified transformations or operations to it.

The following are key features and concepts about tasks:

  1. Partition processing: Each task is responsible for processing a partition in the data set, which is the basic unit of data distribution in the cluster. Tasks perform computational operations by applying specified functions on partitions.

  2. Data locality: Spark optimizes task scheduling and assigns tasks to computing nodes that store corresponding data partitions as much as possible. This can reduce the overhead of data transfer and improve the efficiency of task execution.

  3. Parallel execution: Spark executes tasks in the cluster in parallel. It can run multiple tasks on multiple computing nodes at the same time, thereby utilizing cluster resources to achieve high-performance distributed computing.

  4. Task dependency: There is a dependency relationship between tasks, that is, some tasks need to be executed after other tasks are completed. This dependency is determined through RDD transformation operations and data dependencies.

  5. Task scheduling: The task scheduler is responsible for assigning tasks to available computing nodes in the cluster. It schedules and allocates based on factors such as cluster resources, task dependencies, and data locality.

Tasks play a key role in Spark, they are the basic unit of job execution, and realize large-scale data processing and analysis through parallel execution and distributed computing. The Spark framework automatically handles task scheduling and execution. Developers only need to define conversion operations and build jobs, and Spark will automatically divide them into tasks and execute them in the cluster.

It should be noted that tasks are an internal concept of Spark, and there is usually no need to directly manipulate tasks in application code. Developers are mainly concerned with defining transformation operations, building workflows, and optimizing data processing logic.

Thread

Thread (Thread) is the smallest unit for executing tasks in a computer, and is the basic unit for task scheduling and execution by the operating system. In multithreaded programming, threads allow a program to execute multiple tasks simultaneously, with each task running in a separate thread.

Here are some key features and concepts of threads:

  1. Concurrent execution: Threads enable a program to execute multiple tasks concurrently, with each task running independently in its own thread. This can make full use of multi-core processors and system resources, improving the performance and responsiveness of the program.

  2. Context Switching: Context switching occurs when the operating system schedules threads. This means that the currently executing thread pauses, saves its state, and switches to another thread's execution.

  3. Shared resources: Process resources are shared between threads, such as memory, files, and network connections. This also requires consideration of thread safety in multithreaded programming to avoid data races and concurrent access issues.

  4. Synchronization and mutual exclusion: In a multi-threaded environment, shared data may be accessed and modified between threads. Mechanisms such as synchronization mechanisms and mutexes are used to ensure the order and consistency of data access among multiple threads to avoid race conditions and data inconsistencies.

  5. Thread scheduling: The thread scheduler is responsible for determining the execution order and priority of threads. The scheduler allocates CPU time slices to different threads according to different scheduling algorithms to achieve fairness and high efficiency.

Threads play an important role in programming, especially in the context of concurrency and parallel computing. Multi-threaded programming can improve program performance and resource utilization, but also need to pay attention to thread safety and synchronization mechanisms to avoid potential concurrency problems.

It should be noted that the way threads are created and managed depends on the programming language and platform used. In Java, threads can be created and managed using the Thread class or the Executor framework. In Apache Spark, thread management is handled by the Spark framework and the underlying execution engine, and developers usually do not need to directly manipulate threads.

ThreadPool

Thread Pool (ThreadPool) is a mechanism for managing and reusing threads, which is used to execute and manage multiple tasks. The thread pool can improve the performance and resource utilization of the program, and avoid the overhead of frequently creating and destroying threads.

Following are some key features and concepts of thread pools:

  1. Thread reuse: When the thread pool is initialized, a fixed number of threads will be created, and these threads can be reused to perform multiple tasks. When a task is completed, the thread will return to the thread pool and wait for a new task to be assigned instead of destroying the thread.

  2. Task queue: The thread pool maintains a task queue for storing tasks to be executed. When threads in the thread pool are idle, they fetch tasks from the task queue and execute them.

  3. Thread management: The thread pool is responsible for managing the life cycle of threads, including creating, allocating, executing, and reclaiming threads. It can also dynamically adjust the size of the thread pool as needed.

  4. Concurrency control: The thread pool manages system resources by controlling the number of concurrently executing threads. The maximum number of threads in the thread pool can be configured according to requirements to control the number of tasks executed at the same time.

  5. Thread pool strategy: The thread pool can adopt different scheduling strategies, such as fixed-size thread pool, cache thread pool, and work-stealing thread pool, etc., to adapt to different types of tasks and loads.

Thread pool is very commonly used in multi-threaded programming, it provides an efficient way to manage thread resources and concurrently executed tasks. Using the thread pool can reduce the overhead of thread creation and destruction, and improve the performance and responsiveness of the program.

Thread pools may be implemented and used differently in different programming languages ​​and platforms. For example, Java provides the ThreadPoolExecutor class to create and manage thread pools, and in Apache Spark, thread pools are also used to manage concurrently executed tasks.

It is necessary to select the appropriate thread pool configuration and strategy according to the specific application requirements and performance requirements to obtain the best performance and resource utilization. At the same time, it is necessary to pay attention to the synchronization mechanism of thread safety and shared resources to avoid the problem of concurrent access.

Executor

Executor (Executor) is a component used to manage and execute tasks, usually used in conjunction with a thread pool. It provides a way to submit tasks to the thread pool for execution to achieve concurrent execution and resource management.

Here are some key features and concepts of Executor:

  1. Task submission: Executor provides an interface for submitting tasks to the thread pool for execution. By submitting tasks to Executor, tasks can be executed asynchronously without manually creating and managing threads.

  2. Resource management: Executor is responsible for managing available thread resources, and schedules and allocates threads according to the number and priority of tasks. It controls the number of tasks executed concurrently to avoid resource exhaustion or overloading.

  3. Thread scheduling: Executor uses a scheduling algorithm to determine the execution order and priority of tasks. It assigns tasks to available threads according to different scheduling policies to achieve fairness and high efficiency.

  4. Asynchronous execution: Through Executor, asynchronous execution of tasks can be realized. After the task is submitted, you can continue to perform other operations without waiting for the task to complete.

  5. Task status and results: Executor tracks the execution status of each task and provides a way to obtain task results. It can return the execution result of the task through mechanisms such as Future or CompletionService.

Executor plays an important role in multi-threaded programming, it provides a way to simplify task management and concurrent execution. By using Executor, the execution of tasks can be decoupled from the creation and management of threads, so as to achieve more efficient concurrent programming.

It should be noted that the specific implementation and usage of Executor may vary by programming language and platform. For example, in Java, you can use the ThreadPoolExecutor provided by the Executor framework to create and manage thread pools, and submit tasks through the ExecutorService interface. In Apache Spark, Executor is also used to manage task execution and resource allocation.

Selecting an appropriate Executor configuration and scheduling strategy, and adjusting it according to specific application and performance requirements can achieve high-performance concurrent execution and task management. At the same time, it is also necessary to pay attention to thread safety and synchronization of shared resources to avoid problems of concurrent access.

Chapter 4 Spark Task Scheduling Mechanism

  • In the production environment, the deployment mode of the Spark cluster is generally the YARN-Cluster mode. In the subsequent kernel analysis content, our default cluster deployment mode is the YARN-Cluster mode. In the last chapter, we explained the task submission process in Spark YARNCluster mode, but we did not specify the workflow of the Driver. The Driver thread mainly initializes the SparkContext object, prepares the context required for running, and then maintains the RPC connection with the ApplicationMaster to apply for resources through the ApplicationMaster.
  • When the ResourceManager returns the Container resource to the ApplicationMaster, the ApplicationMaster tries to start the Executor process on the corresponding Container. After the Executor process starts, it will reversely register with the Driver. After the registration is successful, it will maintain a heartbeat with the Driver and wait for the Driver to distribute the task. When the distributed task is executed, it will report the task status to the Driver.

4.1 Overview of Spark task scheduling

When the Driver wakes up, the Driver will prepare the tasks according to the user program logic, and gradually distribute the tasks according to the Executor resources
. Before elaborating on task scheduling in detail, first explain a few concepts in Spark. A Spark application
includes three concepts of Job, Stage and Task:

  1. The Job is bounded by the Action method, and a Job is triggered when an Action method is encountered;
  2. Stage is a subset of Job, bounded by RDD wide dependencies (that is, Shuffle), and divide once when Shuffle is encountered;
  3. Task is a subset of Stage, which is measured by the degree of parallelism (number of partitions). The number of partitions means how many tasks there are.

Spark's task scheduling is generally divided into two ways, one is stage-level scheduling, and the other is task-level scheduling. The overall
scheduling process is shown in the following figure:

insert image description here
Spark RDD forms the RDD kinship (dependency) relationship graph through its Transactions operation, that is, DAG. Finally, through the call of Action, Job is triggered and scheduled for execution. During the execution process, two schedulers will be created: DAGScheduler and TaskScheduler.

➢ DAGScheduler is responsible for Stage-level scheduling, mainly dividing the job into several Stages, and packaging each Stage into a TaskSet and handing it over to TaskScheduler for scheduling.

➢ TaskScheduler is responsible for Task-level scheduling, and distributes the TaskSets given by DAGScheduler to Executors for execution according to the specified scheduling policy. During the scheduling process, SchedulerBackend is responsible for providing available resources. There are multiple implementations of SchedulerBackend, which connect to different resource management systems.

insert image description here

  • When the Driver initializes the SparkContext, it will initialize DAGScheduler, TaskScheduler, SchedulerBackend and HeartbeatReceiver respectively, and start SchedulerBackend and HeartbeatReceiver. SchedulerBackend applies for resources through ApplicationMaster, and continuously obtains appropriate Tasks from TaskScheduler and distributes them to Executor for execution. HeartbeatReceiver is responsible for receiving the heartbeat information of Executor, monitoring the survival status of Executor, and notifying TaskScheduler.

4.2 Spark Stage Scheduling

Spark's task scheduling starts with DAG cutting, which is mainly done by DAGScheduler. When encountering an Action operation, it will trigger the calculation of a Job and hand it over to DAGScheduler for submission. The following figure is a flow chart of related method calls related to Job submission.

insert image description here

  1. Job is encapsulated by the final RDD and Action method;

  2. SparkContext submits the job to DAGScheduler. It will
    divide the job into several Stages according to the DAG formed by the blood relationship of the RDD. The specific division strategy is that the final RDD will continue to use
    dependency backtracking to determine whether the parent dependency is a wide dependency, that is, divide the stage with Shuffle as the boundary, and divide the narrowly dependent RDD into the same stage
    , which can be used for pipeline calculation. The divided Stages are divided into two categories, one
    is called ResultStage, which is the most downstream stage of the DAG, and is determined by the Action method, and the other is called
    ShuffleMapStage, which prepares data for the downstream stage. Let's look at a simple example WordCount.

insert image description here

  • The job is triggered by saveAsTextFile, which is composed of RDD-3 and saveAsTextFile methods. According to the
    dependency between RDDs, the backtracking search starts from RDD-3 until there is no dependent RDD-0. During the backtracking search, RDD-3 depends on RDD-2 and is a wide dependency, so the stage is divided between RDD-2 and RDD-3, and RDD-3 is assigned to the last stage, namely Result In the Stage, RDD-2 depends on RDD-1, and RDD-1 depends on RDD-0. These dependencies are all narrow dependencies, so RDD-0, RDD-1 and RDD-2 are divided
    into the same Stage to form a pipeline
    operation
    . That is,
    in ShuffleMapStage, during actual execution, data records will convert RDD-0 to RDD-2 in one go
    . It is not difficult to see that it is essentially a Depth First Search (Depth First Search) algorithm.

  • Whether a Stage is submitted or not needs to be judged whether its parent Stage is executed. The
    current Stage can only be submitted after the execution of the parent Stage is completed. If a Stage has no parent Stage, the submission starts from this Stage. When STAGE is submitted,
    the Task information (partition information, method, etc.) is serialized and packed into TaskscheDuler. A partition corresponds to a task. On the other hand, TaskscheDuler monitor the running status of the Stage. H failure only needs to re -submit the failure of the failure to run the task of running a failure. Other types of TASK failure will be retrieved during the scheduling process of
    TaskscheDuler
    .

  • Relatively speaking, what DAGScheduler does is relatively simple. It only divides DAG at the stage level, submits
    the stage and monitors relevant status information. TaskScheduler is relatively more complicated, and its details are elaborated below.

4.3 Spark Task-level Scheduling

  • The scheduling of Spark Task is completed by TaskScheduler. As we can see from the above, DAGScheduler packs the Stage
    to TaskScheTaskSetduler, and TaskScheduler will encapsulate TaskSet as TaskSetManager and add it to
    the scheduling queue. The structure of TaskSetManager is shown in the figure below.
    insert image description here

TaskSetManager is responsible for monitoring and managing Tasks in the same Stage, and TaskScheduler uses TaskSetManager as a unit to schedule tasks.

  • As mentioned earlier, after the TaskScheduler is initialized, the SchedulerBackend will be started. It is responsible for dealing with the outside world, receiving the registration information of the Executor, and maintaining the status of the Executor. Therefore, the SchedulerBackend is in charge of "food". At the same time, it will periodically "ask" the TaskScheduler if there are any tasks to run after it starts. When the Scheduler "asks" it from the SchedulerBackend, it will select TaskSetManager from the scheduling queue according to the specified scheduling strategy to schedule and run. The general method call process is shown in the following figure:
    insert image description here

  • In the figure above, after adding TaskSetManager to the rootPool scheduling pool, call the riviveOffers method of SchedulerBackend to send a ReviveOffer message to driverEndpoint; after receiving the ReviveOffer message, driverEndpoint calls the makeOffers method to filter out the active Executors (these Executors are all Executors registered to the Driver when the task is started), and then set the Executor It is encapsulated into a WorkerOffer object; after the computing resources (WorkerOffer) are prepared, the taskScheduler calls resourceOffer based on these resources to allocate tasks on the Executor.

4.3.1 Scheduling strategy

TaskScheduler supports two scheduling strategies, one is FIFO, which is also the default scheduling strategy, and the other is FAIR. In the initialization process of TaskScheduler, rootPool will be instantiated, representing the root node of the tree, which is of Pool type.

  1. FIFO Scheduling Strategy
    If the FIFO scheduling strategy is adopted, simply put the TaskSetManager into the queue on a first-come-first-served basis, and take out the most advanced TaskSetManager when leaving the queue. The tree structure is shown in the figure below, and the TaskSetManager is stored in a FIFO queue.
    insert image description here
  2. FAIR Scheduling Policy
    The tree structure of the FAIR scheduling policy is shown in the following figure:
    insert image description here
  • In the FAIR mode, there is one rootPool and multiple sub-pools, and each sub-pool stores all TaskSetMagagers to be allocated.
  • In FAIR mode, you need to sort the sub-Pool first, and then sort the TaskSetMagager in the sub-Pool, because both Pool and TaskSetMagager inherit the Schedulable trait, so use the same sorting algorithm. The comparison of the sorting process is based on Fair-share. Each object to be sorted contains three attributes: runningTasks value (the number of running Tasks), minShare value, and weight value. The runningTasks value, minShare value, and weight value will be considered comprehensively during comparison. Note that the values ​​of minShare and weight are specified in the fair scheduling configuration file fairscheduler.xml, and the scheduling pool will read the relevant configuration of this file during the construction phase.
  1. If the runningTasks of the A object is greater than its minShare, and the runningTasks of the B object is less than its minShare,
    then B is ranked in front of A; (runningTasks smaller than minShare are executed first)
  2. If the runningTasks of A and B objects are both smaller than their minShare, then compare
    the ratio of runningTasks to minShare (minShare usage rate), and whoever is smaller will be ranked first; (the one with the lower minShare usage rate will be executed first)
  3. If the runningTasks of A and B objects are both greater than their minShare, then compare
    the ratio of runningTasks to weight (weight usage rate), and whoever is smaller will be ranked first. (The one with the lower weight utilization rate is executed first)
  4. If the above comparisons are all equal, the names are compared.
    Generally speaking, the comparison process is controlled by the two parameters of minShare and weight, so that the minShare usage rate and weight usage rate can be reduced (the proportion of actual running tasks is less) to run first. After the sorting in FAIR mode is completed, all TaskSetManagers are put into an ArrayBuffer, and then taken out one by one and sent to Executor for execution. After getting the TaskSetManager from the scheduling queue, since the TaskSetManager encapsulates all the Tasks of a Stage and is responsible for managing and scheduling these Tasks, the next job is that the TaskSetManager takes out the Tasks one by one to the TaskScheduler according to certain rules, and the TaskScheduler then hands it to the SchedulerBackend to send to the Executor for execution.

4.3.2 Localized Scheduling

  • DAGScheduler cuts the Job, divides the Stage, and submits the tasks corresponding to a Stage by calling submitStage. submitStage will call submitMissingTasks. submitMissingTasks determines the preferredLocations of each task that needs to be calculated, and obtains the priority location of the partition by calling getPreferrdeLocations(). Since a partition corresponds to a Task, the priority location of this partition is the priority location of the task. For each Task in the TaskSet to be submitted to TaskScheduler, the priority position of the task is consistent with the priority position corresponding to the corresponding partition.
  • After getting the TaskSetManager from the scheduling queue, the next job is that the TaskSetManager takes out the tasks one by one to the TaskScheduler according to certain rules, and the TaskScheduler then hands it to the SchedulerBackend to send to the Executor for execution. As mentioned earlier, TaskSetManager encapsulates all Tasks of a Stage and is responsible for managing and scheduling these Tasks. According to the priority position of each Task, determine the Locality level of the Task. There are five types of Locality, and the priority is from high to low:
    insert image description here
  • During scheduling execution, Spark scheduling will always try to start each task with the highest locality level. When a task starts with X locality level, but all nodes corresponding to the locality level have no free resources and fails to start, it will not immediately lower the locality level to start but start the task again with X locality level within a certain period of time. If the time limit is exceeded, it will be downgraded to start and try the next locality level, and so on. By increasing the maximum tolerable delay time of each category, the corresponding Executor may have corresponding resources to execute the task during the waiting phase, which improves the running performance to a certain extent.

4.3.3 Failure retry and blacklist mechanism

  • In addition to selecting the appropriate Task scheduling operation, it is also necessary to monitor the execution status of the Task. As mentioned earlier, it is the SchedulerBackend that deals with the outside world. After the Task is submitted to the Executor to start execution, the Executor will report the execution status to the SchedulerBackend, and the SchedulerBackend tells the TaskScheduler. The TaskScheduler finds the TaskSetManager corresponding to the Task and notifies the TaskSetManager. TaskSetManager knows the failure and success status of the Task. For the failed Task, it will record the number of times it failed. If the number of failures has not exceeded the maximum number of retries, it will be put back into the task pool to be scheduled, otherwise the entire Application will fail. In the process of recording the number of Task failures, it will record the Executor Id and Host where it failed last time, so that when the Task is scheduled next time, the blacklist mechanism will be used to prevent it from being scheduled to the node that failed last time, which plays a certain role in fault tolerance. The blacklist records the Executor Id and Host where the last failure of the Task was located, and its corresponding "blackout" time. The "blackout" time means that the Task should not be scheduled on this node during this period.

Guess you like

Origin blog.csdn.net/weixin_43554580/article/details/131698513