Spark submission and execution process

Spark learning summary (spark running process)
RDD creation method
1 Operate on the collection sc.parallelize (specify the element position) sc.makeRDD
2 Operate the file in the specified path sc.textFile 
3 Create an RDD from hdfs sc.hadoopFile("") 4 Obtain RDD spark program submission process
through transformation 1 DAGScheduler determines the ideal location for running tasks and transmits information to the lower-level TaskScheduler Handles Task failures caused by suffle data loss 2 TaskScheduler allocates tasks reasonably Maintains Task status RDD operation 1 Creates RDD objects 2 DAGScheduler module Intervene operations, calculate the dependencies between RDDs to form DAG 3 Divide each job into multiple stages (a job is divided into multiple groups of Tasks, also known as stage The stage is divided into shuffle and result, the transformation before shuffle is a stage, After shuffle, there is another stage) spark execution process 1 standalone mode 1) SparkContext connects to the Master to register and apply for resources 2). The Master decides on which Worker to allocate resources according to the resource application requirements of SparkContext and the information reported in the Worker heartbeat cycle, and then allocates resources on that Get resources on Worker, and then start StandaloneExecutorBackend;















3) StandaloneExecutorBackend registers with SparkContext;
4) SparkContext sends App code to StandaloneExecutorBackend, parses and builds DAG graph and submits DAGScheduler to decompose into Stage (when an Action operation is encountered, Job will be spawned; each Job contains 1 or more Stage, Stage is generally generated before acquiring external data and shuffle), and then submitted to Task Scheduler as Stage (or TaskSet), Task Scheduler is responsible for assigning Task to corresponding Worker, and finally submitted to StandaloneExecutorBackend for execution
5). StandaloneExecutorBackend will be established The Executor thread pool starts executing the Task and reports to the SparkContext until the Task is completed. 
6). After all tasks are completed, SparkContext logs out to the Master to release resources.
2. Distribute to other resource scheduling frameworks (on yarn)
client mode
1). Spark Yarn Client applies to YARN's ResourceManager to start the Application Master. At the same time, DAGScheduler and TASKScheduler will be created in SparkContent initialization. Since we choose Yarn-Client mode, the program will choose YarnClientClusterScheduler and YarnClientSchedulerBackend;
2). After the ResourceManager receives the request, it selects a NodeManager in the cluster, assigns the first Container to the application, and asks it to start the ApplicationMaster of the application in this Container. The difference from YARN-Cluster is that the ApplicationMaster does not run SparkContext only communicates with SparkContext for resource allocation;
3). After the SparkContext in Client is initialized, it establishes communication with ApplicationMaster, registers with ResourceManager, and applies for resources (Container) to ResourceManager according to the task information;
4). Once ApplicationMaster applies for After the resource (that is, the Container), it communicates with the corresponding NodeManager and asks it to start the CoarseGrainedExecutorBackend in the obtained Container. After the CoarseGrainedExecutorBackend starts, it will register and apply for the Task in the SparkContext in the Client;
5). The SparkContext in the Client assigns the Task to the CoarseGrainedExecutorBackend Execute, CoarseGrainedExecutorBackend runs the Task and reports the running status and progress to the Driver, so that the Client can grasp the running status of each task at any time, so that the task can be restarted when the task fails;
6). After the application is completed, the SparkContext of the Client reports to the ResourceManager Apply for logout and shut yourself down.
cluster
When a user submits an application to YARN, YARN will run the application in two stages: the first stage is to start the Spark Driver as an ApplicationMaster in the YARN cluster; the second stage is to create the application by the ApplicationMaster The program then requests resources from the ResourceManager for it, and starts the Executor to run the Task, while monitoring its entire running process until the operation is completed.
1). The Spark Yarn Client submits applications to YARN, including the ApplicationMaster program, the command to start the ApplicationMaster, the programs that need to be run in the Executor, etc.;
2). After the ResourceManager receives the request, it selects a NodeManager in the cluster for the application The program allocates the first Container and asks it to start the ApplicationMaster of the application in this Container, where the ApplicationMaster initializes the SparkContext, etc.;
3). The ApplicationMaster registers with the ResourceManager, so that the user can directly view the running status of the application through ResourceManage, and then it The polling method will be used to apply for resources for each task through the RPC protocol, and monitor their running status until the end of the operation;
4). Once the ApplicationMaster applies for the resource (that is, the Container), it communicates with the corresponding NodeManager and asks it to start the CoarseGrainedExecutorBackend in the obtained Container. After the CoarseGrainedExecutorBackend starts, it will register with the SparkContext in the ApplicationMaster and apply for a Task. This is the same as the Standalone mode, except that when SparkContext is initialized in Spark Application, CoarseGrainedSchedulerBackend is used to coordinate with YarnClusterScheduler to schedule tasks. YarnClusterScheduler is just a simple wrapper for TaskSchedulerImpl, adding waiting logic for Executor, etc.
5). In ApplicationMaster The SparkContext assigns the Task to the CoarseGrainedExecutorBackend for execution, and the CoarseGrainedExecutorBackend runs the Task and reports the running status and progress to the ApplicationMaster, so that the ApplicationMaster can keep track of the running status of each task, so that the task can be restarted when the task fails;
6). After the application is finished running , ApplicationMaster applies to ResourceManager to log out and close itself.
The difference between client and cluster
 In YARN-Cluster mode, Driver runs in AM (Application Master), which is responsible for applying for resources to YARN and supervising the running status of jobs. After the user submits the job, the Client can be turned off, and the job will continue to run on YARN, so the YARN-Cluster mode is not suitable for running interactive jobs;
 In YARN-Client mode, the Application Master only requests Executor from YARN, and the Client will communicate with the requested Container to schedule their work, which means that the Client cannot leave.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325934352&siteId=291194637