Spark on yarn mode. Client and Cluster two modes of operation of the workflow, the basic concepts. spark workflow

 

table of Contents

1, first introduced the model diagram of yarn

(1), yarn model of FIG.

(2), yarn process is as follows:

2, submitted to the task flow in cluster mode

(1), the flowchart follows

(2), works as follows:

3, in the Client mode, Driver process will start in the current client, the client process exist until the end of the application is running.

(1), the flowchart in client mode

(2), works as follows:

4, Spark task scheduling



1, first introduced the model diagram of yarn

(1), yarn model of FIG.

 

(2), yarn process is as follows:

YARN submit an application in the process, can in turn be divided into the following steps:

(1) submit an application to the user by the client RM

(2) RM after receiving a new application, you will first select a container for starting the application-specific AM

After the resource (3) AM start, the need to request an application to run the RM

(4) RM allocates the AM container resource requested possible, expressed as the container ID and the host name of container

(5) AM accordance with the given container ID and the host name, the corresponding required resources to start these tasks NodeManager using a specific application.

(6) NodeManager start the task, and the task of monitoring the health of the use of resources.

Implementation (7) AM ongoing monitoring tasks.

(8) When the task execution is completed, AM and RM will report to the cancellation of the container to perform tasks, and log out yourself.

 

2, Cluster model submitted under task flow

(1), the flowchart follows

Write pictures described here

(2), works as follows:


1. After node in the cluster, start the master (ResourceManager), worker (NodeManager ) process, worker (NodeManager) process started successfully, it will register with the Master (ResourceManager).
3. After the client submits tasks, master notify the worker nodes start driver (application Master) process. (select worker is arbitrary, as long as the worker has sufficient resources to)
after the driver started the process of success, success will return to Master registration information
4.master start the executor notify the worker process
5. Start the executor process after successful registration with the driver
6.Driver divide stage of the job, and the stage were further divide, encapsulates all operations into a pipeline in a task, executor process and sent to their registered in task execution thread
7. All task execution after completion, the program ends

Through the above description we know: Mater (the ResourceManager) is responsible for the entire cluster resource management and create worker (NodeManager), worker responsible for managing the resources of the current node, and saves the current cpu, memory and other information to inform the timing master, and Executor is responsible for creating process (that is, most small resource allocation units), Driver (applicationMaster) cutting responsible for the entire job application task division and task and stage of cutting and optimization, and is responsible for the task is distributed to worker nodes corresponding to the executor the process of task execution threads, and get the results of the task, Driver for contact by SparkContext objects and spark cluster, obtain master host host, you can register itself with the master through rpc.

master spark cluster is yarn of ResourceManager, responsible for the entire cluster resource management and create worker.

worker spark cluster is yarn of NodeManager, responsible for the current node resources (cpu, memory) management.

Driver is the task of each application unique. The yarn is applicationMaster.

Each operator will have to perform a job.

Each have a width dependent stage.

executor * core = number of task, a task process corresponding to a partition.

3, in Client mode under, Driver process will start in the current client, the client process exist until the end of the application is running.

(1), the flowchart in client mode

Write pictures described here

 

(2), works as follows:


1. Start the master and the worker. Worker responsible for resource management of the entire cluster, worker responsible for monitoring their own cpu, memory information and regularly report to the master
2. Start Driver in the client process, and registered master
3.master by the worker rpc communication, notification worker starts one or more executor processes
4.executor registration process to Driver, Driver information to inform their own, including where the host and other nodes
5.Driver divide stage of the job, and the stage were further division, will All operations are encapsulated into a pipeline in a task, executor and sent to their registration
process in the task execution thread
6. the application execution is completed, Driver process exit

 

 

4, Spark task scheduling

 

  RDD between the respective dependencies of these dependencies to form directed acyclic graph DAG, DAGScheduler of these dependencies DAG Stage formed is divided, the divided very simple rules, forward from the back, narrow face dependence added this stage, be met Stage width dependent segmentation. Completion of Stage division. DAGScheduler generated on a per-Stage TaskSet, and TaskSet submitted to TaskScheduler. TaskScheduler responsible for a specific task scheduling, and finally start the task on Worker nodes.

Published 159 original articles · won praise 75 · views 190 000 +

Guess you like

Origin blog.csdn.net/xuehuagongzi000/article/details/103072841