Spark architecture system

The StandAlone mode is the cluster operation mode that comes with Spark. It does not depend on other resource scheduling frameworks and is easy to deploy.

The StandAlone mode is divided into client mode and cluster mode. The essential difference is where the Driver runs. If the Driver runs in the SparkSubmit process, it is the Client mode. If the Driver runs in the cluster, it is the Cluster mode.

standalone client mode

standalone cluster mode

Spark On YARN cluster mode

Introduction to Spark Execution Process

  • Job: Each action operation of RDD will generate one or more scheduling stages Scheduling stage (Stage): Each job will be divided into Shuffle Map Stage and Result Stage according to the dependency relationship and the Shuffle process. Each Stage corresponds to a TaskSet, a Task contains multiple Tasks, and the number of TaskSets is the same as the number of partitions of the last RDD of this stage.
  • Task: The work task distributed to the Executor is the smallest execution unit of Spark
  • DAGScheduler: DAGScheduler divides the DAG into Stages based on wide dependencies, is responsible for dividing the scheduling stages, and converts Stages into TaskSets and submits them to TaskScheduler
  • TaskScheduler: TaskScheduler schedules the Task to the Executor process under the Worker, and then throws it into the Executor's thread pool for execution

Important role in Spark

  • Master: It is a Java process that receives Worker's registration information and heartbeat, removes Workers with abnormal timeouts, receives tasks submitted by clients, is responsible for resource scheduling, and orders Workers to start Executor.
  • Worker: It is a Java process, responsible for managing the resource management of the current node, registering with the Master and sending heartbeats periodically, responsible for starting the Executor, and monitoring the status of the Executor.
  • SparkSubmit: It is a Java process responsible for submitting tasks to the Master.
  • Driver: It is a general term for many classes. You can think of SparkContext as Driver. In client mode, Driver runs in the SparkSubmit process, and in cluster mode, it runs in a single process. It is responsible for converting the code written by the user into Tasks, and then dispatching it to the Executor for execution. Monitor the status and execution progress of the Task.
  • Executor: It is a Java process, responsible for executing the Task generated by the Driver side, and putting the Task into a thread to run.

Comparison of Spark and Yarn roles

Guess you like

Origin blog.csdn.net/qq_61162288/article/details/131586049