Spark Architecture and Operating Mechanism (1) - System Architecture

Spark uses a master/slave architecture to build a computing cluster. The architecture diagram is as follows:

Client is the node that submits the Spark program, and the rest are physical nodes in the Spark distributed cluster. These nodes can be divided into 2 categories, cluster management nodes (ClusterMaster) and slave (Slave) nodes.

1. Cluster management node
The ClusterMaster node is the core of the entire Spark cluster. The ClusterMaster node does not perform actual computing tasks, but is responsible for managing the computing resources of the entire cluster. The computing resources here refer to physical resources such as memory and CPU processors of other physical hosts other than the ClusterMaster. These computing resources are uniformly managed by the ClusterMaster node, and these resources are reasonably allocated to each application submitted by the user.
All computing nodes must register with the ClusterMaster node, and hand over their computing resources to the ClusterMaster node for unified scheduling.

2. Slave node
Slave node is a node that executes job logic. It is divided into two categories according to different functions: task scheduling node (Driver) and task execution node (Worker).
(1) Task scheduling node
If the main function of the Spark program runs on the node, then the node acts as the driver node of the application.
The Driver node serves as the starting point for the entire application logic and is responsible for creating the SparkContext and defining one or more RDDs.
The driver is mainly responsible for two tasks:
    1. It is responsible for dividing an application into physically executable tasks (Tasks);
    2. Allocating tasks to the most suitable Worker nodes to run, and coordinating these tasks to complete running on the Worker ;

(2) Worker node The node that
runs the Executor process is the Worker node, and Spark creates an Executor process on the Worker node for each application. The Executor process is responsible for two aspects of work:
    1. Responsible for executing Task tasks and feeding back the execution results to the Driver node;
    2. Providing memory storage for RDDs;


3. Spark program running mode
(1) Local mode (Local): Spark job In fact, the single machine is executed in non-parallel mode;
(2) Standalone mode (Standalone): The Spark cluster running in Standalone mode schedules different applications in a first-in, first-out (FIFO) order. By default, each application will have exclusive The resources of all available nodes;
(3) Pseudo-distributed mode (local-cluster): The Standalone mode is simulated in a stand-alone environment. The resource scheduling process in pseudo-distributed mode is the same as in Standalone mode, but resource scheduling is not an actual physical node, but a pseudo-distributed Spark cluster running on a single machine;
(4) YARN mode: Spark cluster runs in the YARN resource management framework (
5) Mesos mode: Spark cluster runs on Mesos resource management framework;

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326250252&siteId=291194637