1 Spark runtime architecture
Terminology:
驱动器
( Driver
) node: the node responsible for central coordination
执行器
( executor
) node: the corresponding worker node
Spark应用
( application
): the driver node and all executor nodes as a whole
- - - Application = Driver (driver) + Executor (executor)
集群管理器
( Cluster Manager
): an external service started on a machine in the cluster
1.1 Drive Node
【Function】
Convert user process to task
The Spark driver program is responsible for converting the user program into multiple physical execution units called tasks.
All Spark programs follow the same structure: the program creates a series of RDDs from input data, uses transform operations to derive new RDDs, uses transform operations to derive new RDDs, and finally uses action operations to collect or store data in the resulting RDD .
Spark programs actually implicitly create a logical有向无环图
(Directed Acyclic Graph
) composed of operationsMake some optimizations to the logical execution plan, such as converting continuous mapping into pipelined execution, and combining multiple steps into one step. In this way, Spark converts the logical plan into a series of
步骤
(stage
)Schedule tasks for executor nodes
After the executor process starts, it registers itself with the driver process.
The Spark driver process will try to assign all tasks to the appropriate executor process based on the location of the data based on the current set of executor nodes. When the task is executed, the executor process will store the cached data, and the driver process will also use the location information of the cached data to schedule future tasks to minimize the network transmission of data
1.2 Actuator Node
A Spark executor node is a type of worker process
[ Life cycle ] When the Spark application starts, the executor nodes are started at the same time, and always exist along with the entire Spark application life cycle. The exception and crash of the executor node will not cause the application to stop
【Function】
- Responsible for running the tasks that make up the Spark application and returning the results to the driver process
- Provide in-memory storage for RDDs that require caching in user programs through its own Block Manager
1.3 Cluster Manager
Spark relies on the cluster manager to start executor nodes and, in some special cases, the cluster manager to start driver nodes.
The cluster manager is a pluggable component in Spark
In addition to the built-in standalone cluster manager, it can also run on an external cluster manager, such as YARN Mesos
2 Detailed process of running Spark application
The driver manager is the big boss for the strategy, the resource manager is the team leader responsible for execution, and the executor is the soldier who implements it.
(1) The user submits the application through the spark-submit script
(2) The spark-submit script starts the driver and calls the user-defined main() method
(3) The driver program communicates with the cluster manager to apply for resources to start the executor node
(4) The cluster manager starts the driver program
(5) The driver process performs operations in the user application. According to the transformation operation and action operation of the Pair RDD defined in the program, the driver node sends the work to the executor process in the form of a task
(6) The task is calculated in the executor program and the result is saved
(7) If the main( ) method of the driver program exits, or SparkContext.stop( ) is called, the driver program will terminate the executor and release resources through the cluster manager
Deploy the application using spark-submit
It is necessary to understand this part
bin/spark-submit [options] < app jar | python file> [app option]
Without other parameters, the spark program will only be executed locally
The --master flag specifies the cluster URL to connect to
value | describe |
---|---|
spark://host:post | Connect to a spark standalone cluster on the specified port, default window 7077 |
yarn | When running on YARN, you need to set the environment variable HADOOP_CONF_DIR to point to the hadoop configuration directory to obtain cluster information |
local | Run in local mode, use a single core |
local[N] | Run local mode, use N cores |
local[*] | Run in local mode, using as many cores as possible |