Spark Basics | Running Spark on a Cluster

1 Spark runtime architecture

write picture description here
Terminology:
驱动器( Driver) node: the node responsible for central coordination
执行器( executor) node: the corresponding worker node
Spark应用( application): the driver node and all executor nodes as a whole
- - - Application = Driver (driver) + Executor (executor)
集群管理器( Cluster Manager): an external service started on a machine in the cluster

1.1 Drive Node

Function

  1. Convert user process to task

    The Spark driver program is responsible for converting the user program into multiple physical execution units called tasks.

    All Spark programs follow the same structure: the program creates a series of RDDs from input data, uses transform operations to derive new RDDs, uses transform operations to derive new RDDs, and finally uses action operations to collect or store data in the resulting RDD .
    Spark programs actually implicitly create a logical 有向无环图( Directed Acyclic Graph) composed of operations

    Make some optimizations to the logical execution plan, such as converting continuous mapping into pipelined execution, and combining multiple steps into one step. In this way, Spark converts the logical plan into a series of 步骤( stage)

  2. Schedule tasks for executor nodes

    After the executor process starts, it registers itself with the driver process.

    The Spark driver process will try to assign all tasks to the appropriate executor process based on the location of the data based on the current set of executor nodes. When the task is executed, the executor process will store the cached data, and the driver process will also use the location information of the cached data to schedule future tasks to minimize the network transmission of data

1.2 Actuator Node

A Spark executor node is a type of worker process

[ Life cycle ] When the Spark application starts, the executor nodes are started at the same time, and always exist along with the entire Spark application life cycle. The exception and crash of the executor node will not cause the application to stop

Function

  1. Responsible for running the tasks that make up the Spark application and returning the results to the driver process
  2. Provide in-memory storage for RDDs that require caching in user programs through its own Block Manager

1.3 Cluster Manager

Spark relies on the cluster manager to start executor nodes and, in some special cases, the cluster manager to start driver nodes.

The cluster manager is a pluggable component in Spark

In addition to the built-in standalone cluster manager, it can also run on an external cluster manager, such as YARN Mesos

2 Detailed process of running Spark application

The driver manager is the big boss for the strategy, the resource manager is the team leader responsible for execution, and the executor is the soldier who implements it.

(1) The user submits the application through the spark-submit script

(2) The spark-submit script starts the driver and calls the user-defined main() method

(3) The driver program communicates with the cluster manager to apply for resources to start the executor node

(4) The cluster manager starts the driver program

(5) The driver process performs operations in the user application. According to the transformation operation and action operation of the Pair RDD defined in the program, the driver node sends the work to the executor process in the form of a task

(6) The task is calculated in the executor program and the result is saved

(7) If the main( ) method of the driver program exits, or SparkContext.stop( ) is called, the driver program will terminate the executor and release resources through the cluster manager

Deploy the application using spark-submit

It is necessary to understand this part

 bin/spark-submit [options]  < app jar | python file> [app option] 

Without other parameters, the spark program will only be executed locally

The --master flag specifies the cluster URL to connect to

value describe
spark://host:post Connect to a spark standalone cluster on the specified port, default window 7077
yarn When running on YARN, you need to set the environment variable HADOOP_CONF_DIR to point to the hadoop configuration directory to obtain cluster information
local Run in local mode, use a single core
local[N] Run local mode, use N cores
local[*] Run in local mode, using as many cores as possible

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325846888&siteId=291194637