spark 部署模式和启动进程

Cluster Mode Overview (集群模式概述)

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Spark 应用在集群上作为独立的进程组来运行，在您的 main 程序中通过 SparkContext 来协调（称之为 driver 程序）。

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

具体的说，为了运行在集群上，SparkContext 可以连接至几种类型的 Cluster Manager（既可以用 Spark 自己的 Standlone Cluster Manager，或者 Mesos，也可以使用 YARN），它们会分配应用的资源。一旦连接上，Spark 获得集群中节点上的 Executor，这些进程可以运行计算并且为您的应用存储数据。接下来，它将发送您的应用代码（通过 JAR 或者 Python 文件定义传递给 SparkContext）至 Executor。最终，SparkContext 将发送 Task 到 Executor 以运行。

Spark Standalone Mode（独立集群模式）

Launching Spark Applications （启动应用）

The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster.
For standalone clusters, Spark currently supports two deploy modes.

In client mode, the driver is launched in the same process as the client that submits the application. （client driver在同一个 process）
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish. （driver 由Worker 启动，client 很快退出）

jar distributed

If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).

Launching Spark on YARN (Yarn 模式)

There are two deploy modes that can be used to launch Spark applications on YARN.

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.
In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

有两种部署模式可以用于在 YARN 上启动 Spark 应用程序。在 cluster 集群模式下， Spark driver 运行在集群上由 YARN 管理的application master 进程内，并且客户端可以在初始化应用程序后离开。在 client 客户端模式下，driver 在客户端进程中运行，并且 application master 仅用于从 YARN 请求资源。

spark-submit （submit application）

In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.

Term (术语)

Application: User program built on Spark. Consists of a driver program and executors on the cluster.
Driver program: The process running the main() function of the application and creating the SparkContext
Cluster manager: An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Deploy mode: Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
Worker node: Any node that can run application code in the cluster
Executor: A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task: A unit of work that will be sent to one executor
Job: A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs.
Stage: Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.

进程对应关系

一个executor 对应一个CoarseGrainedExecutorBackend

spark-submit --master yarn --deploy-mode client

RM会在集群中的某个NodeManager上，启动一个ExecutorLauncher进程，来做为ApplicationMaster。另外，也会在多个NodeManager上生成CoarseGrainedExecutorBackend进程来并发的执行应用程序。

spark-submit --master yarn --deploy-mode cluster

Resource Manager在集群中的某个NodeManager上运行ApplicationMaster，该AM同时会执行driver程序。紧接着，会在各NodeManager上运行CoarseGrainedExecutorBackend来并发执行应用程序。