(1) Explanation of terms:
concept |
explain |
Application | The application built by the user based on spark, when running on the cluster, contains a driver and multiple executors Driver |
Driver program | A process that executes Application's main method and creates SparkContext |
Cluster program | 1 additional service to apply for cluster resources (standlone manager, Messos, yarn), Dynamically specified via the --master parameter |
Deploy mode | Identify where the driver process is started?
|
Worker node | Enter any node on the group that can run application code
|
Executor | A process started on a worker node that can run tasks (one executor can run multiple tasks) and store data; each application has its own independent executor, and the executors between different applications are independent |
Task | The job is sent to a unit of work executed on the executor |
Job | A spark action is a spark job, and a job consists of multiple tasks |
Stage | Each job is divided into a collection of multiple tasks, which is called a stage, and the stages depend on each other (similar to the map and reduce stages in MapReduce) |
(2) Correspondence:
1 job = n stages = n tasks = 1 action
(3) Summary:
- 1 application consists of 1 driver process + multiple executor processes
- The driver is a process that runs the main method and creates the SparkConext
- executor也是一个进程用来处理tasks,存储数据,每个application的有自己的executors
- task是发送到executor上的一个最小的工作单元
- 1个job对应1个action,1个job会产生多个stage,1个stage对应多个task,提交的时候是以stage为单位根据satge id从后往前进行提交,就是把stage中所有的tasks发送到executor上去执行
- 如果是standlone模式,executor运行在worker上,如果是yarn模式,executor运行在nodemanager的container上,提交时可以通过 --master 和 --deploy-mode 指定运行模式和以及使用客户端还是集群
- spark applications是一组独立的进程的集合,通过运行在driver中的sparkconext协调,跨spark应用程序之间数据数不共享的,除非使用第三方存储系统(hdfs,s3,alluxio等)