spark - core concepts

(1) Explanation of terms:

concept

explain
Application

The application built by the user based on spark, when running on the cluster, contains a driver and multiple executors

Driver 

Driver program A process that executes Application's main method and creates SparkContext
Cluster program

1 additional service to apply for cluster resources (standlone manager, Messos, yarn),

Dynamically specified via the --master parameter

Deploy mode

Identify where the driver process is started?
(1) cluster: The driver is started on the cluster

  • yarn mode: on NodeManager
  • standlone mode: on the worker
(2) client: start outside the cluster, start locally where the program is submitted
Worker node 

Enter any node on the group that can run application code

  • standlone mode: worker node
  • yarn mode: container in Nodemanager
Executor A process started on a worker node that can run tasks (one executor can run multiple tasks) and store data; each application has its own independent executor, and the executors between different applications are independent
Task The job is sent to a unit of work executed on the executor
Job A spark action is a spark job, and a job consists of multiple tasks
Stage

Each job is divided into a collection of multiple tasks, which is called a stage, and the stages depend on each other (similar to the map and reduce stages in MapReduce)

 

(2) Correspondence:

1 job = n stages = n tasks = 1 action

 

(3) Summary:

  • 1 application consists of 1 driver process + multiple executor processes
  • The driver is a process that runs the main method and creates the SparkConext
  • executor也是一个进程用来处理tasks,存储数据,每个application的有自己的executors
  • task是发送到executor上的一个最小的工作单元
  • 1个job对应1个action,1个job会产生多个stage,1个stage对应多个task,提交的时候是以stage为单位根据satge id从后往前进行提交,就是把stage中所有的tasks发送到executor上去执行
  • 如果是standlone模式,executor运行在worker上,如果是yarn模式,executor运行在nodemanager的container上,提交时可以通过 --master 和 --deploy-mode 指定运行模式和以及使用客户端还是集群
  •   spark applications是一组独立的进程的集合,通过运行在driver中的sparkconext协调,跨spark应用程序之间数据数不共享的,除非使用第三方存储系统(hdfs,s3,alluxio等)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326115490&siteId=291194637