Spark basic concept combing

Because I am learning and using Spark recently, I have sorted out some basic concepts and terminology. Used to deepen the image and facilitate subsequent review

Spark is a memory-based distributed computing framework that can be seamlessly integrated into the existing Hadoop ecosystem. It mainly includes four major components: Spark Streaming, Spark SQL, Spark MLlib and Spark GraphX.

Some basic concepts involved in Spark operation are as follows:

  1. mater: mainly to control, manage and supervise the entire spark cluster

  2. client: The client, which will be submitted by the application, records the business operation logic and master communication.

  3. Application: User-defined Spark program. After the user submits it, Spark allocates resources for the App to convert and execute the program.

  4. sparkContext: The entrance to the spark application, responsible for scheduling various computing resources and coordinating the Executor on each work node. Mainly some record information, record who runs it, how it runs, etc. This is why it is necessary to create a sparkContext when programming.

  5. Driver Program: The main manager of each application, the boss of each application, some people may ask why there is a master, why do you have one? Because the master is the boss of the cluster, and every application is under the control of the boss, the boss is crazy. Therefore, the driver is responsible for running and tracking specific transactions, running Application's main() function and creating sparkContext.

  6. RDD: Spark's core data structure, which can be operated by a series of operators. When Rdd encounters the Action operator, all the previous operators form a directed acyclic graph (DAG). Then it is transformed into a job in spark and submitted to the cluster for execution. An app can contain multiple jobs

  7. worker Node: The worker node of the cluster, the node that can run the Application code, receives commands from the mater and receives running tasks, and reports the progress and results of the execution to the master. One or more Executor processes are run on the node.

  8. exector: A process running on the workerNode for the application. The process is responsible for running Task and storing data in memory or disk. Each application will apply for its own Executor to handle tasks. '
    Insert picture description here

    Figure 1-Spark running architecture diagram

Concepts involved in the execution of spark application (Application):

  1. Task (task): A partition in the RDD corresponds to a task, and the task is the smallest unit of processing flow on a single partition.

  2. TaskSet (task set): a set of related tasks, but no shuffle dependency between them.

  3. Stage (scheduling stage): A scheduling stage corresponding to a taskSet. Each job is divided into many stages according to the wide dependency of the RDD, and each stage contains a TaskSet.

  4. Job (job): A computing job composed of one or more stages triggered by an Action operator.

  5. application: A spark application written by the user, consisting of one or more jobs. After submitting to spark, spark allocates resources for the application, converts the program and executes it.

  6. DAGScheduler: Construct a stage-based DAG according to the job and submit the stage to TaskScheduler.

  7. TaskScheduler: Submit the Taskset to the Worker Node cluster to run and return the result.

Logical view of spark application

Figure 2-Logical view of spark application (application)

As can be seen from the above figure: an Application can be composed of one or more jobs, and a job can be composed of one or more stages, where the stages are divided according to width and narrowness dependencies, a stage is composed of a taskset, and a taskset can be Consists of one to multiple tasks.

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/90581307