Big Data Development-Spark-Understanding Stage, Executor, Driver... in Spark

1. Introduction

Asi, for newcomers to Spark, first of all, they don’t know the operating mechanism of Spark. When communicating with you, they don’t know what they are talking about. For example, deployment mode and operation mode may be confused. For those who have certain development experience Even if you know the operating mechanism, you may not understand the various terms of Spark very well in terms of expression. Therefore, understanding Spark terminology is a necessary way to communicate between Spark developers. This article starts from the operating mechanism of Spark to WordCount case to understand various terms in Spark.

2. Spark operating mechanism

First, take a picture of the official website to illustrate that it is a general execution framework for spark applications on a distributed cluster. Mainly by sparkcontext (spark context), cluster manager (resource manager) and executor (execution process of a single node). The cluster manager is responsible for the unified resource management of the entire cluster. The executor is the main process of application execution, which contains multiple task threads and memory space.

file The main running process of Spark is as follows:

  1. After the application is submitted using spark-submit, it initializes the sparkcontext, which is the running environment of spark, in the corresponding location according to the parameter setting (deploy mode) at the time of submission, and creates the DAG Scheduler and Task Scheduer. The Driver executes the code according to the application, and the entire program According to the action operator, it is divided into multiple jobs. Each job builds a DAG graph. The DAG Scheduler divides the DAG graph into multiple stages. At the same time, each stage is divided into multiple tasks. The DAG Scheduler passes the taskset to the Task Scheduer, Task The Scheduer is responsible for the scheduling of tasks on the cluster. As for the relationship between stage and task and how it is divided, we will discuss in detail later.

  2. The Driver applies for resources from the resource manager according to the resource requirements in the sparkcontext, including the number of executors and memory resources.

  3. After the resource manager receives the request, it creates an executor process on the work node that meets the conditions.

  4. After the Executor is created, it will register with the driver in the reverse direction so that the driver can assign tasks to him for execution.

  5. When the program is executed, the driver cancels the requested resource to the resource manager.

3. Understand the terms in Spark

From the operational mechanism, let’s continue to explain the following terms,

3.1 Driver program

The driver is the spark application we wrote to create sparkcontext or sparksession. The driver will communicate with the cluster mananer and assign tasks to the executor for execution.

3.2 Cluster Manager

Responsible for the resource scheduling of the entire program, the current main schedulers are:

YARN

Spark Standalone

Months

3.3 Executors

Executors is actually an independent JVM process, one on each worker node, which is mainly used to execute tasks. Within an executor, multiple tasks can be executed concurrently.

3.4 Job

Job is a complete processing flow of the user program, which is a logical name.

3.5 Stage

A job can contain multiple stages. The stages are serialized. State triggering is generated by some shuffle, reduceBy, and save actions.

3.6 Task

A stage can contain multiple tasks, such as sc.textFile("/xxxx").map().filter(), where map and filter are each a task. The output of each task is the output of the next task.

3.7 Partition

Partition is part of the data source in spark. A complete data source will be divided into multiple partitions by spark so that spark can be sent to multiple executors to execute tasks in parallel.

3.8 RDD

RDD is a distributed elastic data set. In spark, a data source can be regarded as a large RDD. The RDD is composed of multiple partitions. The data loaded by spark will be stored in the RDD. Of course, it is actually cut into the RDD. Multiple partitions.

So the question is how is a spark job executed?

(1) The spark program we wrote, also known as the driver, will submit a job to the Cluster Manager

(2) Cluster Manager will check the data local line and find a most suitable node to schedule the task

(3) The job will be split into different stages, and each stage will be split into multiple tasks

(4) The driver sends the task to the executor to execute the task

(5) The driver will track the execution of each task and update it to the master node. This can be checked on the spark master UI

(6) After the job is completed, the data of all nodes will be finally aggregated on the master node again, including the average time, maximum time, median and other indicators.

3.9 Deployment mode and operation mode

The deployment mode means that Cluster Manager generally includes Standalone and Yarn, while the operation mode refers to the running machine of Drvier, whether it is the cluster or the machine that submits the task. It corresponds to the Cluster and Client modes. The difference lies in the running results, logs, and stability. Wait.

4. Understanding various terms from the WordCount case

Understand related concepts again

  • Job: Job is triggered by Action, so a Job contains one Action and N Transform operations;

  • Stage: Stage is a set of Tasks divided due to shuffle operations, and Stage is divided according to its width and narrow dependencies;

  • Task: The smallest execution unit, because each Task is only responsible for one partition of data

    Processing, so generally there are as many tasks as there are partitions, this type of Task actually performs the same action on different partitions;

The following is a WordCount program

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object WordCount {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("yarn").setAppName("WordCount")
    val sc = new SparkContext(conf)
    val lines1: RDD[String] = sc.textFile("data/spark/wc.txt")
    val lines2: RDD[String] = sc.textFile("data/spark/wc2.txt")
    val j1 = lines1.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    val j2 = lines2.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    j1.join(j2).collect()
    sc.stop()
  }
}

Yarn mode is used more in the production environment, so from the perspective of Yarn deployment mode, there is only one action operation collect in the code, so there is only one job, and the job is divided into three stages due to shuffle, which are flatMap and map. And reduceBykey are counted as a Stage0, and the other line2 is counted as another, Stage1, and Stage3 is the first two results join, then collect, and stage3 depends on stage1 and stage0, but stage0 and stage1 are parallel. In the actual production environment, To look at the dependency graph of the dependency stage, you can clearly see the dependency relationship.

Wu Xie, Xiao San Ye, a little rookie in the background, big data, and artificial intelligence. Please pay attention to morefile

Guess you like

Origin blog.csdn.net/hu_lichao/article/details/111829136