Spark's stage task division

First understand the stages

Spark cluster

  • A spark cluster can run multiple spark applications concurrently.

Spark application

  • A spark application consists of a driver (writing logic code) and multiple executor threads. The spark program is executed on the driver side and sends instructions to the node where the executor is located.
  • When sparkContext is started, a driver is started, and multiple executors are also started. The executor cannot span nodes, but a node can have multiple executors. RDD will be calculated across multiple executors in parallel. One executor can process data from multiple partitions of the RDD, but the data of one partition cannot be executed by multiple executors.
  • A spark application can run multiple jobs concurrently, and triggering an action operator is a job.

Job

  • Spark RDD is lazy execution, triggering an action and splitting a job.
  • A job can have multiple stages .

stage

  • A wide dependency splits a stage.
  • The number of stages = the number of wide dependencies + 1.
  • A stage has multiple tasks .

task

  • The number of tasks = the number of partitions of the last RDD in each stage .

View the relationship between the various stages through the web UI

	val conf: SparkConf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[*]")
    val sc = new SparkContext(conf)
    val rdd: RDD[Int] = sc.makeRDD(List(1,3,4,5,1,3,9), 2)
    val resRDD: RDD[(Int, Int)] = rdd.map((_,1)).reduceByKey(_+_)
    // 第一次action
    resRDD.foreach(println)

    // 第二次action
    resRDD.saveAsTextFile("D:\\develop\\workspace\\bigdata2021\\spark2021\\out")

    Thread.sleep(100000000)
    sc.stop()

View the number of jobs (each action triggers a job)

Insert picture description here

View the number of stages of job0 (the number of stages increases every time a wide dependency is triggered)

Insert picture description here

View the number of tasks (the number of RDD partitions in the final stage of each stage)

Insert picture description here

If there is a shuffle process

  • The system automatically caches the process before shuffle and displays skipped on the web page.
    Insert picture description here

Stage task division

  • First, generate a DAG directed acyclic graph, which is a topological graph composed of points and lines. The graph has directions and does not form a closed loop.

  • The original RDD forms a DAG after a series of transformations, and the DAG is divided into different stages according to the wide dependency between the RDDs.

  • DAG records the conversion process and task stages of RDD.
    Insert picture description here

  • The intermediate process of RDD task segmentation is: Application->job->stage->task.

  • Application is started when sparkContext is started.

  • An action operator generates a job.

  • Each job is divided into different stages according to wide dependence.

  • The final number of RDD partitions for each stage is the number of tasks.

Guess you like

Origin blog.csdn.net/FlatTiger/article/details/115085777