spark task execution

After learning the Spark RDD and RDD operation, it is not really want to quickly write a program to consolidate Spark about what they have learned. Big Data can learn how to develop less wordcount it?

Man of few words said, directly on the code:

val lines = sc.textFile("data/dataset.txt")

val result = lines.flatMap(_.split(" "))
    .map((_, 1))
    .reduceByKey(_ + _)

result.collect()
    .foreach(x => println(x._1 + " = " + x._2))

Output:

Deer = 2
Bear = 2
Car = 3
River = 2

Just a few lines of code to get WordCount, think of WordCount write MapReduce programs, Spark is really easy to use too.

Next, we want to take advantage of the program you just wrote, to do some other things ...

First of all, we can provide the Spark toDebugStringto see the pedigree RDD method:

result.toDebugString

Output:

(2) ShuffledRDD[4] at reduceByKey at WordCount.scala:21 []
 +-(2) MapPartitionsRDD[3] at map at WordCount.scala:20 []
    |  MapPartitionsRDD[2] at flatMap at WordCount.scala:19 []
    |  data/dataset.txt MapPartitionsRDD[1] at textFile at WordCount.scala:17 []
    |  data/dataset.txt HadoopRDD[0] at textFile at WordCount.scala:17 []

From the output, we can see some RDD WordCount program was created out. Here from the bottom up.

Call to sc.textFilecreate a MapPartitionsRDD. In practical textFileinterior is to create a HadoopRDD, then the RDD mapping operation, it has finally been MapPartitionsRDD. After then flatMap, Mapand reduceByKeyother conversion operations, before the latter is dependent on a RDD RDD. The last call collectaction operation, generate a the Job . At this Spark scheduler will create a user computing operations RDD physical action implementation plan. Spark scheduler can be invoked from the final collect()departure RDD operation, back up all the necessary calculations RDD. Parent scheduler will visit RDD's parent, parent, and so on, recursively up to generate all the necessary computational physics program ancestors of RDD.

DAG generation

The original RDD (s) is formed by converting a series of DAG. Dependencies between RDD, which contains the RDD is converted Parent RDD (s) derived from the Partitions and it depends on what Parent RDD (s) of. These dependencies on the formation of the RDD Lineage (descent). With Lineage, can guarantee a RDD is calculated before, it depends Parent RDD have been completed calculation; but also to achieve fault tolerance RDD, that some or all of the results if a RDD is lost, it can be re this part of the calculation of the lost data without recalculate all data.

How to generate computing task DAG

Depending on the dependency of the DAG is divided into different stages (Stage). For narrow-dependent, since Partition dependencies certainty, Partition conversion process can be completed in the same thread, narrow dependencies are divided into a Spark Stage same; for a wide-dependent, due to the presence of Shuffle, only Parent RDD ( s) after the treatment is completed, the following calculations can begin, thus wider Spark dependence is based on dividing the Stage . In one internal Stage, each Partition is assigned a calculation task (Task), which can be performed in parallel Task. The dependencies between Stage a DAG into a large particle size, the order of execution of the DAG are also front to rear. In other words, Stage only when it has no Parent Stage or Parent Stage have been executed, you can perform.

WordCount In the above example, the shift operation is performed reduceByKeywhen a trigger Shuffle (shuffling) process. Thus, the entire Job from here will split, forming two Stage. WebUI in the Spark ( http://127.0.0.1:4040 in), we can see:

3423665-411a28bf5589916e
image
3423665-71e0f5c577d5ae47
image

Stage1 dependent on Stage0, therefore Stage0 must be executed and then execute Stage1. Next, consider the relationship between Stage0 RDD conversion and Stage1 in.

3423665-c4522734e4a60576
Stage0

In Stage0 is pipelined execution from reading a file into the final MapPartitionsRDD. Stage0 generated in two task, as shown below:

3423665-e53b348613307eea
image

Stage0 Task Stage1 Task as input the output of the final result.

3423665-08980b3a4c9126d5
Stage1
3423665-893f3f5116cd6bef
image

summary

After a series of RDD conversion, it will call an action on the last operation RDD, then will generate a Job. After being divided into a number of computing tasks (Task) in the Job, these Task will be submitted to the compute nodes on a cluster to calculate. Spark in Task divided ShuffleMapTaskand ResultTasktwo kinds. A final Stage of the DAG (corresponding to a Stage1) to generate a ResultTask Partition each result, all the rest will generate Stage ShuffleMapTask. Task generated will be sent to the Executor has been started by the Executor to accomplish computing tasks.

Spark execute a process:

  1. RDD user code defined in directed acyclic graph
    to create a new RDD RDD action on, the sub-references the parent RDD RDD, formed DAG.
  2. The operating action directed acyclic graph forced translated into implementation plans
    when calling a RDD action operation, the RDD must be calculated. This also requires the parent node is calculated in the RDD. Spark scheduler submits a job to calculate all the necessary RDD. This job will comprise one or more steps, each step corresponding to a number of computing tasks. Also corresponds to a step in the DAG or a plurality of RDD (execution pipeline).
  3. Task scheduling in the cluster and perform
    a task processing a partition data, after a call to action RDD operation, it will generate a lot of tasks, Spark scheduler will schedule tasks to be executed on Worker. Once the last step of the job ended, an action will perform the operation finished.
3423665-1beb0ae8abc22bae
image

Good text recommended:

Reproduced in: https: //www.jianshu.com/p/8d2bf49cf97d

Guess you like

Origin blog.csdn.net/weixin_33896069/article/details/91059713