Understanding DAG in spark

St.Antario :

The question is I have the following DAG:

enter image description here

I thought that spark devides a job in different stages when shuffling is required. Consider Stage 0 and Stage 1. There are operation which do not require shuffling. So why does Spark split them into different stages?

I thought that actual moving of data across partitions should have happened at Stage 2. Because here we need to cogroup. But to cogroup we need data from stage 0 and stage 1.

So Spark keeps the intermediate results of these stages and then apply it on the Stage 2?

Tzach Zohar :

You should think of a single "stage" as a series of transformations that can be performed on each of the RDD's partitions without having to access data in other partitions;

In other words, if I can create an operation T that takes in a single partition and produces a new (single) partition, and apply the same T to each of the RDD's partitions - T can be executed by a single "stage".

Now, stage 0 and stage 1 operate on two separate RDDs and perform different transformations, so they can't share the same stage. Notice that neither of these stages operates on the output of the other - so they are not "candidates" for creating a single stage.

NOTE that this doesn't mean they can't run in parallel: Spark can schedule both stages to run at the same time; In this case, stage 2 (which performs the cogroup) would wait for both stage 0 and stage 1 to complete, produce new partitions, shuffle them to the right executors, and then operate on these new partitions.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=461412&siteId=1