Spark Architecture and Operating Mechanism (4) - Building RDD Directed Acyclic Graph

    After the Spark application is initialized and the first RDD is generated by reading the input data through the SparkContext function, the subsequent Spark program transforms the RDD again and again through the RDD operator, and finally obtains the calculation result.
    In this process, each RDD itself is immutable, and the program converts one RDD into another RDD, and finally generates the RDD you want and outputs it. In order to complete this conversion process, Spark constructs an RDD directed acyclic graph from the RDD operation statements in the program or script we write for subsequent splitting and scheduling.
    In the directed acyclic graph of RDD, the newly generated RDD (child RDD) is generated by several RDD (parent RDD) transformations. Therefore, the content of the child RDD depends on the content of the parent RDD. Spark names this dependency between RDDs "Lineage", which we can call "Lineage".
    Spark retains the lineage information of each RDD, so it can be traced back to its parent RDD from an RDD to find the most original RDD. In order to ensure the uniqueness of the lineage relationship, Spark stipulates that there is no circular dependency between RDDs.

    So why does Spark keep lineage information? Because Spark is a distributed parallel computing system, there is a problem that a node is down or data transmission is lost. In order to ensure the complete execution of the entire Spark application when these problems occur, a fault-tolerant mechanism must be in place. The record of lineage information is for this purpose of fault tolerance.
   
    The construction process of RDD directed acyclic graph in Spark is to continuously record a series of RDD transformation operations in Spark code in the form of lineage relationship. The lineage information is continuously added as the code is scanned line by line. However, in this process, Spark does not actually perform these transformations, but only records. Until the action operator appears, the actual RDD operation sequence action will be triggered, and all operator operations before the action operator will be formed. A directed acyclic graph job (Job), and submit the job to the cluster for parallel job processing.

    After the preceding processes, a Spark program is decomposed into multiple jobs and submitted to the Spark cluster. Take WordCount as an example, only when the saveAsTextFile action operator is run, Spark will actually process the RDD conversion process. Since the entire program has only one action operator, the program will only submit one job.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326264175&siteId=291194637