Spark in RDD operating mechanism

Design and operation principle of 1. RDD

Spark's core is built on a unified abstract RDD, based on RDD conversion operations and actions that each Spark components can be seamlessly integrated, thus completing large data computing tasks in the same application.

1.1. RDD concept

A RDD is a distributed collection of objects, provides a shared memory model of a highly constrained, which is essentially a collection of read-only partition record, can not be modified directly. Each RDD can be divided into a plurality of partitions, each partition is a set of data segments, and a RDD different partitions can be stored on different nodes in the cluster, can be calculated in parallel on different nodes in the cluster.

RDD provides a rich set of common data operations to support operations, into "action" (the Action), and "Conversion" (Transformation) of two types, the former is used to perform calculations and output of the specified form, which specifies the RDD interdependencies between. RDD provides conversion interface is very simple and is similar to the map, filter, groupBy, joinand other data conversion operation coarse-grained, rather than fine-grained modified for a data item. Thus, more suitable for batch applications RDD perform the same operation for the data set elements need not suitable for asynchronous, fine-grained state in the application, such as Web applications, incremental web crawlers.

RDD typical implementation process is as follows:

  1. Read an external data source (or set of memory) for creating RDD;
  2. RDD After a series of "conversion" operation, each will have a different RDD, supplying a conversion uses the next;
  3. After the last RDD "action" operation, and outputs the specified data type and value.

RDD using inert call that during the execution of RDD, all the conversion operation will not perform the actual operation will only record dependencies, and only met the operational actions will trigger a real calculation, and in accordance with the prior dependencies to get the final result.

uploading-image-563318.png

Below an example to describe the actual implementation process of RDD, as shown below, from RDD starts creating two inputs, namely A and C, and then converted through a series of operations, to generate a final F., Which is a RDD. Note that during the execution of these conversion operations and did not perform the actual calculations, nor perform the actual calculation based on the process of creation, and the flow of data tracks just recorded. F acts when the operation is performed and generating output data, Spark will RDD generation of a dependency directed acyclic graph (DAG), and starts from the starting point perform the actual calculations. It is this mechanism RDD inert call, the switching operation such that the intermediate results obtained do not need to save, but directly flows into the pipeline to a next processing operation.

uploading-image-61893.png

1.2. RDD properties

Overall, the main reason, after using the RDD Spark enables efficient calculation as follows:

  1. Efficient fault tolerance. In RDD design, data can only be modified by conversion from parent to child RDD RDD manner, which means that we can recalculate get lost partitions directly dependent on the relationship between the RDD, without the need for redundant data through More than fashion. And it does not need to log specific data and a variety of fine-grained operations, which greatly reduces the overhead fault-tolerant data-intensive applications.

  2. Intermediate result to persistent memory. For data transfer between a plurality of RDD operations in memory, and does not need to be stored on the disk read, disk write avoiding unnecessary overhead;

Dependencies between 1.3. RDD

RDD different operations may lead to different partitions in RDD produce different dependencies, divided into narrow-dependent (Narrow Dependency) and width dependence (Wide Dependency). Wherein the narrow-dependent represents the many-to-one relationship or a relationship between parent and child RDD RDD, including operations have map, filter, unionand the like; and wide-many relationship between said dependent parent child RDD and RDD , i.e. converted into a plurality of sub parent RDD RDD, including operations have groupByKey, sortByKeyand the like.

uploading-image-606443.png

For narrow-dependent RDD, all parent partitions may be calculated in a pipelined manner, without causing mixing between the data network. RDD for a wide-dependent, usually accompanied Shuffle operation, i.e., first need to calculate all the data of the parent partition, then Shuffle between nodes. Therefore, when data recovery, depends only on the narrow lost partitions need to be recalculated according to RDD parent partition, and can be performed in parallel recalculated at different nodes. While for wide-dependent, usually means a single node failure recalculation process involves a plurality of partitions RDD parent, large overhead. In addition, also it provides the Spark and the checkpoint data logging, persistence intermediate for RDD, such that during failure recovery stage does not need to be traced back to the beginning. During recovery, Spark checkpoints overhead and data will be recalculated by comparing the cost of RDD partition to automatically choose the best recovery strategy.

1.4. Phase of division

Spark generated by analyzing the dependencies of each of the RDD DAG, and then decide how to divide Analysis by the RDD respective dependencies between partitions, is specific division method: reverse analysis in a DAG, encounters a wide rely on off open, narrow face dependent current RDD put into the current stage; will depend narrow as possible in the same division stage, pipelined calculation may be achieved. For example in the figure, the read data is first generated in accordance with the DAG, conversion operations and behaviors. And then when the operation execution behavior, the DAG reverse analysis, since the conversion from A to B and from B, F to G is wide and are dependent conversion is required dependent from the width in the off so divided into three stages . FIG put a DAG into a plurality of "stages" after each stage represents a group of associated tasks not set the task dependencies between each other Shuffle composition. Each set of tasks will be submitted to the task scheduler (TaskScheduler) processed by the task scheduler task will be distributed Executor run.

uploading-image-772555.png

1.5. Run the process RDD

RDD of the above concepts, relations and describes the dependence of the phasing, the basic flow of operation in conjunction with the previously introduced Spark, here about the process and then summed in RDD run Spark architecture (as shown below):

  1. RDD create objects;
  2. SparkContext responsible for calculating the dependencies between the RDD, build DAG;
  3. DAGSchedule responsible for the DAG reverse parsed into a plurality of stages, each stage comprising a plurality of tasks, each task can be distributed to the task scheduler execution Executor on the working node.

uploading-image-64896.png

Guess you like

Origin www.linuxidc.com/Linux/2019-06/159052.htm