RDD operating mechanism

Design and operation principle of 1. RDD

Spark's core is built on a unified abstract RDD, based on RDD conversion operations and actions that each Spark components can be seamlessly integrated, thus completing large data computing tasks in the same application.

In practice, there are many iterative algorithms and interactive data mining tools, common to these scenarios is that the intermediate results are reused among different computing stages, i.e., the output stage will be used as input of the next stage. The Hadoop MapReduce framework of the intermediate results are written to HDFS in, it brought a lot of data replication, disk IO and serialization costs, and usually only supports certain computing model. RDD and provides an abstract data structure, so that developers need not worry about the distributed nature of the underlying data, only the specific application logic is expressed as a series of conversion process, the conversion operations between different RDD formed dependencies, can be achieved conduit, thereby avoiding the storage of intermediate results, greatly reduces the replication, and sequences overhead disk IO.

1.1. RDD concept

A RDD is a distributed collection of objects, provides a shared memory model of a highly constrained, which is essentially a collection of read-only partition record, can not be modified directly. Each RDD can be divided into a plurality of partitions, each partition is a set of data segments, and a RDD different partitions can be stored on different nodes in the cluster, can be calculated in parallel on different nodes in the cluster.

RDD provides a rich set of common data operations to support operations, into "action" (the Action), and "Conversion" (Transformation) of two types, the former is used to perform calculations and output of the specified form, which specifies the RDD interdependencies between. RDD provides conversion interface is very simple and is similar to the map, coarse-grained filter, groupBy, join operations such as data conversion, rather than fine-grained modify a data item. Thus, more suitable for batch applications RDD perform the same operation for the data set elements need not suitable for asynchronous, fine-grained state in the application, such as Web applications, incremental web crawlers.

RDD typical implementation process is as follows:

Read an external data source (or set of memory) for creating RDD;
RDD through a series of "switching" operation, each will have a different RDD, supplying a conversion uses the next;
last RDD through "action" operation processing, and outputs the specified data type and value.
RDD using inert call that during the execution of RDD, all the conversion operation will not perform the actual operation will only record dependencies, and only met the operational actions will trigger a real calculation, and in accordance with the prior dependencies to get the final result.

RDD RDD operating mechanism operating mechanism

Below an example to describe the actual implementation process of RDD, as shown below, from RDD starts creating two inputs, namely A and C, and then converted through a series of operations, to generate a final F., Which is a RDD. Note that during the execution of these conversion operations and did not perform the actual calculations, nor perform the actual calculation based on the process of creation, and the flow of data tracks just recorded. F acts when the operation is performed and generating output data, Spark will RDD generation of a dependency directed acyclic graph (DAG), and starts from the starting point perform the actual calculations. It is this mechanism RDD inert call, the switching operation such that the intermediate results obtained do not need to save, but directly flows into the pipeline to a next processing operation.

RDD RDD operating mechanism operating mechanism

1.2. RDD properties

Overall, the main reason, after using the RDD Spark enables efficient calculation as follows:

Efficient fault tolerance. In RDD design, data can only be modified by conversion from parent to child RDD RDD manner, which means that we can recalculate get lost partitions directly dependent on the relationship between the RDD, without the need for redundant data through More than fashion. And it does not need to log specific data and a variety of fine-grained operations, which greatly reduces the overhead fault-tolerant data-intensive applications.

Intermediate result to persistent memory. For data transfer between a plurality of RDD operations in memory, and does not need to be stored on the disk read, disk write avoiding unnecessary overhead;

Data can be stored Java objects, avoid unnecessary object serialization and de-serialization overhead.

Dependencies between 1.3. RDD

RDD different operations may lead to different partitions in RDD produce different dependencies, divided into narrow-dependent (Narrow Dependency) and width dependence (Wide Dependency). Wherein, the one represented by the narrow-dependent relationship between parent and child RDD RDD or a foreign key, there are operations including map, filter, union and the like; and a width, said dependence between parent and child RDD RDD to-many relationship, a parent that is converted into a plurality of sub RDD RDD, including operations have groupByKey, sortByKey like.

RDD RDD operating mechanism operating mechanism

For narrow-dependent RDD, all parent partitions may be calculated in a pipelined manner, without causing mixing between the data network. RDD for a wide-dependent, usually accompanied Shuffle operation, i.e., first need to calculate all the data of the parent partition, then Shuffle between nodes. Therefore, when data recovery, depends only on the narrow lost partitions need to be recalculated according to RDD parent partition, and can be performed in parallel recalculated at different nodes. While for wide-dependent, usually means a single node failure recalculation process involves a plurality of partitions RDD parent, large overhead. In addition, also it provides the Spark and the checkpoint data logging, persistence intermediate for RDD, such that during failure recovery stage does not need to be traced back to the beginning. During recovery, Spark checkpoints overhead and data will be recalculated by comparing the cost of RDD partition to automatically choose the best recovery strategy.

1.4. Phase of division

Spark generated by analyzing the dependencies of each of the RDD DAG, and then decide how to divide Analysis by the RDD respective dependencies between partitions, is specific division method: reverse analysis in a DAG, encounters a wide rely on off open, narrow face dependent current RDD put into the current stage; will depend narrow as possible in the same division stage, pipelined calculation may be achieved. For example in the figure, the read data is first generated in accordance with the DAG, conversion operations and behaviors. And then when the operation execution behavior, the DAG reverse analysis, since the conversion from A to B and from B, F to G is wide and are dependent conversion is required dependent from the width in the off so divided into three stages . FIG put a DAG into a plurality of "stages" after each stage represents a group of associated tasks not set the task dependencies between each other Shuffle composition. Each set of tasks will be submitted to the task scheduler (TaskScheduler) processed by the task scheduler task will be distributed Executor run.

RDD RDD operating mechanism operating mechanism

1.5. Run the process RDD

RDD of the above concepts, relations and describes the dependence of the phasing, the basic flow of operation in conjunction with the previously introduced Spark, here about the process and then summed in RDD run Spark architecture (as shown below):

RDD create objects;
SparkContext responsible for calculating the dependencies between the RDD, constructed DAG;
DAGSchedule responsible for the DAG reverse parsed into a plurality of stages, each stage comprising a plurality of tasks, each task can be distributed scheduler Executor execute on the node.
RDD RDD operating mechanism operating mechanism

Guess you like

Origin blog.csdn.net/weixin_43226231/article/details/94338345