Spark(26)-RDD operating principle and operation and characteristics

Why are there RDDs?

        The iterative calculation of MR is not very good, and the disk read and write is too short. It is used in data mining, graph computing, machine learning, etc., and many of the results need to be reused. Writing to disk will cause a lot of disk overhead, and there are many serialization and deserialization . 

        DAG has directed acyclic graph pipelined processing and does not need to be written to disk.

RDD design background

 What is RDD 

Spark provides many transformation operations, action operations

RDD operations

 

RDD execution process

Inert mechanism and DAG diagram

Conversion operation and action operation, the conversion operation only records the trajectory and does not really calculate, so it is lazy loading.

The conversion will only occur when the action method is performed, which is an inert mechanism.

 

RDD characteristics

Efficient fault tolerance of RDD 

Typical system fault tolerance methods: checkpoints, log methods

In large-scale distributed systems: checkpoints are usually not used, and the cost of logging is quite high

However, RDD uses DAG to record the track. If there is a problem with any RDD, it will directly find its parent node and roll back.

very thin

[3.11]--RDD dependencies and running process_哔哩哔哩_bilibili

Guess you like

Origin blog.csdn.net/qq_52128187/article/details/131106927