Why are there RDDs?
The iterative calculation of MR is not very good, and the disk read and write is too short. It is used in data mining, graph computing, machine learning, etc., and many of the results need to be reused. Writing to disk will cause a lot of disk overhead, and there are many serialization and deserialization .
DAG has directed acyclic graph pipelined processing and does not need to be written to disk.
RDD design background
What is RDD
Spark provides many transformation operations, action operations
RDD operations
RDD execution process
Inert mechanism and DAG diagram
Conversion operation and action operation, the conversion operation only records the trajectory and does not really calculate, so it is lazy loading.
The conversion will only occur when the action method is performed, which is an inert mechanism.
RDD characteristics
Efficient fault tolerance of RDD
Typical system fault tolerance methods: checkpoints, log methods
In large-scale distributed systems: checkpoints are usually not used, and the cost of logging is quite high
However, RDD uses DAG to record the track. If there is a problem with any RDD, it will directly find its parent node and roll back.
very thin