[Spark] cache, persist, checkpoint mechanism

 

checkpoint: Cut off the blood relationship between the RDD and the parent RDD, and store the current RDD in an external system such as HDFS. Advantages: 1) Avoid taking up too much storage resources to store long RDD dependencies; 2) After node failure, avoid repeated calculations in case of wide dependencies

Checkpoint principle:

1) Set checkpoint on rdd, rdd.checkpoint ()

2) Mark this rdd as marked for checkpoint

3) After the job is completed, the finalRDD.docheckpoint () method will be called to follow the rdd blood relationship and scan back. Find rdd with "marked for checkpoint", and perform secondary marking (inProgressCheckpoint) to indicate that checkpoint is being set.

4) After the job is completed, a new job is started internally, and the rdd data marked as "inProgressCheckpoint" is written to HDFS or other systems.

5) Set the parent rdd of this rdd to CheckpointRDD, and the blood relationship is stored. In the event of a node failure, you can use the readCheckpointOrCompute method to read the checkpoint data in HDFS from CheckpointRDD first.

6) Since this rdd will be calculated twice, before setting checkpoint, first cache this rdd, ie rdd = rdd.cache ()

It is easy to share data in the same application, you can use cache, persist.

Cache usage principles: 1) RDDs that are reused many times; 2) The amount of RDD data cannot be too large, otherwise it will occupy too much memory or resources

cache: call persis (MEMORY_ONLY) to store RDD data only in memory, if the memory cannot be stored, recalculate.

persist: You can choose to store in memory, hard disk and serialization, and the number of copies. But this is a temporary cache. When the application ends, the driver process and executor process will also end, BlockManager will also stop, and the data it manages will also be deleted.

 

 

 

 

 

Published 61 original articles · won praise 2 · Views 7302

Guess you like

Origin blog.csdn.net/hebaojing/article/details/104061524