Detailed Checkpoint of SparkCore RDD

Introduction:

  • RDD data can be persisted, but persistence/caching can put data in memory . Although it is fast, it is also the least reliable (memory loss); it can also put data on disk, which is not completely reliable. ! For example, the disk will be damaged.
  • Checkpoint is created for more reliable data persistence. In Checkpoint, data is generally placed on HDFS. This naturally takes advantage of HDFS's inherent high fault tolerance and high reliability to achieve maximum data security. RDD's fault tolerance and high availability.

RDD's enumeration mechanism:

  • 1- What is the checkpoint mechanism

    • Is to use CheckPoint to save data in non-volatile media such as HDFS
  • 2- What is the purpose of the checkpoint mechanism

    • In order to solve the problem that the RDD cache cannot guarantee the security of the data in the volatile medium, the data is stored in HDFS through the Checkpoint checkpoint mechanism, and the checkpoint mechanism can be realized with the high fault tolerance and high reliability of HDFS.
  • 3- How to use the checkpoint mechanism

    • sc.checkPoint("path to hdfs")
  • 4- What is the difference and connection between the checkpoint mechanism and the RDD cache?
    Insert picture description here

  • 5-Note:
    Insert picture description here
    How does Spark implement fault tolerance ?

  • (1) First, Spark will find out whether there will be RDD in the memory for cache or persist, if not continue

  • (2) Continue to find whether CheckPoint checkpoint mechanism is set in Spark

  • (3) Recalculate based on the blood relationship or dependency of RDD

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112724515