SparkCore of Spark: RDD-Data Core/API [CheckPoint]

1 Overview

\quad \quad Although the blood relationship (dependency) of the RDD can naturally achieve fault tolerance, when a certain partition data of the RDD fails or is lost, it can be reconstructed through the blood relationship. However, for long-term iterative applications, as the iteration progresses, the blood relationship between RDDs will become longer and longer. Once an error occurs in the subsequent iteration process, it needs to be rebuilt through a very long blood relationship, which will inevitably affect performance . For this reason, RDD supports checkpoint to save data to persistent storage, so that the previous blood relationship can be cut off, because the RDD after checkpoint does not need to know its parent RDDs, it can get data from checkpoint.

\quad \quad Although the storage of data in Spark is given to the cache or persist in the memory or disk for persistent operation, this does not guarantee that the data will not be lost at all.If the stored memory is faulty or the disk is broken, it will also cause spark to re-start. Calculate it again based on the RDD; so there is a checkpoint (checkpoint).

2. Function

\quad \quad The function of checkpoint is to make a checkpoint of the more important intermediate data in the DAG and store the results in a highly available place (usually this place is HDFS that supports distributed storage and copy mechanisms).

3. Features

  • Checkpoint is transformation. When it encounters Action, checkpoint will start another task to split the data and save it in the set checkpoint directory.

  • When the checkpoint is used, the data is saved to HDFS, and all the dependencies of this RDD are also lost, because the data has been persisted to the hard disk and does not need to be recalculated.

  • The checkpoint operation on the RDD will not be executed immediately, and the Action operation must be executed to trigger it.

  • It is strongly recommended to persist the data in memory (cache operation) first, otherwise directly using checkpoint will start a calculation and waste resources.

4. Application

premise:

\quad \quad Before using Checkpoint, you need to set the storage path of Checkpoint, and if the task is running in a cluster, this path must be a path on HDFS. That is, checkpoints are set.

How to set checkpoints?

\quad \quad Use sparkContext to set the HDFS checkpoint directory (if checkpoint is not set, an exception will be thrown: throw new SparkException("Checkpoint directory has not been set in the SparkContext"):

scala> sc.setCheckpointDir("hdfs://master:9000/checkpoint1")

After executing the above code, a directory will be created in hdfs:

/checkpoint1/ffe2e7d8-5558-4483-b116-377bfb04c7b1
  • Shell command view

Insert picture description here
Insert picture description here

  • Web page view

Insert picture description here

After the checkpoint directory is set, it can be used as follows:

rdd.cache()
rdd.checkpoint()
rdd.collect
  • It should be cached before checkpoint, because checkpoint will recalculate the data of the entire RDD and then store it in HDFS and other places. So if there is no cache before the checkpoint, the entire task flow will be calculated once, and the extra one is the checkpoint.

Small case:
1) Set checkpoints

scala> sc.setCheckpointDir("hdfs://master:9000/checkpoint1")

2) Create an RDD

scala> val rdd = sc.parallelize(Array("atguigu"))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[207] at parallelize at <console>:26

3) Convert the RDD to carry the current timestamp and cache it

scala> val ch = rdd.map(_+System.currentTimeMillis)
ch: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[208] at map at <console>:28

scala> ch.cache
res47: ch.type = MapPartitionsRDD[208] at map at <console>:28

4) Do checkpoint

scala> ch.checkpoint

5) Print results multiple times

scala> ch.collect
res49: Array[String] = Array(atguigu1610789426206)                              

scala> ch.collect
res50: Array[String] = Array(atguigu1610789426206)
  • View hdfs inspection directory

Insert picture description here

5. The difference between Checkpoint, Cache and Persist

\quad \quad Cache can calculate RDD and store it in memory, but the dependency chain of RDD (equivalent to the Edits log in NameNode) cannot be thrown away, because this cache is unreliable. If there are some errors (such as Executor downtime) ), the fault tolerance of this RDD can only be calculated by backtracking the dependency chain and replaying it. But Checkpoint saves the results in HDFS storage, which is reliable, so the dependency can be cut off. If something goes wrong, it can realize fault tolerance by copying the files in HDFS.

The difference is mainly in the following two points

  • Checkpoint can save data to reliable storage such as HDFS, Persist and Cache can only be saved in local disks and memory

  • Checkpoint can cut off the dependency chain of RDD, but Persist and Cache cannot

  • Because CheckpointRDD does not have an upward dependency chain, it will still exist after the program ends and will not be deleted. Cache and Persist will be cleared immediately after the program ends.

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112712755