Detailed explanation of spark checkpoint

There are two main applications of checkpoint in spark: one is to checkpoint RDDs in spark core, which can cut off the dependencies of RDDs for checkpointing, and save RDD data to reliable storage (such as HDFS) for data recovery; the other is applied in In spark streaming, checkpoint is used to save DStreamGraph and related configuration information, so that when the driver crashes and restarts, it can continue to process the previous progress (for example, the previous waiting batch job will continue to process after restart).

This article will mainly analyze the read and write process of checkpoint in the above two scenarios in detail.

1. Checkpoint analysis in spark core

1.1, how to use checkpoint

Using checkpoint to take a snapshot of an RDD is roughly as follows:

sc.setCheckpointDir (checkpointDir.toString) 
val rdd = sc.makeRDD (1 to 20, numSlices = 1)
rdd.checkpoint ()

First, set the checkpoint directory (usually the hdfs directory), which is used to store RDD-related data (including the actual data of each partition, and partitioner (if any)). Then call the checkpoint method on the RDD.

1.2, checkpoint writing process

You can see that the use of checkpoint is very simple, set the checkpoint directory, and then call the checkpoint method of RDD. For the checkpoint writing process, there are mainly the following four problems:

Q1: When is the data in the RDD written? Is it when rdd calls the checkpoint method?

Q2: When doing checkpoint, what data was written to HDFS?

Q3: After checkpointing the RDD, what finishing work has been done to the province that does the RDD?

Q4: In the actual process, what should be paid attention to when using RDD for checkpoint?

After clarifying the above four questions, I think the writing process of checkpoint is basically clear. The following questions will be answered one by one.

A1: First look at the checkpoint method in RDD. You can see that in this method, a ReliableRDDCheckpintData object is only created, and no actual writing is done. The timing of actually triggering the writing is done by calling the doCheckpoint method of the RDD after the runJob generates and modifies the RDD.

A2: After calling RDD.doCheckpoint → RDDCheckpintData.checkpoint → ReliableRDDCheckpintData.doCheckpoint → ReliableRDDCheckpintData.writeRDDToCheckpointDirectory, you can see in the writeRDDToCheckpointDirectory method that the data of each partition in the RDD will be written in turn as a separate task (RunJob). To the checkpoint directory (writePartitionToCheckpointFile), and if the partitioner in the RDD is not empty, the object will also be serialized and stored in the checkpoint directory. Therefore, when doing checkpoint, the data written in hdfs mainly includes: the actual data of each partition in the RDD, and the possible partitioner object (writePartitionerToCheckpointDir).

A3: After writing the checkpoint data to hdfs, the markCheckpointed method of rdd will be called, mainly to cut off the rdd's dependence on the upstream, and to empty the paritions and other operations.

A4: Through A1, A2 can know that after the RDD calculation is completed, each partition data will be saved to HDFS through RunJob again. In this way, the RDD will be calculated twice, so in order to avoid such a situation, it is better to cache the RDD. That is, the recommended usage of rdd in 1.1 is as follows:

sc.setCheckpointDir (checkpointDir.toString) 
val rdd = sc.makeRDD (1 to 20, numSlices = 1)
rdd.cache ()
rdd.checkpoint ()

1.3, checkpoint reading process

After checkpointing, the dependencies and partitions data of the original RDD will be obtained from CheckpointRDD. That is to say, each partition data and objects such as partitioner in the original rdd will be transferred to CheckPointRDD.

As can be seen in the compute method in ReliableRDDCheckpintRDD, a concrete implementation of CheckPointRDD, the previously written partition data will be restored from the checkpoint directory of hdfs. And the partitioner object (if any) is also restored from the partitioner object previously written to hdfs.

In general, the checkpoint reading process is relatively simple.

2. Checkpoint analysis in spark streaming

2.1, the use of checkpoint in streaming

The use of checkpoint in streaming mainly includes the following two points: set the checkpoint directory, call the getOrCreate method when initializing the StreamingContext, that is, when the checkpoint directory has no data, create a new streamingContext instance, and set the checkpoint directory, otherwise, read the relevant configuration and information from the checkpoint directory. Data creates streamingcontext.

// Function to create and setup a new StreamingContext
def functionToCreateContext(): StreamingContext = { val ssc = new StreamingContext(...) // new context val lines = ssc.socketTextStream(...) // create DStreams ... ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc }
 
 
// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

2.2, checkpoint writing process in streaming

Similarly, for the writing process of checkpoint in streaming, there are mainly the following three problems, and related explanations will be given.

Q1: When is checkpoint done in streaming?

A1: In spark streaming, jobGenerator periodically generates jobs (jobGenerator.generateJobs). After the task is generated, the doCheckpoint method will be called to checkpoint the system. In addition, the doCheckpoint method is also called when the current batch task ends and the metadata is cleared (jobGenerator.clearMetadata).

Q2: During the streaming checkpoint process, what data is written to the checkpoint directory?

A2: The main logic for checkpointing is basically in the JobGenerator.doCheckpoint method.

In this method, first update the relevant information of the checkpoint RDD in the current time period. For example, in DirectKafkaInputDStream, update the time, topic, partition, offset and other related information of the generated RDD information.

Second, write the Checkpoint object to the checkpoint directory through checkpointWriter (CheckPoint.write → CheckpointWriteHandle). So far, we know that the data written to the checkpoint directory is actually the Checkpoint object.

Checkpoint mainly contains the following information:

val master = ssc.sc.master
val framework = ssc.sc.appName
val jars = ssc.sc.jars
val graph = ssc.graph
val checkpointDir = ssc.checkpointDir
val checkpointDuration = ssc.checkpointDuration
val pendingTimes = ssc.scheduler.getPendingTimes().toArray
val sparkConfPairs = ssc.conf.getAll

Specifically, it includes related configuration information, checkpoint directory, DStreamGraph, etc. For DStreamGraph, it mainly contains related information such as InputDstream and outputStream, so we can see that the calculation functions related to the definition application are also serialized and saved in the checkpoint directory.

Q3: What are the pits in streaming checkpoint?

A3 :

As can be seen from A2, the calculation function defined by the application is also serialized to the checkpoint directory. When the application code changes, it cannot be recovered from the checkpoint at this time. Personally, I feel that this is the biggest obstacle to the use of checkpoint in the production environment.

In addition, when restoring the streamingContext from the checkpoint directory, the configuration information is also read from the checkpoint (only a small part of the configuration is reloaded, see the read process for details). When the task is restarted, the newly changed configuration may be Does not work, causing very strange problems.

Also, broadcast variables are restricted from being used in checkpoints ( SPARK-5206 ).

2.3, checkpoint reading process in streaming

When the spark streaming task restores the streamingContext from the checkpoint, it will trigger the read action of the previously saved checkpoint object. In the getOrCreate method of StreamingContext, restore the previously saved Checkpoint object from the checkpoint directory through the checkpoint.read method. Once the object exists, a streamingContext will be created using Checkpoint. At the same time, the restoration of DStreamGraph in StreamingContext relies on the previously saved object, and calls restoreCheckpointData to restore the previously generated but not calculated RDD, so that data processing is performed following the previous progress.

In addition, it should be noted that the following configuration information is reloaded when the streamingContext is created using checkpoint.

val propertiesToReload = List(
"spark.yarn.app.id",
"spark.yarn.app.attemptId",
"spark.driver.host",
"spark.driver.bindAddress",
"spark.driver.port",
"spark.master",
"spark.yarn.jars",
"spark.yarn.keytab",
"spark.yarn.principal",
"spark.yarn.credentials.file",
"spark.yarn.credentials.renewalTime",
"spark.yarn.credentials.updateTime",
"spark.ui.filters",
"spark.mesos.driver.frameworkId")

3. Summary

This article mainly analyzes the basic process of checkpoint reading and writing in spark core and streaming, and points out some pits encountered in the use of checkpoint. For spark streaming, I personally think that checkpoint is not suitable for use in production environments.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325682929&siteId=291194637