RDDs in Spark are lazy loaded. All RDDs are calculated from scratch only when an action operator is encountered, and when the same RDD is used multiple times, it needs to be recalculated each time, which will seriously increase the consumption. In order to avoid recomputing the same RDD, the RDD can be persisted.
One of the important functions in Spark is that the data in a certain RDD can be saved to memory or disk. Every time an operator operation is required on this RDD, the persistent data of this RDD can be directly retrieved from memory or disk. Without needing to calculate from scratch to get this RDD.
(2) Case Demonstration Persistence Operation
1. Dependency diagram of RDD
Read the file, perform a series of operations, there are multiple RDDs, as shown in the following figure.
2. Do not use persistent operations
In the above figure, two operator operations are performed on RDD3 to generate RDD4 and RDD5 respectively. If RDD3 is not persistently saved, every time you operate on RDD3, you need to start the calculation from textFile(), convert the file data into RDD1, and then convert it into RDD2, and finally get RDD3.
View the file to be manipulated
Start Spark Shell
Follow the diagram to get RDD4 and RDD5
Calculate RDD4, it will run from RDD1 to RDD2 to RDD3 to RDD4, and check the result
Calculate RDD5, and also make a trip from RDD1 to RDD2 to RDD3 to RDD4 to view the results
3. Use persistent operations
You can use the persist() or cache() method on the RDD to mark the RDD to be persisted (the cache() method actually calls the persist() method at the bottom). The data will be computed on the first action and cached in the node's memory. Spark's cache is fault-tolerant: if any partition of a cached RDD is lost, Spark will automatically recompute and cache the RDD's original transformation process.
When calculating to RDD3, mark persistence
Calculating RDD4 is to start the calculation based on the data cached in RDD3, without running it from beginning to end
Calculating RDD5 is to start the calculation based on the data cached in RDD3, without running it from beginning to end
Second, the storage level
(1) Parameters of the persistence method
Use the persist() method of RDD to achieve persistence, and pass an StorageLevelobject to the persist() method to specify the storage level. Each persistent RDD can be stored using a different storage level, the default storage level is StorageLevel.MEMORY_ONLY.
(2) Spark RDD storage level table
There are seven storage levels for Spark RDDs
In Spark's Shuffle operation (such as reduceByKey()), some intermediate data is automatically saved even if the user does not use the persist() method. This is done to avoid recomputing the entire input if it fails during node shuffling. If you want to use an RDD multiple times, it is strongly recommended to call persist()methods on that RDD.
(3) How to choose storage level - trade-off between memory usage and CPU efficiency
If the RDD is stored in memory without overflow, the default storage level (MEMORY_ONLY) is preferred, which maximizes the performance of the CPU and enables operations on the RDD to run at the fastest speed.
If the RDD will overflow when stored in memory, then use MEMORY_ONLY_SER and choose a fast serialization library to serialize the object to save space and still be reasonably fast to access.
Unless computing an RDD is very expensive, or the RDD filters a lot of data, don't write spilled data to disk, because recomputing partitions can be as fast as reading them from disk.
If you want fast recovery in the event of a server failure, you can use the multi-copy storage level MEMORY_ONLY_2 or MEMORY_AND_DISK_2. This storage level allows tasks to continue running on RDDs after data loss without having to wait for lost partitions to be recomputed. Other storage levels require recomputing of lost partitions after data loss occurs.
(4) View the source code of the persist() and cache() methods
As can be seen from the above code, the cache() method calls the parameterless method of the persist() method, and the default storage level of both is MEMORY_ONLY, but the cache() method cannot change the storage level, while the persist() method can pass parameters Custom storage levels.
(5) Case Demonstration Setting the Storage Level
net.huawei.rddCreate TestPersistobjects in packages
Access the WebUI of Spark Shell in the browser to http://master:4040/storage/view the RDD storage information, and you can see that the storage information is empty
Execute the command: rdd.collect(), collect RDD data
Refresh the WebUI and find that there is a ParallelCollectionRDDstorage information, the storage level of the RDD is MEMORY, the persistent partition is 8, and it is completely stored in memory.
Click ParallelCollectionRDDthe hyperlink to view the detailed storage information of the RDD
The above operation shows that calling the persist() method of an RDD only marks the RDD as persistent, and the RDD marked as persistent will only be persisted when an action operation is performed.
Execute the following commands to create rdd2 and persist rdd2 to disk
Refresh the above WebUI and find one more MapPartitionsRDDstorage information. The storage level of the RDD is DISK, the persistent partition is 8, and it is completely stored in the disk.
(3) Delete the RDD from the cache
Execute the following command, it will be removed rdd(ParallelCollectionRDD)from the cache
Refresh the above WebUI and find that there is only one left MapPartitionsRDD, which ParallelCollectionRDDhas been removed.
Spark automatically monitors cache usage on each node and removes old partition data from the cache in a least recently used manner. If you want to delete an RDD manually, instead of waiting for the RDD to be automatically deleted from the cache by Spark, you can use the RDD unpersist()method.