Spark basic study notes 20: RDD persistence, storage level and caching

Zero, the learning objectives of this lecture

  1. Understand the need for RDD persistence
  2. Understanding storage levels of RDDs
  3. Learn how to view the RDD cache

1. RDD endurance

(1) The necessity of introducing persistence

  • RDDs in Spark are lazy loaded. All RDDs are calculated from scratch only when an action operator is encountered, and when the same RDD is used multiple times, it needs to be recalculated each time, which will seriously increase the consumption. In order to avoid recomputing the same RDD, the RDD can be persisted.
  • One of the important functions in Spark is that the data in a certain RDD can be saved to memory or disk. Every time an operator operation is required on this RDD, the persistent data of this RDD can be directly retrieved from memory or disk. Without needing to calculate from scratch to get this RDD.

(2) Case Demonstration Persistence Operation

1. Dependency diagram of RDD

  • Read the file, perform a series of operations, there are multiple RDDs, as shown in the following figure.
    insert image description here

2. Do not use persistent operations

  • In the above figure, two operator operations are performed on RDD3 to generate RDD4 and RDD5 respectively. If RDD3 is not persistently saved, every time you operate on RDD3, you need to start the calculation from textFile(), convert the file data into RDD1, and then convert it into RDD2, and finally get RDD3.
  • View the file to be manipulated
    insert image description here
  • Start Spark Shell
    insert image description here
  • Follow the diagram to get RDD4 and RDD5
    insert image description here
  • Calculate RDD4, it will run from RDD1 to RDD2 to RDD3 to RDD4, and check the result
    insert image description here
  • Calculate RDD5, and also make a trip from RDD1 to RDD2 to RDD3 to RDD4 to view the results
    insert image description here

3. Use persistent operations

  • You can use the persist() or cache() method on the RDD to mark the RDD to be persisted (the cache() method actually calls the persist() method at the bottom). The data will be computed on the first action and cached in the node's memory. Spark's cache is fault-tolerant: if any partition of a cached RDD is lost, Spark will automatically recompute and cache the RDD's original transformation process.
  • When calculating to RDD3, mark persistence
    insert image description here
  • Calculating RDD4 is to start the calculation based on the data cached in RDD3, without running it from beginning to end
    insert image description here
  • Calculating RDD5 is to start the calculation based on the data cached in RDD3, without running it from beginning to end
    insert image description here

Second, the storage level

(1) Parameters of the persistence method

  • Use the persist() method of RDD to achieve persistence, and pass an StorageLevelobject to the persist() method to specify the storage level. Each persistent RDD can be stored using a different storage level, the default storage level is StorageLevel.MEMORY_ONLY.

(2) Spark RDD storage level table

  • There are seven storage levels for Spark RDDs
    insert image description here
  • In Spark's Shuffle operation (such as reduceByKey()), some intermediate data is automatically saved even if the user does not use the persist() method. This is done to avoid recomputing the entire input if it fails during node shuffling. If you want to use an RDD multiple times, it is strongly recommended to call persist()methods on that RDD.

(3) How to choose storage level - trade-off between memory usage and CPU efficiency

  • If the RDD is stored in memory without overflow, the default storage level (MEMORY_ONLY) is preferred, which maximizes the performance of the CPU and enables operations on the RDD to run at the fastest speed.
  • If the RDD will overflow when stored in memory, then use MEMORY_ONLY_SER and choose a fast serialization library to serialize the object to save space and still be reasonably fast to access.
  • Unless computing an RDD is very expensive, or the RDD filters a lot of data, don't write spilled data to disk, because recomputing partitions can be as fast as reading them from disk.
  • If you want fast recovery in the event of a server failure, you can use the multi-copy storage level MEMORY_ONLY_2 or MEMORY_AND_DISK_2. This storage level allows tasks to continue running on RDDs after data loss without having to wait for lost partitions to be recomputed. Other storage levels require recomputing of lost partitions after data loss occurs.

(4) View the source code of the persist() and cache() methods

/**                                                                                           
 * 在第一次行动操作时持久化RDD,并设置存储级别,当RDD从来没有设置过存储级别时才能使用该方法                                           
 */                                                                                          
def persist(newLevel: StorageLevel): this.type = {
    
                                                
  if (isLocallyCheckpointed) {
    
                                                                    
    // 如果之前已将该RDD设置为localCheckpoint,就覆盖之前的存储级别                                                
    persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)     
  } else {
    
                                                                                        
    persist(newLevel, allowOverride = false)                                                  
  }                                                                                           
}                                                                                             
/**                                                                                           
  * 持久化RDD,使用默认存储级别(MEMORY_ONLY)                                                              
  */                                                                                          
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)                                  
                                                                                              
/**                                                                                           
  * 持久化RDD,使用默认存储级别(MEMORY_ONLY)                                                              
  */                                                                                          
def cache(): this.type = persist()                                                            
  • As can be seen from the above code, the cache() method calls the parameterless method of the persist() method, and the default storage level of both is MEMORY_ONLY, but the cache() method cannot change the storage level, while the persist() method can pass parameters Custom storage levels.

(5) Case Demonstration Setting the Storage Level

  • net.huawei.rddCreate TestPersistobjects in packages
    insert image description here
package net.huawei.rdd

import org.apache.log4j.{
    
    Level, Logger}
import org.apache.spark.{
    
    SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

/**
  * 功能:演示持久化操作
  * 作者:华卫
  * 日期:2022年04月11日
  */
object TestPersist {
    
    
  def main(args: Array[String]): Unit = {
    
    
    // 创建Spark配置对象
    val conf = new SparkConf()
    conf.setAppName("TestPersist")
      .setMaster("local")
      .set("spark.testing.memory", "2147480000")
    // 基于配置创建Spark上下文
    val sc = new SparkContext(conf)

    // 去除Spark运行信息
    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("com").setLevel(Level.OFF)
    System.setProperty("spark.ui.showConsoleProgress", "false")
    Logger.getRootLogger().setLevel(Level.OFF)

    //创建RDD
    val rdd: RDD[Int] = sc.parallelize(List(100, 200, 300, 400, 500))

    //将RDD标记为持久化,默认存储级别为StorageLevel.MEMORY_ONLY
    rdd.persist()
    // rdd.persist(StorageLevel.DISK_ONLY)  //持久化到磁盘
    // rdd.persist(StorageLevel.MEMORY_AND_DISK)//持久化到内存,将溢出的数据持久化到磁盘

    // 第一次行动算子计算时,将对标记为持久化的RDD进行持久化操作
    val result: String = rdd.collect().mkString(", ")
    println(result)

    // 第二次行动算子计算时,将直接从持久化的目的地读取数据进行操作,而不需要从头计算数据
    rdd.collect().foreach(println)
  }
}
  • Run the program and see the results
    insert image description here

3. Use Spark WebUI to view the cache

(1) Create an RDD and mark it for persistence

insert image description here

(2) Spark WebUI to view RDD storage information

  • Access the WebUI of Spark Shell in the browser to http://master:4040/storage/view the RDD storage information, and you can see that the storage information is empty
    insert image description here
  • Execute the command: rdd.collect(), collect RDD data
    insert image description here
  • Refresh the WebUI and find that there is a ParallelCollectionRDDstorage information, the storage level of the RDD is MEMORY, the persistent partition is 8, and it is completely stored in memory.
    insert image description here
  • Click ParallelCollectionRDDthe hyperlink to view the detailed storage information of the RDD
    insert image description here
  • The above operation shows that calling the persist() method of an RDD only marks the RDD as persistent, and the RDD marked as persistent will only be persisted when an action operation is performed.
  • Execute the following commands to create rdd2 and persist rdd2 to disk
    insert image description here
  • Refresh the above WebUI and find one more MapPartitionsRDDstorage information. The storage level of the RDD is DISK, the persistent partition is 8, and it is completely stored in the disk.
    insert image description here

(3) Delete the RDD from the cache

  • Execute the following command, it will be removed rdd(ParallelCollectionRDD)from the cache
    insert image description here
  • Refresh the above WebUI and find that there is only one left MapPartitionsRDD, which ParallelCollectionRDDhas been removed.
    insert image description here
  • Spark automatically monitors cache usage on each node and removes old partition data from the cache in a least recently used manner. If you want to delete an RDD manually, instead of waiting for the RDD to be automatically deleted from the cache by Spark, you can use the RDD unpersist()method.

Guess you like

Origin blog.csdn.net/howard2005/article/details/124092714