Spark basic study notes: RDD persistence, storage level and cache

1. RDD persistence

(1) The necessity of introducing persistence

RDDs in Spark are lazy-loaded, and all RDDs are calculated from scratch only when an action operator is encountered, and when the same RDD is used multiple times, it needs to be recalculated each time, which will seriously increase consumption. In order to avoid repeated calculation of the same RDD, the RDD can be persisted.
One of the important functions in Spark is to save the data in an RDD to memory or disk. Every time you need to perform an operator operation on this RDD, you can directly fetch the persistent data of this RDD from memory or disk. You don't need to calculate from scratch to get this RDD.

(2) Case demonstration persistence operation

1. RDD dependency graph

2. Do not use persistent operations

View the file to be operated
insert image description here
Start the Spark Shell
insert image description here
and operate according to the diagram, get RDD4 and RDD5
insert image description here
to calculate RDD4, run from RDD1 to RDD2 to RDD3 to RDD4, and check the results

insert image description here
Calculate RDD5, and also run from RDD1 to RDD2 to RDD3 to RDD4 to view the results
insert image description here

3. Adopt persistent operation

When RDD3 is calculated, the mark is persisted.
insert image description here
When calculating RDD4, the calculation is based on the data cached in RDD3, without running from the beginning to the end.

insert image description here
Calculation of RDD5 is to start calculation based on the data cached in RDD3, without running from beginning to end
insert image description here

2. Storage level

(1) Parameters of the persistence method

Use the persist() method of RDD to achieve persistence, and pass a StorageLevel object to the persist() method to specify the storage level. Each persistent RDD can be stored using a different storage level, the default storage level is StorageLevel.MEMORY_ONLY.

(2) Spark RDD storage level table

insert image description here

(3) How to choose a storage level - balance memory usage and CPU efficiency

If the RDD is stored in memory without overflow, then the default storage level (MEMORY_ONLY) is preferred, which maximizes the performance of the CPU and enables operations on the RDD to run at the fastest speed.
If the RDD stored in memory will overflow, then use MEMORY_ONLY_SER and choose a fast serialization library to serialize the object to save space, and the access speed is still quite fast.
Unless the RDD is very expensive to compute, or the RDD filters a large amount of data, do not write overflow data to disk, since recomputing partitions can be as fast as reading them from disk.
If you want to be able to recover quickly in the event of a server failure, you can use the multi-copy storage level MEMORY_ONLY_2 or MEMORY_AND_DISK_2. This storage level allows tasks to continue running on RDDs after data loss without having to wait for lost partitions to be recomputed. For other storage classes, after data loss occurs, the lost partitions need to be recalculated.

(4) View the source code of persist() and cache() methods

/**                                                                                           
 * 在第一次行动操作时持久化RDD,并设置存储级别,当RDD从来没有设置过存储级别时才能使用该方法                                           
 */                                                                                          
def persist(newLevel: StorageLevel): this.type = {                                            
  if (isLocallyCheckpointed) {                                                                
    // 如果之前已将该RDD设置为localCheckpoint,就覆盖之前的存储级别                                                
    persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)     
  } else {                                                                                    
    persist(newLevel, allowOverride = false)                                                  
  }                                                                                           
}                                                                                             
/**                                                                                           
  * 持久化RDD,使用默认存储级别(MEMORY_ONLY)                                                              
  */                                                                                          
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)                                  
                                                                                              
/**                                                                                           
  * 持久化RDD,使用默认存储级别(MEMORY_ONLY)                                                              
  */                                                                                          
def cache(): this.type = persist()                                                            

(5) Case demonstration setting storage level

Create a TestPersist object in the net.py.rdd package
insert image description here

package net.py.rdd

import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object TestPersist {
  def main(args: Array[String]): Unit = {
    // 创建Spark配置对象
    val conf = new SparkConf()
    conf.setAppName("TestPersist")
      .setMaster("local")
      .set("spark.testing.memory", "2147480000")
    // 基于配置创建Spark上下文
    val sc = new SparkContext(conf)

    // 去除Spark运行信息
    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("com").setLevel(Level.OFF)
    System.setProperty("spark.ui.showConsoleProgress", "false")
    Logger.getRootLogger().setLevel(Level.OFF)

    //创建RDD
    val rdd: RDD[Int] = sc.parallelize(List(100, 200, 300, 400, 500))

    //将RDD标记为持久化,默认存储级别为StorageLevel.MEMORY_ONLY
    rdd.persist()
    // rdd.persist(StorageLevel.DISK_ONLY)  //持久化到磁盘
    // rdd.persist(StorageLevel.MEMORY_AND_DISK)//持久化到内存,将溢出的数据持久化到磁盘

    // 第一次行动算子计算时,将对标记为持久化的RDD进行持久化操作
    val result: String = rdd.collect().mkString(", ")
    println(result)

    // 第二次行动算子计算时,将直接从持久化的目的地读取数据进行操作,而不需要从头计算数据
    rdd.collect().foreach(println)
  }
}

3. Use Spark WebUI to view the cache

(1) Create an RDD and mark it as persistent

insert image description here

(2) Spark WebUI to view RDD storage information

Execute the command: rdd.collect(), collect RDD data
insert image description here
Execute the following command to create rdd2 and persist rdd2 to disk
insert image description here

(3) Delete the RDD from the cache

insert image description here

Guess you like

Origin blog.csdn.net/py20010218/article/details/125379579