Spark-serialization, dependencies, persistence

Serialization

closure check

Serialization Methods and Properties

dependencies 

RDD blood relationship

RDD narrow dependencies

RDD wide dependencies

RDD task division

RDD persistence

RDD Cache cache

RDD Checkpoint Checkpoint

difference between cache and checkpoint


Serialization

closure check

        From a calculation point of view, codes other than operators are executed on the Driver side, and codes inside operators are executed on the Executor side. Then in scala functional programming, data outside the operator will often be used in the operator, thus forming the effect of closure. If the data outside the operator used cannot be serialized, it means that the data cannot be serialized. Errors will occur when passing values ​​to the Executor side for execution, so it is necessary to check whether the objects in the closure can be serialized before performing task calculations. This operation is called closure detection. The closure compilation method has changed after Scala2.12

Serialization Methods and Properties

        From a calculation point of view, codes other than operators are executed on the Driver side, and codes inside operators are executed on the Executor side.

object spark_02 {
  def main(args: Array[String]): Unit = {
    //准备环境
    //"*"代表线程的核数   应用程序名称"RDD"
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("RDD")
    val sc = new SparkContext(sparkConf)
    val rdd: RDD[String] = sc.makeRDD(Array("hello world", "hello spark", "hive", "atguigu"))


    //创建查询对象
    val search = new Search("h")
    //函数传递,打印:ERROR Task not serializable
    search.getMatch1(rdd).collect().foreach(println)

    println("========================================")
    
    //属性传递,打印:ERROR Task not serializable
    search.getMatch2(rdd).collect().foreach(println)
    
    //关闭环境
    sc.stop()
  }
}

//查询对象
//类的构造参数是类的属性,构造参数需要进行闭包检查(对类进行闭包检查)
class Search(query:String) extends Serializable {
  def isMatch(s: String): Boolean = {
    s.contains(query)
  }
  // 函数序列化案例
  def getMatch1 (rdd: RDD[String]): RDD[String] = {
    rdd.filter(isMatch)
  }
  // 属性序列化案例
  def getMatch2(rdd: RDD[String]): RDD[String] = {
    rdd.filter(x => x.contains(query))
  }
}

dependencies 

        Two adjacent RDD relationships are called dependencies

RDD blood relationship

        The dependency relationship of multiple continuous RDDs is called blood relationship

        RDDs only support coarse-grained transformations, i.e. a single operation performed on a large number of records. Create a series of Lineage (lineage) of RDD to record in order to restore lost partitions. RDD's Lineage will record the metadata information and conversion behavior of the RDD. When some partition data of the RDD is lost, it can recalculate and restore the lost data partition based on this information.

val fileRDD: RDD[String] = sc.textFile("input/1.txt")
println(fileRDD.toDebugString) //打印输出血缘关系
println("----------------------")
val wordRDD: RDD[String] = fileRDD.flatMap(_.split(" "))
println(wordRDD.toDebugString)
println("----------------------")
val mapRDD: RDD[(String, Int)] = wordRDD.map((_,1))
println(mapRDD.toDebugString)
println("----------------------")
val resultRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
println(resultRDD.toDebugString)
resultRDD.collect()

RDD narrow dependencies

        Narrow dependency means that the Partition of each parent (upstream) RDD is used by at most one Partition of the child (downstream) RDD. Narrow dependency is compared to the only child in our image.

RDD wide dependencies

        Wide dependence means that the Partition of the same parent (upstream) RDD is dependent on the Partitions of multiple child (downstream) RDDs, which will cause Shuffle. Summary: The metaphor of wide dependence is multiple births.

RDD task division

        RDD task segmentation is divided into: Application, Job, Stage and Task:

  • Application: Initialize a SparkContext to generate an Application;
  • Job: An Action operator will generate a Job;
  • Stage: Stage is equal to the number of ShuffleDependencies plus 1;
  • Task: In a stage, the number of partitions in the last RDD is the number of tasks.

Note: Each layer of Application->Job->Stage->Task has a 1-to-n relationship.

RDD persistence

RDD Cache cache

        RDD caches the previous calculation results through the Cache or Persist method. By default, the data will be cached in the heap memory of the JVM. However, it is not immediately cached when these two methods are called, but when the subsequent action operator is triggered, the RDD will be cached in the memory of the computing node and reused later.

// cache 操作会增加血缘关系,不改变原有的血缘关系
println(wordToOneRdd.toDebugString)
// 数据缓存。
wordToOneRdd.cache()
// 可以更改存储级别
//mapRdd.persist(StorageLevel.MEMORY_AND_DISK_2)

        The cache may be lost, or the data stored in the memory may be deleted due to insufficient memory. RDD's cache fault tolerance mechanism ensures that the calculation can be performed correctly even if the cache is lost. Through a series of conversions based on RDD, the lost data will be recalculated. Since each Partition of RDD is relatively independent, only the missing part needs to be calculated, and all Partitions do not need to be recalculated. Spark will automatically perform persistent operations on the intermediate data of some Shuffle operations (for example: reduceByKey). The purpose of this is to avoid recalculating the entire input when a node Shuffle fails. However, in actual use, if you want to reuse data, it is still recommended to call persist or cache.

RDD Checkpoint Checkpoint

        The so-called checkpoint is actually by writing the intermediate results of RDD to disk. Due to the long blood relationship dependency, the fault tolerance cost will be too high, so it is better to do checkpoint fault tolerance in the middle stage. If there is a problem with the node after the checkpoint, you can start from the checkpoint Started to redo Bloodline, reducing overhead. The checkpoint operation on the RDD will not be executed immediately, and the Action operation must be executed to trigger it.

// 设置检查点路径
sc.setCheckpointDir("./checkpoint1")
// 创建一个 RDD,读取指定位置文件:hello atguigu atguigu
val lineRdd: RDD[String] = sc.textFile("input/1.txt")
// 业务逻辑
val wordRdd: RDD[String] = lineRdd.flatMap(line => line.split(" "))
val wordToOneRdd: RDD[(String, Long)] = wordRdd.map {
 word => {
 (word, System.currentTimeMillis())
 }
}
// 增加缓存,避免再重新跑一个 job 做 checkpoint
wordToOneRdd.cache()
// 数据检查点:针对 wordToOneRdd 做检查点计算
wordToOneRdd.checkpoint()
// 触发执行逻辑
wordToOneRdd.collect().foreach(println)

difference between cache and checkpoint

  1. Cache cache only saves data without cutting off blood dependencies. Checkpoint checkpoints cut off blood dependencies.
  2. The data cached by Cache is usually stored in disk, memory, etc., with low reliability. Checkpoint data is usually stored in a fault-tolerant, highly available file system such as HDFS with high reliability.
  3. It is recommended to use the Cache cache for the RDD of checkpoint(), so that the checkpoint job only needs to read data from the Cache cache, otherwise the RDD needs to be calculated from scratch.

Guess you like

Origin blog.csdn.net/dafsq/article/details/129466546