Spark core RDD (lower)

Spark core RDD (lower)

introduction

  Spark core RDD (lower) Main contents include: a, Spark programming interface (the API), two, using the analysis based on the application data source in parallel, three, of the RDD Spark relationship represented by the RDD

A, Spark programming interface

Preliminaries:

1, Scala: JVM is based on static type, function, object-oriented language. Scala has a simple (especially suitable for interactive use), effective (because static type) Features

2, Driver RDD defines one or more, and the operation of the RDD call. Worker is a long running process, the RDD partition in the form of Java objects cached in memory

3, closure: the equivalent of only a compact object is a method of (a compact object)

4, when the user performs RDD provides operation parameters , such as passing a map closure (closure, the concept of functional programming). Scala The closure as Java objects, if the parameter is passed closure, these objects are serialized for transmission to other nodes on the network load.

 

figure 1

 

Second, the use of application-based data parallelism represented RDD

1, machine learning algorithms: one pair for each iteration performing map and reduce operations: logistic regression, the kmeans ; two different map / reduce steps are alternately performed: the EM ; alternating least squares matrix decomposition and collaborative filtering algorithm

val points = spark.textFile(...)
     .map(parsePoint).persist()
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
     val gradient = points.map{ p =>
          p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
     }.reduce((a,b) => a+b)
     w -= gradient
}

 Note: The definition of a named points of cache RDD, which is obtained after the implementation of the conversion map on the text file, each line of text is about to be resolved to a Point object; then executed repeatedly when the map and reduce operations on points, each iteration by summing the current function w is calculated gradient

2, RDD achieve cluster programming model (MapReduce, Pregel, Hadoop)

1、MapReduce

data.flatMap(myMap)
    .groupByKey()
    .map((k, vs) => myReduce(k, vs))

If the task contains combiner, the corresponding code is:

data.flatMap(myMap)
    .reduceByKey(myCombiner)
    .map((k, v) => myReduce(k, v))

 NOTE: ReduceByKey operations performed on partially aggregated mapper nodes, and the combiner similar MapReduce

2、Pregel

3、Hadoop

Third, the relationship of Spark source of RDD analysis

Table 1 Spark internal interface in the RDD

operating meaning
partitions() Partition returns a set of objects, part of the data set atoms
preferredLocations(p) According to the position data storage, access and faster return to the partition p in which nodes
dependencies() Returns a set of dependency, it is described in RDD Lineage
iterator(p, parentIters) According iterator parent partition, the partition-by-element calculation of p
partitioner() Return RDD whether hash / range metadata partition information
1, abstract class Dependency:
abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}

Note: the Dependency has two subclasses, a narrow subclass dependence: NarrowDependency; a wide dependency ShuffleDependency

2, abstract class NarrowDependency :

abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

 注:NarrowDependency实现了getParents 重写了 rdd 函数。NarrowDependency有两个子类,一个是 OneToOneDependency,一个是 RangeDependency

  NarrowDependency允许在一个集群节点上以流水线的方式(pipeline)计算所有父分区

  NarrowDependency能够更有效地进行失效节点的恢复,即只需重新计算丢失RDD分区的父分区

2.1、OneToOneDependency:

class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}

注:getParents实现很简单,就是传进一个partitionId: Int,再把partitionId放在List里面传出去,即去parent RDD 中取与该RDD 相同 partitionID的数据

2.2、RangeDependency

class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

 注:某个parent RDD 从 inStart 开始的partition,逐个生成了 child RDD 从outStart 开始的partition,则计算方式为: partitionId - outStart + inStart ***

3、 ShuffleDependency
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
    @transient private val _rdd: RDD[_ <: Product2[K, V]],
    val partitioner: Partitioner,
    val serializer: Serializer = SparkEnv.get.serializer,
    val keyOrdering: Option[Ordering[K]] = None,
    val aggregator: Option[Aggregator[K, V, C]] = None,
    val mapSideCombine: Boolean = false)
  extends Dependency[Product2[K, V]] {

  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]

  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
  // Note: It's possible that the combiner class tag is null, if the combineByKey
  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
  private[spark] val combinerClassName: Option[String] =
    Option(reflect.classTag[C]).map(_.runtimeClass.getName)
//获取shuffleID
  val shuffleId: Int = _rdd.context.newShuffleId()
//向注册shuffleManager注册Shuffle信息
  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
    shuffleId, _rdd.partitions.length, this)

  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}

注:ShuffleDependency需要首先计算好所有父分区数据,然后在节点之间进行Shuffle

   ShuffleDependency单个节点失效可能导致这个RDD的所有祖先丢失部分分区,因而需要整体重新计算

  ShuffleDependency的Lineage链较长,采用检查点机制,将ShuffleDependency的RDD数据存到物理存储中可以实现优化,单个节点失效可以从物理存储中获取RDD数据,但是一般cpu的计算的速度比读取磁盘的速度快,这得看实际情况权衡。

1、writePartitionToCheckpointFile:把RDD一个Partition文件里面的数据写到一个Checkpoint文件里面

 

  def writePartitionToCheckpointFile[T: ClassTag](
      path: String,
      broadcastedConf: Broadcast[SerializableConfiguration],
      blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
    val env = SparkEnv.get
    //获取Checkpoint文件输出路径
    val outputDir = new Path(path)
    val fs = outputDir.getFileSystem(broadcastedConf.value.value)

    //根据partitionId 生成 checkpointFileName
    val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
    //拼接路径与文件名
    val finalOutputPath = new Path(outputDir, finalOutputName)
    //生成临时输出路径
    val tempOutputPath =
      new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")

    if (fs.exists(tempOutputPath)) {
      throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")
    }
    //得到块大小,默认为64MB
    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
    //得到文件输出流
    val fileOutputStream = if (blockSize < 0) {
      fs.create(tempOutputPath, false, bufferSize)
    } else {
      // This is mainly for testing purpose
      fs.create(tempOutputPath, false, bufferSize,
        fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
    }
    //序列化文件输出流
    val serializer = env.serializer.newInstance()
    val serializeStream = serializer.serializeStream(fileOutputStream)
    Utils.tryWithSafeFinally {
    //写数据
      serializeStream.writeAll(iterator)
    } {
      serializeStream.close()
    }

    if (!fs.rename(tempOutputPath, finalOutputPath)) {
      if (!fs.exists(finalOutputPath)) {
        logInfo(s"Deleting tempOutputPath $tempOutputPath")
        fs.delete(tempOutputPath, false)
        throw new IOException("Checkpoint failed: failed to save output of task: " +
          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
      } else {
        // Some other copy of this task must've finished before us and renamed it
        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
        if (!fs.delete(tempOutputPath, false)) {
          logWarning(s"Error deleting ${tempOutputPath}")
        }
      }
    }
  }

 

 2、writeRDDToCheckpointDirectoryWrite,将一个RDD写入到多个checkpoint文件,并返回一个ReliableCheckpointRDD来代表这个RDD

 

 

 def writeRDDToCheckpointDirectory[T: ClassTag](
      originalRDD: RDD[T],
      checkpointDir: String,
      blockSize: Int = -1): ReliableCheckpointRDD[T] = {

    val sc = originalRDD.sparkContext

    // 生成 checkpoint文件 的输出路径
    val checkpointDirPath = new Path(checkpointDir)
    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    if (!fs.mkdirs(checkpointDirPath)) {
      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    }

    // 保存文件,并重新加载它作为一个RDD
    val broadcastedConf = sc.broadcast(
      new SerializableConfiguration(sc.hadoopConfiguration))
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

    if (originalRDD.partitioner.nonEmpty) {
      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    }

    val newRDD = new ReliableCheckpointRDD[T](
      sc, checkpointDirPath.toString, originalRDD.partitioner)
    if (newRDD.partitions.length != originalRDD.partitions.length) {
      throw new SparkException(
        s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
          s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
    }
    newRDD
  }

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/yinminbo/p/11834899.html
Recommended