spark 上游rdd的缓存

Rdd的缓存有两种意义上的缓存。

当在SparkContext中常创建输入流的时候，将会注册一个InputDStream流到DStreamGraph当中。

当对该流进行transform操作，比如map，flatmap等操作的时候，将会以一开始的InputDStream生成MappedDStream和FlatMappedDStream。

在所有stream的超类DStream实现了map()方法。

def map[U: ClassTag](mapFunc: T => U): DStream[U] = ssc.withScope {
  new MappedDStream(this, context.sparkContext.clean(mapFunc))
}

class MappedDStream[T: ClassTag, U: ClassTag] (
    parent: DStream[T],
    mapFunc: T => U
  ) extends DStream[U](parent.ssc) {

  override def dependencies: List[DStream[_]] = List(parent)

  override def slideDuration: Duration = parent.slideDuration

  override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.map[U](mapFunc))
  }
}

从MappedDStream的初始化方法可以看到，被调用的DStream将会被作为parent记录到MappedDStream当中，并作为dependencies的一员记录，当该MappedDStream的compute()方法被调用的时候，将会首先调用parent的getOrCompute()方法。

回到DStream流的实现，当不断延伸DStream的处理流程，当到输出流的时候，将会作为OutputDStream向DStreamGraph。

private[streaming] def register(): DStream[T] = {
  ssc.graph.addOutputStream(this)
  this
}

比如print()操作，将会生成一个ForEachDStream并调用register()方法向DStreamGraph注册成为一个OutputDStream。

当然在生成ForEachDStream也记录了上游操作的DStream作为parent。

回到DStreamGraph生成job时候的generateJobs()方法。

def generateJobs(time: Time): Seq[Job] = {
  logDebug("Generating jobs for time " + time)
  val jobs = this.synchronized {
    outputStreams.flatMap { outputStream =>
      val jobOption = outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
    }
  }
  logDebug("Generated " + jobs.length + " jobs for time " + time)
  jobs
}

DStreamGraph的generateJobs()实则是遍历所有的OutputDStream去实现其generateJob()方法，在这里，将会从输出流OutputDStream开始，不断从其parent开始逐级往上调用compute()方法，直到到最初的输入流InputDStream正式定义rdd为止。

在这个过程中，DStream维护了一个generatedRDDs，当一个上游的DStream已经被调用过compute()生成该时间对应的rdd之后，将会缓存在这个集合中，直接进行返回。

上述是第一种缓存，当rdd在DStream中被定义时候的缓存。

另一种，是rdd在executor具体进行计算时候对于中间结果的缓存。

在流的定义过程中，可以显示调用cache()方法。

def persist(level: StorageLevel): DStream[T] = {
  if (this.isInitialized) {
    throw new UnsupportedOperationException(
      "Cannot change storage level of a DStream after streaming context has started")
  }
  this.storageLevel = level
  this
}

/** Persist RDDs of this DStream with the default storage level (MEMORY_ONLY_SER) */
def persist(): DStream[T] = persist(StorageLevel.MEMORY_ONLY_SER)

/** Persist RDDs of this DStream with the default storage level (MEMORY_ONLY_SER) */
def cache(): DStream[T] = persist()

其实cache()方法只是简单的改变了DStream中的储存级别。

真正的缓存步骤在上述的DStream的compute()方法中，在此处，将会根据DStream存储级别判断，如果是调用过上述cache()方法的存储级别，将会调用RDD的persist()方法。

private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
  // TODO: Handle changes of StorageLevel
  if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
    throw new UnsupportedOperationException(
      "Cannot change storage level of an RDD after it was already assigned a level")
  }
  // If this is the first time this RDD is marked for persisting, register it
  // with the SparkContext for cleanups and accounting. Do this only once.
  if (storageLevel == StorageLevel.NONE) {
    sc.cleaner.foreach(_.registerRDDForCleanup(this))
    sc.persistRDD(this)
  }
  storageLevel = newLevel
  this
}

在这里，将会在SparkContext中，将该rdd纳入到缓存范围内。

再回到当任务的执行流程还在DAGSchdeuler当中时，当准备向executor具体拆分成task的时候，将会调用getPreferredLocs()方法，来定位做合适的executor位置。

if (!cacheLocs.contains(rdd.id)) {
  // Note: if the storage level is NONE, we don't need to get locations from block manager.
  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) {
    IndexedSeq.fill(rdd.partitions.length)(Nil)
  } else {
    val blockIds =
      rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]
    blockManagerMaster.getLocations(blockIds).map { bms =>
      bms.map(bm => TaskLocation(bm.host, bm.executorId))
    }
  }
  cacheLocs(rdd.id) = locs
}

在这个方法中，如果上游采用了窄依赖，将会根据rdd及对应分区得到对应的RDDBlockId，从各个BlockManager寻址得到缓存了上游rdd的位置，优先作为目标调度。

当Rdd在executor上具体进行处理的时候，将会调用iterator()方法返回对应分区下该rdd的所有数据，此处在这里调用了其getOrCompute()方法。

private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
  val blockId = RDDBlockId(id, partition.index)
  var readCachedBlock = true
  // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
  SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
    readCachedBlock = false
    computeOrReadCheckpoint(partition, context)
  }) match {
    case Left(blockResult) =>
      if (readCachedBlock) {
        val existingMetrics = context.taskMetrics().inputMetrics
        existingMetrics.incBytesRead(blockResult.bytes)
        new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
          override def next(): T = {
            existingMetrics.incRecordsRead(1)
            delegate.next()
          }
        }
      } else {
        new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])
      }
    case Right(iter) =>
      new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])
  }
}

在这里，尝试从BlockManager中获取该RDD分区下的缓存数据，避免再从头将数据计算一遍，如果本地缓存中找不到对应的缓存数据，将会从头开始对rdd进行计算，并缓存，以便下游算子可以在缓存中获取到。如果在上述的调度过程中，将该task直接调度到了上游rdd缓存所在的executor，将可以直接从本地的缓存中读取，完成了rdd处理结果的高效缓存。

tydhot

发布了145 篇原创文章 · 获赞 19 · 访问量 11万+

私信关注

spark 上游rdd的缓存

猜你喜欢