十五、rdd从缓存读取数据

1. storageLevel不为none，说明之前持久化过数据，则尝试优先读取缓存数据，读不到的话，再重新计算。

  /**
   * Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
   * This should ''not'' be called by users directly, but is available for implementors of custom
   * subclasses of RDD.
   */
  final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
    if (storageLevel != StorageLevel.NONE) {
      // storageLevel不为none，说明之前持久化过数据
      getOrCompute(split, context)
    } else {
      computeOrReadCheckpoint(split, context)
    }
  }

2. 先存缓存中读，先读本地，再读远端。根据存储级别，从内存或磁盘中加载数据。

读到了，则返回结果。

没有读到，就进行计算，并持久化，然后返回结果。

如果缓存中没有读到数据，还会尝试从Checkpoint加载数据。

2.1 如果是从缓存中获取到的数据，则会对existingMetrics的读取记录+1，并将数据封装进入InterruptibleIterator。

2.2 如果没有存缓存中读到数据，但是计算获取到了数据，并将数据成功加入了缓存，则直接将数据封装进入InterruptibleIterator。

2.3 如果计算出的数据没能成功放入缓存持久化，则直接将拿到的迭代器iter封装进入InterruptibleIterator。

  /**
   * Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.
   */
  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {
    val blockId = RDDBlockId(id, partition.index)
    var readCachedBlock = true
    // This method is called on executors, so we need call SparkEnv.get instead of sc.env.
    // 从blockManager.getOrElseUpdate获取数据
    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {
      // 这里是一个匿名函数，读到了数据就不会执行，否则就会执行这里面的逻辑
      readCachedBlock = false
      computeOrReadCheckpoint(partition, context) //根据checkpoint获取数据
    }) match {
      case Left(blockResult) => // 存缓存拿到了数据
        if (readCachedBlock) { // 数据是存缓存直接拿到的，没有经过计算
          val existingMetrics = context.taskMetrics().inputMetrics
          existingMetrics.incBytesRead(blockResult.bytes)
          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {
            override def next(): T = {
              existingMetrics.incRecordsRead(1) // existingMetrics读取记录 +1
              delegate.next()
            }
          }
        } else {
          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]]) // 直接封装数据
        }
      case Right(iter) =>
        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]]) // 将iter封装进入InterruptibleIterator
    }
  }

 // BlockManager类
 /**
   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method
   * to compute the block, persist it, and return its values.
   *
   * @return either a BlockResult if the block was successfully cached, or an iterator if the block
   *         could not be cached.
   */
  def getOrElseUpdate[T](
      blockId: BlockId,
      level: StorageLevel,
      classTag: ClassTag[T],
      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
    // Attempt to read the block from local or remote storage. If it's present, then we don't need
    // to go through the local-get-or-put path.
    // 尝试从本地或远端读取block
    get[T](blockId)(classTag) match {
      case Some(block) =>
        return Left(block) // 读到了，就返回
      case _ =>
        // Need to compute the block.
    }
    // Initially we hold no locks on this block.
    // 没有读到
    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {
      case None =>
        // doPut() didn't hand work back to us, so the block already existed or was successfully
        // stored. Therefore, we now hold a read lock on the block.
        // doPut()已将数据放入本地缓存，再次尝试从本地获取，如果获取成功，则返回结果，否则抛出异常
        val blockResult = getLocalValues(blockId).getOrElse {
          // Since we held a read lock between the doPut() and get() calls, the block should not
          // have been evicted, so get() not returning the block indicates some internal error.
          releaseLock(blockId)
          throw new SparkException(s"get() failed for block $blockId even though we held a lock")
        }
        // We already hold a read lock on the block from the doPut() call and getLocalValues()
        // acquires the lock again, so we need to call releaseLock() here so that the net number
        // of lock acquisitions is 1 (since the caller will only call release() once).
        releaseLock(blockId)
        Left(blockResult)
      case Some(iter) =>
        // 无法放入缓存持久化
        // The put failed, likely because the data was too large to fit in memory and could not be
        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so
        // that they can decide what to do with the values (e.g. process them without caching).
       Right(iter)
    }
  }

十五、rdd从缓存读取数据

猜你喜欢