spark2.3 RDD之reduce源码解析

  • reduce源码
    /**
     * Reduces the elements of this RDD using the specified commutative and
     * associative binary operator.
     */
    def reduce(f: (T, T) => T): T = withScope {
      val cleanF = sc.clean(f)
      val reducePartition: Iterator[T] => Option[T] = iter => {
        if (iter.hasNext) {
          Some(iter.reduceLeft(cleanF))
        } else {
          None
        }
      }
      var jobResult: Option[T] = None
      val mergeResult = (index: Int, taskResult: Option[T]) => {
        if (taskResult.isDefined) {
          jobResult = jobResult match {
            case Some(value) => Some(f(value, taskResult.get))
            case None => taskResult
          }
        }
      }
      sc.runJob(this, reducePartition, mergeResult)
      // Get the final result out of our Option, or throw an exception if the RDD was empty
      jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
    }
  • scala reduceLeft 源码
    /** Applies a binary operator to all elements of this $coll,
     *  going left to right.
     *  $willNotTerminateInf
     *  $orderDependentFold
     *
     *  @param  op    the binary operator.
     *  @tparam  B    the result type of the binary operator.
     *  @return  the result of inserting `op` between consecutive elements of this $coll,
     *           going left to right:
     *           {{{
     *             op( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)
     *           }}}
     *           where `x,,1,,, ..., x,,n,,` are the elements of this $coll.
     *  @throws UnsupportedOperationException if this $coll is empty.   */
    def reduceLeft[B >: A](op: (B, A) => B): B = {
      if (isEmpty)
        throw new UnsupportedOperationException("empty.reduceLeft")
    
      var first = true
      var acc: B = 0.asInstanceOf[B]
    
      for (x <- self) {
        if (first) {
          acc = x
          first = false
        }
        else acc = op(acc, x)
      }
      acc
    }
  • reduceLeft :将acc转换成B类型,执行op函数将上一次的计算结果和下一次的元素进行执行op函数并赋值给acc
  • reduce:先遍历RDD的每个分区,在每个分区上执行自定义的聚合函数,然后定义每个分区之间的merge函数,
   执行runJob方法,在RDD中的所有分区上运行一个作业,并将结果传递给处理程序函数,jobResult是最终的结果

猜你喜欢

转载自blog.csdn.net/dpnice/article/details/80054614
今日推荐