理解sparkRDD的fold()和aggregate()算子

1、fold()

函数原型:fold(self, zeroValue, op)

示例:求序列[1,2,3,4,5]的元素累加和

>>> nums = sc.parallelize([1,2,3,4,5])      

>>> sumCnt = nums.fold(0, lambda x, y: x + y)
>>> print sumCnt
15

zeroValue意义:1、初值;2、保存中间结果

执行累加过程分解:

1、[1,2,3,4,5], zeroValue = 0

2、currentVal = 1, zeroValue = 0

3、currentVal = 2, zeroValue = 1

4、currentVal = 3, zeroValue = 3

5、currentVal = 4, zeroValue = 6

6、sumCnt = 4 + 6 = 10

2、aggregate()

函数原型:aggregate(self, zeroValue, seqOp, combOp)

seqOp:针对每个分区(节点)的操作函数

combOp:在seqOp对每个分区操作完成之后,将每个分区的结果进行整合,从而求出最后的结果

示例:求序列[1,2,3,4,5]的均值

>>> nums = sc.parallelize([1,2,3,4,5])                           
>>> sumCnt = nums.aggregate((0, 0), (lambda partSumAndNum, zeroVal: (partSumAndNum[0] + zeroVal, partSumAndNum[1] + 1)), (lambda part1Ret, part2Ret: (part1Ret[0] + part2Ret[0], part1Ret[1] + part2Ret[1])))
>>> print sumCnt[0] / float(sumCnt[1])
3.0

partSumAndNum:某分区(节点)的元素累加和以及元素个数,如part1的元素序列为[1,2,3,4,5],则part1的partSumAndNum=(15, 5)


猜你喜欢

转载自blog.csdn.net/u011376563/article/details/79045525