aggregate

First lets see how parallelize splits your data between partitions:

val x = sc.parallelize(List("12","23","345","4567"), 2)
x.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, 4567))

val y = sc.parallelize(List("12","23","345",""), 2)
y.glom.collect
// Array[Array[String]] = Array(Array(12, 23), Array(345, ""))
and define two helpers:

def seqOp(x: String, y: String) =  math.min(x.length, y.length).toString
def combOp(x: String, y: String) = x + y
Now lets trace execution for x. Ignoring parallelism it can be represented as follows:

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "4567"))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp (seqOp "" "345") "4567"))
(combOp "1" (seqOp "0" "4567"))
(combOp "1" "1")
"11"
The same thing for y:

(combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") ""))
(combOp (seqOp "0" "23") (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp (seqOp "" "345") ""))
(combOp "1" (seqOp "0" ""))
(combOp "1" "0")
"10"

猜你喜欢

转载自wang-peng1.iteye.com/blog/2315012
今日推荐