First lets see how parallelize splits your data between partitions: val x = sc.parallelize(List("12","23","345","4567"), 2) x.glom.collect // Array[Array[String]] = Array(Array(12, 23), Array(345, 4567)) val y = sc.parallelize(List("12","23","345",""), 2) y.glom.collect // Array[Array[String]] = Array(Array(12, 23), Array(345, "")) and define two helpers: def seqOp(x: String, y: String) = math.min(x.length, y.length).toString def combOp(x: String, y: String) = x + y Now lets trace execution for x. Ignoring parallelism it can be represented as follows: (combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "4567")) (combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "4567")) (combOp "1" (seqOp (seqOp "" "345") "4567")) (combOp "1" (seqOp "0" "4567")) (combOp "1" "1") "11" The same thing for y: (combOp (seqOp (seqOp "" "12") "23") (seqOp (seqOp "" "345") "")) (combOp (seqOp "0" "23") (seqOp (seqOp "" "345") "")) (combOp "1" (seqOp (seqOp "" "345") "")) (combOp "1" (seqOp "0" "")) (combOp "1" "0") "10"
aggregate
猜你喜欢
转载自wang-peng1.iteye.com/blog/2315012
今日推荐
周排行