combineByKey bottom is more advanced uses, such as the dataframe or rdd GroupBy, the rdd reduce, reduceByKey with it all depend.
combineByKey into the reference function is a function of three, are combined with the (k, v) for a single (k, v) into (k, c) a new row of objects, (k, c), (k, c) and (k, c) combined into the final overall (k, c) a set of key-value pairs such new
The simplest example:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD
search function to find combineByKey
python spark environment
x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
def to_list(a): #单个v 转c
return [a]
def append(a, b): #c 与v 合并
a.append(b)
return a
def extend(a, b): #c 与 c 合并
a.extend(b)
return a
sorted(x.combineByKey(to_list, append, extend).collect())
[('a', [1, 2]), ('b', [1])]
Complex use cases seen each of my user login percentage distribution statistics use case article