Advanced API usage pyspark rdd of combineByKey, multi-row by a column combined into one line

combineByKey bottom is more advanced uses, such as the dataframe or rdd GroupBy, the rdd reduce, reduceByKey with it all depend.
combineByKey into the reference function is a function of three, are combined with the (k, v) for a single (k, v) into (k, c) a new row of objects, (k, c), (k, c) and (k, c) combined into the final overall (k, c) a set of key-value pairs such new

The simplest example:
https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD
search function to find combineByKey
python spark environment

x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
def to_list(a): #单个v 转c
    return [a]
def append(a, b): #c 与v 合并
    a.append(b)
    return a
def extend(a, b): #c 与 c 合并
    a.extend(b)
    return a
sorted(x.combineByKey(to_list, append, extend).collect())
[('a', [1, 2]), ('b', [1])]

Complex use cases seen each of my user login percentage distribution statistics use case article

Guess you like

Origin blog.csdn.net/u010720408/article/details/94434643