pyspark rdd press wherein the split divided into a plurality of rows

dataframe also according to a column by delimiter turn into a multi-line function, but dataframe required more resources than rdd, so here is the first by a column of some rdd split into multiple lines
see https dataframe of: // spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame of pyspark.sql.functions.explode (col), pyspark.sql.functions.explode_outer (col ), pyspark.sql.functions.posexplode (col), pyspark.sql.functions.posexplode_outer (col ) function interface split cutting method

rdd split https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD mainly used in flatMap (f, preservesPartitioning = False)], flatMapValues ​​(f) Method

Parametric and f flatMap function call for transmission to split

x = sc.parallelize([("a", "x,y,z"), ("b", "p,r")])
x.flatMap(lambda x:[(x[0],x[1].split(",")[0]),(x[0],x[1].split(",")[1])]).collect() #取第二列的按‘,’号分隔前两个作为两行,但是对未知个数不太适用,而且实际生产因为数据不标准容易致使程序异常
[('a', 'x'), ('a', 'y'), ('b', 'p'), ('b', 'r')]

def itemsToRow(x):
	    list=[]
	    for value in x[1].split(","):
		    newrow=(x[0],value)
		    list.append(newrow)
		    return list
x.flatMap(lambda x:itemsToRow(x)).collect()#这样就将未知个数的有多少个元素就弄成多少行了

flatMapValues ​​relatively simple, is simply the line (K, v) v in container elements pass in split v, v is noted that container type #####

x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
def f(x): return x
x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

If v is not a container type, we need to think Bangfa inflicted container type, for example:

x = sc.parallelize([("a", "x,y,z"), ("b", "p,r")])
def f(x): return x
x.map(lambda x:(x[0],x[1].split(","))).flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]

This method combineByKey multiple lines ended up one line has the opposite function mechanism, combined with repeated play, playable many tricks, called powerful combination, you deserve

Guess you like

Origin blog.csdn.net/u010720408/article/details/94436873