dataframe also according to a column by delimiter turn into a multi-line function, but dataframe required more resources than rdd, so here is the first by a column of some rdd split into multiple lines
see https dataframe of: // spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame of pyspark.sql.functions.explode (col), pyspark.sql.functions.explode_outer (col ), pyspark.sql.functions.posexplode (col), pyspark.sql.functions.posexplode_outer (col ) function interface split cutting method
rdd split https://spark.apache.org/docs/2.3.1/api/python/pyspark.html#pyspark.RDD mainly used in flatMap (f, preservesPartitioning = False)], flatMapValues (f) Method
Parametric and f flatMap function call for transmission to split
x = sc.parallelize([("a", "x,y,z"), ("b", "p,r")])
x.flatMap(lambda x:[(x[0],x[1].split(",")[0]),(x[0],x[1].split(",")[1])]).collect() #取第二列的按‘,’号分隔前两个作为两行,但是对未知个数不太适用,而且实际生产因为数据不标准容易致使程序异常
[('a', 'x'), ('a', 'y'), ('b', 'p'), ('b', 'r')]
def itemsToRow(x):
list=[]
for value in x[1].split(","):
newrow=(x[0],value)
list.append(newrow)
return list
x.flatMap(lambda x:itemsToRow(x)).collect()#这样就将未知个数的有多少个元素就弄成多少行了
flatMapValues relatively simple, is simply the line (K, v) v in container elements pass in split v, v is noted that container type #####
x = sc.parallelize([("a", ["x", "y", "z"]), ("b", ["p", "r"])])
def f(x): return x
x.flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
If v is not a container type, we need to think Bangfa inflicted container type, for example:
x = sc.parallelize([("a", "x,y,z"), ("b", "p,r")])
def f(x): return x
x.map(lambda x:(x[0],x[1].split(","))).flatMapValues(f).collect()
[('a', 'x'), ('a', 'y'), ('a', 'z'), ('b', 'p'), ('b', 'r')]
This method combineByKey multiple lines ended up one line has the opposite function mechanism, combined with repeated play, playable many tricks, called powerful combination, you deserve