pyspark---将list作为df的新列添加

python中的list不能直接添加到dataframe中,需要先将list转为新的dataframe,然后新的dataframe和老的dataframe进行join操作, 下面的例子会先新建一个dataframe,然后将list转为dataframe,然后将两者join起来。

from pyspark.sql.functions import lit

df = sqlContext.createDataFrame(
    [(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
from pyspark.sql.functions import monotonically_increasing_id
df = df.withColumn("id", monotonically_increasing_id())
df.show()

+---+---+-----+---+
| x1| x2|   x3| id|
+---+---+-----+---+
|  1|  a| 23.0|  0|
|  3|  B|-23.0|  1|
+---+---+-----+---+
from pyspark.sql import Row
l = ['jerry', 'tom']
row = Row("pid", "name")
new_df = sc.parallelize([row(i, l[i]) for i in range(0,len(l))]).toDF()
new_df.show()

+---+-----+
|pid| name|
+---+-----+
|  0|jerry|
|  1|  tom|
+---+-----+
join_df = df.join(new_df, df.id==new_df.pid)
join_df.show()

+---+---+-----+---+---+-----+
| x1| x2|   x3| id|pid| name|
+---+---+-----+---+---+-----+
|  1|  a| 23.0|  0|  0|jerry|
|  3|  B|-23.0|  1|  1|  tom|
+---+---+-----+---+---+-----+

Guess you like

Origin blog.csdn.net/qq_42363032/article/details/118934268