Spark - an increase of DataFrame index columns (increment id column)

Spark DataFrame added increment id

  When processing data with Spark, often need to add an increment ID number to the whole amount of data, when stored in the database, increment ID is often a critical factor. Need to specify an int / long type id column LightGBMRanker mmlspark in use, the following are several implementations.

Use of RDD zipWithIndex operator

Here Insert Picture Description

// 在原Schema信息的基础上添加一列 “id”信息
val schema: StructType = dataframe.schema.add(StructField("id", LongType))

// DataFrame转RDD 然后调用 zipWithIndex
val dfRDD: RDD[(Row, Long)] = dataframe.rdd.zipWithIndex()

val rowRDD: RDD[Row] = dfRDD.map(tp => Row.merge(tp._1, Row(tp._2)))

// 将添加了索引的RDD 转化为DataFrame
val df2 = spark.createDataFrame(rowRDD, schema)

df2.show()
+-----------+-----------+---+
|        lon|        lat| id|
+-----------+-----------+---+
|106.4273071|29.63554591|  0|
|  106.44104|29.51372023|  1|
|106.4602661|29.60211821|  2|
|106.4657593|29.45394812|  3|
+-----------+-----------+---+

Use the function SparkSQL

    import org.apache.spark.sql.functions._
    val inputDF = inputDF.withColumn("id", monotonically_increasing_id)
    inputDF.show
Published 66 original articles · won praise 18 · views 10000 +

Guess you like

Origin blog.csdn.net/Aeve_imp/article/details/104923222