Spark DataFrame added increment id
When processing data with Spark, often need to add an increment ID number to the whole amount of data, when stored in the database, increment ID is often a critical factor. Need to specify an int / long type id column LightGBMRanker mmlspark in use, the following are several implementations.
Use of RDD zipWithIndex operator
// 在原Schema信息的基础上添加一列 “id”信息
val schema: StructType = dataframe.schema.add(StructField("id", LongType))
// DataFrame转RDD 然后调用 zipWithIndex
val dfRDD: RDD[(Row, Long)] = dataframe.rdd.zipWithIndex()
val rowRDD: RDD[Row] = dfRDD.map(tp => Row.merge(tp._1, Row(tp._2)))
// 将添加了索引的RDD 转化为DataFrame
val df2 = spark.createDataFrame(rowRDD, schema)
df2.show()
+-----------+-----------+---+
| lon| lat| id|
+-----------+-----------+---+
|106.4273071|29.63554591| 0|
| 106.44104|29.51372023| 1|
|106.4602661|29.60211821| 2|
|106.4657593|29.45394812| 3|
+-----------+-----------+---+
Use the function SparkSQL
import org.apache.spark.sql.functions._
val inputDF = inputDF.withColumn("id", monotonically_increasing_id)
inputDF.show