Baerrow :
With Spark and Java, I am trying to add to an existing Dataset[Row] with n columns an Integer identify column.
I successfully added an id with zipWithUniqueId()
or with zipWithIndex
, even using monotonically_increasing_id()
. But neither one gives satisfaction.
Example : I have one dataset with 195 rows. When I use one of these three methods, i get some id like 1584156487 or 12036. Plus, those id's are not contiguous.
What i need/want is rather simply : an Integer id column, which value goes 1 to dataset.count() foreach row, where id = 1 is followed by id = 2, etc.
How can I do that in Java/Spark ?
Fabich :
You can try to use the row_number function :
In java :
import org.apache.spark.sql.functions;
import org.apache.spark.sql.expressions.Window;
df.withColumn("id", functions.row_number().over(Window.orderBy("a column")));
Or in scala :
import org.apache.spark.sql.expressions.Window;
df.withColumn("id",row_number().over(Window.orderBy("a column")))
Guess you like
Origin http://10.200.1.11:23101/article/api/json?id=466231&siteId=1