rakeeee :
I have data as follows
//input data
df.show()
//+---+---+---+
//| x| y| z|
//+---+---+---+
//|tes| 45| 34|
//|tes| 43| 67|
//|tes| 56| 43|
//|raj| 45| 43|
//|raj| 44| 67|
//+---+---+---+
I want this to be converted as with out changing the order of the given input.
//output data
df.show()
//+---+---+---+---+
//| x| y| z| n|
//+---+---+---+---+
//|tes| 45| 34| 1|
//|tes| 43| 67| 1|
//|tes| 56| 43| 1|
//|raj| 45| 43| 2|
//|raj| 44| 67| 2|
//+---+---+---+---+
BlueSheepToken :
The idea here is to find a way to add a new column with the information "is it a new x ?", and then to do a rolling sum to "add" these informations.
For this, we need a window
function and the lag
method.
// Some imports
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window
// inputs
val df = Seq(("tes", 45, 34), ("tes", 43, 67), ("tes", 56, 43), ("raj", 45, 43), ("raj", 44, 67)).toDF("x", "y", "z")
// adding new information
val windowSpec = Window.orderBy(F.monotonically_increasing_id())
val dfWithNewNameInfo = df
.withColumn("n", (F.lag($"x", 1).over(windowSpec) =!= $"x").cast("bigint"))
.na.fill(1, Seq("n"))
dfWithNewNameInfo.show
/*
+---+---+---+---+
| x| y| z| n|
+---+---+---+---+
|tes| 45| 34| 1|
|tes| 43| 67| 0|
|tes| 56| 43| 0|
|raj| 45| 43| 1|
|raj| 44| 67| 0|
+---+---+---+---+
*/
// We can see the "1" in the last column indicates whenever this is a new x
// Adding these 1
val resultDf = dfWithNewNameInfo.withColumn("n", F.sum("n").over(windowSpec))
resultDf.show
/*
+---+---+---+---+
| x| y| z| n|
+---+---+---+---+
|tes| 45| 34| 1|
|tes| 43| 67| 1|
|tes| 56| 43| 1|
|raj| 45| 43| 2|
|raj| 44| 67| 2|
+---+---+---+---+
*/
This method gives the desired result for a small dataframe which can be fully loaded in memory.
Be aware the windowSpec
order by the line, but this would fail for the following DataFrame:
//+---+---+---+
//| x| y| z|
//+---+---+---+
//|tes| 45| 34|
//|tes| 43| 67|
//|raj| 45| 43|
//|raj| 44| 67|
//|tes| 56| 43|
//+---+---+---+
Giving the result:
//+---+---+---+---+
//| x| y| z| n|
//+---+---+---+---+
//|tes| 45| 34| 1|
//|tes| 43| 67| 1|
//|raj| 45| 43| 2|
//|raj| 44| 67| 2|
//|tes| 56| 43| 3|
//+---+---+---+---+
That is why I strongly recommand to order by "x"
in the windowSpec
.
Guess you like
Origin http://10.200.1.11:23101/article/api/json?id=374278&siteId=1