Spark add same row number based on value in cell

rakeeee :

I have data as follows

//input data
df.show()
//+---+---+---+
//|  x|  y|  z|
//+---+---+---+
//|tes| 45| 34|
//|tes| 43| 67|
//|tes| 56| 43|
//|raj| 45| 43|
//|raj| 44| 67|
//+---+---+---+

I want this to be converted as with out changing the order of the given input.

//output data
    df.show()
    //+---+---+---+---+
    //|  x|  y|  z|  n|
    //+---+---+---+---+
    //|tes| 45| 34|  1|
    //|tes| 43| 67|  1|
    //|tes| 56| 43|  1|
    //|raj| 45| 43|  2|
    //|raj| 44| 67|  2|
    //+---+---+---+---+
BlueSheepToken :

The idea here is to find a way to add a new column with the information "is it a new x ?", and then to do a rolling sum to "add" these informations.

For this, we need a window function and the lag method.

// Some imports
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window

// inputs
val df = Seq(("tes", 45, 34), ("tes", 43, 67), ("tes", 56, 43), ("raj", 45, 43), ("raj", 44, 67)).toDF("x", "y", "z")

// adding new information
val windowSpec = Window.orderBy(F.monotonically_increasing_id())


val dfWithNewNameInfo = df
  .withColumn("n", (F.lag($"x", 1).over(windowSpec) =!= $"x").cast("bigint"))
  .na.fill(1, Seq("n"))
dfWithNewNameInfo.show
/*
+---+---+---+---+
|  x|  y|  z|  n|
+---+---+---+---+
|tes| 45| 34|  1|
|tes| 43| 67|  0|
|tes| 56| 43|  0|
|raj| 45| 43|  1|
|raj| 44| 67|  0|
+---+---+---+---+


*/
// We can see the "1" in the last column indicates whenever this is a new x

// Adding these 1
val resultDf = dfWithNewNameInfo.withColumn("n", F.sum("n").over(windowSpec))
resultDf.show
/*
+---+---+---+---+
|  x|  y|  z|  n|
+---+---+---+---+
|tes| 45| 34|  1|
|tes| 43| 67|  1|
|tes| 56| 43|  1|
|raj| 45| 43|  2|
|raj| 44| 67|  2|
+---+---+---+---+
*/

This method gives the desired result for a small dataframe which can be fully loaded in memory.


Be aware the windowSpec order by the line, but this would fail for the following DataFrame:

//+---+---+---+
//|  x|  y|  z|
//+---+---+---+
//|tes| 45| 34|
//|tes| 43| 67|
//|raj| 45| 43|
//|raj| 44| 67|
//|tes| 56| 43|
//+---+---+---+

Giving the result:

//+---+---+---+---+
//|  x|  y|  z|  n|
//+---+---+---+---+
//|tes| 45| 34|  1|
//|tes| 43| 67|  1|
//|raj| 45| 43|  2|
//|raj| 44| 67|  2|
//|tes| 56| 43|  3|
//+---+---+---+---+

That is why I strongly recommand to order by "x" in the windowSpec.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=374278&siteId=1