Calculate a value in spark Dataset depending of previous version with Java Spark

Jack :

I have this dataset :

ID      timestamp   value
unique1 1584420000  120
unique1 1584410000  100
unique1 1584400000  20
unique2 1584410000  90
unique2 1584400000  10
unique3 1584400000  30

i need to calculate the value for an id and version depending of the previous version of the same id. If an ID doesn't have a last version before, the value is kept the same

ID      timestamp   valueCalculated
unique1 1584420000  20
unique1 1584410000  80
unique1 1584400000  20
unique2 1584410000  80
unique2 1584400000  10
unique3 1584400000  30

I have tried to achieve this but i am only able to aggregate by id and version and do a minus of the last two version avaible ( if an ID hasn't been updated it will keep its value ) which gives only one row per ID.

ID      timestamp   valueCalculated
unique1 1584420000  20
unique2 1584410000  80
unique3 1584400000  30

This my code :

dataset.groupBy("id","timestamp")
.agg(
max("timestamp").as("timestamp"),
functionscallUDF("CalculateValue",first("timestamp"),first("value"),last("timestamp"),last("value")
).as("valueCalculated")

i have used an UDF4 to calculate the value expected :

sparksession.udf().register("CalculatValue", (UDF4<Long,Double,Long,Double,Double>) this::calculateValue , DataTypes.DoubleType);

public Double calculateValue(Long Version1, Double Value1,Long Version2, Double Value2){
if(version1.equals(version2)){
return value1;
}else{
return value1 - value2;
}
}

I don't think i am using the good approach here becaure of the aggregation. Could you please help on to achieve this ? Thanks

Lamanus :

I don't understand what is the version, but you can calculate the difference of the value between the current row and the previous row in this way:

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("ID").orderBy("timestamp")
df.withColumn("previousValue", lag($"value", 1, 0).over(w))
  .withColumn("valueCalculated", $"value" - $"previousValue")
  .orderBy("ID", "timestamp")
  .show(false)

which will give you the result as follows:

+-------+----------+-----+-------------+---------------+
|ID     |timestamp |value|previousValue|valueCalculated|
+-------+----------+-----+-------------+---------------+
|unique1|1584400000|20   |0            |20             |
|unique1|1584410000|100  |20           |80             |
|unique1|1584420000|120  |100          |20             |
|unique2|1584400000|10   |0            |10             |
|unique2|1584410000|90   |10           |80             |
|unique3|1584400000|30   |0            |30             |
+-------+----------+-----+-------------+---------------+

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=296433&siteId=1