Motheri Yani :
I am working on a spark code using java , where after join condition we got multiple records because of duplicate IDs
in different sources,(ID
are duplicates but some attribute changed) so with same id we have multiple records.what I need is to combine which are duplicates records to on single unique row for each ID
Input Dataset
+---+---+---+----+---+---+
|id |b |c |d |f |g |
+---+---+---+----+---+---+
|1 |e |dd |ddd |34 |r5t|
|1 |e |dd2|ddd |34 |r5t|
|1 |e |dd3|ddd |34 |rt |
|2 |e |dd |ddd1|34 |5rt|
|4 |e |dd |ddd1|34 |rt |
|1 |e |dd4|ddd |34 |rt |
|4 |e |dd4|ddd |34 |rt |
|4 |e |dd4|ddd |3 |rt |
|2 |e |dd |ddd |3 |r5t|
|2 |e |dd |ddd |334|rt |
+---+---+---+----+---+---+
expected output
+---+--------------+--------------+--------------+-------------------+--------------+
|id |f | b | g|d |d |
+---+--------------+--------------+--------------+-------------------+--------------+
|1 |[34] |[e] |[r5t, rt] |[dd4, dd3, dd2, dd]|[ddd] |
+---+--------------+--------------+--------------+-------------------+--------------+
I tried giving explicitly collect_list
as below
df.groupBy("id").agg(
functions.collect_set("f"),
functions.collect_set("b")
).show(1,false);
But my case we have 300 columns, in the dataset that too columns may not be same, changing some times.
Yashwanth Kambala :
In Spark org.apache.spark.sql
package, there is agg(exprs: Map[String, String])
method which accepts a Map<String,String>
where key
is name of column and value is sql.functions.
Dataset<Row> df = spark.read().format("csv").option("header", "true")
.load("...");
Map<String,String> collect_MAP = Arrays.stream(df.columns())
.filter(f -> !f.equals("id"))
.collect(Collectors.toMap(f -> f,f -> "collect_set"));
df.groupBy("id").agg(collect_MAP).show(false);
result
+---+--------------+--------------+--------------+-------------------+--------------+
|id |collect_set(f)|collect_set(b)|collect_set(g)|collect_set(c) |collect_set(d)|
+---+--------------+--------------+--------------+-------------------+--------------+
|1 |[34] |[e] |[r5t, rt] |[dd4, dd3, dd2, dd]|[ddd] |
|4 |[3, 34] |[e] |[rt] |[dd4, dd] |[ddd1, ddd] |
|2 |[334, 3, 34] |[e] |[r5t, rt, 5rt]|[dd] |[ddd1, ddd] |
+---+--------------+--------------+--------------+-------------------+--------------+
Guess you like
Origin http://43.154.161.224:23101/article/api/json?id=308363&siteId=1