Carl Ambroselli :
I have a dataframe with multiple columns:
| a | b | c | d |
-----------------
| 0 | 4 | 3 | 6 |
| 1 | 7 | 0 | 4 |
| 2 | 4 | 3 | 6 |
| 3 | 9 | 5 | 9 |
I would now like to combine [b,c,d]
into a single column. However, I do not know, how big the list of columns will be, otherwise I could just use a UDF3 to combine the three.
So the desired outcome is:
| a | combined |
-----------------
| 0 | [4, 3, 6] |
| 1 | [7, 0, 4] |
| 2 | [4, 3, 6] |
| 3 | [9, 5, 9] |
How can I achieve this?
Non-working pseudo-code:
public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
return ds.withColumn("combined", collectAsList(columns))
}
Worst-case workaround would be a switch statement on the number of input columns and then write a UDF each for, i.e. 2-20 input columns and throw an error, if more input columns are supplied.
Grisha Weintraub :
As Ramesh mentioned in his comment, you can use array
function. You only need to convert your columns list to Column
array.
public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
return ds.withColumn("combined", functions.array(columns.stream().map(functions::col).toArray(Column[]::new)))
}