Spark Java - Collect multiple columns into array column

Carl Ambroselli :

I have a dataframe with multiple columns:

| a | b | c | d |
-----------------
| 0 | 4 | 3 | 6 |
| 1 | 7 | 0 | 4 |
| 2 | 4 | 3 | 6 |
| 3 | 9 | 5 | 9 |

I would now like to combine [b,c,d] into a single column. However, I do not know, how big the list of columns will be, otherwise I could just use a UDF3 to combine the three.

So the desired outcome is:

| a | combined  |
-----------------
| 0 | [4, 3, 6] |
| 1 | [7, 0, 4] |
| 2 | [4, 3, 6] |
| 3 | [9, 5, 9] |

How can I achieve this?

Non-working pseudo-code:

public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
   return ds.withColumn("combined", collectAsList(columns))
}

Worst-case workaround would be a switch statement on the number of input columns and then write a UDF each for, i.e. 2-20 input columns and throw an error, if more input columns are supplied.

Grisha Weintraub :

As Ramesh mentioned in his comment, you can use array function. You only need to convert your columns list to Column array.

public static Dataset<Row> mergeColumns(Dataset<Row> ds, List<String> columns) {
    return ds.withColumn("combined", functions.array(columns.stream().map(functions::col).toArray(Column[]::new)))
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=78349&siteId=1