Drachens :
I would like to query, if anyone has an idea, how to get the second lowest value in a row of Dataframe in pyspark.
For example:
Input Dataframe:
Col1 Col2 Col3 Col4
83 32 14 62
63 32 74 55
13 88 6 46
Expected output:
Col1 Col2 Col3 Col4 Res
83 32 14 62 32
63 32 74 55 55
13 88 6 46 13
Thank you
Shu :
We can use concat_ws
function to concat all columns for the row then use split
to create an array.
use array_sort
function to sort with in the array and extract second element[1]
of the array.
Example:
from pyspark.sql.functions import *
df=spark.createDataFrame([('83','32','14','62'),('63','32','74','55'),('13','88','6','46')],['Col1','Col2','Col3','Col4'])
df.selectExpr("array_sort(split(concat_ws(',',Col1,Col2,Col3,Col4),','))[1] Res").show()
#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+
More Dynamic Way:
df.selectExpr("array_sort(split(concat_ws(',',*),','))[1]").show()
#+---+
#|Res|
#+---+
#|32 |
#|55 |
#|13 |
#+---+
EDIT:
#adding Res column to the dataframe
df1=df.selectExpr("*","array_sort(split(concat_ws(',',*),','))[1] Res")
df1.show()
#+----+----+----+----+---+
#|Col1|Col2|Col3|Col4|Res|
#+----+----+----+----+---+
#| 83| 32| 14| 62| 32|
#| 63| 32| 74| 55| 55|
#| 13| 88| 6| 46| 46|
#+----+----+----+----+---+