I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-
Name|Place
a |a1
a |a2
a |a2
|d1
b |a2
c |a2
c |
|
d |c1
In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.
You can use .na.fill
function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).
Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame
You can choose the columns, and you choose the value you want to replace the null or NaN.
In your case it will be something like:
val df2 = df.na.fill("a", Seq("Name"))
.na.fill("a2", Seq("Place"))