Avoiding multiple joins in a specific case with multiple columns with the same domain in dataframes of apache spark sql

Steven Salazar Molina :

I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).

I got the following dataframe:

+----+----+----+----+----+----------+------+
|  C1|  C2|  C3|  C4|  C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
|   A|   A|null|null|null|      1234|     2|
|   A|null|null|   H|null|      1235|     3|
|   A|   B|null|null|null|      1236|     3|
|   B|null|null|null|   E|      1237|     1|
|   C|null|null|   G|null|      1238|     1|
|   F|null|   C|   E|null|      1239|     2|
|null|null|   D|   E|   G|      1240|     1|
+----+----+----+----+----+----------+------+

C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (e.g., for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2)

I got asked to "for each existing C value get the total number of points".

So the output should be:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point"), .select("C2","point") and so on. But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world. I think I'm missing some concepts on how to apply a map reduce.

I thought also about using the function explode, I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.

Thanks for your time.

Shaido - Reinstate Monica :

Using explode would probably be the way to go here. It won't increase the amount of data and would be a lot more computationally effective as compared to using multiple join (note that a single join by itself is an expensive operation).

In this case, you can convert the columns to an array, retaining only the unique values for each separate row. This array can then be exploded and all nulls filtered away. At this point, a simple groupBy and sum will give you the wanted result.

In Scala:

df.select(explode(array_distinct(array("C1", "C2", "C3", "C4", "C5"))).as("C1"), $"points")
  .filter($"C1".isNotNull)
  .groupBy($"C1)
  .agg(sum($"points").as("points"))
  .sort($"C1") // not really necessary

This will give you the wanted result:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=169081&siteId=1