Create new columns by maps in array column inside a Spark Dataframe

Matthias Hauert :

I want to transform a Spark 2.4 Dataframe imported from AVRO Files (those contain tracking data from Google Analytics).

Interesting part of the schema looks like this:

root
 |-- visitorId: long (nullable = true)
 |-- visitNumber: long (nullable = true)
 |-- visitId: long (nullable = true)
 |-- visitStartTime: long (nullable = true)
 |-- date: string (nullable = true)
 |-- hits: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- hitNumber: long (nullable = true)
 |    |    |-- time: long (nullable = true)
 |    |    |-- hour: long (nullable = true)
 |    |    |-- minute: long (nullable = true)
 |    |    |-- customDimensions: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- index: long (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)

The resulting Dataset should be nearly flat without deeply nested structs. Arrays like the hits should have their own row, which is easily achieved by the explode function. Arrays like hits.customDimensions are trickier. Each array element has an index field (that does not correspond to the array position), and for each possible value, a new column should be created. Final schema should look like:

root
 |-- visitorId: long (nullable = true)
 |-- visitNumber: long (nullable = true)
 |-- visitId: long (nullable = true)
 |-- visitStartTime: long (nullable = true)
 |-- hit_date: string (nullable = true)
 |-- hit_hitNumber: long (nullable = true)
 |-- hit_time: long (nullable = true)
 |-- hit_hour: long (nullable = true)
 |-- hit_minute: long (nullable = true)
 |-- hit_customDimension_1: string (nullable = true)
 |-- hit_customDimension_9: string (nullable = true)

Depending on the actual indices found in the data, hit_customDimension_X can occur more often.

The Dataset is transformed like this so far:

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.explode_outer;

public class Flattener {
  public static void main(String[] args) {
    String avroFiles = String.join(",", args); // @TODO: named parameters
    SparkConf conf = new SparkConf().setAppName("Simple Application").set("spark.ui.port", "8080");
    SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate();
    SQLContext sqlContext = spark.sqlContext();
    Dataset<Row> sessions = spark.read().format("avro").load(avroFiles).limit(1000);

    //explode the hits to more rows, remove original array
    sessions = sessions.withColumn("hit", explode_outer(col("hits"))).drop(col("hits"));

    //sample the distinct indinces
    Dataset<Row> r = result.sample(0.1).select(explode(col("hit.customDimensions"))).select(col("col.index")).distinct();
    List<Long> indices = new LinkedList<Long>();
    r.foreach(dr -> {
        indices.add(dr.getLong(0));
    });
    Iterator<Long> l = indices.iterator();
    // for each found index, extract the array element to its own column
    while (l.hasNext()) {
        Long i = l.next();
        result.withColumn("hit_customDimension" + "_" + i.toString(), array_find("hit.customDimensions", "index", i));
    }

    //TODO: move hit column up one level
}

The problem is: there's no such array_find function. I found a filter function (see the section about Filtering on an Array column), but it seems to filter rows and not array elements.

I guess one could write a UDF to do this, but as far as I know they may degrade the performance. Performance is of great concern due to our volatile and large datasets (several terabytes). The tasks does not look to uncommon, so I wonder if there is a built in way to do this I simply missed.

Egor 123 :

It seems that you are looking for an SQL function that extracts elements from array by given index.

This function is already present in Spark API but a kind of "hidden" because it is implemented as not a separate function but as an apply method in Column class. Please check the scaladoc:

  /**
   * Extracts a value or values from a complex type.
   * The following types of extraction are supported:
   * <ul>
   * <li>Given an Array, an integer ordinal can be used to retrieve a single value.</li>
   * <li>Given a Map, a key of the correct type can be used to retrieve an individual value.</li>
   * <li>Given a Struct, a string fieldName can be used to extract that field.</li>
   * <li>Given an Array of Structs, a string fieldName can be used to extract filed
   *    of every struct in that array, and return an Array of fields.</li>
   * </ul>
   * @group expr_ops
   * @since 1.4.0
   */
  def apply(extraction: Any): Column

So, I would propose to replace your array_find(...) with col("hit.customDimensions")(i):

result.withColumn("hit_customDimension" + "_" + i.toString(), col("hit.customDimensions")(i));

[UPD]

As validly pointed out in the comments, customDimensions may be a sparse array where an index is not an ordinal but an arbitrary integer.

In this case, the converting Array to a Map from the outset looks the most natural.

Convert Array[Struct[Int, String]] to Map[Int, String]:

result.withColumn("hit_customDimensions_Map", 
  map_from_arrays(col("hit.customDimensions")("index"), col("hit.customDimensions")("value")))

And change the way you transpose customDimensions columns:

result.withColumn("hit_customDimension" + "_" + i.toString(), col("hit_customDimensions_Map")(i));

Create new columns by maps in array column inside a Spark Dataframe

Guess you like