I want to transform a Spark 2.4 Dataframe imported from AVRO Files (those contain tracking data from Google Analytics).
Interesting part of the schema looks like this:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- date: string (nullable = true)
|-- hits: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- hitNumber: long (nullable = true)
| | |-- time: long (nullable = true)
| | |-- hour: long (nullable = true)
| | |-- minute: long (nullable = true)
| | |-- customDimensions: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- index: long (nullable = true)
| | | | |-- value: string (nullable = true)
The resulting Dataset should be nearly flat without deeply nested structs. Arrays like the hits
should have their own row, which is easily achieved by the explode
function. Arrays like hits.customDimensions
are trickier. Each array element has an index
field (that does not correspond to the array position), and for each possible value, a new column should be created. Final schema should look like:
root
|-- visitorId: long (nullable = true)
|-- visitNumber: long (nullable = true)
|-- visitId: long (nullable = true)
|-- visitStartTime: long (nullable = true)
|-- hit_date: string (nullable = true)
|-- hit_hitNumber: long (nullable = true)
|-- hit_time: long (nullable = true)
|-- hit_hour: long (nullable = true)
|-- hit_minute: long (nullable = true)
|-- hit_customDimension_1: string (nullable = true)
|-- hit_customDimension_9: string (nullable = true)
Depending on the actual indices found in the data, hit_customDimension_X
can occur more often.
The Dataset is transformed like this so far:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.explode_outer;
public class Flattener {
public static void main(String[] args) {
String avroFiles = String.join(",", args); // @TODO: named parameters
SparkConf conf = new SparkConf().setAppName("Simple Application").set("spark.ui.port", "8080");
SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate();
SQLContext sqlContext = spark.sqlContext();
Dataset<Row> sessions = spark.read().format("avro").load(avroFiles).limit(1000);
//explode the hits to more rows, remove original array
sessions = sessions.withColumn("hit", explode_outer(col("hits"))).drop(col("hits"));
//sample the distinct indinces
Dataset<Row> r = result.sample(0.1).select(explode(col("hit.customDimensions"))).select(col("col.index")).distinct();
List<Long> indices = new LinkedList<Long>();
r.foreach(dr -> {
indices.add(dr.getLong(0));
});
Iterator<Long> l = indices.iterator();
// for each found index, extract the array element to its own column
while (l.hasNext()) {
Long i = l.next();
result.withColumn("hit_customDimension" + "_" + i.toString(), array_find("hit.customDimensions", "index", i));
}
//TODO: move hit column up one level
}
The problem is: there's no such array_find
function. I found a filter function (see the section about Filtering on an Array column), but it seems to filter rows and not array elements.
I guess one could write a UDF to do this, but as far as I know they may degrade the performance. Performance is of great concern due to our volatile and large datasets (several terabytes). The tasks does not look to uncommon, so I wonder if there is a built in way to do this I simply missed.
It seems that you are looking for an SQL function that extracts elements from array by given index.
This function is already present in Spark API but a kind of "hidden" because it is implemented as not a separate function but as an apply
method in Column
class. Please check the scaladoc:
/**
* Extracts a value or values from a complex type.
* The following types of extraction are supported:
* <ul>
* <li>Given an Array, an integer ordinal can be used to retrieve a single value.</li>
* <li>Given a Map, a key of the correct type can be used to retrieve an individual value.</li>
* <li>Given a Struct, a string fieldName can be used to extract that field.</li>
* <li>Given an Array of Structs, a string fieldName can be used to extract filed
* of every struct in that array, and return an Array of fields.</li>
* </ul>
* @group expr_ops
* @since 1.4.0
*/
def apply(extraction: Any): Column
So, I would propose to replace your array_find(...)
with col("hit.customDimensions")(i)
:
result.withColumn("hit_customDimension" + "_" + i.toString(), col("hit.customDimensions")(i));
[UPD]
As validly pointed out in the comments, customDimensions
may be a sparse array where an index
is not an ordinal but an arbitrary integer.
In this case, the converting Array
to a Map
from the outset looks the most natural.
- Convert
Array[Struct[Int, String]]
toMap[Int, String]
:
result.withColumn("hit_customDimensions_Map",
map_from_arrays(col("hit.customDimensions")("index"), col("hit.customDimensions")("value")))
- And change the way you transpose
customDimensions
columns:
result.withColumn("hit_customDimension" + "_" + i.toString(), col("hit_customDimensions_Map")(i));