Nested Top N in Apache Beam

Colin Bankier :

I'm using Apache Beam 2.14 with Java.

Given a dataset that looks like this:

| countryID | sessionID | pageID    | count    |
| --------- | --------- | --------- | -------- |
| a         | a         | a         | 1        |
| a         | b         | c         | 2        |
| b         | c         | a         | 4        |
| c         | d         | a         | 6        |

I'd like to return a dataset with only the rows where the sum of count is in the top N countryIDs, for each countryID the top N sessions, for each sessionID, the top N pageIDs.

The size of the dataset is in the many billions of rows - it will not fit in memory. As an aside - the dataset resides in BigQuery, and attempts to do this directly in BigQuery using a DENSE_RANK() or ROW_NUMBER() function errors with a "memory limit exceeded" error due to this size, hence attempting using Dataflow instead.

My Current Strategy is to:

group by a combined key of countryID, sessionID, pageID, find the sum of each group.
group the result by countryID, sessionID, and find the sum of each group.
group the result by countryID and find the sum of each group.
Use Top.of to get the top countryIDs
Flatten the result back to the 2nd level grouping and use Top.perKey to get the top sessions per country.
Flatten the result to the 1st level grouping and get the top pageIDs per session.
Flatten the result to emit the rows.

The tricky part is that the rows need to be retained at each "group by" level, so that they can be emitted at the end. I've attempted to create a tree structure where each node holds a result of the "group by" steps - so that I can compute the sum of it's children just once for comparison in subsequent steps. i.e. At each "group by" step, the result is a KV<String, Iterable<Node>>, and a node has fields like:

    @DefaultCoder(SerializableCoder.class)
    public static class TreeNode implements Node, Serializable {
        private Long total = 0L;
        private KV<String, Iterable<LeafNode>> kv;
    ...

While this almost seems to work with the direct runner and a small sample dataset, when run on dataflow I encounter serialization errors related to the Node classes due to the Iterable being a window of the input PCollection:

Caused by: java.io.NotSerializableException: org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn$WindowReiterable

(as per https://beam.apache.org/releases/javadoc/2.15.0/index.html?org/apache/beam/sdk/transforms/GroupByKey.html)

Copying the data into a different collection in memory in order to be serializable would not be a viable option given the dataset size I need to work with.

Here is an example of the pipeline so far - using just 2 levels of grouping as an example:

        Pipeline pipeline = Pipeline.create(options);

        pipeline.apply("Read from BQ", BigQueryIO.readTableRows().from(inputTable))
                .apply("Transform to row", ParDo.of(new RowParDo())).setRowSchema(SCHEMA)
                .apply("Set first level key", WithKeys.of(new GroupKey(key1)))
                .apply("Group by", GroupByKey.create())
                .apply("to leaf nodes", ParDo.of(new ToLeafNode()))
                .apply("Set 2nd level key", WithKeys.of(new GroupKey2()))
                .apply("Group by 2nd level", GroupByKey.create())
                .apply("To tree nodes", ParDo.of(new ToTreeNode()))
                .apply("Top N", Top.of(10, new CompareTreeNode<TreeNode>()))
                .apply("Flatten", FlatMapElements.via(new FlattenNodes<TreeNode>()))
                .apply("Expand", ParDo.of(new ExpandTreeNode()))
                .apply("Top N of first key", Top.perKey(10, new CompareTreeNode<LeafNode>()))
                .apply("Values", Values.create())
                .apply("Flatten", FlatMapElements.via(new FlattenNodes<LeafNode>()))
                .apply("Expand", ParDo.of(new ExpandLeafNode()))
                .apply("Values", Values.create())
                .apply("Write to bq",
                        BigQueryIO.<Row>write().to(outputTable).withSchema(BigQueryUtils.toTableSchema(SCHEMA))
                                .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
                                .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                                .withFormatFunction(BigQueryUtils.toTableRow()));
        pipeline.run();

It seems like this should be a common goal, so am wondering if there is an easier way, or any examples achieving the same thing in Java with Beam.

Ankur :

You can try setting the code by using setCoder as follows.

    Pipeline pipeline = Pipeline.create(options);

    pipeline.apply("Read from BQ", BigQueryIO.readTableRows().from(inputTable))
            .apply("Transform to row", ParDo.of(new RowParDo())).setRowSchema(SCHEMA)
            .apply("Set first level key", WithKeys.of(new GroupKey(key1)))
            .apply("Group by", GroupByKey.create())
            .apply("to leaf nodes", ParDo.of(new ToLeafNode()))
            .apply("Set 2nd level key", WithKeys.of(new GroupKey2()))
            .apply("Group by 2nd level", GroupByKey.create())
            .apply("To tree nodes", ParDo.of(new ToTreeNode())).setCoder(SerializableCoder.of(TreeNode.class))
            .apply("Top N", Top.of(10, new CompareTreeNode<TreeNode>()))
            .apply("Flatten", FlatMapElements.via(new FlattenNodes<TreeNode>()))
            .apply("Expand", ParDo.of(new ExpandTreeNode()))
            .apply("Top N of first key", Top.perKey(10, new CompareTreeNode<LeafNode>()))
            .apply("Values", Values.create())
            .apply("Flatten", FlatMapElements.via(new FlattenNodes<LeafNode>()))
            .apply("Expand", ParDo.of(new ExpandLeafNode()))
            .apply("Values", Values.create())
            .apply("Write to bq",
                    BigQueryIO.<Row>write().to(outputTable).withSchema(BigQueryUtils.toTableSchema(SCHEMA))
                            .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE)
                            .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
                            .withFormatFunction(BigQueryUtils.toTableRow()));
    pipeline.run();

However, for your usecase where you need to determine Top N countries, Top N session and Top N pages, I would recommend simplifying the pipeline to just GroupBy the right field separately and then apply Sum and Top as follows.

    Pipeline pipeline = Pipeline.create(options);

    rows = pipeline.apply("Read from BQ", BigQueryIO.readTableRows().from(inputTable))
           .apply("Transform to row", ParDo.of(new RowParDo())).setRowSchema(SCHEMA);
    sumByCountry =rows.apply("Set Country key", MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
         .via((Row row) -> KV.of(row.getCountry(), row.getCount()))))
         .apply("Country Scores", Sum.<String>integersPerKey());
         .apply("Top Countries", Top.of(N, new CompareValues()))
     // Do the same for Session and page
     sumBySession = rows....
     sumByPage = rows....

I am not sure if you want to get all the rows for the Top N Countries but if you want to get the rows then you can use side input of Top N countries on the rows PCollection and filter the result out. You can do the same for Session and Page.

Dataflow should scale as needed for this use case so you don't need to manually do intermediate groupby for this usecase.

Nested Top N in Apache Beam

Guess you like