Why is Apache Orc RecordReader.searchArgument() not filtering correctly?

Ashwin Jayaprakash :

Here is a simple program that:

  1. Writes records into an Orc file
  2. Then tries to read the file using predicate pushdown (searchArgument)

Questions:

  1. Is this the right way to use predicate push down in Orc?
  2. The read(..) method seems to return all the records, completely ignoring the searchArguments. Why is that?

Notes:

I have not been able to find any useful unit test that demonstrates how predicate pushdown works in Orc (Orc on GitHub). Nor am I able to find any clear documentation on this feature. Tried looking at Spark and Presto code, but I was not able to find anything useful.

The code below is a modified version of https://github.com/melanio/codecheese-blog-examples/tree/master/orc-examples/src/main/java/codecheese/blog/examples/orc

public class TestRoundTrip {
public static void main(String[] args) throws IOException {
    final String file = "tmp/test-round-trip.orc";
    new File(file).delete();

    final long highestX = 10000L;
    final Configuration conf = new Configuration();

    write(file, highestX, conf);
    read(file, highestX, conf);
}

private static void read(String file, long highestX, Configuration conf) throws IOException {
    Reader reader = OrcFile.createReader(
            new Path(file),
            OrcFile.readerOptions(conf)
    );

    //Retrieve x that is "highestX - 1000". So, only 1 value should've been retrieved.
    Options readerOptions = new Options(conf)
            .searchArgument(
                    SearchArgumentFactory
                            .newBuilder()
                            .equals("x", Type.LONG, highestX - 1000)
                            .build(),
                    new String[]{"x"}
            );
    RecordReader rows = reader.rows(readerOptions);
    VectorizedRowBatch batch = reader.getSchema().createRowBatch();

    while (rows.nextBatch(batch)) {
        LongColumnVector x = (LongColumnVector) batch.cols[0];
        LongColumnVector y = (LongColumnVector) batch.cols[1];

        for (int r = 0; r < batch.size; r++) {
            long xValue = x.vector[r];
            long yValue = y.vector[r];

            System.out.println(xValue + ", " + yValue);
        }
    }
    rows.close();
}

private static void write(String file, long highestX, Configuration conf) throws IOException {
    TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
    Writer writer = OrcFile.createWriter(
            new Path(file),
            OrcFile.writerOptions(conf).setSchema(schema)
    );

    VectorizedRowBatch batch = schema.createRowBatch();
    LongColumnVector x = (LongColumnVector) batch.cols[0];
    LongColumnVector y = (LongColumnVector) batch.cols[1];
    for (int r = 0; r < highestX; ++r) {
        int row = batch.size++;
        x.vector[row] = r;
        y.vector[row] = r * 3;
        // If the batch is full, write it out and start over.
        if (batch.size == batch.getMaxSize()) {
            writer.addRowBatch(batch);
            batch.reset();
        }
    }
    if (batch.size != 0) {
        writer.addRowBatch(batch);
        batch.reset();
    }
    writer.close();
}

}

Vinayak Thatte :

I encountered the same issue, and I think it was rectified by changing

.equals("x", Type.LONG,

to

.equals("x",PredicateLeaf.Type.LONG

On using this, the reader seems to return only the batch with the relevant rows, not only once which we asked for.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=472142&siteId=1