Table of contents
1. Use the rowkeys sort column to quickly read the parquet file
When defining a cube, there will be a rowkeys sorting column by default. In this way, when the cube is built, the dimension field of each cuboid will be sorted and saved according to the rowkeys sorting column. In this way, the data can be quickly retrieved when the data is queried.
In the Rowkeys section of Cube Designer's Advanced Setting, you can drag and drop in the ID area to customize the order of rowkeys, as shown below:
2. Use the shardby column to crop the parquet file
By default, there will be multiple parquet files in a cuboid of a segment of a cube. As follows:
If a column is defined as a shardBy column, different values of the shardBy column will form different parquet files, so that when querying, filter the shardBy column to skip unnecessary file scanning
It is recommended to select high base columns (columns with basically no repeated data or unique values), and
columns that appear in multiple cuboids as shardBy columns
Currently only the following filtering operations are supported in SQL queries to crop parquet files : Equality,
In, InSet, IsNull
2.1 Use of shardBy column
Disable the cube first, and then Purge ( the metadata of the cube will be deleted , but the data on HDFS will not be deleted)
Finally, Edit, in the Rowkeys section of Cube Designer's Advanced Setting, you can define some dimensions as shardBy columns, as follows:
Just build the cube again