Kylin query performance optimization: use rowkeys sorting column to quickly read parquet files, and use shardby columns to crop parquet files

1. Use the rowkeys sort column to quickly read the parquet file

When defining a cube, there will be a rowkeys sorting column by default. In this way, when the cube is built, the dimension field of each cuboid will be sorted and saved according to the rowkeys sorting column. In this way, the data can be quickly retrieved when the data is queried.

In the Rowkeys section of Cube Designer's Advanced Setting, you can drag and drop in the ID area to customize the order of rowkeys, as shown below:

rowkeys

2. Use the shardby column to crop the parquet file

By default, there will be multiple parquet files in a cuboid of a segment of a cube. As follows:

Multiple parquet filesIf a column is defined as a shardBy column, different values ​​of the shardBy column will form different parquet files, so that when querying, filter the shardBy column to skip unnecessary file scanning

It is recommended to select high base columns (columns with basically no repeated data or unique values), and
columns that appear in multiple cuboids as shardBy columns

Currently only the following filtering operations are supported in SQL queries to crop parquet files : Equality,
In, InSet, IsNull

2.1 Use of shardBy column

Disable the cube first, and then Purge ( the metadata of the cube will be deleted , but the data on HDFS will not be deleted)

Finally, Edit, in the Rowkeys section of Cube Designer's Advanced Setting, you can define some dimensions as shardBy columns, as follows:

shardBy columnJust build the cube again

Guess you like

Origin blog.csdn.net/yy8623977/article/details/126055982