Derived dimension
concept
Derived dimensions are used to exclude non-primary key dimensions on the dimension table within the effective dimension, and use the primary key of the dimension table (in fact, the corresponding foreign key on the fact table) to replace them. Kylin records the mapping relationship between the primary key of the dimension table and other dimensions of the dimension table at the bottom, so that it can dynamically "translate" the primary key of the dimension table into these non-primary key dimensions during query, and perform real-time aggregation.
note:
Although the derived dimension is very attractive, it does not mean that all the dimensions on the dimension table must become the derived dimension. If the aggregation workload from the primary key of the dimension table to a dimension table dimension is very large, then Derived dimensions are not recommended.
As shown
When originally constructed in three dimensions, cuboid = 2 ^ 3 -1 = 7
A is the primary key in the dimension table, and it is not repeated, then E can be considered as a derived dimension of A, and A can be used to replace the E dimension during construction.
In this way, the efficiency of building the cube will be very high, and if the dimension of E is designed during the query, A will be replaced by E to obtain the result.
Specific operation
When selecting dimensions, derived dimensions in the selected dimension table
Then select the primary keys of these two dimension tables in the aggregation group
After setting, you can see Cuboid = 3
Setting of dimensions in aggregation groups
Mandatory dimension
If a dimension is defined as a mandatory dimension, then each of the Cuboids generated by this grouping will contain that dimension, and Cuboids that do not contain this dimension will not be calculated (one-dimensional dimensions will not be counted)
The operation is as follows
Hierarchical dimension
In simple terms, the dimension B depends on the dimension A. If only B does not have A, it will not be calculated.
The operation is as follows
Joint dimension
Joint dimensions must exist simultaneously
RowKey Design
1 ) The dimension used as the where filter is placed in front.
2 ) The dimension with a large cardinality is placed before the dimension with a small cardinality.
When the cube is aggregated from 3D to 2D, the default is to select a 3D with a small cuboid for aggregation. It can be seen from the figure that the cardinality of C is much larger than the cardinality of D, so the calculation speed on the right will be a little faster
operating
Concurrency granular optimization (understand)
When the size of a Cuboid in the segment exceeds a certain threshold, the system will divide the data of the Cuboid into multiple partitions to parallelize the reading of the Cuboid data,
thereby optimizing the query speed of the Cube. The specific implementation method is as follows: the construction engine determines the size of the segment according to the size of the segment and the parameter "kylin.hbase.region.cut" to determine how many segments the segment needs to store in the
storage engine . If the storage engine is HBase, then the partition The number corresponds to the number of regions in HBase. The default value of kylin.hbase.region.cut is 5.0, and the unit is GB, which means that for a segment with an estimated size of 50GB, the build engine will allocate 10 partitions to it.
Users can also determine the minimum or maximum number of each segment by setting kylin.hbase.region.count.min (default is 1) and kylin.hbase.region.count.max (default is 500) two configurations Partition.
Cubes are stored in the form of segments in HBase. In fact, here is to optimize the partitioning strategy of Hbase. The more partitions, the better the concurrency. Generally, the default is to keep it.
Setting method