Optimization of Kylin cube construction

Derived dimension

concept

  Derived dimensions are used to exclude non-primary key dimensions on the dimension table within the effective dimension, and use the primary key of the dimension table (in fact, the corresponding foreign key on the fact table) to replace them. Kylin records the mapping relationship between the primary key of the dimension table and other dimensions of the dimension table at the bottom, so that it can dynamically "translate" the primary key of the dimension table into these non-primary key dimensions during query, and perform real-time aggregation.

note:

Although the derived dimension is very attractive, it does not mean that all the dimensions on the dimension table must become the derived dimension. If the aggregation workload from the primary key of the dimension table to a dimension table dimension is very large, then Derived dimensions are not recommended.

As shown

 

When originally constructed in three dimensions, cuboid = 2 ^ 3 -1 = 7 

A is the primary key in the dimension table, and it is not repeated, then E can be considered as a derived dimension of A, and A can be used to replace the E dimension during construction.

In this way, the efficiency of building the cube will be very high, and if the dimension of E is designed during the query, A will be replaced by E to obtain the result.

Specific operation

When selecting dimensions, derived dimensions in the selected dimension table

 

 Then select the primary keys of these two dimension tables in the aggregation group

 

 After setting, you can see Cuboid = 3

 

 Setting of dimensions in aggregation groups

Mandatory dimension

If a dimension is defined as a mandatory dimension, then each of the Cuboids generated by this grouping will contain that dimension, and Cuboids that do not contain this dimension will not be calculated (one-dimensional dimensions will not be counted)

 

The operation is as follows

 

 

 

Hierarchical dimension

In simple terms, the dimension B depends on the dimension A. If only B does not have A, it will not be calculated.

 

The operation is as follows

 

 

 

 

 

 Joint dimension

Joint dimensions must exist simultaneously

 

 

 

 

 

 RowKey Design

1 ) The dimension used as the where filter is placed in front.

 

 

2 ) The dimension with a large cardinality is placed before the dimension with a small cardinality.

When the cube is aggregated from 3D to 2D, the default is to select a 3D with a small cuboid for aggregation. It can be seen from the figure that the cardinality of C is much larger than the cardinality of D, so the calculation speed on the right will be a little faster

 

 

operating

 

 Concurrency granular optimization (understand)

    

When the size of a Cuboid in the segment exceeds a certain threshold, the system will divide the data of the Cuboid into multiple partitions to parallelize the reading of the Cuboid data, 
thereby optimizing the query speed of the Cube. The specific implementation method is as follows: the construction engine determines the size of the segment according to the size of the segment and the parameter "kylin.hbase.region.cut" to determine how many segments the segment needs to store in the
storage engine . If the storage engine is HBase, then the partition The number corresponds to the number of regions in HBase. The default value of kylin.hbase.region.cut is 5.0, and the unit is GB, which means that for a segment with an estimated size of 50GB, the build engine will allocate 10 partitions to it.
Users can also determine the minimum or maximum number of each segment by setting kylin.hbase.region.count.min (default is 1) and kylin.hbase.region.count.max (default is 500) two configurations Partition.

 

Cubes are stored in the form of segments in HBase. In fact, here is to optimize the partitioning strategy of Hbase. The more partitions, the better the concurrency. Generally, the default is to keep it.

Setting method

 

Guess you like

Origin www.cnblogs.com/yangxusun9/p/12731028.html