Kylin Cube building process detailed

1 Introduction

When using Kylin, the most important step is to create a definition cube model that designation and dimensions as well as some additional information measure, then cube were build, of course, we can also based on one string fields in the original table (this field format must be a date format, showing the meaning of the date) field is set partitions, such a cube can build several times, each time generates a build segment, each segment corresponds to a time interval of a cube, the segment time sections are continuous and do not overlap, for the cube may be performed with multiple merge segment, corresponding to the inside of a time interval are combined into a segment. Let's start analyzing the cube build process.

Example 2 Cube

Mobile phone sales, for example, table SALE records of mobile phone brands in various countries, the annual sales. Table PHONE mobile phone brand, COUNTRY table is a list of countries, two tables by foreign keys associated with the SALE table. These tables constitute the star model, which is a fact table SALE, PHONE, COUNTRY dimension tables.

Now you need to know the brand mobile phones in 2010-2012, total sales in China, then the sql query is:

SELECT b.`name`, c.`NAME`, SUM(a.count)
FROM SALE AS a 
LEFT JOIN PHONE AS b ON a.`pId`=b.`id` 
LEFT JOIN COUNTRY AS c ON a.`cId`=c.`id` 
WHERE a.`time` >= 2010 AND a.`time` <= 2012 AND c.`NAME` = "中国"
GROUP BY b.`NAME`

Wherein the time (time), mobile phone brands (b.name, later with a phone in place), National (c.name, later with the country instead of) is the dimension and the number of sales (a.count) is a measure. The number of mobile phone brands may be used to represent groups of mobile phone brands column. Each mobile phone brands in sales in each of the respective countries as a cuboid, all of the cuboid form a cube, as shown below:

The figure shows there are three-dimensional cube, each small cube represents a cuboid, which is stored in the result column aggregation of metrics, such as Apple is a cuboid sales in China in 2010.

Entrance 3 Introduction

After you finish creating a cube on kylin web page can click on the action drop-down box to perform build operations or merge these two operations will be called the cube rebuild interface parameters of the call include:

  1. cube name, used to uniquely identify a cube, in the current version of the cube kylin name is globally unique, not only for each project under;
  2. The startTime and endTime constructed, these two time intervals the identifier of the segment of the constructed data source selecting only the data in this time range; BUILD for operation, startTime is not needed, because it will always select the last a segment of the end of time as the current start time of the segment.
  3. buildType identifies the type of operation, may be a "BUILD", "MERGE" and "REFRESH".

4 Cube build process

Build process Kylin in the Cube, is calculated in advance all dimension combinations stored in HBase, the space for time, corresponding to the RowKey htable, is a combination of various dimensions, the existence Column, so index, SQL queries of different combinations of dimensions converted into RowKey based on a range of scanning and then aggregated to calculate indicators for fast analysis queries. The whole process is shown below:

The main steps in the order can be divided into several stages:

  1. Calculating a plurality of cuboid information of the user file according to the cube;
  2. The cuboid htable generated file;
  3. Update cube information;
  4. Recycling temporary files.
    Each of the input stage of the operation will need to depend on the output of the previous step, so these operations are all performed sequentially. The following describes the stages of content broken down into 11 specific steps explain:

4.1 Hive fact table to create an intermediate table (Create Intermediate Flat Hive Table)

This step of the operation will create a new hive external table, and then under the star model defined in the cube, check out the dimensions and measures of value inserted into the newly created table, this table is an external table, the table data file ( stored in HDFS) as input to the next subtask.

4.2 redistribute intermediate table (Redistribute Flat Hive Table)

In the previous step, hive will generate data in HDFS file folders, some files are very large, some small number, or even empty. File uneven distribution can lead to imbalance in the subsequent MR job: Some mappers quickly finished the job, but others are very slow. In order to balance work, kylin increase this step "redistribute" the data. First, the number of lines acquired Kylin middle of the table, the number of rows and the number, it will re-allocate the amount of data required files. By default, kylin allocated 100 million lines of a file.

4.3 Extraction different fact table column values ​​(Extract Fact Table Distinct Columns)

In this step by step based on the generated intermediate table hive calculated for each distinct value appears in the fact table column dimensions, and written to a file, it starts a MR task is completed, it is associated with the table step to create a temporary table, if the value of a distinct dimension of a column of relatively large, it may lead to process MR task execution OOM.

4.4 Creating Dimensions dictionary (Build Dimension Dictionary)

This step is a step of generating a file in accordance with distinct column dimension table and the calculated information of all sub typical dimensions and manner trie compression coding, generating a dictionary dimensions, in order to save the storage sub-Code is designed.
Each a cuboid member is a key-value stored in the hbase, key dimension member combination, but in general dimension values for some of the string or the like (e.g., trade name), it is possible by dividing each dimension value into a unique integer reduce memory usage, then obtain the true value according to the sub-Code member after the corresponding key from the lookup hbase.

4.5 Save the statistics Cuboid (Save Cuboid Statistics)

And statistical calculation of all the combinations of dimensions, and stored, wherein each combination of dimensions, is called a Cuboid. Theoretically, a N-dimensional Cube, there kinds of combination of dimensions N-th power of 2, with reference to one example of the Internet, comprising a Cube time, item, location, supplier four dimensions, then a combination of (Cuboid) there were 16 kinds of :

4.6 Creating HTable

Create a HTable time also you need to consider a few things:

  1. Column cluster settings.
  2. Each row cluster compression.
  3. Deployment coprocessor.
  4. A region the size of each HTable in.
    In this step, the column cluster setup is to create cube when setting according to the user, the data key in the storage HBase is a combination of dimension members, and value is the result of the corresponding aggregate functions are value column clusters directed, in general will set up a cluster of columns when creating a cube, the column that contains the results of all of the aggregate functions;
    use LZO compression by default when you create HTable, not if you do not support LZO compression, support for more versions behind the kylin compression;
    Kylin strongly dependent on the HBase coprocessor, so it is necessary for the table to create HTable deployment coprocessor, this file will first be uploaded to the HDFS HBase is located, and the associated meta information table, this step is prone to error, For example coprocessor can not find the will cause the entire regionServer not start, so they need special care; region division has been identified in step, so here the dynamic extension does not exist, so kylin create an interface as follows HTable used:
    public void the createTable ( final HTableDescriptor desc, byte [] [ ] splitKeys)

4.7 build Cube (Build Cube with Spark) with Spark engine

In the Kylin Cube Model, each cube is composed of a plurality of cuboid, in theory, there are N dimensions cube may be common N-th power of 2 is composed of cuboid, then we can calculate the bottom cuboid, i.e. cuboid comprising all dimension (corresponding to group by performing a query of all the dimensions of the column), and then calculating according to the layers of the bottom of the cuboid up, until the calculated topmost cuboid (corresponding to an executed queries group by a), in fact, the implementation of the principle of kylin at this stage is like this, but it needs to be abstracted into mapreduce these models, Spark submit job execution.
Use Spark, to generate data for each combination of dimensions (Cuboid) a.
The Data Base Cuboid the Build;
the Build the Data N-Cuboid the Dimension:. 7-the Dimension;
the Build the Data N-Cuboid the Dimension:. 6-the Dimension;
......
the Build the Data N-Cuboid the Dimension: the Dimension-2;
the Build Cube.

Cuboid 4.8 convert data into HFile (Convert Cuboid Data to HFile)

After you create the finished HTable interface usually by inserting data into a table, but because of the huge amount of data in the cuboid, has a very large impact on the performance of the frequent insertion will Hbase, so kylin took first convert the file into a cuboid HTable Hfile file format, and then associate the file and HTable by bulkLoad way, which can greatly reduce the load Hbase, this process is completed by a task MR.

4.9 the guide HFile HBase table (Load HFile to HBase Table)

The load HFile file to HTable, this step is entirely dependent on HBase tools. After this step is completed, the data already stored in the HBase, key format by the cuboid number + id of every member in the dictionary tree composition, value may be stored in more than one column the group, which included several members in accordance with the original data GROUP BY a value calculated metric.

4.10 Cube update information (Update Cube Info)

State update cube, which needs to be updated comprises a cube is available, and the statistics of this construction, including time to complete the construct, the number of the input record, the size of the input data, the data saved to the size Hbase the like, and these persistent information into the metadata database.

4.11 Hive Cleanup middle of the table (Hive Cleanup)

This step is successful correctness will not have any impact because the step after this segment can be checked to find in this cube, but it generated a lot of junk files throughout the implementation process, including:

  1. The hive temporary table;
  2. Because the hive table is an external table, the table is stored document also require additional deleted;
  3. This fact distinct step writes data to prepare for the establishment of a sub Code on HDFS, this time can be deleted;
  4. rowKey statistics will be generated when a file can be deleted at this time;
  5. HFile different paths when generating the path and file storage hbase real storage, although the load is a remove operation, but the top of the directory still exists, it needs to be removed.

Build this point the whole process is complete.

Guess you like

Origin www.cnblogs.com/xiaodf/p/11685023.html
Recommended