Angle analysis, the method for establishing a data cube from BI

This is not from the perspective of example, to analyze how to build a data cube, but from the perspective of BI products, how to build a better system of data cube.

The concept part

This part describes the main concept to understand the students skip.

Data cube is a multidimensional data model Here are some related concepts Multidimensional Model:

• multidimensional data model: model based on the database of the fact and dimension tables in order to meet user needs data query and analysis from multiple perspectives built up a multi-level, the basic application is to implement OLAP (Online Analytical Processing)

• cube: it is built out by the dimension of the multidimensional space, contains basic data to be analyzed, aggregated data for all the operations are carried out on it

• Dimensions: kind of angle of observation data, such as the figure above address, item, time can be seen as a dimension, a dimension Intuitively cube axes, such as three dimensional space may constitute a cube

• dimension members: the basic unit composed of dimensions, such as for a time dimension, containing Q1, Q2, Q3, Q4 four dimensions members

• hierarchy: hierarchy of a dimension that there are two: the natural level and user-defined hierarchy. For example, for the time dimension, can be divided into three levels of the year, month, day, it can be divided into years, quarters, months three levels. A dimension can have multiple levels, which is a unit of path data aggregation

• Level: Level composition hierarchy, such as year, month and day are three levels of the time dimension

• Measurement: a function value, can be evaluated for each data point in the cube space; metric is a measure of the natural result of

• fact sheet: Store metric worth table, and store the dimension tables have foreign keys, all data used in the analysis must ultimately come from the fact table

• dimension table: For a description of dimensions, each dimension corresponding to one or more dimension tables, a table corresponding to a dimension of a star pattern, a plurality of tables corresponding to the snowflake schema

640?wx_fmt=png

Just kind of image data cube multi-dimensional model to say, it is only a three-dimensional, but is not limited to the three-dimensional multi-dimensional data model, it may be n-dimensional. The reason why such a call is to make it easier for users to imagine, easy to interpret and explain, but also to traditional two-dimensional relational database tables to distinguish. Therefore, we can put any n-dimensional data cube seen (n-1) dimensional cube sequence, such as 4-D cube may be viewed as a sequence of 3-D cube

640?wx_fmt=png

Mode multi-dimensional data model of the main star schema, snowflake patterns and the fact that the constellation pattern.

Star mode

It is the most common pattern, which includes a large central table (fact table), but does not contain a large number of data redundancy; a small group of subsidiary table (dimension table), each dimension a. As shown below, from the item, time, branch, location data to observe four dimensions, a center table Sales Fact Table, comprising a four-dimensional table identifier (generated by the system) and three metrics.

Using a table showing each dimension, the table attributes may form a hierarchy or lattice.

640?wx_fmt=png

Snowflake pattern

It is a variant of the pointing mode, in which certain standardized table, the further decomposition of the additional data table, shaped like a snowflake.

As shown, the item dimension table is normalized, resulting in a new table and the item supplier table; the same location to location and also normalized city two new tables.

640?wx_fmt=png

The fact constellation

Allows multiple fact tables share dimension tables, it can be seen as a collection of star-shaped pattern. As shown below, Sales, and Shipping two fact tables share time, item, location three dimensional table.

640?wx_fmt=png

Overall, the data warehouse multi-mode constellation of facts, because it can theme modeled more relevant; and in the data mart with a popular star or snowflake schema, because it tends to a certain specific topics for.

Multidimensional OLAP analysis operations comprising: drill (Drill-down), the roll (Roll-up), a slice (Slice), cut (Dice) and rotation (Pivot), the following is an example to the above data one by one cube please explain:

640?wx_fmt=png

Drill (Drill-down): variation between different levels in the dimension, from the top down to the next level, or a split aggregated data to more detailed data, such as by total sales for the second quarter of 2010 drill data to see the second quarter of 2010 consumption data 4,5,6 month, as shown above; course, you can drill down to view sales data, Zhejiang Province, Hangzhou, Ningbo, Wenzhou ...... these cities .

On a volume (Roll-up): drilled inverse operation, that is, as the Jiangsu, Zhejiang and Shanghai sales data aggregated data from fine-grained level to aggregate sales data to see Jiangsu, Zhejiang region, as shown above .

Slice (Slice): choose a specific dimension of value analysis, for example, only select sales of electronic products, or the data for the second quarter of 2010.

Diced (Dice): Select the data or the value of a particular batch of peacekeeping in a specific section of the analysis, such as selecting the first quarter of 2010 to the second quarter of 2010 sales data, or electronic products and commodities sales data.

Rotation (Pivot): namely exchange-dimensional position, like the ranks of the converted two-dimensional table, interchangeable rotating achieve product and geographic dimension by dimension is shown.

Kylin's Cube algorithm

The following is the full text references, interested students can go to see the original reference connection.
Layer Cubing algorithm

It may also be referred to as "step by step algorithm", by activating the wheel MapReduce N + 1 is calculated. The first round of read raw data (RawData), a column to remove irrelevant, leaving only relevant. Dimension column while coding, the result of the first round, we called Base Cuboid, after each round MapReuce, an input is the output of calculation results before reuse, to remove polymerized dimension is calculated the new Cuboid, this up until the last calculated all the Cuboid.

640?wx_fmt=png

Cube shown above, shows a 4-dimensional building process

Mapper and Reducer algorithm is relatively simple. Mapper Cuboid one or more results (Key-Value pair) as an input. Since each dimension value Key is spliced ​​together, wherein the dimensions to be polymerized to identify, remove its value to a new Key, and Value operation, and outputs the new Key and Value, and then all new Key Hadoop MapReduce to sort, shuffle (shuffle), then sent to the Reducer; Reducer input would be a group of the same set of Key Value of Value of these aggregates the calculation, combined with a Key output calculation is complete.

Each round of calculations are a MapReduce job, and serial execution; a N-dimensional Cube, at least N times MapReduce Job.

Algorithms advantage

此算法充分利用了MapReduce的能力,处理了中间复杂的排序和洗牌工作,故而算法代码清晰简单,易于维护;

受益于Hadoop的日趋成熟,此算法对集群要求低,运行稳定;在内部维护Kylin的过程中,很少遇到在这几步出错的情况;即便是在Hadoop集群比较繁忙的时候,任务也能完成。

Algorithm shortcomings

当Cube有比较多维度的时候,所需要的MapReduce任务也相应增加;由于Hadoop的任务调度需要耗费额外资源,特别是集群较庞大的时候,反复递交任务造成的额外开销会相当可观;

由于Mapper不做预聚合,此算法会对Hadoop MapReduce输出较多数据; 虽然已经使用了Combiner来减少从Mapper端到Reducer端的数据传输,所有数据依然需要通过Hadoop MapReduce来排序和组合才能被聚合,无形之中增加了集群的压力;

对HDFS的读写操作较多:由于每一层计算的输出会用做下一层计算的输入,这些Key-Value需要写到HDFS上;当所有计算都完成后,Kylin还需要额外的一轮任务将这些文件转成HBase的HFile格式,以导入到HBase中去;

总体而言,该算法的效率较低,尤其是当Cube维度数较大的时候;时常有用户问,是否能改进Cube算法,缩短时间。

Fast (in-mem) Cubing algorithm

Also referred to as "piece-wise" (By Segment) or "block-wise" (By Split) Algorithm

1.5.x is introduced from the start of the algorithm is calculated using the most Mapper end to complete the polymerization, then the result of the polymerization to the Reducer, thereby reducing the pressure of the network bottlenecks.

The main idea

Mapper data blocks allocated, it will be calculated as a complete small Cube segment (comprising all Cuboid);

Each finished Mapper The calculated output to the segment Cube Reducer do combined Cube generate large, i.e., the final result; this explains the process of FIG.

640?wx_fmt=png

The old algorithm differences

Mapper会利用内存做预聚合,算出所有组合;Mapper输出的每个Key都是不同的,这样会减少输出到Hadoop MapReduce的数据量,Combiner也不再需要;

一轮MapReduce便会完成所有层次的计算,减少Hadoop任务的调配。

for instance

A cube with four dimensions: A, B, C, D; each has Mapper 1,000,000 source records to be processed; column base Mapper is Car (A), Car (B), Car (C) and Car (D);

When the source recorded gathered base cuboid (1111), using the old "step by step" algorithm, Hadoop Mapper will output 1 million records; cube using a fast algorithm, after the pre-polymerization, it is only output to Hadoop [DISTINCT a, B, C, D] of the number of records, it is certainly smaller than the source data; under normal circumstances, it may be a source of 1/10 to 1/1000 of the size of the record;

When the polymerization from parent to child cuboid cuboid, the base cuboid (1111) 0111 to a three-dimensional cuboid, the dimension A would polymerization; we assume that the dimension A is independent of other dimensions, after the polymerization, the dimension 0111 is approximately cuboid cuboid Base the 1 / Card (a); so this step will reduce the output to the original 1 / Card (a).

In general, the average dimension of the base is assumed Card (N), from the Mapper Reducer written in the recording can be reduced to the original dimensions 1 / Card (N); Hadoop fewer outputs, fewer I / O and computing the better the performance.

Sub-cube Spanning Tree (Cuboid Spanning Tree) traversal order

In the old algorithm, Kylin in accordance with the hierarchy, that is, breadth-first traversal (Broad First Search) in order to calculate each Cuboid; rapid Cube algorithm, Mapper will press the depth-first traversal (Depth First Search) to compute each of Cuboid. Is a recursive depth-first traversal method, the parent to calculate sub push Cuboid Cuboid, Cuboid provided until no need to calculate the sub-stack and output to the Hadoop; Cuboid up to a scratch of N, N is the number of dimensions Cube.

采用DFS,是为了兼顾CPU和内存:

从父Cuboid计算子Cuboid,避免重复计算;

只压栈当前计算的Cuboid的父Cuboid,减少内存占用。

640?wx_fmt=png

The figure is a four-dimensional Cube complete spanning tree;

DFS is the order, the calculation order prior to output a zero-dimensional Cuboid ABCD -> BCD -> CD -> D ->, ABCD, BCD, CD, and D need to be temporarily stored; after being output, the output D can be, memory is released; after C is calculated and outputted, CD can be output; ABCD is finally output.

Use DFS access sequence, the output Mapper has been completely sorted (except in special cases), because Cuboid ID located at the start position of the line keys, and the internal Cuboid rows sorted:
. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is

0000
0001[D0]
0001[D1]
....
0010[C0]
0010[C1]
....
0011[C0][D0]
0011[C0][D1]
....
....
1111[A0][B0][C0][D0]
....

Since the output mapper has been sorted, sorting efficiency Hadoop will be higher,

Further, the prepolymerization occurs in memory mapper, to avoid unnecessary disk and network I / O, and reduce the cost of Hadoop;

During the development phase, we encountered OutOfMemory error mapper; this can happen:

Mapper的JVM堆大小很小;

使用“dictinct count”度量(HyperLogLog占用空间)

生成树太深(维度太多);

给Mapper的数据太大

Kylin not think we realized Mapper always have enough memory; Cubing adaptive algorithm requires a variety of situations;

When the active OutOfMemory error is detected, it will optimize memory usage and data spilling onto the disk; the results are promising, OOM error rarely occurs now;

Advantages and disadvantages

advantage

It is faster than the old way; from our comparison test can be reduced by 30% to 50% of the total build time;

It produces less workload on Hadoop, and leaving less on the HDFS intermediate file;

Spark Cubing and other cube engine may easily reuse code for the cube;

Shortcoming

The algorithm is a little more complicated; this increases the maintenance work;

Although this algorithm can automatically spill data to disk, but it still hopes Mapper have enough memory to get the best performance;

Users need more knowledge to adjust the cube;

By-layer Spark Cubing algorithm

We know, RDD (elasticity distributed data set) is a basic concept of Spark. A set of N-dimensional cube can be well described as RDD, N-dimensional cube having the N + 1 RDD. The RDD having a parent / child relationship, can be used to generate RDD as parent child RDD. RDD by the parent in the cache memory, generating a sub RDD can be more effectively than read from disk. This process is described under FIG.

640?wx_fmt=png

Improve

每一层的cuboid视作一个RDD

父亲RDD被尽可能cache到内存

RDD被导出到sequence file

通过将“map”替换为“flatMap”,以及把“reduce”替换为“reduceByKey”,可以复用大部分代码

Spark in the process Cubing

FIG lower DAG, which details the process:

In the "Stage 5" in, reads the intermediate Kylin used HiveContext Hive table, and then perform "map" operation of one-one mapping of the original value is encoded as a byte KV. After completion of the encoding of an intermediate obtained Kylin RDD.

In the "Stage 6", with a middle RDD "reduceByKey" operation polymerized to obtain RDD-1, which is a base cuboid. Next, RDD-1 made on a "flatMap" (many Map), since there are N sub-base cuboid cuboid. And so on, RDD levels calculated. Upon completion, the RDD will be intact in the distributed file system, but can be used to calculate the next level cache in memory. When a child cuboid, it is removed from the cache.

640?wx_fmt=png

Performance Testing

640?wx_fmt=png

640?wx_fmt=png

In all three cases, Spark faster than MR, overall, it can be reduced by about half the time.

Comparison of different algorithms Cubing

640?wx_fmt=png

Reference connection:

https://blog.csdn.net/bbbeoy/article/details/79073725

https://blog.csdn.net/Forlogen/article/details/88634117

http://cxy7.com/articles/2018/06/09/1528549073259.html
----------------
Disclaimer: This article is CSDN blogger "Qi Miao think think" of original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https://blog.csdn.net/dafei1288/article/details/101443603

Guess you like

Origin www.cnblogs.com/peter-lau/p/12534422.html