[Others] How should multi-dimensional analysis pre-summarization be effective?

How to do multi-dimensional analysis pre-summarization to work

Multidimensional analysis is one of the advanced statistical analysis methods, which puts a product or a market phenomenon on a spatial coordinate of more than two dimensions for analysis.

Multidimensional analysis (OLAP) usually requires extremely high response efficiency. When the amount of data involved is large, the efficiency of summarizing each time based on detailed data will be very low. People will consider using pre-summarization to speed up the query speed, that is, pre-summarizing The query result is calculated well, and the real-time response can be obtained by directly reading the pre-summary result when using it, so as to meet the needs of interactive analysis.

However, it is not realistic to pre-aggregate all possible combinations of dimensions. The storage space required to calculate the full pre-aggregation of 50 dimensions is as high as 1MT, and 100 million 1T hard disks are required, even if only 20 of them are aggregated. Dimensions also take up 470,000T of space ( the storage capacity of multi-dimensional analysis pre-summarization ), which is obviously unacceptable. Therefore, a partial pre-summarization method is generally used to summarize some of the dimensions to balance storage space and performance requirements.

The Dilemma of the Pre-Aggregation Scheme

In fact, even if the capacity problem is not considered, pre-aggregation can only meet a small number of relatively fixed query requirements in multidimensional analysis, and it cannot handle slightly complex and flexible scenarios, and these scenarios exist in large numbers in actual business.

  1. Unconventional aggregation: In addition to the common totals and counts, some unconventional aggregations, such as unique counts, medians, and variances, are likely to be missed and cannot be calculated from other aggregated values. There are theoretically infinite number of aggregation operations that cannot be pre-aggregated.

  2. Combinatorial Aggregation: Aggregation operations may be combined. For example, we may care about the average monthly sales. This value is calculated by summing up the daily sales on a monthly basis. It is not simply summing and averaging, but a combination of two aggregation operations at different dimensional levels. These are also unlikely to be pre-aggregated beforehand.

  3. Conditional measures: Measures may also have conditions in statistics. For example, we want to know the total sales of orders with a transaction amount greater than 100 yuan. This information also cannot be processed during pre-aggregation, because 100 would be a temporary input parameter.

  4. Time period statistics: Time is a special dimension, which can be enumerated or sliced ​​in the form of continuous intervals. The start and end points of the query interval may be fine-grained (for example, to a certain date), so fine-grained data must be used for re-statistics, and higher-level pre-summarized data cannot be directly used.

Pre-summarization can indeed improve the performance of multidimensional analysis to a certain extent, but it can only deal with few scenarios in multidimensional analysis, and it can only be partially pre-summarized, and the usage scenarios are even more limited. Even so, it still faces the problem of huge storage space . It is not reliable to expect the effect of multidimensional analysis on the pre-aggregation scheme. To do a good job in multi-dimensional analysis, hard traversal skills are basic. Even with pre-summarized data, it can only play a greater role with the assistance of excellent hard traversal capabilities.

SPL pre-aggregation

The open-source esProc SPL provides conventional multi-dimensional analysis pre-aggregation methods, as well as special time period pre-aggregation. More importantly, with the help of SPL's excellent data traversal capabilities, it can also meet the needs of a wider range of multi-dimensional analysis scenarios.

First look at the pre-aggregation capabilities of SPL.

partial pre-aggregation

Full pre-aggregation is unrealistic, and only partial pre-aggregation can be performed. Although the response speed of O(1) cannot be achieved, the performance can be improved by dozens of times, which is meaningful. SPL can build as many pre-aggregated intermediate results as needed. For example, data table T has five dimensions A, B, C, D, E. Several most commonly used intermediate results can be pre-calculated based on business experience.

image

In the figure above, the size of the storage space occupied by the cube is represented by the length of the bar, with cube1 being the largest and cube2 being the smallest. A request comes from the front-end application, which needs to be summarized according to B and C. At this time, the process of SPL automatically selecting multiple cubes is roughly as follows.

image

In step i, SPL finds that the available cubes are cube1 and cube3. In step ii, SPL finds that cube1 is relatively large, so it will automatically select relatively small cube3, and group and summarize according to B and C based on it.

SPL code example:

A
1 =file("T.ctx").open()
2 =A1.cuboid(cube1,A,B,C;sum(…),avg(…),…)
3 =A1.cuboid(cube2,A,C,D;sum(…),avg(…),…)
4 =A1.cgroups(B,C;sum(…), avg(…))

Use the cuboid function to create pre-summary data (A2 and A3), you need to give a name (such as cube1), and the remaining parameters are dimensions and summary measures; when A4 is used, the cgroups function will automatically use the above rules to use the intermediate cube and select The smallest amount of data is used.

Time period pre-summarization

Time is a particularly important dimension in multidimensional analysis. It can be enumerated or sliced ​​in the form of continuous intervals. For example, in business, it is often necessary to query the total sales between May 8 and June 12. The start and end time points are also passed in as parameters during the query, which is highly arbitrary. Time period statistics may also have multiple combination associations, for example, look at the total amount of goods sold between May 8 and June 12 and whose production dates are between January 9 and February 17. Statistics like this kind of time period have strong business significance, but they cannot be dealt with by conventional pre-aggregation solutions.

For this special time period statistics, SPL provides a time period pre-summarization method. For example, the order table already has a cube1 that is pre-summarized by order date, so we can add another cube2 that is pre-summarized by month. At this time, it is necessary to calculate the summary value of the amount from January 22 to September 8, 2018. The general process will be as follows:

image

Divide the time period into three segments, calculate the aggregation value based on the monthly summary cube2 for the data of the entire month from February to August, and then use cube1 to calculate the aggregation value from January 22 to January 31 and September 1 to September 8 , the amount of calculation involved is 7 (February-August) + 10 (January 22-January 31) + 8 (September 1-September 8) = 25, and if cube1 data aggregation is used, Its calculation is 223 (the number of days from January 22 to September 8), which is almost a 10-fold reduction.

SPL code example:

A
1 =file("orders.ctx").open()
2 =A1.cuboid(cube1,odate,dept;sum(amt))
3 =A1.cuboid(cube2,month@y(odate),dept;sum(amt))
4 =A1.cgroups(dept;sum(amt);odate>=date(2018,1,22)&&dt<=date(2018,9,8))

The cgroups function adds conditional parameters. When SPL finds that there are time period conditions and higher-level pre-summary data, it will use the time period pre-summary mechanism to reduce the amount of calculation. In this example, the corresponding data will be read from cube1 and cube2 respectively and then aggregated.

SPL hard traversal

The scenarios that pre-summarization can handle are still very limited. To make flexible multi-dimensional analysis, you still need to count on excellent traversal capabilities. The multidimensional analysis operation itself is not complicated, and the traversal calculation is mainly for dimension filtering. Traditional databases can only use WHERE for hard calculations, and dimension-related filtering is also used as routine operations, which cannot achieve better performance. SPL provides a variety of dimensional filtering mechanisms to meet the performance requirements of various multi-dimensional analysis scenarios.

Boolean sequence

The most common slicing (dicing) in multidimensional analysis is for enumerated dimensions, except for the time dimension, which is almost always enumerated dimensions, such as product, region, type, etc. The conventional processing method is expressed in SQL like this:
SELECT D1,…,SUM(M1),COUNT(ID)… FROM T GROUP BY D1,…
WHERE Di in (di1,di2…) …

Among them, Di in (di1, di2) means that the filter field takes a value within an enumeration range. In practical applications, "slicing according to customer gender, employee department, product type, etc." is an enumeration dimension slice. The conventional IN method requires multiple comparisons and judgments to filter out eligible data (slices), and its performance is very low. The more IN values ​​are, the worse the performance will be.

SPL converts lookup operations into value operations to improve performance. First convert the enumerated dimension into an integer), as shown in the following figure, convert the value of dimension D5 in the fact table into the serial number (position) in the dimension table:

image

Then, the slicing condition is converted into an alignment sequence composed of Boolean values ​​during the query, and the value (true/false) judgment result can be directly taken out from the specified position of the sequence during the comparison, and the slicing operation can be completed quickly.

image

SPL data preprocessing code example:

A
1 =file("T.ctx").open()
2 =file("T_new.ctx”).create(…)
3 =DV=T(“DV.btx”)
4 =A1.cursor().run(D=DV.pos@b(D))
5 =A2.append@i(A4)

A3 reads the dimension table, and A4 uses DV to convert dimension D into an integer. The DV will be saved additionally for use when looking up.

Slice summary:

A
1 =file("T.ctx").open()
2 =DV.(V.pos(~))
3 =A1.cursor(…;A2(D))
4 =A3.groups(…)

A2 converts the parameter V into a Boolean value sequence with the same length as DV. When the member of DV is in V, the member at the corresponding position of A2 will be non-empty (playing the role of true when judging), otherwise it will be filled in empty (that is, false). Then when traversing the slice, only use the converted integer dimension D as the serial number to fetch the members of this Boolean value sequence. If it is not empty, it indicates that the original dimension D belongs to the slice condition V. The computational complexity of serial number selection is far less than that of IN comparison, which greatly improves slice performance.

SPL's excellent hard traversal ability has obvious application effects in practice. In the case of open source SPL speeding up the intersection calculation of bank user portraits and customer groups by 200+ times , the intersection of bank user portraits and customer groups is achieved by using hard traversal technologies such as Boolean sequence and cursor pre-filtering. Computational efficiency has increased by more than 200 times.

Tag Dimensions

In multidimensional analysis, there is also a special enumerated dimension that is often used for slices (rarely used for group statistics), and its value can only be yes/no or true/false, which is called label dimension or binary dimension, such as personnel Whether you are married, have you gone to college, have a credit card, etc. Label dimension slicing belongs to the whether-type calculation in the filter condition, expressed in SQL like this:
SELECT D1,…,SUM(M1),COUNT(ID)… FROM T GROUP BY D1,…
WHERE Dj=true and Dk=false …

Label dimensions are very common. Labeling customers and things is an important means of current data analysis. Data sets for modern multidimensional analysis often have hundreds or even thousands of label dimensions. If this dimension is treated as an ordinary field, whether it is stored Or operation will cause a lot of waste, it is difficult to obtain high performance.

There are only two values ​​for the label dimension, and only one bit can be stored. A 16-bit integer can store 16 tags. One field is enough for the information originally stored in 16 fields. This storage method is called tag bit dimension. SPL provides this mechanism, which will greatly reduce the amount of storage, that is, the amount of hard disk reading, and the integer does not affect the reading speed.

For example, here we assume that there are a total of 8 binary dimensions, and the integer field c1 is used to store 8-bit binary numbers. To calculate binary dimension slices by bitwise storage, the fact table needs to be preprocessed into bitwise storage first.

image

In the processed fact table, the first row c1 is AFh, which is converted to a binary number of 10100000, indicating that D6 and D8 are true, and other binary dimensions are false. Then you can perform bitwise calculations to implement binary dimension slicing.

image

The slicing condition passed in by the front end is "2,3", that is, to filter out the data whose values ​​of the second binary dimension (D7) and the third binary dimension (D8) are true, and the other binary dimensions are false.

SPL code example:

A B
1 ="2,3" =A1.split@p(",")
2 =to(8).(0) =B1.(A2(8-~+1)=1)
3 =bits(A2)
4 =file("T.ctx").open().cursor(;and(c1,A3)==A3)
5 =A4.groups(~.D1,~.D2,~.D3,~.D4;sum(~.M1):S,count(ID):C)

The 8 whether-type conditional filters can be realized by doing a bitwise AND calculation. In this way, the multiple comparison calculation of the original binary dimension is converted into a bitwise AND calculation, so the performance will be significantly improved. Multiple whether or not values ​​are converted into an integer, which can also reduce the storage space occupied by the data.

redundant sort

Redundant sorting is an optimization method to speed up the reading (traversal) speed by using orderly. When implementing it, sort by dimensions D1,...,Dn and store one copy, and then sort and store one copy according to Dn,...,D1. The amount of data will decrease Doubled, but still acceptable. For any dimension D, there can always be a data set that makes D in the first half of its sorting dimension list. If it is not the first dimension, the data after slicing generally cannot be connected into an area, but it is also composed of some relatively large formed by a continuous area. The higher the dimension in the sorting dimension list, the higher the physical order of the data after slicing.

image

When calculating, it is enough to use the slicing condition of one dimension to filter, and the conditions on other dimensions are still calculated by traversal. Slicing in a certain dimension in multidimensional analysis can often reduce the amount of data involved by several times or dozens of times, and it is not meaningful to reuse slicing conditions in other dimensions. When there are slicing conditions on multiple dimensions, SPL will select the dimension whose slicing range is smaller than the total value range, which usually means that the amount of filtered data is smaller.

This selection is implemented in the cgroups function of SPL. If it is found that there are multiple pre-summary data sorted by different dimensions and there are slice conditions, the most suitable one will be selected.

A
1 =file("T.ctx").open()
2 =A1.cuboid(cube1,D1,D2,…,D10;sum(…))
3 =A1.cuboid(cube2,D10,D9,…,D1;sum(…))
4 =A1.cgroups(D2;sum(...);D6>=230&&D6<=910&&D8>=100&&D8<=10&&...)

When cuboid creates pre-summary data, the order of grouping dimensions is meaningful, and different pre-summary data will be created for different dimension orders. It is also possible to manually select a suitable sorted data set with code, and store more sorted data sets.

In addition, SPL also provides many high-efficiency computing mechanisms that are not only suitable for multi-dimensional analysis, but also for other data processing scenarios, such as high-performance storage, sequential computing, parallel computing, etc. Combining these capabilities can provide a more efficient data processing experience.

As mentioned above, pre-summarization can only solve a small part of relatively simple and fixed requirements in multidimensional analysis. Other common requirements need to use computing engines such as SPL to implement efficient hard traversal. Based on the traversal capability, combined with the partial pre-summarization and time-segment pre-summarization functions provided by SPL, it can better meet the performance and flexibility requirements of multi-dimensional analysis, while minimizing storage costs.

Using SPL to deal with multi-dimensional analysis scenarios has a wide coverage, high query performance, and low cost of use. This is the ideal technical solution.

SPL information

Welcome those who are interested in SPL to join the SPL technical exchange group.

end!

Guess you like

Origin blog.csdn.net/weixin_44299027/article/details/128091528