Performance optimization skills: dimensional table filtering or correlation when calculating

When querying the association between fact table and dimension table, you often encounter situations where you need to filter the data of the dimension table or perform calculations on the dimension table. At this time, there are two processing methods:

1. First make the association (if it is memory, it can be pre-association), and then filter the associated fact table. Just like in " Performance Optimization Techniques: Pre-Association " and " Performance Optimization Techniques: Foreign Key Serialization ".

2. Filter the dimension table first, and then associate it with the fact table. We know that when establishing an association, the dimension table needs to have an index. After filtering, the original index is no longer available, and the index needs to be rebuilt to generate a new index.

Which of these two methods is better, cannot be generalized, it should be related to the comparison of the data scale of the dimension table and the fact table. Let's explore the effects of these performance optimization techniques through experiments.

 

1. Test environment

8 data tables generated by TPCH standard, a total of 50G data. There are many introductions on the structure of TPCH data sheet on the Internet, so I won't repeat them here.

The test machine has two Intel2670 CPUs, with a main frequency of 2.6G, a total of 16 cores, a memory of 128G, and an SSD solid state drive.

To make it easier to see the gap, the following tests are all single-threaded calculations, multi-core does not work.

 

2. Data table full memory

The so-called full memory is to load all the data tables to be used into the memory in advance. We choose customer as the dimension table with a total of 7.5 million records; use orders as the fact table with a total of 75 million records.

When querying, the filter condition for the dimension table is left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal, find the total price of the order that meets these conditions. The first two conditions are always true (in order to increase the calculation amount of the dimension table filtering to enhance the comparison effect of the experiment), bal is a parameter used to test the effect of different data scales after the dimension table filtering.

1. Pre-association

Let's first look at the situation after pre-association, and write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >customer=file("/home/ctx/customer.ctx").create().memory().keys@i(C_CUSTKEY)
2 >orders=file("/home/ctx/orders.ctx").create().memory()
3 =orders.switch(O_CUSTKEY,customer)
4 =now()
5 =orders.select(left(O_CUSTKEY.C_NAME,4)!="shen"   && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal)
6 =A5.sum(O_TOTALPRICE)
7 =interval@s(A4,now())

The dimension table is read in A1 and the index is created, the fact table is read in A2, and the pre-association is performed in A3. These times are not included in the test time, and the timing starts from A4.

2. Rebuild the index

Write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >customer=file("/home/ctx/customer.ctx").create().memory().keys@i(C_CUSTKEY)
2 >orders=file("/home/ctx/orders.ctx").create().memory()
3 =now()
4 =customer.select(left(C_NAME,4)!="shen"   && C_NATIONKEY>-1 && C_ACCTBAL>bal).derive@o().keys@i(C_CUSTKEY)
5 =orders.switch@i(O_CUSTKEY,A4)
6 =A5.sum(O_TOTALPRICE)
7 =interval@s(A3,now())

Rebuild the index after filtering the customer in A4, and make the association in A5.

 

3. Reuse Index

SPL supports the reuse of existing indexes after filtering, just change the above A4 cell script to:

=customer.select@i(left(C_NAME,4)!="shen" && C_NATIONKEY>-1 && C_ACCTBAL>bal)

Select plus option @i means to reuse customer's original index.

 

4. Foreign key serialization

When preloading the data table, load the serialized group tables customer_xh.ctx and orders_xh.ctx.

Write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >customer=file("/home/ctx/customer_xh.ctx").create().memory()
2 >orders=file("/home/ctx/orders_xh.ctx").create().memory()
3 =now()
4 =orders.switch@i(O_CUSTKEY,customer:#)
5 =A4.select(left(O_CUSTKEY.C_NAME,4)!="shen"   && O_CUSTKEY.C_NATIONKEY>-1 && O_CUSTKEY.C_ACCTBAL>bal)
6 =A5.sum(O_TOTALPRICE)
7 =interval@s(A3,now())

No index is required for serialized association, so no index is created in A1. In A4, use customer:# to indicate that the value of O_CUSTKEY is associated with the customer line number.

 

5. Alignment sequence after serialization

When preloading the data table, load the serialized group tables customer_xh.ctx and orders_xh.ctx.

Write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >customer=file("/home/ctx/customer_xh.ctx").create().memory()
2 >orders=file("/home/ctx/orders_xh.ctx").create().memory()
3 =now()
4 =customer.(left(C_NAME,4)!="shen"   && C_NATIONKEY>-1 && C_ACCTBAL>bal)
5 =orders.select(A4(O_CUSTKEY))
6 =A5.sum(O_TOTALPRICE)
7 =interval@s(A3,now())

In A4, use customer. (filter conditions) to calculate a sequence with the same length as the number of records and a value of true or false, which we call a align sequence; O_CUSTKEY in the orders table has been serialized, and its value corresponds to It is the record line number of customer, so in A5, you can use A4 (O_CUSTKEY) to determine whether the data in this line of orders meets the filter conditions.

6. Test results and analysis

The test results obtained in the experiment are as follows (unit: second):

Number of records after dimension table filtering 7.16 million 6.13 million 4.77 million 2.73 million 68 million
Pre-association 41 39 38 37 35
重建索引 39 34 29 25 19
复用索引 35 31 27 23 17
外键序号化 53 51 49 48 46
对位序列 25 23 21 19 16

这个实验中,维表数据记录750万行,事实表orders数据记录7500万行,是维表的10倍。

在预关联和外键序号化测试中,采用的是先关联后再过滤的处理方式,复杂的过滤计算要在事实表的行上进行,也就是说过滤计算量是直接过滤维表的10倍!所以整个查询的运行时间是最长的。预关联与外键序号化相比,在查询时,前者会省去关联这一步,所以比后者速度快。

在重建索引和复用索引测试中,采用的是先对维表过滤后再与事实表关联的处理方式,复杂的过滤计算只在维表的行上进行,所以比预关联和外键序号化要快。复用索引与重建索引相比,过滤、关联、求和的计算量相同,但会在创建索引这一步上节约时间,所以查询速度也更快。随着维表过滤后的数据规模越来越小,重建索引的时间也会减少,整体差距就会变小。

在对位序列测试中,过滤计算也是只在维表的行上进行,计算出对位序列后,只对事实表进行一次过滤,而不用与事实表关联,不用建索引也不用计算hash值,所以速度是最快的!

 

三、   维表内存、事实表外存

这次我们选择orders作为维表,共7500万条记录;用lineitem作为事实表,共3亿条记录。

查询时对维表的过滤条件是left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price,求满足这些条件的订单总价。其中前三个条件总是为真(为了增加维表过滤的计算量,以增强实验的对比效果),price是个参数,用来测试维表过滤后不同数据规模下的效果。

 

1.   关联后再过滤

我们先看关联后再过滤的情况,编写SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >orders=file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY)
2 =now()
3 =file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE)
4 =A3.switch@i(L_ORDERKEY,orders)
5 =A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-"   && L_ORDERKEY.O_ORDERSTATUS!="A" &&   L_ORDERKEY.O_ORDERDATE>date("1990-01-01") &&   L_ORDERKEY.O_TOTALPRICE>price)
6 =A5.total(sum(L_EXTENDEDPRICE))
7 =interval@s(A2,now())

A1中读入维表并创建索引,这不计入测试时间,从A2才开始计时。

由于事实表很大,使用游标读取数据,并与维表关联后再过滤。

2.   重建索引

编写SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >orders=file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY)
2 =now()
3 =orders.select(left(O_ORDERPRIORITY,2)!="9-"   && O_ORDERSTATUS!="A" &&   O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price).derive@o().keys@i(O_ORDERKEY)
4 =file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE).switch@i(L_ORDERKEY,A3)
5 =A4.total(sum(L_EXTENDEDPRICE))
6 =interval@s(A2,now())

A3中orders过滤后再重建索引。

 

3.   复用索引

只需将上述A3单元格脚本改为:

=orders.select@i(left(O_ORDERPRIORITY,2)!="9-" && O_ORDERSTATUS!="A" && O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price)

select加选项@i表示复用orders原来的索引。

 

4.  外键序号化

预加载数据表时加载序号化处理过的组表orders_xh.ctx,且不用创建索引。

编写SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >orders=file("/home/ctx/orders_xh.ctx").create().memory()
2 =now()
3 =file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE)
4 =A3.switch@i(L_ORDERKEY,orders:#)
5 =A4.select(left(L_ORDERKEY.O_ORDERPRIORITY,2)!="9-"   && L_ORDERKEY.O_ORDERSTATUS!="A" && L_ORDERKEY.O_ORDERDATE>date("1990-01-01")   && L_ORDERKEY.O_TOTALPRICE>price)
6 =A5.total(sum(L_EXTENDEDPRICE))
7 =interval@s(A2,now())

A4中用orders:#表示用L_ORDERKEY的值与orders的行号关联。

 

5.  序号化后对位序列

预加载数据表时加载序号化处理过的组表orders_xh.ctx,且不用创建索引。

编写SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >orders=file("/home/ctx/orders_xh.ctx").create().memory()
2 =now()
3 =orders.(left(O_ORDERPRIORITY,2)!="9-"   && O_ORDERSTATUS!="A" &&   O_ORDERDATE>date("1990-01-01") && O_TOTALPRICE>price)
4 =file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_EXTENDEDPRICE).select(A3(L_ORDERKEY))
5 =A4.total(sum(L_EXTENDEDPRICE))
6 =interval@s(A2,now())

查询实现原理与全内存时相同。

6.  测试结果与分析

实验获得测试结果如下(单位:秒):

维表过滤后记录数 6443万 4995万 3590万 2249万 4.28 million
Filter after association 101 98 97 94 92
Reindex 102 98 92 73 53
Reuse index 85 82 77 74 57
Foreign key serialization 79 78 76 75 72
Alignment sequence 53 49 47 43 39

In this experiment, the dimension table data records 75 million rows, and the fact table lineitem data records 300 million rows, which is 4 times that of the dimension table.

The calculation principle of the query process is the same as the analysis in the previous section, but the data size comparison multiple of the fact table and the dimension table has decreased, from 10 times to 4 times. Compared with the multiplexing index, the speed difference of foreign key serialization is not very obvious. Even when there are fewer records filtered by the dimension table, because serialized associations are more dominant than hash values, the query speed is slightly faster.

 

Four, summary

According to the previous test results and analysis, for the query when the dimension table is filtered or calculated, what optimization techniques should be used to obtain the best performance, we make the following summary.

1. The fact table data record is smaller than the dimension table

1) If the data table can all be loaded into the memory, use pre-association.

2) If it cannot be loaded into the memory, but the dimension table and foreign key are serialized, the serialized association is used first and then the fact table is filtered.

3) If it cannot be loaded into the memory, and no serialization is performed, the fact table is filtered by the foreign key value association first.

 

2. The fact table data record is much larger than the dimension table

1) If the data sheet is serialized, use the alignment sequence technology.

2) If the data table is not serialized, first filter the dimension table and reuse the index, and then associate the query by the foreign key value.

 

3. The fact table data record is not much larger than the dimension table

1) If the data sheet is serialized, use the alignment sequence technology.

2) If the data table is not serialized, it is better to use pre-association (if it can be installed in the memory) or multiplex index, it is best to measure it.


Guess you like

Origin blog.51cto.com/12749034/2602014