Performance optimization skills: foreign key serialization

1. Problem background and applicable scenarios

In the article " Performance Optimization Techniques: Partial Pre-Association ", we introduced the technology of memory and pre-association of dimension tables. However, when the fact table is associated with the dimension table, hash calculation and comparison are still required. How to improve this step Performance? Today we introduce another optimization technique: foreign key serialization.

The idea of ​​foreign key serialization is that if the primary key of the dimension table is a natural number starting from 1 (that is, the row number where the record is located), then the key value can be used to locate the dimensional table record directly by the row number, instead of calculating and comparing The HASH value is increased, thereby speeding up the association with dimension tables and further improving performance. Moreover, directly use the serial number to locate, there is no need to build an index, and the memory usage will be much smaller.

Let's introduce how to use the foreign key serialization technique in SPL, and use the above test environment, for the same query problem, compare the serialized data with the previous data to verify the performance improvement effect of serialization .

 

2. Preparation for serialization

To use the foreign key serialization technique, you must ensure that the primary key value of the dimension table is exactly the serial number (record row number), but the primary key value of the dimension table in actual business is often not like this, so first convert the primary key of the dimension table to Serial number. The conversion method is as follows:

1) Create a new key value-serial number correspondence table, and save the corresponding relationship between the key value of the dimension table and the natural serial number;

2) Replace the key value of the dimension table with the natural sequence number to obtain a new dimension table file;

3) Modify the foreign key value in the fact table to the serial number. The basis of the modification is the key value-serial number correspondence table. After the modification, a new fact table is obtained.

The three dimension tables used in this experiment are supplier, part, and orders, and the fact table is lineitem. The following is to implement serialization for them.

1. Supplier serialization

The serialized SPL script is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/tbl/supplier.tbl").cursor(;   , "|").new(_1:S_SUPPKEY, _2:S_NAME, _3:S_ADDRESS, _4:S_NATIONKEY,   _5:S_PHONE, _6:S_ACCTBAL).fetch().derive(#:xh)
2 =file("/home/btx/supplier_xh_key.btx").export@b(A1,S_SUPPKEY,xh)
3 =A1.new(xh:S_SUPPKEY,   S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL)
4 =file("/home/ctx/supplier_xh.ctx").create(#S_SUPPKEY,   S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL)
5 >A4.append(A3.cursor())

A1 Read data from the original data file supplier.tbl, and use the derive function to add a new column xh, and use the row number as the field value

A2 Output the two fields of S_SUPPKEY and xh in A1 to the set file supplier_xh_key.btx to generate a table of correspondence between key values ​​and serial numbers for use when serializing fact tables

A3 Replace the S_SUPPKEY field with the xh field value to reconstruct the supplier table sequence

A4 Create the serialized group table file supplier_xh.ctx

A5 Save the reconstructed supplier table sequence to the group table file supplier_xh.ctx

2. Part serialization

The serialized SPL script is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/tbl/part.tbl").cursor(;   , "|").new(_1:P_PARTKEY, _2:P_NAME, _3:P_MFGR, _4:P_BRAND,   _5:P_TYPE, _6:P_SIZE, _7:P_CONTAINER, _8:P_RETAILPRICE).fetch().derive(#:xh)
2 =file("/home/btx/part_xh_key.btx").export@b(A1,P_PARTKEY,xh)
3 =A1.new(xh:P_PARTKEY,   P_NAME,P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE)
4 =file("/home/ctx/part_xh.ctx").create(  #P_PARTKEY, P_NAME,P_MFGR, P_BRAND, P_TYPE,   P_SIZE, P_CONTAINER, P_RETAILPRICE)
5 >A4.append(A3.cursor())

The principle of the script is the same as that of the supplier table serialization. The generated key value and serial number correspondence table is part_xh_key.btx, and the serialized group table file is called part_xh.ctx.

3. Order serialization

The serialized SPL script is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/tbl/orders.tbl").cursor(;   , "|").new(_1:O_ORDERKEY, _2:O_CUSTKEY,   _3:O_ORDERSTATUS,_4:O_TOTALPRICE,    _5:O_ORDERDATE,   _6:O_ORDERPRIORITY,_7:O_SHIPPRIORITY).fetch().derive(#:xh)
2 =file("/home/btx/orders_xh_key.btx").export@b(A1,O_ORDERKEY,xh)
3 = A1.new (xh: O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_SHIPPRIORITY)
4 =file("/home/ctx/orders_xh.ctx").create(    #O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,  O_ORDERDATE,O_ORDERPRIORITY,O_SHIPPRIORITY)  
5 >A4.append(A3.cursor())

The principle of the script is the same as that of supplier table serialization. The generated key value and serial number correspondence table is orders_xh_key.btx, and the serialized group table file is called orders_xh.ctx.

4. Serialization of lineitem

The serialized SPL script is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/tbl/lineitem.tbl").cursor(;   , "|").new(  _1:L_ORDERKEY,   _4:L_LINENUMBER, _2:L_PARTKEY, _3:L_SUPPKEY,    _5:L_QUANTITY, _6:L_EXTENDEDPRICE,_7:L_DISCOUNT, _8:L_TAX, _9:L_RETURNFLAG,   _10:L_LINESTATUS,_11:L_SHIPDATE,     _12:L_COMMITDATE, _13:L_RECEIPTDATE,_14:L_SHIPINSTRUCT,  _15:L_SHIPMODE, _16:L_COMMENT)
2 =file("/home/btx/orders_xh_key.btx").import@b()
3 =file("/home/btx/part_xh_key.btx").import@b()
4 =file("/home/btx/supplier_xh_key.btx").import@b()
5 =A1.switch(L_ORDERKEY,A2:O_ORDERKEY;L_PARTKEY,A3:P_PARTKEY;L_SUPPKEY,A4:S_SUPPKEY)
6 =A5.run(L_ORDERKEY=L_ORDERKEY.xh,   L_PARTKEY=L_PARTKEY.xh, L_SUPPKEY=L_SUPPKEY.xh)
7 =file("/home/ctx/lineitem_xh.ctx").create(#L_ORDERKEY,#L_LINENUMBER,L_PARTKEY,   L_SUPPKEY,  L_QUANTITY,   L_EXTENDEDPRICE,L_DISCOUNT, L_TAX,    L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE,  L_RECEIPTDATE,L_SHIPINSTRUCT, L_SHIPMODE,   L_COMMENT;L_ORDERKEY)
8 >A7.append(A6)

A1   创建读入lineitem原始数据的游标

A2/A3/A4   分别读入orders、part、supplier键值与序号对应关系表

A5   用L_ORDERKEY与orders对应关系表关联,用L_PARTKEY与part对应关系表关联,用L_SUPPKEY与supplier对应关系表关联

A6   用关联后的序号值替换键值生成新的游标

A7   创建序号化后的组表文件lineitem_xh.ctx

A8   将序号化后的游标数据写入组表lineitem_xh.ctx

 

三、  序号化测试

1.   原始数据测试

维表预加载SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >env(supplier, file("/home/ctx/supplier.ctx").create().memory().keys@i(S_SUPPKEY))
2 >env(part, file("/home/ctx/part.ctx").create().memory().keys@i(P_PARTKEY))
3 >env(orders,file("/home/ctx/orders.ctx").create().memory().keys@i(O_ORDERKEY))

加载维表并建索引。

 

编写SPL测试脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/ctx/lineitem.ctx").create().cursor(L_ORDERKEY,L_PARTKEY,   L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)
3 =A2.select(L_ORDERKEY.O_TOTALPRICE>0   && L_PARTKEY.P_SIZE>0 && L_SUPPKEY.S_ACCTBAL<999999)
4 =A3.groups(year(L_SHIPDATE):l_year;   sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

先运行维表预加载脚本,再运行测试脚本,得到测试脚本运行时间为450秒。

 

2.  序号化数据测试

维表预加载SPL脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >env(supplier, file("/home/ctx/supplier_xh.ctx").create().memory())
2 >env(part, file("/home/ctx/part_xh.ctx").create().memory())
3 >env(orders,file("/home/ctx/orders_xh.ctx").create().memory())

加载的是序号化后的维表,不需要建立索引。

编写SPL测试脚本如下:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/ctx/lineitem_xh.ctx").create().cursor(L_ORDERKEY,L_PARTKEY,   L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch(L_ORDERKEY,orders:#;L_PARTKEY,part:#;L_SUPPKEY,supplier:#)
3 =A2.select(L_ORDERKEY.O_TOTALPRICE>0   && L_PARTKEY.P_SIZE>0 && L_SUPPKEY.S_ACCTBAL<999999)
4 =A3.groups(year(L_SHIPDATE):l_year;   sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

注意:A2中关联时使用“:#”与维表关联,表示用事实表中的键值与维表的行号进行关联,假如键值是7,则直接与维表的第7行关联。

先运行维表预加载脚本,再运行测试脚本,得到测试脚本运行时间为269秒。

 

3.  分析与结论

在上面两组对照实验中,序号化后的数据与序号化前相比,维表的记录数、字段数完全相同,事实表的记录数与字段数也完全相同,仅有相关的键值换成了序号。查询过程相比,过滤条件完全相同,过滤后的参与计算的数据完全相同,也就是说过滤与分组求和所用的时间完全相同,所不同的仅有关联的方式(行号关联、键值的hash值比对关联),而运行时间减少了450-269=181秒,可见,外键序号化对性能提升效果十分显著。

 

四、  进一步说明

The serialized field must be the primary key of the dimension table, but the data type of the primary key field is not limited. Integer, string, date, time, etc. can all be serialized. For a multi-primary key dimension table, you can add a serial number field, create multiple key-value and serial number correspondence tables, and serialize the fact table accordingly.

Generally speaking, the foreign key serialization technique can be easily applied to the query of historical data. It can be used only by serializing the historical data once, and there is no need to keep the table of correspondence between key values ​​and serial numbers.

But the foreign key serialization technique is also applicable to queries with newly added data, but it will add more steps.

1. Both dimension tables and fact tables have new data

1) First obtain the newly added record of the dimension table, and add the table of correspondence between key values ​​and serial numbers;

2) Then append the new record to the serialized dimension table, based on the key value and serial number correspondence table;

3) Then append the newly added records of the fact table to the serialized fact table, based on the table of correspondence between key values ​​and serial numbers;

2. Only the fact table has new data

In the case that the dimension table data remains unchanged, only step 3) above is required.

 

After processing the newly added data, you can use the foreign key serialized associative query technique.


Guess you like

Origin blog.51cto.com/12749034/2595012