1. Problem background and applicable scenarios
In " Performance Optimization Techniques: Pre-Association ", we tested the query performance optimization problem after loading all data tables into memory in advance and doing the association. However, if the memory is not large enough, all dimension tables and fact tables cannot be loaded. then what should we do? At this point, the dimension table can be pre-loaded into the memory, the index is built, and the pre-association of the dimension table part is realized, saving half of the hash calculation.
Let's test this scenario again. This time we use the lineitem table with the largest amount of data and the memory cannot fit. In the pre-association of the SPL part, the other 7 tables are pre-loaded into the memory, and the lineitem is read in real time during query. Into.
Two, SQL test
The Oracle database is still used as the representative of the SQL test to query the total revenue of the annual parts order from the lineitem table.
1. Two tables associated
The query SQL statement is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
lineitem,
part
where
p_partkey = l_partkey
and length(p_type)>2
) shipping
group by
l_year
order by
l_year;
2. Six table association
The query SQL statement is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
supplier,
lineitem,
orders,
customer,
part,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and length(p_type) > 2
and n1.n_name is not null
and n2.n_name is not null
and s_suppkey > 0
) shipping
group by
l_year
order by
l_year;
3. Test results
Two-table association | Six table association | |
Running time (seconds) | 235 | 2669 |
These two test data are still the fastest one after multiple runs.
It can be seen from the test results that the six-table association is 2669/235=11.4 times slower than the two-table association! The performance drops a lot.
3. SPL partial pre-association test
1. Partial pre-association
The SPL script for pre-association is as follows:
A | |
1 | >env(region, file(path+"region.ctx").create().memory().keys@i(R_REGIONKEY)) |
2 | >env(nation, file(path+"nation.ctx").create().memory().keys@i(N_NATIONKEY)) |
3 | >env(supplier, file(path+"supplier.ctx").create().memory().keys@i(S_SUPPKEY)) |
4 | >env(customer, file(path+"customer.ctx").create().memory().keys@i(C_CUSTKEY)) |
5 | >env(part, file(path+"part.ctx").create().memory().keys@i(P_PARTKEY)) |
6 | >env(orders,file(path+"orders.ctx").create().memory().keys@i(O_ORDERKEY)) |
7 | >nation.switch(N_REGIONKEY,region) |
8 | >customer.switch(C_NATIONKEY,nation) |
9 | >supplier.switch(S_NATIONKEY,nation) |
10 | >orders.switch(O_CUSTKEY,customer) |
The first 6 lines of the script read the 6 dimension tables into the memory, generate the internal table, build the index, and set it as a global variable. The last 4 lines complete the connection between dimension tables. When the SPL server starts, run this script first to complete the environment preparation.
2. Two tables associated
Write the SPL script as follows:
A | |
1 | =file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 | =A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2) |
3 | =A2.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
Temporary loading needs to use a cursor, and then associate on the cursor, and then the writing method is similar to the full memory.
3. Six table association
Write the SPL script as follows:
A | |
1 | =file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE) |
2 | =A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier) |
3 | =A2.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME != null && L_SUPPKEY.S_SUPPKEY>0 ) |
4 | =A3.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
Similarly, the writing after creating a cursor and association is similar to that of full memory, which is very concise and easy to understand.
4. Run results
Two-table association | Six table association | |
Running time (seconds) | 266 | 472 |
The six-table association is only 1.8 times slower than the two-table association. The increased time is mainly used for the association of the L_ORDERKEY and L_SUPPKEY fields in the fact table lineitem and the increase in the calculation of filter conditions (referring to these association table fields). Because of the partial pre-association, the association operation itself between dimension tables no longer consumes time, and the time associated with the dimension table and the lineitem table is also improved because the index is built in advance (the hash calculation can be reduced by half).
4. Conclusion
Summary of test results:
Running time (seconds) | Two-table association | Six table association | Performance reduction factor |
SQL | 235 | 2669 | 11.4 |
SPL pre-association | 266 | 472 | 1.8 |
The six-table association is 11.4 times slower than the two-table association, indicating that the SQL processing JOIN consumes a lot of CPU, and the performance decreases significantly. The SPL after using partial pre-association mechanism is only 1.8 times slower, and multiple JOIN tables have little effect, and performance will not be significantly reduced.
When performing queries with many associated tables, if the memory is large enough to read all the dimension table data except the fact table into the memory, using some pre-association techniques can still effectively improve the calculation performance! However, when the relational database is used in a lot of related tables, the database engine will not be optimized, resulting in a serious performance degradation.