Performance optimization techniques: partial pre-association

1. Problem background and applicable scenarios

In " Performance Optimization Techniques: Pre-Association ", we tested the query performance optimization problem after loading all data tables into memory in advance and doing the association. However, if the memory is not large enough, all dimension tables and fact tables cannot be loaded. then what should we do? At this point, the dimension table can be pre-loaded into the memory, the index is built, and the pre-association of the dimension table part is realized, saving half of the hash calculation.

Let's test this scenario again. This time we use the lineitem table with the largest amount of data and the memory cannot fit. In the pre-association of the SPL part, the other 7 tables are pre-loaded into the memory, and the lineitem is read in real time during query. Into.

 

Two, SQL test

The Oracle database is still used as the representative of the SQL test to query the total revenue of the annual parts order from the lineitem table.

1. Two tables associated

The query SQL statement is as follows:

select

       l_year,

       sum(volume) as revenue

from

       (

              select

                     extract(year from l_shipdate) as l_year,

                     (l_extendedprice * (1 - l_discount) ) as volume

              from

                     lineitem,

                     part

              where

                     p_partkey = l_partkey

                     and length(p_type)>2

       ) shipping

group by

       l_year

order by

       l_year;

 

2. Six table association

The query SQL statement is as follows:

select

       l_year,

       sum(volume) as revenue

from

       (

              select

                     extract(year from l_shipdate) as l_year,

                     (l_extendedprice * (1 - l_discount) ) as volume

              from

                     supplier,

                     lineitem,

                     orders,

                     customer,

                     part,

                     nation n1,

                     nation n2

              where

                     s_suppkey = l_suppkey

                     and p_partkey = l_partkey

                     and o_orderkey = l_orderkey

                     and c_custkey = o_custkey

                     and s_nationkey = n1.n_nationkey

                     and c_nationkey = n2.n_nationkey

                     and length(p_type) > 2

                     and n1.n_name is not null

                     and n2.n_name is not null

                     and s_suppkey > 0

       ) shipping

group by

       l_year

order by

       l_year;

3. Test results


Two-table association Six table association
Running time (seconds) 235 2669

These two test data are still the fastest one after multiple runs.

It can be seen from the test results that the six-table association is 2669/235=11.4 times slower than the two-table association! The performance drops a lot.

 

3. SPL partial pre-association test

1. Partial pre-association

The SPL script for pre-association is as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 >env(region,   file(path+"region.ctx").create().memory().keys@i(R_REGIONKEY))
2 >env(nation,   file(path+"nation.ctx").create().memory().keys@i(N_NATIONKEY))
3 >env(supplier,   file(path+"supplier.ctx").create().memory().keys@i(S_SUPPKEY))
4 >env(customer,   file(path+"customer.ctx").create().memory().keys@i(C_CUSTKEY))
5 >env(part,   file(path+"part.ctx").create().memory().keys@i(P_PARTKEY))
6 >env(orders,file(path+"orders.ctx").create().memory().keys@i(O_ORDERKEY))
7 >nation.switch(N_REGIONKEY,region)
8 >customer.switch(C_NATIONKEY,nation)
9 >supplier.switch(S_NATIONKEY,nation)
10 >orders.switch(O_CUSTKEY,customer)

The first 6 lines of the script read the 6 dimension tables into the memory, generate the internal table, build the index, and set it as a global variable. The last 4 lines complete the connection between dimension tables. When the SPL server starts, run this script first to complete the environment preparation.

 

2. Two tables associated

Write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/btx/lineitem.btx").cursor@tb(L_PARTKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch@i(L_PARTKEY,part).select(len(L_PARTKEY.P_TYPE)>2)
3 =A2.groups(year(L_SHIPDATE):l_year;   sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue)

Temporary loading needs to use a cursor, and then associate on the cursor, and then the writing method is similar to the full memory.

 

3. Six table association

Write the SPL script as follows:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =file("/home/btx/lineitem.btx").cursor@tb(L_ORDERKEY,L_PARTKEY,L_SUPPKEY,L_EXTENDEDPRICE,L_DISCOUNT,L_SHIPDATE)
2 =A1.switch@i(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier)
3 =A2.select(len(L_PARTKEY.P_TYPE)>2   && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null &&  L_SUPPKEY.S_NATIONKEY.N_NAME != null   && L_SUPPKEY.S_SUPPKEY>0 )
4

=A3.groups(year(L_SHIPDATE):l_year;

sum(L_EXTENDEDPRICE   * (1 - L_DISCOUNT)):revenue)

Similarly, the writing after creating a cursor and association is similar to that of full memory, which is very concise and easy to understand.

 

4. Run results


Two-table association Six table association
Running time (seconds) 266 472

The six-table association is only 1.8 times slower than the two-table association. The increased time is mainly used for the association of the L_ORDERKEY and L_SUPPKEY fields in the fact table lineitem and the increase in the calculation of filter conditions (referring to these association table fields). Because of the partial pre-association, the association operation itself between dimension tables no longer consumes time, and the time associated with the dimension table and the lineitem table is also improved because the index is built in advance (the hash calculation can be reduced by half).

4. Conclusion

Summary of test results:

Running time (seconds) Two-table association Six table association Performance reduction factor
SQL 235 2669 11.4
SPL pre-association 266 472 1.8

The six-table association is 11.4 times slower than the two-table association, indicating that the SQL processing JOIN consumes a lot of CPU, and the performance decreases significantly. The SPL after using partial pre-association mechanism is only 1.8 times slower, and multiple JOIN tables have little effect, and performance will not be significantly reduced.

When performing queries with many associated tables, if the memory is large enough to read all the dimension table data except the fact table into the memory, using some pre-association techniques can still effectively improve the calculation performance! However, when the relational database is used in a lot of related tables, the database engine will not be optimized, resulting in a serious performance degradation.


Guess you like

Origin blog.51cto.com/12749034/2588479