1. Problem background and applicable scenarios
The performance of JOIN in SQL is an old problem, especially when there are many associated tables, the calculation performance will drop sharply.
SQL implementation of JOIN generally adopts the method of HASH stacking, that is, the HASH value of the associated key is calculated first, and then the records with the same HASH value are put together and then traversed and compared. Each JOIN has to do a round of such operations.
If the amount of data is not very large relative to the memory and can be loaded into the memory in advance, then the memory pointer mechanism can be used to establish the association relationship in advance. When doing this operation, there is no need to do HASH and comparison operations. Specifically, when the data is loaded, the HASH and comparison operations are completed at one time, and the association results are saved in a pointer mode, and then each operation can be directly referenced to the association record, thereby improving the performance of the operation.
Unfortunately, SQL does not have a pointer data type and cannot implement this optimization logic. Even if the amount of data can be stored in memory, it is difficult to use pre-association techniques to speed up. SQL-based memory databases also have this shortcoming. The SPL has a pointer data type to implement this mechanism.
Let's test the difference between SQL implementation of single-table calculation and multi-table association calculation, and then use SPL to use the pre-association technique to do the same to see the difference between the two.
2. Test environment
8 data tables generated by the TPCH standard, a total of 50G data (small enough to fit into the memory). There are many introductions on the structure of TPCH data sheet on the Internet, so I won't repeat them here.
The test machine has two Intel2670 CPUs, with a main frequency of 2.6G, a total of 16 cores, a memory of 128G, and an SSD solid state drive.
Because the amount of data in the lineitem table is too large, this server has insufficient memory to load it, so a table orderdetail with the same table structure as it is created to reduce the amount of data to be able to fit in the memory. This table is used below. do tests.
To make it easier to see the gap, the following tests are all single-threaded calculations, multi-core does not work.
Three, SQL test
Here, Oracle database is used as the representative of SQL test to query the total revenue of parts orders each year from the orderdetail table.
1. Two tables associated
The query SQL statement is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
orderdetail,
part
where
p_partkey = l_partkey
and length(p_type)>2
) shipping
group by
l_year
order by
l_year;
2. Six table association
The query SQL statement is as follows:
select
l_year,
sum(volume) as revenue
from
(
select
extract(year from l_shipdate) as l_year,
(l_extendedprice * (1 - l_discount) ) as volume
from
supplier,
orderdetail,
orders,
customer,
part,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and p_partkey = l_partkey
and o_orderkey = l_orderkey
and c_custkey = o_custkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and length(p_type) > 2
and n1.n_name is not null
and n2.n_name is not null
and s_suppkey > 0
) shipping
group by
l_year
order by
l_year;
3. Test results
Two-table association | Six table association | |
Running time (seconds) | 26 | 167 |
Both query statements use nested writing, and the calculation performance after Oracle's automatic optimization is better than without nesting (group by and select may have double calculations without nesting).
These two test data are the results of multiple runs. In the test, it is found that when Oracle runs a query for the first time, it is often much slower than the second, third..., indicating that when the memory is greater than the amount of data, The database can cache all data in memory (Oracle's cache is very strong), so we take the fastest time in multiple runs, so there is almost no hard disk read time, only computing time.
At the same time, in the above two sets of tests, the filter condition is always true, that is, there is no substantial filtering of the data. Both queries involve all records in the orderdetail table, and the calculation scale is equivalent.
It can be seen from the test results that the six-table association is 167/26=6.4 times slower than the two-table association! The performance drops a lot. After excluding the hard disk time, the added time here is mainly the correlation between the tables and the judgment on the fields of the correlation table, and these judgments are very simple, so most of the time is spent on the correlation between the tables.
This test shows that SQL JOIN performance is really poor.
Four, SPL pre-association test
1. Pre-association
The SPL script for pre-association is as follows:
A | |
1 | >env(region, file(path+"region.ctx").create().memory().keys@i(R_REGIONKEY)) |
2 | >env(nation, file(path+"nation.ctx").create().memory().keys@i(N_NATIONKEY)) |
3 | >env(supplier, file(path+"supplier.ctx").create().memory().keys@i(S_SUPPKEY)) |
4 | >env(customer, file(path+"customer.ctx").create().memory().keys@i(C_CUSTKEY)) |
5 | >env(part, file(path+"part.ctx").create().memory().keys@i(P_PARTKEY)) |
6 | >env(orders,file(path+"orders.ctx").create().memory().keys@i(O_ORDERKEY)) |
7 | >env(orderdetail,file(path+"orderdetail.ctx").create().memory()) |
8 | >nation.switch(N_REGIONKEY,region) |
9 | >customer.switch(C_NATIONKEY,nation) |
10 | >supplier.switch(S_NATIONKEY,nation) |
11 | >orders.switch(O_CUSTKEY,customer) |
12 | >orderdetail.switch(L_ORDERKEY,orders;L_PARTKEY,part;L_SUPPKEY,supplier) |
In the first 7 lines of the script, the 7 group tables are respectively read into the memory to generate internal tables and set them as global variables. The last 5 rows complete the connection between tables. When the SPL server starts, run this script first to complete the environment preparation.
Let's take a look at the data structure of the table object in memory after pre-association, taking orderdetail as an example:
The figure only lists the pre-association of the first record of orderdetail, and other records are similar. Limited to the width of the layout, only some fields are listed in each table.
2. Two tables associated
Write the SPL script as follows:
A | |
1 | =orderdetail.select(len(L_PARTKEY.P_TYPE)>2) |
2 | =A1.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
3. Six table association
Write the SPL script as follows:
A | |
1 | =orderdetail.select(len(L_PARTKEY.P_TYPE)>2 && L_ORDERKEY.O_CUSTKEY.C_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_NATIONKEY.N_NAME!=null && L_SUPPKEY.S_SUPPKEY>0 ) |
2 | =A1.groups(year(L_SHIPDATE):l_year; sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)):revenue) |
After pre-association, the SPL code is also very simple, and the fields of the association table can be directly accessed as sub-attributes of the fields of this table, which is easy to understand.
4. Run results
Two-table association | Six table association | |
Running time (seconds) | 28 | 56 |
The six-table association is only 2 times slower than the two-table association, which is basically the time required to increase the amount of calculation (referring to these association table fields), and because of the pre-association, the association operation itself no longer consumes time.
5. Conclusion
Summary of test results:
Running time (seconds) | Two-table association | Six table association | Performance reduction factor |
SQL | 26 | 167 | 6.4 |
SPL pre-association | 28 | 56 | 2 |
The six-table association is 6.4 times slower than the two-table association, which shows that the SQL processing JOIN consumes a lot of CPU, and the performance decreases significantly. The SPL after adopting the pre-association mechanism is only 2 times slower, and multiple JOIN tables no longer show significant performance degradation.
When performing queries with many associated tables, if the memory is large enough to read all the data into memory (application scenarios of in-memory databases), the use of pre-association technology will greatly improve computing performance! However, relational databases (including memory databases) cannot implement this optimization technique in SQL language.