Performance optimization tips: TopN

TopN is a common operation, written in SQL like this (take Oracle as an example):
       select * from (select * from T order by x desc) where rownum<=N
The operation logic of this SQL is to look at the statement. Do order (Order by), and then take out the first N items.

We know that sorting is a very slow action with high complexity (n*logn). If the amount of data involved is too large to be stored internally, data exchange between internal and external memory is required, and performance will drop sharply.

In fact, to calculate TopN, we can design an algorithm that does not require full sorting, as long as we maintain a small set of size N, traverse the data set once, and save the top N names of the traversed data in this small set , Traverse to a new piece of data. If it is larger than the current Nth place, insert it and discard the current Nth place. If it is smaller than the current Nth place, no action is taken.

The complexity of this calculation is much lower (n*logN, n is the total number of data), and generally N is not large and can be put in the memory. No matter how large the amount of data, it will not involve internal and external memory exchange issues.

However, SQL cannot describe the above calculation process. At this time, we can only hope that the database engine can optimize itself. Using SPL, it is easy to describe the above calculation process to achieve high-performance computing.

Let's test whether Oracle will do this optimization, that is, use Oracle to implement TopN and compare with SPL for the same operation. Because SPL can use optimization algorithms, if Oracle's calculation time is similar to that of SPL, it means that it has been optimized. If the difference is far, it may be a full sort.

 

1. Data preparation and environment

Use the SPL script to generate a data file. The data has two columns. The first column id is a random integer less than 2 billion, and the second column amount is a random real number not greater than 10 million. The data record is 8 billion rows, and the original text file size is 169G. Use the data import tool provided by the database to import this file data into the Oracle data table topn, and use this file data to generate the SPL group table file topn.ctx.

The test was completed on an Intel server, with two Intel3014 CPUs, clocked at 1.7G, a total of 12 cores, and 64G of memory. The database table data and SPL group table files are stored on the same SSD hard disk.

We deliberately design the amount of data to be larger than the memory, so that if sorting is performed, there will be internal and external memory swap actions, and the performance degradation will be very large and easy to be observed.

 

2. Regular TopN

Get the top 100 items with the largest amount in the topn table.

1. Oracle test

The SQL statement used for the query is:

select * from (

select  /*+ parallel(4) */

* from topn order by amount desc

) where rownum<=100;

Description: /*+ parallel(4) */ is Oracle's parallel query syntax, where 4 is the number of parallels.

 

2. SPL test

Write the SPL script to perform the test:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =now() /Record time
2 =4 /Parallel number
3 =file("/home/topn/topn.ctx").create() /Generate group table object
4 =A3.cursor@m(id,amount;;A2).groups(;top(100;-amount))
5 =interval@s(A1,now()) /Calculate the execution time

Unlike SQL, SPL regards TopN as an aggregation operation, which is the same as sum/count operations. The difference is that TopN returns a set, while sum/count returns a single value. But their calculation logic is the same, they only need to traverse the original data set once, and do not involve full sorting.

The groups(;top(100;-amount) in the A4 grid is to do the TopN aggregation operation on the complete set to calculate the top 100.

 

3. Conclusion and analysis

The regular TopN test time is shown in the table below:

Test result (time unit: second)

Parallel number 1 2 4 8 12
Oracle 1922 952 526 308 256
SPL group table 2641 1565 729 371 321

Tests show that Oracle is a bit faster than SPL, and SPL does not do full sorting, which shows that Oracle will automatically optimize in this situation.

It is understandable that Oracle is faster than SPL, because Oracle is developed in C++, while the current version of SPL is developed in java. It is normal for C++ to calculate faster than Java, and this test reads all two columns of data, and the data is random and disordered, which is difficult to compress, so the columnar storage of the group table has no advantage.

 

Three, increase complexity

For the most basic TopN, Oracle is very smart and will optimize even if it is written as a subquery. Let's increase the difficulty of the problem below, and do TopN in each group after regrouping.

The specific calculation design is: group according to the last digit of the id field, and then query the top 100 records with the largest amount in each group. id is an integer, so there are only 10 groups, and the calculation amount of the group itself is not large, but to do TopN for the grouped data, the overall computational complexity is slightly higher. If there is no full ordering, the overall computing time should be more than in the previous case, but still within the same order of magnitude.

1. Oracle test

The SQL statement used for the query is:

select * from (

select  /*+ parallel(2) */

       mod(id,10) as gid,amount,

        row_number()over (partition by mod(id,10) order by amount desc) rn

from topn

) where rn <= 100;

SQL cannot directly write the operation of taking TopN after grouping, but can only calculate the serial number with the help of window functions, and there is still the word order (order by) in the syntax.

 

2. SPL group table test

Write the SPL script to perform the test:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =now() /Record time
2 =4 /Parallel number
3 =file("/home/topn/topn.ctx").create() /Generate group table object
4 =A3.cursor@m(id,amount;;A2).groups@u(id%10:gid;top(100;-amount))
5 =interval@s(A1,now()) /Calculate the execution time

Because SPL regards TopN as an aggregate calculation, it can be easily placed in a grouping summary, which is almost the same as the wording of full aggregation.

 

3. Conclusion and analysis

Test result (time unit: second)

Parallel number 1 2 4 8 12
Oracle 41649 19602 9359 4627 3211
SPL group table 4380 2127 1007 465 349

After increasing the difficulty, Oracle is more than 10 times slower than the previous simple case, and it is nearly 10 times slower than SPL doing the same operation. This shows that Oracle is likely to perform a sorting action in this case. After the situation becomes more complicated, Oracle's optimization engine does not work.

The difference between the calculation time of SPL in these two cases is less than 2 times, which is basically in the same order of magnitude, which conforms to our previous analysis, and the advantages of the algorithm are fully reflected.


Guess you like

Origin blog.51cto.com/12749034/2588470