Performance optimization skills: traverse multiplexing speed up multiple groups

We know that the bottleneck of big data computing performance is often on external memory (that is, hard disk) IO, because external memory access performance is one or two orders of magnitude lower than memory. Therefore, when doing performance optimization, reducing the amount of hard disk access is sometimes more important than reducing the amount of CPU calculations. For the same task, if an algorithm with fewer hard disk accesses can be used, even if the CPU calculation is unchanged or slightly more, better performance will be obtained.

Grouping and summarizing need to traverse the data set. The same data set may be grouped according to different dimensions, so in principle, it must be traversed multiple times, and multiple hard disk accesses are involved in the case of big data. However, if we can calculate the grouping results of multiple dimensions in one traversal process, it will reduce a lot of hard disk access.

Unfortunately, SQL cannot write such operations (returning multiple grouping results in the traversal), and can only traverse multiple times, or hope that the database engine can be optimized. The SPL supports this traversal multiplexing syntax, which can calculate multiple grouping results in one traversal, thereby improving performance.

Let's do a test, take Oracle as an example to see whether the database will optimize the calculation of multiple traversals, and the performance impact of the traversal reuse algorithm in SPL.

1. Data preparation and environment

The SPL script generates a data file with two columns of data. The first column id is a random integer less than 2 billion, and the second column amount is a random real number not greater than 10 million. The data record is 8 billion rows, and the original text file size is 169G. Use the data import tool provided by the database to import this file data into the Oracle data table topn, and use this file data to generate the SPL group table file topn.ctx.

The test was completed on an Intel server, with two Intel3014 CPUs, clocked at 1.7G, a total of 12 cores, and 64G of memory. The database table data and SPL group table files are stored on the same SSD hard disk.

The amount of data is deliberately made larger than the memory to ensure that the operating system cannot cache all these data in the memory, and the hard disk must be read during actual operations.

Two, Oracle test

The test is divided into three situations: single group and single calculation amount, single group and double calculation amount, and double group and double calculation amount.

1. Single group and single calculation

select  /*+ parallel(12) */ mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100)

 

2. Double calculation amount for single group

select  /*+ parallel(12) */ mod(id,100)+floor(id/20000000) Aid, max(amount) Amax, min(amount) Amin from topn group by mod(id,100)+floor(id/20000000);

The calculation formula has doubled, which is equivalent to twice the amount of calculation.

3. Double calculation

select  /*+ parallel(12) */  * from (select mod(id,100) Aid,max(amount) Amax from topn group by mod(id,100) ) A 
join
( select floor(id/20000000) Bid,min(amount) Bmin from topn group by floor(id/20000000) )  B
on A.Aid=B.Bid;

The amount of calculation here is roughly the same as 2, but there are two groups, we will observe whether the database will be traversed twice. The final JOIN operation only involves 100 rows of data, and the time is negligible.

 

Three, SPL test

Let's do the tests done by Oracle again with SPL.

1. Single group and single calculation

Write the SPL script to perform the test:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =now()
2 =12
3 =file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2)
4 =A3.groups@u(id%100:Aid;max(amount):Amax)
5 =interval@s(A1,now())

 

2. Double calculation amount for single group

Write the SPL script to perform the test:

bef4400cbc05de35bf684369d9db10ed.gif A
1 =now()
2 =12
3 =file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2)
4 =A3.groups@u(id%100+id\20000000:Aid;max(amount):Amax,min(amount):Amin)
5 =interval@s(A1,now())

 

3. Double calculation

Write the SPL script to perform the test:

bef4400cbc05de35bf684369d9db10ed.gif A B
1 =now()
2 =12
3 =file("/home/topn/topn.ctx").create().cursor@m(id,amount;;A2)
4 cursor A3 =A4.groups@u(id%100:Aid;max(amount):Amax)
5 cursor =A5.groups@u(id\20000000:Bid;max(amount):Bmax)
6 =A4.join@i(Aid,A5:Bid,Bid,Bmax)
7 =interval@s(A1,now())

The SPL-specific traversal multiplexing syntax is used here. The cursor is defined in A3, and two sets of calculations for this cursor are defined in A4/B4 and A5/B5, which means that the two results will be calculated at the same time during a cursor traversal.

 

4. Analysis and conclusion

The test time for the three cases is as follows:

Test result (time unit: second)


Single calculation Double calculation for single group Double the amount of calculation
Oracle 458 692 878
SPL 336 350 376

From Oracle’s test results, the double-group double calculation is nearly 200 seconds slower than the single-group double calculation. This is not a negligible time, because the calculations of the two are almost the same, which is more time. It is estimated that it will take one more time to traverse. This means that the database will not automatically optimize the traversal and reuse. In the double grouping, the data table will be traversed twice. As a result, doing one more grouping will almost double the time.

SPL uses a traversal and multiplexing mechanism. The calculation time of the three tests is very small. One more grouping will not make one more traversal. It just adds some logic for multiplexing control, which will not slow down much.

To explain, when preparing the data, the Oracle amount field type is set to decimal, so the calculation speed is relatively slow; and the SPL group table uses the double type, so it is much faster. But this test is not to compare the computing performance of Oracle and SPL, these differences do not affect the above conclusions.


Guess you like

Origin blog.51cto.com/12749034/2588469