--- circulation optimization program launched

Reprinted: https: //www.cnblogs.com/liboyan/p/5011382.html

 1 void combine5(double data[], int length)
 2 {
 3     double sum = 0.0;
 4     for (int i = 0; i<length; i++)
 5     {
 6         sum *= data[i];
 7     }
 8     cout << sum << endl;
 9 }
10 
11 void combine6(double data[], int length)
12 {
13     double sum = 0.0;
14     int limit = length - 1;
15     int i;
16     for (i = 0; i<limit; i += 2)
17     {
18         sum = sum*data[i] * data[i + 1];
19     }
20     for (; i<length; i++)
21     {
22         sum *= data[i];
23     }
24     cout << sum << endl;
25 }
26 
27 void combine7(doubleData [], int length)
 28  {
 29      Double SUM1 = 0.0 , SUM2 = 0.0 ;
 30      int limit = length - . 1 ;
 31 is      int I;
 32      for (I = 0 ; I <limit; I + = 2 )
 33 is      {
 34 is          * = Data SUM1 [I];    //   combined with even subscripts and an even number of 0 by the operator 
35          SUM2 Data * = [I + . 1 ];   //   combined odd value for the subscript 
36      }
 37 [      Double SUM = SUM1 * sum2;
38     for (; i<length; i++)
39     {
40         sum *= data[i];
41     }
42     cout << sum << endl;
43 }

 

   Features three functions above are the same, but the study of its operating speed is not the same Japanese. why?

   Because of adders and multipliers are fully pipelined, which represents a clock cycle they can execute multiple instructions, see in front of me to write that blog post (optimized for small details). Reducing the relevance, increasing parallelism of the code, we can greatly increase the efficiency of the code, so that the point of the power of the CPU are forced out between code.

   combine5 just did some simple optimizations, combine6 were unrolling, combine7 addition of both road parallel loop unrolling. If the running time is running time combine5 5, combine6 would be 2.5! combine7 running time would be 1! Yes that is so overbearing! After the IA32 instruction set support such optimization, but most compilers will not help you change the code like this, so their writing is the best. If you include SSE, then drained the ability to put the CPU =. = Quack who said the following reasons why multi-channel parallel loop unrolling it faster than just write it? First explain a term, called the critical path, efficiency of the cycle depends on the instruction on the critical path, critical path circulation can be seen throughout the entire cycle of variables and calculation, reducing things on the critical path is equivalent to speed up the time! Expand the loop can be seen as reducing the number of CPU go critical path, but the calculation on the critical path each sum = sum * data [i] * data [i + 1] is greater relevance, two multiplications can not parallel, because the second multiplication step one must be completed in order to perform multiplication. Channel parallel can solve this problem, SUM1 * = Data [I];  SUM2 Data * = [I +. 1]; these two multiplications are not mutually dependent, entirely parallel, the equivalent of only one multiplication, so the speed will greatly accelerated.

   From the above analysis will find that loop unrolling three times faster than parallel three-way above code, and so on, and optimize the efficiency of the code will inevitably tends to a limit. This limit is the CPU throughput limits, and in theory it can not be optimized.

Guess you like

Origin www.cnblogs.com/nuchenghao/p/11297193.html
Recommended