Performance Analysis cv :: dft () of opencl version: android + opencv + opencl

In millet mix 2s + Qualcomm Snapdragon Adreno 630 845 + on a test version of the opencl cv :: dft ().

Test Data

Look at the table inside Description:

name Function name The maximum time (ms) The average time (ms) Explanation
cpu version dft cv::dft() - 0.029448 No other statistics, only cv :: dft call time () function
opencl version cv :: dft (Ummah) 802.557000 0.202941 Copy mat and does not calculate the umat, not counting filling aligned umat
opencl using the calculated main function opencl cv::ocl_dft() 802.553000 0.210583 cv :: dft () wraps cv :: ocl_dft (), this layer without too much loss of performance
The first step in the subroutine call ocl_dft ocl_dft_rows() 802.518000 0.1031 -
The second step of the subroutine call ocl_dft ocl_dft_cols() 338.004000 0.078061 -
Pooling OCL_FftPlanCache::getInstance().getFftPlan() 0.190000 0.000028 Pooling quickly, takes up little time, you can ignore
Nuclear compiled opencl function, binding parameters, calculated OCL_FftPlan::enqueueTransform() 464.393000 0.075685 -
Nuclear compiled function enqueueTransform() 464.237000 0.019422 The first compilation is very slow, the future will be much faster. But do not repeat the compilation fishes
Parameter binding enqueueTransform() 0.122000 0.016015 Binding parameters quickly
Kernel execution enqueueTransform() 1.167000 0.028805 -

Result analysis

There is such a number of conclusions:

  • Disappointing: opencl + average time gpu version of 0.202941 , while the average time CPU version is 0.029448 , GPU version 6.9 times slower than the CPU version; and yet add Mat copy to UMat, Mat filling alignment, UMat copied back to the Mat and some other occupied time;
  • Can be found, for the first time to perform cv :: dft () version of opencl when compiling the kernel function is very time-consuming (464ms), subsequent compilation still takes time;
  • Pure calculation time point of view, opencl kernel execution time is approximately 0.028805 * 2, which is about 1.96 times the CPU version. This may be due to produce my test data is very small, if large amounts of data, GPU version on pure computing time may be better than the CPU version number.

Optimization Program

  • In calling cv :: dft () of opencl previous version, open a thread for air conditioning once cv :: ocl_dft (), compile-time kernel function so it will not take up the total call time.
  • ocl :: Kernel pool where you can create an object, rather than use the temporary objects each call, this is the case, each call can save 0.019422ms, performance can improve by 9.6%;
ocl::Kernel k(kernel_name.c_str(), ocl::core::fft_oclsrc, options);
  • If the GPU memory pool, calculated each time the input and output addresses are the same, then the link 0.016015ms binding parameters can be omitted, performance may improve by 7.9%
  • In my cv :: dft () usage scenario, data matrix 44 of each successive calculation. Suppose to find ways to bring 44 calculations have joined the queue, allowing continuous GPU computing. GPU support concurrency hypothesis 44 calculations simultaneously, then the GPU version of the theory of delay is 0.202941 / 44 = 0.004612, upgrade 6.39 times than the CPU version!

Guess you like

Origin www.cnblogs.com/ahfuzhang/p/11097141.html