In millet mix 2s + Qualcomm Snapdragon Adreno 630 845 + on a test version of the opencl cv :: dft ().
Test Data
Look at the table inside Description:
name | Function name | The maximum time (ms) | The average time (ms) | Explanation |
---|---|---|---|---|
cpu version dft | cv::dft() | - | 0.029448 | No other statistics, only cv :: dft call time () function |
opencl version | cv :: dft (Ummah) | 802.557000 | 0.202941 | Copy mat and does not calculate the umat, not counting filling aligned umat |
opencl using the calculated main function opencl | cv::ocl_dft() | 802.553000 | 0.210583 | cv :: dft () wraps cv :: ocl_dft (), this layer without too much loss of performance |
The first step in the subroutine call ocl_dft | ocl_dft_rows() | 802.518000 | 0.1031 | - |
The second step of the subroutine call ocl_dft | ocl_dft_cols() | 338.004000 | 0.078061 | - |
Pooling | OCL_FftPlanCache::getInstance().getFftPlan() | 0.190000 | 0.000028 | Pooling quickly, takes up little time, you can ignore |
Nuclear compiled opencl function, binding parameters, calculated | OCL_FftPlan::enqueueTransform() | 464.393000 | 0.075685 | - |
Nuclear compiled function | enqueueTransform() | 464.237000 | 0.019422 | The first compilation is very slow, the future will be much faster. But do not repeat the compilation fishes |
Parameter binding | enqueueTransform() | 0.122000 | 0.016015 | Binding parameters quickly |
Kernel execution | enqueueTransform() | 1.167000 | 0.028805 | - |
Result analysis
There is such a number of conclusions:
- Disappointing: opencl + average time gpu version of 0.202941 , while the average time CPU version is 0.029448 , GPU version 6.9 times slower than the CPU version; and yet add Mat copy to UMat, Mat filling alignment, UMat copied back to the Mat and some other occupied time;
- Can be found, for the first time to perform cv :: dft () version of opencl when compiling the kernel function is very time-consuming (464ms), subsequent compilation still takes time;
- Pure calculation time point of view, opencl kernel execution time is approximately 0.028805 * 2, which is about 1.96 times the CPU version. This may be due to produce my test data is very small, if large amounts of data, GPU version on pure computing time may be better than the CPU version number.
Optimization Program
- In calling cv :: dft () of opencl previous version, open a thread for air conditioning once cv :: ocl_dft (), compile-time kernel function so it will not take up the total call time.
- ocl :: Kernel pool where you can create an object, rather than use the temporary objects each call, this is the case, each call can save 0.019422ms, performance can improve by 9.6%;
ocl::Kernel k(kernel_name.c_str(), ocl::core::fft_oclsrc, options);
- If the GPU memory pool, calculated each time the input and output addresses are the same, then the link 0.016015ms binding parameters can be omitted, performance may improve by 7.9%
- In my cv :: dft () usage scenario, data matrix 44 of each successive calculation. Suppose to find ways to bring 44 calculations have joined the queue, allowing continuous GPU computing. GPU support concurrency hypothesis 44 calculations simultaneously, then the GPU version of the theory of delay is 0.202941 / 44 = 0.004612, upgrade 6.39 times than the CPU version!