[OpenCV & CUDA] Combined programming of OpenCV and Cuda

[Original: http://www.cnblogs.com/dwdxdy/p/3528711.html ]

1. Use the GPU module provided in OpenCV

  At present, many GPU functions have been provided in OpenCV, and most of the acceleration operations of image processing can be completed by directly using the GPU module provided by OpenCV.

  For basic usage, please refer to: http://www.cnblogs.com/dwdxdy/p/3244508.html

  The advantage of this method is that it is simple to use, uses GpuMat to manage the data transmission between the CPU and the GPU, and does not need to pay attention to the setting of the parameters of the kernel function call, and only needs to pay attention to the logical operation of the processing during use.

  The disadvantage is that it is limited by the development and update of the OpenCV library. When some custom operations need to be completed (the corresponding library is not provided in OpenCV), it is difficult to meet the needs of the application, and it is necessary to implement the parallel implementation of the custom operations. In addition, for some special needs, OpenCV provides parallel processing functions, and its performance optimization is not optimal. In specific applications, further optimization may be required to improve performance.

Second, use Cuda API programming alone

  Use the Cuda Runtime API and Cuda Driver API to achieve parallel acceleration of some operations. The use process needs to manage the data transfer between the CPU and GPU, the setting of kernel function call parameters, and the optimization of kernel functions.

  The advantage is that the processing process is controlled by the user, and the user can implement more parallel accelerated processing operations.

  The disadvantage is that it is complicated to use and requires a lot of code writing, and it is necessary to be familiar with Cuda related materials and API interfaces. Here is a simple example program:

copy code
__global__ void swap_rb_kernel(const uchar3* src,uchar3* dst,int width,int height)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.x + blockIdx.y * blockDim.y;
    
    if(x < width && y < height)
    {
        uchar3 v = src[y * width + x];
        dst[y * width + x].x = v.z;
        dst[y * width + x].y = v.y;
        dst[y * width + x].z = v.x;
    }
}

void swap_rb_caller(const uchar3* src,uchar3* dst,int width,int height)
{
    dim3 block(32,8);
    dim3 grid((width + block.x - 1)/block.x,(height + block.y - 1)/block.y);
    
    swap_rb_kernel<<<grid,block,0>>>(src,dst,width,height);
    cudaThreadSynchronize();
}

intmain ()
{
    Mat image = imread("lena.jpg");
    imshow("src",image);
    
    size_t memSize = image.cols*image.rows*sizeof(uchar3);
    uchar3* d_src = NULL;
    uchar3* d_dst = NULL;
    CUDA_SAFE_CALL(cudaMalloc((void**)&d_src,memSize));
    CUDA_SAFE_CALL(cudaMalloc((void**)&d_dst,memSize));
    CUDA_SAFE_CALL(cudaMempcy(d_src,image.data,memSize,cudaMemcpyHostToDevice));
    
    swap_rb_caller(d_src,d_dst,image.cols,image.rows);
    
    CUDA_SAFE_CALL(cudaMempcy(image.data,d_dst,memSize,cudaMemcpyDeviceToHost));
    imshow("gpu",image);
    waitKey(0);
    
    CUDA_SAFE_CALL(cudaFree(d_src));
    CUDA_SAFE_CALL(cudaFree(d_dst));
    return 0;
}
copy code

  上述代码中,使用cudaMalloc,cudaMemcpy,cudaFree管理内存的分配、传输和释放。

  注意:若image.data包含字节对齐的空白数据,上述程序无法完成正常的处理操作。

三、利用OpenCV中提供接口,并结合Cuda API编程

  利用OpenCV已经提供的部分接口,完成一些Cuda编程的基本处理,简化编程的复杂程度;只是根据自己业务需求,自定义内核函数或扩展OpenCV已提供的内核函数。这样既可以充分利用OpenCV的特性,又可以满足业务的不同需求,使用方便,且易于扩展。下面是简单的示例程序:

copy code
//swap_rb.cu
#include <opencv2/core/cuda_devptrs.hpp>
using namespace cv;
using namespace cv::gpu;
//自定义内核函数
__global__ void swap_rb_kernel(const PtrStepSz<uchar3> src,PtrStep<uchar3> dst)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;

    if(x < src.cols && y < src.rows)
    {
        uchar3 v = src(y,x);
        dst(y,x) = make_uchar3(v.z,v.y,v.x);
    }
}

void swap_rb_caller(const PtrStepSz<uchar3>& src,PtrStep<uchar3> dst,cudaStream_t stream)
{
    dim3 block(32,8);
    dim3 grid((src.cols + block.x - 1)/block.x,(src.rows + block.y - 1)/block.y);

    swap_rb_kernel<<<grid,block,0,stream>>>(src,dst);
    if(stream == 0)
        cudaDeviceSynchronize();
}
copy code
copy code
//swap_rb.cpp
#include <opencv2/gpu/gpu.hpp>
#include <opencv2/gpu/stream_accessor.hpp>
using namespace cv;
using namespace cv::gpu;

void swap_rb_caller(const PtrStepSz<uchar3>& src,PtrStep<uchar3> dst,cudaStream_t stream);

void swap_rb(const GpuMat& src,GpuMat& dst,Stream& stream = Stream::Null())
{
    CV_Assert(src.type() == CV_8UC3);
    dst.create(src.size(),src.type());
    cudaStream_t s = StreamAccessor::getStream(stream);
    swap_rb_caller(src,dst,s);
}
copy code
copy code
//main.cpp
#include <iostream>
#include <opencv2/opencv.hpp>
#include <opencv2/gpu/gpu.hpp>
using namespace cv;
using namespace cv::gpu;
void swap_rb(const GpuMat& src,GpuMat& dst,Stream& stream = Stream::Null());
int main()
{
    Mat image = imread("lena.jpg");
    imshow("src",image);
    GpuMat gpuMat,output;

    gpuMat.upload(image);
    swap_rb(gpuMat,output);
    output.download(image);

    imshow("gpu",image);
    waitKey(0);
    return 0;
}
copy code

  The swap_rb.cu file defines the kernel function and the calling function of the kernel function. In the calling function, the calling parameters of the kernel function are set.

  The swap_rb.cpp file defines the entry function of the parallel operation, that is, the function that the main program needs to call to complete the parallel operation. It mainly encapsulates the calling function of the kernel function, adds the verification of the input parameters, and selects different kernel functions according to the input parameters. .

  The main.cpp file is the main program, which completes data input, business processing and data output.

Summarize

  Ease of programming and controllability are relative , the more convenient the programming, the less controllable it is. In the actual application process, a balance between programming simplicity and controllability should be sought, and an appropriate method should be selected according to the application requirements, and method three is generally recommended.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325574097&siteId=291194637
Recommended