OpenCL编程指南-8图像和采集器

图像和采集器对象

GPU原本设计为高性能地渲染3维图形。3维图形管线最重要的特性之一是对多边形表面应用纹理图像。因此，GPU逐步发展为可以极为高性能地访问和过滤纹理图像。尽管大多数图像操作可以使用第7章介绍的通用内存对象来模拟，但与使用图像对象相比，那些模拟的方法往往会大大降低性能。另外，利用图像对象，可以非常容易地完成诸如纹理边缘钳制和过滤等操作。

因此，首先要了解OpenCL中存在图像对象的主要原因是：这样允许程序充分利用GPU中的高性能纹理硬件。此外还可以从其他硬件得到另外一些优点，因此图像对象是OpenCL中处理2维和3维图像数据的最佳方法。

图像对象封装了有关一个图像的多种信息：
1）图像大小：2维图像的宽度和高度（以及3维图像的深度）。
2）图像格式：内存中图像像素的位深度和布局。
3）内存访问标志：例如，图像用于读还是写，或者是否可同时读、写。
内核中从图像对象获取数据时需要采样器。采样器告诉图像读取函数如何访问图像。
4）坐标模式：从图像获取数据所用的纹理坐标规格化至范围[0…1]还是范围[0…image_dim - 1]。
5）寻地模式：当坐标超出图像边界范围时，从图像获取数据的行为。
6）过滤模式：从图像获取数据时，取一个样本还是使用多个样本过滤（例如，双线性过滤）。

关于采样器，有一个问题开始时可能会让人有点困惑：对于如何创建采样器有两个选择。可以在内核代码中直接声明采样器（使用sampler_t)，或者在C/C++程序中创建为采样器对象。之所以希望将采样器创建为一个对象而不是在代码中静态声明，主要原因是这样允许内核使用不同的过滤和寻址选项。

创建图像对象

创建图像对象可以通过clCreateImage2D()或clCreateImage3D()完成：

cl_mem clCreateImage2D(cl_context context,
                       cl_mem_flags flags,
                       const cl_image_format * image_format,
                       size_t image_width,
                       size_t image_height,
                       size_t image_row _pitch,
                       void * host_pts,
                       cl_ int * errcode_ret)
                       
cl_mem clCreateImage3D(cl_context context,
                       cl_mem_flags flags,
                       const cl_image_format * image_format,
                       size_t image_width,
                       size_t image_height,
                       size_t image_depth,
                       size_t image_row_pitch,
                       size_t image_slice pitch,
                       void * host ptr,
                       cl_ int *errcode_ret)

/*
context 创建图像对象的上下文。
flags 这是一个位域，用来指定有关图像创建的分配和使用信息。
      flags可取的合法值由枚举el_mem_flags定义，见表7-1。

image_format 描述通道次序和图像通道数据的类型。
image_width 图像宽度(像素数)。
image_height 图像高度(像素数)。
image_depth (仅适用于3维图像）对于3维图像，指定图像的切片数。

image_row_pitch 如果host_ptr不为NULL，这个值指定图像中各行的字节数。
                如果值为0，则认为长度等于image_width *(bytes_per_pixel)。

image_slice_pitch (仅适用于3维图像）如果host_ptr不为NULL，这个值指定图像中各个切片的字节数。
                  如果值为0，则认为长度等于image_height * image_row_pitch。
                  
host_ptr 内存中线性布局图像缓冲区的指针。对于2维图像，缓冲区是扫描行的线性数组。
         对于3维图像，缓冲区则是2维图像切片的一个线性数组。
         每个2维切片与2维图像有相同的布局。
         
errcode_ret 如果为非NULL，函数返回的错误码由这个参数返回。
*/

lmageFilter2D示例的代码清单8-1展示了如何使用FreeImage库从文件加载图像，然后根据其内容创建一个2维图像对象。首先从磁盘加载图像，然后存储在一个32位RGBA 缓冲区中，其中各个通道分别为1字节（8位）。接下来，根据通道次序CL_RGBA和通道数据类型CL_UNORM_INT8建立cl_image_format结构。最后使用clCreateImage2D()创建图像。这个32位图像缓冲区加载到host_ptr，并复制到OpenCL设备。mem_flags设置为CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR，这会从宿主机指针复制数据，并把它存储在一个2维图像对象中，在内核中这个2维图像对象是只读的。

需要说明的重要一点是，clCreateImage2D()和 clCreateImage3D()会返回一个cl_mem对象。对于图像对象并没有任何特殊的对象类型，这说明必须使用标准的内存对象函数（如clReleaseMemObject ()）释放这些对象。

// 代码清单8-1 从文件创建一个2维图像对象

cl_mem LoadImage(cl_context context, char* fileName, int& width, int& height)
{
    
    
	FREE_IMAGE_FORMAT format = FreeImage_GetFileType(fileName, 0);
	FIBITMAP* image = FreeImage_Load(format, fileName);

	// Convert to 32-bit image
	FIBITMAP* temp = image;
	image = FreeImage_ConvertTo32Bits(image);
	FreeImage_Unload(temp);

	width = FreeImage_GetWidth(image);
	height = FreeImage_GetHeight(image);

	char* buffer = new char[width * height * 4];
	memcpy(buffer, FreeImage_GetBits(image), width * height * 4);

	FreeImage_Unload(image);

	// Create OpenCL image
	cl_image_format clImageFormat;
	clImageFormat.image_channel_order = CL_RGBA;
	clImageFormat.image_channel_data_type = CL_UNORM_INT8;

	cl_int errNum;
	cl_mem clImage;
	clImage = clCreateImage2D(context,
		CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		&clImageFormat,
		width,
		height,
		0,
		buffer,
		&errNum);

	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error creating CL image object" << std::endl;
		return 0;
	}

	return clImage;
}

除了创建作为输入的2维图像对象外，示例程序还创建了一个输出2维图像对象，它将存储对输入图像完成高斯过滤的结果。这个输出对象使用代码清单8-2中所示的代码创建。需要说明的是，创建这个对象时没有指定host_ptr，因为它会填入内核中的数据。另外，mem_flags设置为CL_MEM_WRITE_ONLY，因为图像在内核中是只写的，而非只读。

// 代码清单8-2 创建用于输出的2维图像对象

// Create ouput image object
cl_image_format clImageFormat;
clImageFormat.image_channel_order = CL_RGBA;
clImageFormat.image_channel_data_type = CL_UNORM_INT8;
imageObjects[1] = clCreateImage2D(context,
	CL_MEM_WRITE_ONLY,
	&clImageFormat,
	width,
	height,
	0,
	NULL,
	&errNum);

创建图像对象之后，可以使用第7章介绍的通用内存对象函数clGetMemobjectInfo()查询对象信息。还可以使用clGetImageInfo()查询图像对象特定的额外信息：

cl_int clCetImageInfo(cl_mem image,
                      cl_image_info param_name,
                      size_t param_value_size,
                      void * param_value,
                      size_t * param_value_size_ret)

/*
image 要查询的一个合法图像对象。
param_name 查询信息的参数，必须是以下参数之一。
CL_IMAGE_FORMAT(cl_image_format): 创建图像采用的格式。
CL_IMAGE_ELEMENT_SIZE(size_t): 图像中单个像素元素的大小(字节数)。
CL_IMAGE_ROW_PITCH(size_t): 图像中各行的字节数。

CL_IMAGE_SLICE_PITCH(size_t): 3维图像中各个2维切片的字节数;
                              对于2维图像，则为0。
                              
CL_IMAGE_WIDTH(size_t): 图像宽度(像素数)。
CL_IMAGE_HEIGHT(size_t): 图像高度(像素数)。
CL_IMAGE_DEPTH(size_t): 对于3维图像，这是图像的深度（像素数);
                        对于2维图像，则为0。

param_value_size: param_value中的字节数。
param_value 这个指针指向存储结果的位置。
            必须给这个位置分配足够的字节来存储请求的结果。param_value_size_ret 实际写入param_value的字节数。

*/

图像格式

如代码清单8-1所示，传入 clCreateImage2D()和 clCreateImage3D()的cl_image_format参数指定了图像的各个像素在内存中如何布局。cl_image_format结构详细指定了通道次序及位表示，定义如下：

typedef struct _ci_image_format
{
    
    
     cl_channel_order image_channel_order;
     cl_channel_type image_channel_data_type;
}ci_image_format;

image_channel_order和image_channel_data_type可取的合法值见表8-1和表8-2。除了提供一个布局外，还指出图像位在内存中如何存储，cl_image_format还确定在内核中读取时如何解释结果。通道数据类型的选择会影响适用的读/写图像的OpenCL C函数（例如read_imagef，read_imagei或read_imageui)。表8-1中的最后一列显示了图像通道次序如何影响内核中对获取结果的解释。

扫描二维码关注公众号，回复： 16113208 查看本文章

表8-1 图像通道次序
 通道次序                         描述                                  内核中读取结果
CL_R、CL_Rx          将读入内核R分量的一个图像数据通道                   (R,0.0,0.0,1.0)
                     CL_Rx包含两个通道,但读入内核时仅第一个通道可用

CL_A                 将读入内核A分量的一个图像数据通道                   (0.0,0.0,0.0,A) 

CL_INTENSITY         将读入内核所有颜色分量的一个图像数据通道                (I,I,I,I)
                     这种格式只能用于通道数据类型CL_UNORM_INT8、CL_UNORM_INT16、
                     CL_SNORM_INT8、CL_SNORM_INT16、CL_HALF_FLOAT或CL_FLOAT

CL_RG、CL_RGx        将读入内核R和G分量的两个图像数据通道                 (R,G,0.0,1.0)
                     CL_RGx包含3个通道,不过会忽略数据的第三个通道

CL_RA                将读入内核R和A分量的两个图像数据通道                  (R,0.0,0.0,A)

CL_RGB、             将读入内核R、G和B分量的三个图像数据通道                (R,G,B,1.0)
CL_RGBx              这些格式只能用于通道数据类型CL_UNORM_SHORT_565、
                     CL_UNORM_SHORT_555或CL_UNORM_INT_101010

CL_RGBA、            将读入内核R、G、B和A分量的四个图像数据通道              (R,G,B,A)
CL_BGRA、            CL_BGRA和CL_ARGB只能用于通道数据类型CL_UNORM_INT8、
CL_ARGB              CL_SNORM_INT8、CL_SIGNED_INT8或CL_UNSIGNED_INT8

CL_LUMINANCE         将复制到内核中所有4个分量的一个图像数据通道             (L,L,L,1.0)
                     这个格式只能用于通道数据类型CL_UNORM_INT8、
                     CL_UNORM_INT16、CL_SNORM_INT8、CL_SNORM_INT16、
                     CL_HALF_FLOAT或CL_FLOAT

表8-2 图像通道数据类型
通道数据类型                         描述
CL_SNORM_INT8           各个8位整数值将映射到范围[-1.0, 1.0]
CL_SNORA_INT16          各个16位整数值将射到范围[-1.0, 1.0]
CL_UNORM_INT8           各个8位整数值将映射到范围[0.0, 1.0]
CL_UNORM_INT16          各个16位整数值将映射到范围[0.0, 1.0]
CL_SIGNED_INT8          各个8位整数值将读至螫数范围[-128, 127]
CL_SIGNED_INT16         各个16位整数值将读至整数范围[-32768, 32767]
CLSIGNED_INT32          各个32位整数值将读至螫数范围[-2147483648, 2147483647]
CL_UNSIGNED_INT8        各个8位无符号整数值将读至无符号整致范围[0, 255]
CL_UNSIGNED_INT16       各个16位无符号整数值将读至无符号整数范围[0, 65535]
CL_UNSIGNED_INT32       各个32位无符号整数值将读至无符号整数范围[0, 4294967295]
CL_HALF_FLOAT           各个16位分量将处理为一个半浮点值
CL_FLOAT                各个32位分量将处理为一个单精度浮点值
CL_UNORM_SHORT_565      一个5:6:5的16位值，其中各个分量(R,G,B)将规格化至范围[0.0, 1.0]
CL_UNORM_SHORT_555      一个x:5:5:5的16位值，其中各个分量(R,G,B)将规格化至范围[0.0, 1.0]
CL_UNORM_INT_101010     一个x:10:10:10的32位值，其中各个分量(R,G,B)将规格化至范围[0.0, 1.0]

表8-1和表8-2中的所有图像格式可能得到OpenCL实现的支持，不过只强制要求支持这些格式的一个子集。表8-3显示了所有OpenCL实现都必须支持的图像格式（如果要支持图像）。一个实现可能根本不支持图像，可以使用clGetDeviceInfo()查询OpenCL设备的布尔设置CL_DEVICE_IMAGE_SUPPORT。如果支持图像，则可以直接使用表8-3中的格式，而无需查询OpenCL有哪些格式可用。

表8-3 强制支持的图像格式
通道次序                        通道数据类型
CL_RGBA                        CL_UNORM_INT8
                               CL_UNORM_INT16
                               CL_SIGNED_INT8
                               CL_SIGNED_INT16
                               CL_SIGNED_INT32
                               CL_UNSIGNED_INT8
                               CL_UNSIGNED_INT16
                               CL_UNSIGNED_INT32
                               CL_FLOAT

CL_BGRA                        CL_UNORM_INT8

如果使用了未在表8-3中列出的图像格式，则必须使用clGetsupportedImageFormats()查询OpenCL，来确定所需要的图像格式是否得到支持：

cl_int clGetSupportedTmagePormats(cl_context context,
                                  cl_mem_flags flags,
                                  cl_mem_object_type image_type,
                                  cl_uint num_entries,
                                  cl_image_format * image_formats,
                                  cl_uint * num_image_formats)
/*
context 查询所支持图像格式的上下文。
flags 这是一个位域，用来指定有关图像创建的分配和使用信息。
      flags可取的合法值由枚举cl_mem_flags定义，见表7-1。
      要将这个标志设置为创建图像时计划使用的标志。
      
image_type 图像的类型必须是CL_MEM_OBJECT_IMAGE2D或CL_MEM_OBJECT_IMAGE3D。num_entries 可以返回的项数。

image_formats 这个指针指向的位置将存储所支持图像格式的一个列表。
              可以将它设置为NULL，查询所支持的图像格式数目。
              
num_image_formats 这个指针指向一个cl_uint，其中将存储图像格式数目。
*/

查询图像支持

ImageFilter2D例子只使用了一种强制支持的格式，所以它只检查是否支持图像，如代码清单8-3所示。如果程序使用了某种非强制支持的格式，那么还需要调用clGetSupportedImageFormats()确保支持这种图像格式。

代码清单8-3 查询设备的图像支持
// Make sure the device supports images, otherwise exit
cl_bool imageSupport = CL_FALSE;
clGetDeviceInfo(device, CL_DEVICE_IMAGE_SUPPORT, sizeof(cl_bool),
	&imageSupport, NULL);
if (imageSupport != CL_TRUE)
{
    
    
	std::cerr << "OpenCL device does not support images." << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 1;
}

创建采样器对象

目前我们已经介绍了ImageFilter2D示例如何为输入和输出图像创建图像对象。接下来可以准备执行内核了。不过还需要创建另外一个对象：采样器对象。采样器对象指定了从图像获取数据时使用的过滤、寻址和坐标模式。所有这些选项分别对应于GPU硬件获取纹理的能力。

过滤模式指定使用最近(nearest)采样还是线性(linear）采样来获取数据。对于最近采样，将从图像中与坐标最接近的位置读取值。对于线性采样，将把与坐标接近的多个值取平均。对于2维图像，线性过滤器取4个最接近的样本，对它们求平均，这称为双线性采样(bilinear sampling)。对于3维图像，线性过滤器从各个最接近的切片分别取4个样本，然后在这些平均值之间完成线性插值。这称为三线性采样(trilinear sampling)。过滤的开销随GPU硬件会有所变化，不过通常认为这是相当高效的，与手动过滤相比要高效得多。

坐标模式指定从图像读取数据时使用的坐标是规格化坐标（[0.0, 1.0]范围内的浮点值）还是非规格化坐标（[0, image_dimension - 1]范围内的整数值）。使用规格化坐标意味着坐标值不考虑图像大小。使用非规格化坐标则说明坐标在图像大小范围内。

寻址模式指定了当坐标落在范围[0.0, 1.0]（对于规格化坐标）或 [0, dimension -1]（对于非规格化坐标）之外时要做什么。这些模式在clCreateSampler()的描述中给出：

cl_sampler clCreatesampler(cl_context context,
                           cl_bool normalized_coords,
                           cl_addressing_mode addressing_mode,
                           cl_filter_mode filter_mode,
                           cl_ int * errcode_ret)

/*
context 创建采样器对象的上下文。
normalized_coords 坐标是规格化浮点值还是图像大小范围内的整数值。

addressing_mode 寻址模式指定了使用超出图像范围的一个坐标获取图像时会发生什么。

                CL_ADDRESS_CLAMP: 超出图像范围的坐标会返回边界颜色。
                                  对于CL_A、CL_INTENSITY、CL_Rx、CI_RA、CL_RGx、CL_RGBx、CL_ARGB、CL_BGRA和CL_RGBA，这个颜色将是（0.0,0.0,0.0,0.0)。
                                  对于CL_R、CL_RG、CL_RGB和CL_LUM工NANCE，这个颜色将是(0.0,0.0,0.0,1.0)。
                  
                CL_ADDRESS_CLAMP_TO_EDGE: 坐标将钳制至图像边缘。
                CL_ADDRESS_REPEAT: 超出图像范围的坐标会重复。
                CL_ADDRESS_4IRRORED_REPEAT:超出图像范围的坐标会镜像并重复。

fiiter_mode 过滤模式指定如何对图像采样。
            CL_FILTER_NEAREST: 取最接近坐标的样本。
            CL_FILTER_LINEAR: 取最接近坐标的样本的平均值。
                              对于2维图像，这会完成双线性过滤;
                              对于3维图像,这会完成三线性过滤。

errcode_ret 如果为非NULL，函数返回的错误码由这个参数返回。

*/

在 ImageFilter2D例子中，创建了一个采样器，它完成最近采样，并把坐标钳制到图像边缘，如代码清单8-4所示。坐标指定为非规格化，这说明x坐标将是[0，width - 1]范围内的一个整数，y坐标是范围[0，height - 1]内的一个整数。

代码清单8-4 创建一个采样器对象
// Create sampler for sampling image object
sampler = clCreateSampler(context,
	CL_FALSE, // Non-normalized coordinates
	CL_ADDRESS_CLAMP_TO_EDGE,
	CL_FILTER_NEAREST,
	&errNum);

if (errNum != CL_SUCCESS)
{
    
    
	std::cerr << "Error creating CL sampler object." << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 1;
}

并不要求非得在C程序中创建采样器对象。对于ImageFilter2D示例，代码清单8-4中创建的采样器对象会作为一个参数传递到内核函数。采用这种方式创建采样器对象的好处在于，可以修改采样器对象的属性而无需修改内核。然而，也可以在内核代码中直接创建一个采样器。例如，可以在内核代码中创建这个采样器，其行为是一样的：

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE |
                          CLK_ADDRESS_CLAMP_TO_EDGE |
                          CLK_FILTER_NEAREST;

是否需要使用clCreatesampler()所创建的采样器对象的灵活性，或者只需要内核中直接声明的采样器对象，这完全由你决定。对于ImageFilter2D例子，确实没有必要从内核外部创建采样器。实际上，这样做只是为了向你展示可以这样做。相反，一般情况下，这样做确实会提供更大的灵活性。

当应用程序使用完一个采样器对象时，可以使用clReleasesampler()释放这个对象：

cl_int clReleaseSampler(cl_sampler sampler)
//sampler 要释放的采样器对象

另外，可以使用clGetSamplerInfo()查询采样器对象的设置：

cl_int clCetsamplerInfo(cl_sampler sampler,
                        cl_sampler_info param name,
                        size_t param_value_size,
                        void * param_value,
                        size_t * param_value_size_ret)

/*
sampler 查询信息的一个合法的采样器对象。
param_name 要查询的参数,必须是以下参数之一:
           CL_SAMPLER_REFERENCE_COUNT(cl_uint): 采样器对象的引用计数。 
           CL_SAMPLER_CONTEXT(cl_context): 采样器关联的上下文。
           CL_SAMPLER_NORMALIZED_COORDS(cl_bool): 规格化坐标还是非规格化坐标。
           CL_SAMPLER_ADDRESSING_MODE (cl_addressing_mode): 采样器的寻址模式。
           CL_SAMPLER_FILTER_MODE(cl_filter_mode): 采样器的过滤模式。

param_value_size param_value 指示的内存大小(字节数)。

param_value 这个指针指向存储结果的位置。
            必须给这个位置分配足够的字节来存储请求的结果。
            
param_value_size_ret 实际写至param_value的字节数。

*/

处理图像的OpenCL C函数

我们已经解释了ImageFilter2D例子如何创建图像对象和采样器对象。现在来解释高斯过滤器内核本身，如代码清单8-5所示。高斯过滤器是一个内核，通常用于平滑或模糊一个图像，这是通过降低图像中的高频噪声来完成的。

代码清单8-5 高斯过滤器内核

// Gaussian filter of image

__kernel void gaussian_filter(__read_only image2d_t srcImg,
                              __write_only image2d_t dstImg,
                              sampler_t sampler,
                              int width, int height)
{
    
    
    // Gaussian Kernel is:
    // 1  2  1
    // 2  4  2
    // 1  2  1
    float kernelWeights[9] = {
    
     1.0f, 2.0f, 1.0f,
                               2.0f, 4.0f, 2.0f,
                               1.0f, 2.0f, 1.0f };

    int2 startImageCoord = (int2) (get_global_id(0) - 1, get_global_id(1) - 1);
    int2 endImageCoord   = (int2) (get_global_id(0) + 1, get_global_id(1) + 1);
    int2 outImageCoord = (int2) (get_global_id(0), get_global_id(1));

    if (outImageCoord.x < width && outImageCoord.y < height)
    {
    
    
        int weight = 0;
        float4 outColor = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
        for( int y = startImageCoord.y; y <= endImageCoord.y; y++)
        {
    
    
            for( int x = startImageCoord.x; x <= endImageCoord.x; x++)
            {
    
    
                outColor += (read_imagef(srcImg, sampler, (int2)(x, y)) * (kernelWeights[weight] / 16.0f));
                weight += 1;
            }
        }

        // Write the output value to image
        write_imagef(dstImg, outImageCoord, outColor);
    }
}

gaussian_kernel()有5个参数：

__read_only image2d_t srcImg: 要过滤的源图像对象。
__write_only image2d_t dstImg: 目标图像对象，过滤的结果将写入这个对象。sampler_t sampler: 采样器对象，指定read_imagef()所用的寻址、坐标和过滤模式。
int width、int height: 要过滤的图像的宽度和高度(像素数)。
                      注意，源图像对象和目标图像对象大小相同。

ImageFilter2D程序将设置内核参数，内核排队等待执行，如代码清单8-6所示。首先通过对每个参数调用clsetKernelArg()来设置内核参数。设置参数之后，内核排队等待执行。localworksize设置为一个 16 × 16的硬编码值（可能需要针对设备的最优大小有所调整，不过这里为了便于说明只设置为一个硬编码值)。全局工作大小将宽度和高度向上舍入为最接近的localworkSize倍数。这是必要的，因为globalWorkSize必须是localworkSize的倍数。这样设置允许内核处理任意的图像大小（而不要求图像宽度和高度只能是16的倍数)。

代码清单8-6 将高斯内核排队等待执行
// Set the kernel arguments
errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &imageObjects[0]);
errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &imageObjects[1]);
errNum |= clSetKernelArg(kernel, 2, sizeof(cl_sampler), &sampler);
errNum |= clSetKernelArg(kernel, 3, sizeof(cl_int), &width);
errNum |= clSetKernelArg(kernel, 4, sizeof(cl_int), &height);
if (errNum != CL_SUCCESS)
{
    
    
	std::cerr << "Error setting kernel arguments." << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 1;
}

size_t localWorkSize[2] = {
    
     16, 16 };
size_t globalWorkSize[2] = {
    
     RoundUp(localWorkSize[0], width),
								 RoundUp(localWorkSize[1], height) };

// Queue the kernel up for execution
errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
	globalWorkSize, localWorkSize,
	0, NULL, NULL);
if (errNum != CL_SUCCESS)
{
    
    
	std::cerr << "Error queuing kernel for execution." << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 1;
}

再来看看代码清单8-5中的高斯过滤器内核，这里检查了图像坐标，查看坐标是否在图像宽度和高度范围内。由于对全局工作大小做了舍入，所以这是必要的。如果我们很清楚图像肯定是某个值的倍数，也可以不要这个测试，不过编写这个例子是为了处理任意的图像大小，所以我们在内核中完成了这个测试，以确保读、写操作不会超出图像大小范围。

gaussian_filter()的主循环（代码清单8-5嵌套for循环）读取一个3×3区域中的9个值。从图像读取的各个值乘以一个加权因子，这个加权因子在高斯卷积内核中指定。这个操作的结果是模糊化输入图像。使用OpenCL C函数 read_imagef()从图像中读取各个值：

read_imagef(srcImg, sampler, (int2)(x, y));

第一个参数是图像对象，第二个参数是采样器，第三个参数是要使用的图像坐标。在这种情况下，采样器指定为采用非规格化坐标，因此，(x，y)值是[0, width -1]和[0, height -1]范围内的整数。如果采样器使用规格化坐标，函数调用是一样的，不过最后一个参数应当是一个表示规格化坐标的float2。read_imagef()函数返回一个float4 颜色。颜色值的范围取决于图像指定为何种格式。这里，我们的图像指定为CL_UNORM_INT8，所以返回的颜色值会落在浮点范围[0.0, 1.0]内。另外，由于图像通道次序指定为CL_RGBA，所以返回的颜色将读入结果颜色中的(R,G,B,A)。

表5-16和表5-17给出了所有读取2D和3D图像的函数。选择使用哪个图像函数取决于为图像指定的通道数据类型。在之前详细说明了根据图像格式适用哪个函数。选择哪种坐标（整数非规格化坐标或浮点数规格化坐标）取决于调用read_image f | ui | i 函数所用的采样器设置。

在代码清单8-5的最后，高斯过滤内核的结果写至目标图像：

write_imagef(dstImg, outImageCoora, outColor);

写至一个图像时，坐标必须是图像大小范围内的整数。对于图像写操作，没有采样器，因为这里没有过滤和寻址模式（坐标必然在范围内)，而且坐标总是非规格化坐标。选择使用哪一个write_imagel f | ui | i 同样取决于为目标图像选择的通道格式。写2维和3维图像的所有函数在表5-21和表5-22中给出。

传输图像对象

到现在为止，除了移动图像对象外，我们已经介绍了图像对象的所有其他操作。OpenCL提供了一些函数（可以放在命令队列中)，能够对图像完成以下传输操作：

clEnqueueReadImage() 从设备内存向宿主机内存读入图像。
clEnqueuewriteImage() 从宿主机内存向设备内存写入图像。
clEnqueueCopyImage() 将一个图像复制到另一个图像。
clEnqueueCopyImageToBuffer() 将一个图像对象（或它的一部分）复制到一个通用的内存缓冲区。
clEnqueueCopyBufferToImage() 将一个通用的内存缓冲区复制到一个图像对象(或它的一部分)。
clEnqueueMapImage() 将一个图像（或它的一部分）映射到一个宿主机内存指针。

可以使用clEnqueueReadImage()将图像入队，等待从设备读入宿主机内存。

cl_int clEnqueueReadImage(cl_comrmand_queue command_queue,
                          cl_mem image,
                          cl_bool blocking_read,
                          const size_t origin[3],
                          const size_t region[3],
                          size_t row_pitch,
                          size_t slice _pitch,
                          void* ptr,
                          cl_uint num_events_in_wait_list,
                          const cl_event * event_wait_list,
                          cl_event* event)

/*
command_queue 这是一个命令队列，读命令将在这个队列中排队。
image 这是将读取的一个合法的图像对象。

blocking_read 如果设置为CL_TRUE，则clEnqueueReadImage阻塞，直到数据读入ptr;
              否则，直接返回,用户必须查询event来检查命令的状态。

origin 要读取的相对于图像原点的（x，y，z）整数坐标。对于2维图像，z坐标为0。region 要读取的区域的（宽度，高度，深度)。对于2维图像，深度为1。
row_pitch 图像中各行的字节数。如果值为0，则认为长度为image_width * (bytes_per_pixel)。
slice_pitch 3维图像中各个切片的字节数。如果值为0，则认为长度为image_height * mage_row pitch。

ptr 这个指针指向写入所读数据的宿主机内存。

num_events_in_wait_list 数组event_wait_list中的项数。
                        如果event_wait_list为NULL，则这个参数必须为0;否则，必须大于0。

event_wait_list 如果为非NULL，则event_wait_list是一个事件数组，与必须完成的OpenCL命令关联。
                也就是说，在开始执行读命令之前，这些命令必须处于CL_COMPLETE状态。

event 如果为非NULL，函数返回的对应读命令的事件将由这个参数返回。

*/

在ImageFilter2D例子中，使用clEnqueueReadImage()时指定了一个阻塞读，将用高斯内核过滤的图像读回一个宿主机内存缓冲区。再使用FreeImage将这个缓冲区作为一个图像文件写到磁盘上，如代码清单8-7所示。

代码清单8-7 将图像读回宿主机内存
bool SaveImage(char* fileName, char* buffer, int width, int height)
{
    
    
	FREE_IMAGE_FORMAT format = FreeImage_GetFIFFromFilename(fileName);
	FIBITMAP* image = FreeImage_ConvertFromRawBits((BYTE*)buffer, width,
		height, width * 4, 32,
		0xFF000000, 0x00FF0000, 0x0000FF00);
	return (FreeImage_Save(format, image, fileName) == TRUE) ? true : false;
}

// Read the output buffer back to the Host
char* buffer = new char[width * height * 4];
size_t origin[3] = {
    
     0, 0, 0 };
size_t region[3] = {
    
     width, height, 1 };
errNum = clEnqueueReadImage(commandQueue, imageObjects[1], CL_TRUE,
	origin, region, 0, 0, buffer,
	0, NULL, NULL);
if (errNum != CL_SUCCESS)
{
    
    
	std::cerr << "Error reading result buffer." << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 1;
}

还可以使用clEnqueueWriteImage()将图像从宿主机内存写至目标内存。

cl_int clEnqueueriteImage(cl_comenand_queue comand_queue,
                          cl_mem image,
                          cl_bool blocking_write,
                          const size_t origin[3],
                          const size_t region[3],
                          size_t input_row_pitch,
                          size_t input_slice_pitch,
                          const void * ptr,
                          cl_uint num_events_in_wait_list,
                          const cl_event * event_wait_list,
                          cl_event * event)
/*
command_queue 这是一个命令队列，写命令将在这个队列中排队。
              image将写入的一个合法的图像对象。
              
blocking_write 如果设置为CL_TRUE，则 clEnqueueMriteImage 阻塞，直到从ptr写入数据;
               否则，直接返回,用户必须查询event来检查命令的状态。

origin 要写入的相对于图像原点的（x,y,z）螫数坐标。对于2维图像，z坐标为0。region 要写入的区域的（宽度，高度，深度)。对于2维图像，深度为1。
input_row_pitch 输入图像中各行的字节数。
input_slice pitch 输入3维图像中各个切片的字节数。对于2维图像，这个值为0。

ptr 这个指针指向宿主机内存中从哪里写内存。
    必须给这个指针分配足够的存储空间来存放区域指定的图像字节。

num_events_in_wait_list 数组event_wait_list中的项数。
                        如果event_wait_list为NULL，这个参数必须为0;否则,必须大于0。

event_wait_list 如果为非NULL，则event_wait_list是一个事件数组，与必须完成的OpenCL命令关联。
                也就是说，在开始执行读命令之前，这些命令必须处于CL_COMPLETE状态。

event 如果为非NULL，函数返回的对应读命令的事件将由这个参数返回。

*/

还可以从一个图像对象将图像复制到另一个图像对象，而无需使用宿主机内存。这是将一个图像对象的内容复制到另一个图像对象的最快捷的方法。这种复制可以利用函数clEnqueuecopyImage()来完成。

cl_int clEnqueuecopyImage(cl_command_queue command_queue,
                          cl_mem src_image,
                          cl_mem cst_image,
                          const size_t sre_origin[3],
                          const size_t dst_origin[3],
                          const size_t region[3],
                          cl_uint num_events_in_wait_list,
                          const cl_event * event_wrait_list,
                          cl_event *event)

/*
command_queue 这是一个命令队列，复制命令将在这个队列中排队。
src_image 要读取的一个合法的图像对象。
dst_image 要写入的一个合法的图像对象。
sre_origin 要读取的相对于源图像原点的（x, y,z)整数坐标。
           对于2维图像，z坐标为0。
dst_origin 要写入的相对于目标图像原点的（x, y，,z)整数坐标。
           对于2维图像，z坐标为0。
region 要读/写的区域的（宽度，高度，深度)。对于2维图像，深度为1。
num_events_in_wait_list 数组event_wait_list中的项数。
                        如果event_wait_list为 NULL，这个参数必须为0;否则,必须大于0。
event_wait_list 如果为非NULL，则event_wait_list是一个事件数组，与必须完成的OpenCL命令关联。
                也就是说，在开始执行读命令之前，这些命令必须处于CL_COPLETE状态。
event 如果为非NULE，函数返回的对应读命令的事件将由这个参数返回。

*/

类似地，也可以反过来：将一个通用内存缓冲区复制到一个图像。类似于分配一个宿主机内存缓冲区来存储图像，所复制的内存缓冲区区域要有相同的线性布局。从缓冲区复制到图像可以使用clEnqueuecopyBufferToImage()完成。

cl_int clEnqueueCopyBufferToImage(cl_cammand_queue command_queue,
                                  cl_mem src_buffer,
                                  cl_mem dst_image,
                                  size_t src_offset,
                                  const size_t dst_origin[3],
                                  const size_t region[3],
                                  cl_ uint num_events_in_wait_list,
                                  const cl_event * event_wait_list,
                                  cl_event *event)
/*
command_queue 这是一个命令队列，从缓冲区复制到图像的命令在这个队列中排队。src_buffer 要读取的一个合法的缓冲区对象。
dst_image 要写入的一个合法的图像对象。
sre_offset 源内存缓冲区中读取的起始偏移量（字节数)。

dst_origin 要写入的相对于目标图像原点的（x, y，z)整数坐标。
           对于2维图像，z坐标为0。
region 要写入的区域的（宽度，高度，深度)。对于2维图像，深度为1。

num_events_in_wait_list 数组event_wait_1ist中的项数。
                        如果event_wait_list为NULL，这个参数必须为0;否则,必须大于0。

event_wait_list 如果为非NULL，则event_wait_list是一个事件数组，与必须完成的OpenCL命令关联。
                也就是说，在开始执行读命令之前，这些命令必须处于CL_COMPLETE状态。
                
event 如果为非NULL，函数返回的对应读命令的事件将由这个参数返回。

*/

最后，还有一种方法可以访问一个图像对象的内存。与常规缓冲区类似，图像对象可以直接映射到宿主机内存。可以使用函数clEnqueueMapImage()完成映射。可以利用通用缓冲区函数 clEnqueueUnmapMemobject()解除图像的映射。

void * clEnqueueMapImage(cl_command_queue command_queue,
                         cl_mem image,
                         cl_bool blocking_map,
                         cl_map_flags map_flags,
                         const size_t origin[3],
                         const size_t region[3],
                         size_t * image_row pitch,
                         size_t * image_slice_pitch,
                         cl_uint num_events_in_wrait_list,
                         const cl_event * event_wait_list,
                         cl_event * event,
                         void *errcode_ret)
/*
comamand_queue 这是一个命令队列，读命令将在这个队列中排队。
image 一个合法的图像对象(数据将从中读取)。

blocking_map 如果设置为CL_TRUE，则clEnqueueMapImage阻塞，直到数据映射到宿主机内存;
             否则，直接返回,用户必缜查询event来检查命令的状态。

map_flags 这是一个位域，用来指示图像对象中(offset,cb)指定的区域如何映射。
          map_flags可取的合法值由枚举cl_map_flags定义，见表7-4。

origin 要读取的相对于图像原点的(x, y，z)整数坐标。对于2维图像，z坐标为0。region 要读取的区域的(宽度，高度，深度)。对于2维图像，深度为1。
image_row_pitch 如果不为NULL，则设置为所读图像的行长度。

image_slice_pitch 如果不为NULL，则设置为所读3维图像的切片长度。
                  对于2维图像，这个值设置为0。

num_events_in_wait_list 数组event_wait_list中的项数。
                        如果event_wait_list为 NULL，这个参数必须为0:否则，必须大于0。
                        
event_wait_list 如果为非NULL，则event_wait_list是一个事件数组，与必须完成的OpenCL命令关联。
                也就是说，在开始执行读命令之前，这些命令必须处于CL_COMPLETE状态。
                
event 如果为非NULL，函数返回的对应读命令的事件将由这个参数返回。
errcode_ret 如果为非NULL，函数返回的错误码由这个参数返回。
*/

可以修改本章的ImageFilter2D例子，使用clEnqueueMapImage()将结果读回宿主机，而不是使用clEnqueueReadImage()。代码清单8-8中的代码给出了修改这个示例程序（使之使用clEnqueueMapImage()读取结果）所要做的修改。

代码清单8-8 图像结果映射到宿主机内存指针
//Create the image object. Needs to be
//created with CL_MEM_READ_WRITE rather than
//CL_MEM_WRITE_ONLY since it will need to
//be mapped to the best
imageObjects[1] = clCreateImage2D(context,
		CL_MEM_WRITE_ONLY,
		&clImageFormat,
		width,
		height,
		0,
		NULL,
		&errNum);

//...Execute the kernel...
//Map the results back to a host buffer
size_t rowPitch = 0;
char * buffer = (char*)clEnqueueMapImage(commandQueue,
                                         imageObjects[1],
                                         CL_TRUE,
                                         CL_MAP_READ,
                                         origin,
                                         region,
                                         &rowPitch,
                                         NULL,
                                         0,
                                         NULL,
                                         NULL,
                                         &errNum);
if(errNum != CL_SUCCESS)
{
    
    
     std::cerr << "Error mapping result buffer." << std::endl;
     Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
     return 1;
}
// Save the image out to disk
if (!SaveImage(argv[2], buffer, width, height))
{
    
    
	std::cerr << "Error writing output image: " << argv[2] << std::endl;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	delete[] buffer;
	return 1;
}
//Unmap the image buffer
errNum = clEnqueueUnmapMemObject(commandQueue,
                                 imageObjects[1],
                                 buffer,
                                 0,
                                 NULL,
                                 NULL);
if(errNum != CL_SUCCESS)]
{
    
    
     std::cerr << "Error unmapping result buffer." << std::endl;
     Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
     return 1;
}

这一次为结果创建的图像对象使用内存标志CL_MEM_READ_WRITE创建（而不是原先的CL_MEM_WRITE_ONLY)。必须这样做，因为调用clEnqueueMapImage()时传入了CL_MAP_READ作为映射标志，这允许我们读取宿主机缓冲区中返回的内容。另一处修改是行长度必须显式读回，而不能假设等于width * bytesPerPixel。另外，宿主机指针 buffer必须使用clEnqueueUnmapMemObject()解除映射以便释放其资源。

关于复制和映射图像数据，还要了解一个重要的性能问题，OpenCL规范没有强制图像的内部存储布局。也就是说，尽管宿主机上图像看上去是线性的缓冲区，但OpenCL实现完全可以在内部采用非线性格式存储图像。更常见的是，一个OpenCL实现可能会平铺图像数据，以便硬件完成最优化的访问。这种平铺格式是不透明的（往往是专用的)，OpenCL实现的用户不会看到也不能访问平铺的缓冲区。不过，从性能角度看，这意味着为宿主机来回读/写/映射缓冲区时，OpenCL实现可能需要重新平铺数据来满足它自己的最优内部格式。这种做法的性能影响很可能完全依赖于底层OpenCL硬件设备，不过对用户来说还是很有必要有所了解，从而能限制只在绝对必要的情况下才允许这种平铺/解除平铺操作。

高斯过滤器内核示例

ImageFilter2D示例程序首先从一个文件（例如.png、.bmp等）加载一个2维图像，并把图像位存储在一个2维图像对象中。这个程序还创建了另外一个2维图像对象，它将存储对输入图像运行一个高斯模糊滤镜的结果。这个程序将内核排队执行，然后从OpenCL设备将图像读回到一个宿主机内存缓冲区。最后，这个宿主机内存缓冲区的内容再写至一个文件。

//main.cpp

#include <iostream>
#include <fstream>
#include <sstream>
#include <string.h>

#ifdef __APPLE__
#include <OpenCL/cl.h>
#else
#include <CL/cl.h>
#endif

#include "FreeImage.h"
#pragma warning(disable : 4996)

///
//  Create an OpenCL context on the first available platform using
//  either a GPU or CPU depending on what is available.
//
cl_context CreateContext()
{
    
    
	cl_int errNum;
	cl_uint numPlatforms;
	cl_platform_id firstPlatformId;
	cl_context context = NULL;

	// First, select an OpenCL platform to run on.  For this example, we
	// simply choose the first available platform.  Normally, you would
	// query for all available platforms and select the most appropriate one.
	errNum = clGetPlatformIDs(1, &firstPlatformId, &numPlatforms);
	if (errNum != CL_SUCCESS || numPlatforms <= 0)
	{
    
    
		std::cerr << "Failed to find any OpenCL platforms." << std::endl;
		return NULL;
	}

	// Next, create an OpenCL context on the platform.  Attempt to
	// create a GPU-based context, and if that fails, try to create
	// a CPU-based context.
	cl_context_properties contextProperties[] =
	{
    
    
		CL_CONTEXT_PLATFORM,
		(cl_context_properties)firstPlatformId,
		0
	};
	context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU,
		NULL, NULL, &errNum);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cout << "Could not create GPU context, trying CPU..." << std::endl;
		context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU,
			NULL, NULL, &errNum);
		if (errNum != CL_SUCCESS)
		{
    
    
			std::cerr << "Failed to create an OpenCL GPU or CPU context." << std::endl;
			return NULL;
		}
	}

	return context;
}

///
//  Create a command queue on the first device available on the
//  context
//
cl_command_queue CreateCommandQueue(cl_context context, cl_device_id* device)
{
    
    
	cl_int errNum;
	cl_device_id* devices;
	cl_command_queue commandQueue = NULL;
	size_t deviceBufferSize = -1;

	// First get the size of the devices buffer
	errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &deviceBufferSize);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Failed call to clGetContextInfo(...,GL_CONTEXT_DEVICES,...)";
		return NULL;
	}

	if (deviceBufferSize <= 0)
	{
    
    
		std::cerr << "No devices available.";
		return NULL;
	}

	// Allocate memory for the devices buffer
	devices = new cl_device_id[deviceBufferSize / sizeof(cl_device_id)];
	errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceBufferSize, devices, NULL);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Failed to get device IDs";
		return NULL;
	}

	// In this example, we just choose the first available device.  In a
	// real program, you would likely use all available devices or choose
	// the highest performance device based on OpenCL device queries
	commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);
	if (commandQueue == NULL)
	{
    
    
		std::cerr << "Failed to create commandQueue for device 0";
		return NULL;
	}

	*device = devices[0];
	delete[] devices;
	return commandQueue;
}

///
//  Create an OpenCL program from the kernel source file
//
cl_program CreateProgram(cl_context context, cl_device_id device, const char* fileName)
{
    
    
	cl_int errNum;
	cl_program program;

	std::ifstream kernelFile(fileName, std::ios::in);
	if (!kernelFile.is_open())
	{
    
    
		std::cerr << "Failed to open file for reading: " << fileName << std::endl;
		return NULL;
	}

	std::ostringstream oss;
	oss << kernelFile.rdbuf();

	std::string srcStdStr = oss.str();
	const char* srcStr = srcStdStr.c_str();
	program = clCreateProgramWithSource(context, 1,
		(const char**)&srcStr,
		NULL, NULL);
	if (program == NULL)
	{
    
    
		std::cerr << "Failed to create CL program from source." << std::endl;
		return NULL;
	}

	errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
	if (errNum != CL_SUCCESS)
	{
    
    
		// Determine the reason for the error
		char buildLog[16384];
		clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
			sizeof(buildLog), buildLog, NULL);

		std::cerr << "Error in kernel: " << std::endl;
		std::cerr << buildLog;
		clReleaseProgram(program);
		return NULL;
	}

	return program;
}


///
//  Cleanup any created OpenCL resources
//
void Cleanup(cl_context context, cl_command_queue commandQueue,
	cl_program program, cl_kernel kernel, cl_mem imageObjects[2],
	cl_sampler sampler)
{
    
    
	for (int i = 0; i < 2; i++)
	{
    
    
		if (imageObjects[i] != 0)
			clReleaseMemObject(imageObjects[i]);
	}
	if (commandQueue != 0)
		clReleaseCommandQueue(commandQueue);

	if (kernel != 0)
		clReleaseKernel(kernel);

	if (program != 0)
		clReleaseProgram(program);

	if (sampler != 0)
		clReleaseSampler(sampler);

	if (context != 0)
		clReleaseContext(context);

}

///
//  Load an image using the FreeImage library and create an OpenCL
//  image out of it
//
cl_mem LoadImage(cl_context context, char* fileName, int& width, int& height)
{
    
    
	FREE_IMAGE_FORMAT format = FreeImage_GetFileType(fileName, 0);
	FIBITMAP* image = FreeImage_Load(format, fileName);

	// Convert to 32-bit image
	FIBITMAP* temp = image;
	image = FreeImage_ConvertTo32Bits(image);
	FreeImage_Unload(temp);

	width = FreeImage_GetWidth(image);
	height = FreeImage_GetHeight(image);

	char* buffer = new char[width * height * 4];
	memcpy(buffer, FreeImage_GetBits(image), width * height * 4);

	FreeImage_Unload(image);

	// Create OpenCL image
	cl_image_format clImageFormat;
	clImageFormat.image_channel_order = CL_RGBA;
	clImageFormat.image_channel_data_type = CL_UNORM_INT8;

	cl_int errNum;
	cl_mem clImage;
	clImage = clCreateImage2D(context,
		CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
		&clImageFormat,
		width,
		height,
		0,
		buffer,
		&errNum);

	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error creating CL image object" << std::endl;
		return 0;
	}

	return clImage;
}

///
//  Save an image using the FreeImage library
//
bool SaveImage(char* fileName, char* buffer, int width, int height)
{
    
    
	FREE_IMAGE_FORMAT format = FreeImage_GetFIFFromFilename(fileName);
	FIBITMAP* image = FreeImage_ConvertFromRawBits((BYTE*)buffer, width,
		height, width * 4, 32,
		0xFF000000, 0x00FF0000, 0x0000FF00);
	return (FreeImage_Save(format, image, fileName) == TRUE) ? true : false;
}

///
//  Round up to the nearest multiple of the group size
//
size_t RoundUp(int groupSize, int globalSize)
{
    
    
	int r = globalSize % groupSize;
	if (r == 0)
	{
    
    
		return globalSize;
	}
	else
	{
    
    
		return globalSize + groupSize - r;
	}
}

///
//	main() for HelloBinaryWorld example
//
int main(int argc, char** argv)
{
    
    
	cl_context context = 0;
	cl_command_queue commandQueue = 0;
	cl_program program = 0;
	cl_device_id device = 0;
	cl_kernel kernel = 0;
	cl_mem imageObjects[2] = {
    
     0, 0 };
	cl_sampler sampler = 0;
	cl_int errNum;

	/*
	if (argc != 3)
	{
		std::cerr << "USAGE: " << argv[0] << " <inputImageFile> <outputImageFiles>" << std::endl;
		return 1;
	}
	*/
	std::string src_name = "C://Users//qzh//source//repos//ConsoleApplication5//x64//Debug//picture.jpeg";
	std::string dst_name = "C://Users//qzh//source//repos//ConsoleApplication5//x64//1.png";
	argv[1] = (char*)src_name .data();
	argv[2] = (char*)dst_name .data();

	// Create an OpenCL context on first available platform
	context = CreateContext();
	if (context == NULL)
	{
    
    
		std::cerr << "Failed to create OpenCL context." << std::endl;
		return 1;
	}

	// Create a command-queue on the first device available
	// on the created context
	commandQueue = CreateCommandQueue(context, &device);
	if (commandQueue == NULL)
	{
    
    
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Make sure the device supports images, otherwise exit
	cl_bool imageSupport = CL_FALSE;
	clGetDeviceInfo(device, CL_DEVICE_IMAGE_SUPPORT, sizeof(cl_bool),
		&imageSupport, NULL);
	if (imageSupport != CL_TRUE)
	{
    
    
		std::cerr << "OpenCL device does not support images." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Load input image from file and load it into
	// an OpenCL image object
	int width, height;
	imageObjects[0] = LoadImage(context, argv[1], width, height);
	if (imageObjects[0] == 0)
	{
    
    
		std::cerr << "Error loading: " << std::string(argv[1]) << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Create ouput image object
	cl_image_format clImageFormat;
	clImageFormat.image_channel_order = CL_RGBA;
	clImageFormat.image_channel_data_type = CL_UNORM_INT8;
	imageObjects[1] = clCreateImage2D(context,
		CL_MEM_WRITE_ONLY,
		&clImageFormat,
		width,
		height,
		0,
		NULL,
		&errNum);

	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error creating CL output image object." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}


	// Create sampler for sampling image object
	sampler = clCreateSampler(context,
		CL_FALSE, // Non-normalized coordinates
		CL_ADDRESS_CLAMP_TO_EDGE,
		CL_FILTER_NEAREST,
		&errNum);

	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error creating CL sampler object." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Create OpenCL program
	program = CreateProgram(context, device, "ImageFilter2D.cl");
	if (program == NULL)
	{
    
    
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Create OpenCL kernel
	kernel = clCreateKernel(program, "gaussian_filter", NULL);
	if (kernel == NULL)
	{
    
    
		std::cerr << "Failed to create kernel" << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Set the kernel arguments
	errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &imageObjects[0]);
	errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &imageObjects[1]);
	errNum |= clSetKernelArg(kernel, 2, sizeof(cl_sampler), &sampler);
	errNum |= clSetKernelArg(kernel, 3, sizeof(cl_int), &width);
	errNum |= clSetKernelArg(kernel, 4, sizeof(cl_int), &height);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error setting kernel arguments." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	size_t localWorkSize[2] = {
    
     16, 16 };
	size_t globalWorkSize[2] = {
    
     RoundUp(localWorkSize[0], width),
								  RoundUp(localWorkSize[1], height) };

	// Queue the kernel up for execution
	errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL,
		globalWorkSize, localWorkSize,
		0, NULL, NULL);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error queuing kernel for execution." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	// Read the output buffer back to the Host
	char* buffer = new char[width * height * 4];
	size_t origin[3] = {
    
     0, 0, 0 };
	size_t region[3] = {
    
     width, height, 1 };
	errNum = clEnqueueReadImage(commandQueue, imageObjects[1], CL_TRUE,
		origin, region, 0, 0, buffer,
		0, NULL, NULL);
	if (errNum != CL_SUCCESS)
	{
    
    
		std::cerr << "Error reading result buffer." << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		return 1;
	}

	std::cout << std::endl;
	std::cout << "Executed program succesfully." << std::endl;

	//memset(buffer, 0xff, width * height * 4);
	// Save the image out to disk
	if (!SaveImage(argv[2], buffer, width, height))
	{
    
    
		std::cerr << "Error writing output image: " << argv[2] << std::endl;
		Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
		delete[] buffer;
		return 1;
	}

	delete[] buffer;
	Cleanup(context, commandQueue, program, kernel, imageObjects, sampler);
	return 0;
}

//ImageFilter2D.cl
// Gaussian filter of image

__kernel void gaussian_filter(__read_only image2d_t srcImg,
                              __write_only image2d_t dstImg,
                              sampler_t sampler,
                              int width, int height)
{
    
    
    // Gaussian Kernel is:
    // 1  2  1
    // 2  4  2
    // 1  2  1
    float kernelWeights[9] = {
    
     1.0f, 2.0f, 1.0f,
                               2.0f, 4.0f, 2.0f,
                               1.0f, 2.0f, 1.0f };

    int2 startImageCoord = (int2) (get_global_id(0) - 1, get_global_id(1) - 1);
    int2 endImageCoord   = (int2) (get_global_id(0) + 1, get_global_id(1) + 1);
    int2 outImageCoord = (int2) (get_global_id(0), get_global_id(1));

    if (outImageCoord.x < width && outImageCoord.y < height)
    {
    
    
        int weight = 0;
        float4 outColor = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
        for( int y = startImageCoord.y; y <= endImageCoord.y; y++)
        {
    
    
            for( int x = startImageCoord.x; x <= endImageCoord.x; x++)
            {
    
    
                outColor += (read_imagef(srcImg, sampler, (int2)(x, y)) * (kernelWeights[weight] / 16.0f));
                weight += 1;
            }
        }

        // Write the output value to image
        write_imagef(dstImg, outImageCoord, outColor);
    }
}

FreeImage.h以及FreeImaged.dll等相关问题如何配置，看参考文献

参考文献

https://blog.csdn.net/qq_36314864/article/details/132041933?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22132041933%22%2C%22source%22%3A%22qq_36314864%22%7D