CUDA数组（CUDA Array）

参考:
https://blog.csdn.net/hhko12322/article/details/12004329
http://blog.sciencenet.cn/blog-398465-342089.html

CUDA数组（CUDA Array）

引言

CUDA Array is used for the Texture memory.
本篇谈一下不同维数的CUDA数组的申请，赋值，复制和释放。

CUDA array 在 cuda 中是一个特殊的类型，叫做 cudaArray，在 CUDA 中，他是专门给 texture 用的一种数组；通过cudaMallocArray()、cudaFreeArray()、cudaMemcpyToArray()等函数对其进行管理。此外，由于 cudaArray 本身不是template 的型別，所以在通过cudaMallocArray()来申请CUDA Array时，要通过cudaChannelFormatDesc这个特别的方式，来定义CUDA Array的类型。

申请CUDA Array

如：申请一个一维CUDA Array,类型是float 大小是width×height

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaArray* cuArray;
cudaMallocArray(&cuArray, &channelDesc, width, height);

cudaChannelFormatDesc 是一个用来描述 fetch 一个 texture 時，返回值的类型；
定义如下：

template<class T> struct cudaChannelFormatDesc cudaCreateChannelDesc< T>();

复制CUDA Array

而CUDA Array 需要主机端的线性内存赋值（copy）：

cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray,
                              size_t dstX, size_t dstY,
                              const void* src, size_t count,
                              enum cudaMemcpyKind kind);

cudaMemcpyKind是用来指定赋值的方向，有cudaMemcpyHostToHost、cudaMemcpyHostToDevice、cudaMemcpyDeviceToHost、cudaMemcpyDeviceToDevice四种值。

绑定CUDA Array

CUDA Array是要被Texture使用的，需要绑定到Texture cudaBindTextureToArray()

template<class T, int dim, enum cudaTextureReadMode readMode>
cudaError_t cudaBindTextureToArray(
               const struct texture<T, dim, readMode>& texRef,
               const struct cudaArray* cuArray);

解绑定：cudaUnbindTexture()

取值

而在存取上，和 linear memory 的 texture 時的 tex1Dfetch() 不同，是要使用 tex1D()、tex2D() 这两种，分別是用在 1D 和 2D 的 texture。其形式分別为：

template<class Type, enum cudaTextureReadMode readMode>
Type tex1D(texture<Type, 1, readMode> texRef, float x);

template<class Type, enum cudaTextureReadMode readMode>
Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);

三维CUDA Array

上例描述了CUDA Array使用的全过程，其中CUDA Array是2维的，下面介绍3维的CUDA Array，为什么不介绍一维的CUDA Array？因为Texture本身对局部内存访问有优化，并且纹理内存一般都是基于二维的（图片都是二维的）。

在当前的CUDA版本中，3D的线性内存是无法直接绑定到texture memory，一维的可以，因此，需要将数据首先放进一个3D的CUDA array，然后将3D CUDA array绑定到texture memory上，访问数组元素时，通过取纹理的函数tex3D(tex,x,y,z)可以返回坐标为（x,y,z）的元素。

创建CUDA 3D array
在之前的CUDA版本中，extent.width与height，depth不同，其计数单位为bytes，所以在旧版本中必须使用array_width*sizeof(float)，最新的3.1竟然悄悄的修改了。可以CUDA的文档一直是错误的，文档中记载width,height,depth均是in bytes，实际上赋值时使用元素个数即可。如果不直接赋值，还可以调用函数make_cudaExtent(extent,width,height,depth), 原理类似。

cudaArray *d_u
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc< float>();
cudaExtent extent;
extent.width=array_width;
extent.height=array_height;
extent.depth=array_depth;
cudaMalloc3DArray(&d_u,&channelDesc,extent);
复制数据至3D array
首先解释一下pitched pointer的工具原理，如果访问数组元素u[x][y][z]，通过pitched pointer访问则是u_p[x+y*pitch+ z*pitch*height ]。显然，这里pitch=width，因此当创建pitched pointer时我们需要将width和height作为参数传递给函数make_cudaPitchedPtr()。在这里尤其要注意的是，pitched pointer指向的array与传统的C语言数组的存储方式不同，C语言访问元素u[x][y][z]是通过u[x*height*depth+y*depth+z]。因此为了正确读取所需元素，我建议逆序建立pitched pointer：
copyParams.srcPtr = make_cudaPitchedPtr((void*)u, array_depth*sizeof(float), array_depth, array_height);
此时相当于数组u[x][y][z]被转置，在CUDA3D array中对应元素为u[z][y][x]，CUDA文档与指南中并未提及这一点区别，这个问题当时也困扰我很久，费尽周折才搞清楚，希望以后的SDK sample能覆盖这个注意点。

cudaMemcpy3DParms copyParams = {0};
copyParams.srcPtr = make_cudaPitchedPtr((void*)u, array_width*sizeof(float), array_width, array_height);
copyParams.dstArray = d_u;
copyParams.extent = extent;
copyParams.kind = cudaMemcpyHostToDevice;
cudaMemcpy3D(&copyParams);
绑定3D array至texture memory

normalized 设置是否对纹理坐标是否进行归一化。如果normalized是一个非零值，那么就会使用归一化到[0，1)的坐标进行寻址，否则对尺寸为width, height, depth的纹理使用坐标[0,width-1], [0,height-1], [0,depth-1]寻址。例如，一个尺寸为64×32的纹理可以通过x维度范围为[0，63]，y维度范围[0,31]的坐标寻址。如果采用归一化方式对尺寸为64×32的纹理进行寻址，在x和y维度上的坐标就都是[0.0,1.0)。这样就可以保证纹理的坐标与纹理的尺寸无关。
filterMode用于设置纹理的滤波模式，即如何根据坐标计算返回的纹理值。滤波模式可以是cudaFilterModePoint或者cudaFilterModeLinear。滤波模式为CudaFilterModePoint时，返回值是与坐标最接近的像元的值。CudaFilterModeLinear模式只能对返回值为浮点型的纹理使用，启用这一种模式时将拾取纹理坐标周围的像元，然后根据坐标与这些像元之间的距离进行插值计算。对一维纹理可以使用线性滤波，对二维纹理可以使用双线性滤波。返回值会是对最接近纹理坐标的两个像元（对一维纹理），四个像元（对二维纹理）或者八个像元（对三维纹理）进行插值后得到的值。

texture< float,3,cudaReadModeElementType> tex_u;
tex_u.filterMode = cudaFilterModePoint;
tex_u.normalized = false;
tex_u.channelDesc = channelDesc;
if (cudaBindTextureToArray(tex_u, d_u, channelDesc) != (unsigned int) CUDA_SUCCESS) {
printf(“[ERROR] Could not bind texture un”);
return;
}

接下来会介绍，三通道的RGB影像怎么处理，以及Texture Layer怎么使用。