CUDA C Programming Guide 在线教程学习笔记 Part 10【坑】

▶ 动态并行。

● 动态并行直接从 GPU 上创建工作，可以减少主机和设备间数据传输，在设备线程中调整配置。有数据依赖的并行工作可以在内核运行时生成，并利用 GPU 的硬件调度和负载均衡。动态并行要求算法和程序要提前改进，消除递归、不规则的循环、结构或其他不适合并行的情况。

● 动态并行的经典图

● 主机中 Runtime API 提供了跟踪运行核、流与事件的函数，对主机进程中的所有线程来说 CUDA 对象都是可共享的，但是主机调用的各核函数之间是相互独立的，CUDA 对象不能共享（重叠读写）。同样的情况也存在于设备中创建子内核函数的时候。

● 父线程中的 Runtime API 操作对于该线程所在的线程块是可见的。意思就是在同一个线程块中可以由任意一个线程来调用子内核、调整流和事件，其作用等价。

● 动态并行隐式的完成了父线程和子线程之间的同步，要求所有的子线程都结束后父线程才能结束。若 “父线程所在的线程块中所有的线程” 都在 “子线程结束前” 全部结束了，则子线程隐式的强制结束。

● 父线程格与子线程格共享全局内存和常量内存，但是共享内存和局部内存私有。

● 父线程与子线程在两个时间节点上共享的全局内存具有一致性：父线程格调用子线程格的时候；子线程格计算完成后，在父线程格中调用同步函数的时候。意思就是，父线程格先对全局内存进行操作，然后调用子线程格，则这些操作对子线程格来说都是可见的；子线程格对全局内存进行操作，然后在父线程格中进行同步，则这些操作对父线程格来说都是可见的。

● 动态并行与父 - 子内核之间的全局内存同步的代码举例。

 1 // 子内核
 2 __global__ void child_launch(int *data) 
 3 {
 4     data[threadIdx.x] = data[threadIdx.x] + 1;
 5 }
 6 
 7 // 父内核
 8 __global__ void parent_launch(int *data)
 9 {
10     data[threadIdx.x] = threadIdx.x;
11     
12     __syncthreads();     // 同步所有父线程对全局内存的读写
13 
14     if (threadIdx.x == 0)// 使用一个线程来启动子内核
15     {
16         child_launch << < 1, 256 >> >(data);
17         // 调用子内核时，隐式保证了父线程对全局内存读写（data[0] = 0）对子内核可见，
18         // 但不能保证父内核中其他线程的全局内存读写可见（因为调用子线程时父内核中其他线程的全局内存读写不一定都完成了）
19         // 这里多亏调用子内核之前使用了 __syncthreads();，保证父内核中所有线程的全局内存读写在调用子内核之前都已经完成，保证了子内核可见
20         cudaDeviceSynchronize();// 退出子内核时使用同步，保证父线程对子内核的全局内存读写可见
21     }
22     __syncthreads();// 父内核同步，保证父内核中所有线程对子内核的全局内存读写都可见            
23 }
24 
25 void host_launch(int *data)
26 {
27     parent_launch << < 1, 256 >> >(data);
28 }

● 零拷贝内存与全局内存具有相同的一致性，且不能在设备代码中申请或释放。

● 常量内存不能被设备修改，保证了设备之间高度的一致性。所有的常量内存都应该调用核函数之前由主机读写完成，调用核函数时常量内存就被自动继承。主机与设备之间、设备与设备之间常量内存的指针可以平凡传递。

● 共享内存被线程块私有，局部内存被线程私有，两者均不能在父内核与子内核之间共享。把共享内存指针或局部内存指针传递给子内核的时候编译器会发出警告；可以使用函数 __isGlobal() 来检测一个指针是否指向全局内存，防止将一个共享内存指针或局部内存指针传递给子内核；

 1 // sm_20_intrinsics.h
 2 // "ptr" 指向全局内空间则返回 1；指向共享、局部、常量内存空间则返回 0
 3 __SM_20_INTRINSICS_DECL__ unsigned int __isGlobal(const void *ptr)
 4 {
 5     unsigned int ret;
 6     asm volatile ("{ \n\t"
 7         "    .reg .pred p; \n\t"
 8         "    isspacep.global p, %1; \n\t"
 9         "    selp.u32 %0, 1, 0, p;  \n\t"
10 #if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) || defined(__CUDACC_RTC__)
11         "} \n\t" : "=r"(ret) : "l"(ptr));
12 #else
13         "} \n\t" : "=r"(ret) : "r"(ptr));
14 #endif
15 
16     return ret;
17 }

● 需要在父内核与子内核之间传递的内存可以在全局作用域里显式的声明（__device__ int array[256];），防止在父内核中将局部内存地址传入子内核。

● 纹理内存存与全局内存具有相同的一致性，注意使用同步函数来获得父内核与子内核之间的一致。

● 父内核启用子内核的过程是异步的，这与主机启用内核的情况相同。

● 子内核集成父内核的配置参数，即函数 cudaDeviceGetCacheConfig() 和 cudaDeviceGetLimit() 调整的缓存、共享内存及其他参数。从主机中调用内核时使用的设置优先于全局默认设置，且不能在设备中调用子内核的时候更改这些设置。

● 流可以在指定的线程块中的任意线程使用，但流句柄不能在线程块之间、父子内核之间传递。不同的流中启动的内核可以并行运行，但不保证并发，这点在主机和设备之间、设备父子内核之间局成立。

● 主机端 NULL 流的跨流屏障语义对设备穿件的流不适用。设备内的流不能使用函数 cudaStreamCreate() 来创建，而是要使用函数 cudaStreamCreateWithFlags() 来创建，并传入标志 cudaStreamNonBlocking。

● 设备流中的同步不能使用函数 cudaStreamSynchronize() 或 cudaStreamQuery()，而要使用函数cudaDeviceSynchronize()。

● 主机端的 NULL 流隐式地添加了标志 cudaStreamNonBlocking， NULL 流中启动的的内核不会依赖于其他流中挂起的工作（work launched into the NULL stream will not insert an implicit dependency on pending work in any other streams）。

● 自内核仅支持 CUDA 事件的流内同步功能，意思就是函数 cudaStreamWaitEvent()是可用的，但函数 cudaEventSynchronize()，cudaEventElapsedTime()，cudaEventQuery() 不可用。而且在子线程块中创建事件时，还需要向函数 cudaEventCreateWithFlags() 传入标志 cudaEventDisableTiming。
● 事件在创建它的线程块中的所有线程之间共享，不能传递给另一个内核或线程块。事件句柄不能保证在块之间唯一，所以使用事件句柄前要先创建。

● (?) It is up to the program to perform sufficient additional inter-thread synchronization, for example via a call to __syncthreads(), if the calling thread is intended to synchronize with child grids invoked from other threads.

● (?) The cudaDeviceSynchronize() function does not imply intra-block synchronization. In particular, without explicit synchronization via a __syncthreads() directive the calling thread can make no assumptions about what work has been launched by any thread other than itself. For example if multiple threads within a block are each launching work and synchronization is desired for all this work at once (perhaps because of event-based dependencies), it is up to the program to guarantee that this work is submitted by all threads before calling cudaDeviceSynchronize().

● (?) Because the implementation is permitted to synchronize on launches from any thread in the block, it is quite possible that simultaneous calls to cudaDeviceSynchronize() by multiple threads will drain all work in the first call and then have no effect for the later calls.

● 一个内核只能在一个设备上启用，父内核中不能使用函数 cudaSetDevice() 或 cudaGetDevicePropertites()，但可以使用函数 cudaDeviceGetAttribute() 来访问其他设备的属性。

● 在文件作用域内声明的 __device__ 和 __constant__ 变量可以被所有内核进行读写或读。

● 只能在主机端创建或销毁纹理或表面对象，设备中不能。只有顶层内核（直接被主机调用的内核）中才能使用纹理和表面（The device runtime does not support legacy module-scope textures and surfaces within a kernel launched from the device）。

● 主机与内核之间、设备中父子内核之间均可使用静态或动态的方法调用共享内存，但是数据传递需要借助全局内存来实现。全局内存地址可以直接用 & 算符来获取。

● 内核中常量内存可以直接引用，且不能更改其内容，所以内核中不支持函数 cudaMemcpyToSymbol() 或 cudaGetSymbolAddress()。

● 内核中也可以使用函数 cudaGetLastError() 来捕获调用内核的错误，注意同时启用多个内核的时候可能会发生多个错误，但是该函数返回值中只保存了最后一个。

● 调用内核的 <<< >>>　算符实际上在 PTX 中被解释为函数 cudaGetParameterBuffer() 和函数 cudaLaunchDevice()，（在 cuda_cevice_runtime_api.h中的定义稍有不同）。

1 extern __device__ cudaError_t cudaGetParameterBuffer(void **params);
2 extern __device__ cudaError_t cudaLaunchDevice(void *kernel, void *params, dim3 gridDim, dim3 blockDim, unsigned int sharedMemSize = 0, cudaStream_t stream = 0);

● 给出了所有能在设备代码中使用的 Runtime API 函数。

● PTX 阶段内核调用的低层次实现，详细说明函数 cudaGetParameterBuffer() 和函数 cudaLaunchDevice() 的细节。

■ PTX 阶段 cudaLaunchDevice() 的两种实现。

 1 // .address_size == 64
 2 .extern.func(.param.b32 func_retval0) cudaLaunchDevice
 3 (
 4     .param.b64 func,
 5     .param.b64 parameterBuffer,
 6     .param.align 4.b8 gridDimension[12],
 7     .param.align 4.b8 blockDimension[12],
 8     .param.b32 sharedMemSize,
 9     .param.b64 stream
10 );
11 
12 // .address_size == 32
13 .extern.func(.param.b32 func_retval0) cudaLaunchDevice
14 (
15     .param.b32 func,
16     .param.b32 parameterBuffer,
17     .param.align 4.b8 gridDimension[12],
18     .param.align 4.b8 blockDimension[12],
19     .param.b32 sharedMemSize,
20     .param.b32 stream
21 );

■ PTX 阶段 cudaGetParameterBuffer() 的两种实现。

 1 // .address_size == 64
 2 .extern.func(.param.b64 func_retval0) cudaGetParameterBuffer
 3 (
 4     .param.b64 alignment,
 5     .param.b64 size
 6 );
 7 
 8 // .address_size == 32
 9 .extern.func(.param.b32 func_retval0) cudaGetParameterBuffer
10 (
11     .param.b32 alignment,
12     .param.b32 size
13 );

■ 使用的两个函数的声明。动态并行中必须要有头文件 cuda_device_runtime_api.h，不过其已经在 Runtime 环境中被包含了（cuda_runtime.h）。

 1 extern "C" __device__ cudaError_t cudaLaunchDevice(void *func, void *parameterBuffer, dim3 gridDimension, dim3 blockDimension, unsigned int sharedMemSize, cudaStream_t stream);
 2 extern "C" __device__ void *cudaGetParameterBuffer(size_t alignment, size_t size);
 3 
 4 // cuda_device_runtime_api.h
 5 static __inline__ __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaLaunchDevice(void *func, void *parameterBuffer, dim3 gridDimension, dim3 blockDimension, unsigned int sharedMemSize, cudaStream_t stream)
 6 {
 7     return cudaLaunchDevice_ptsz(func, parameterBuffer, gridDimension, blockDimension, sharedMemSize, stream);
 8 }
 9 static __inline__ __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaLaunchDeviceV2(void *parameterBuffer, cudaStream_t stream)
10 {
11     return cudaLaunchDeviceV2_ptsz(parameterBuffer, stream);
12 }
13 extern __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaLaunchDevice(void *func, void *parameterBuffer, dim3 gridDimension, dim3 blockDimension, unsigned int sharedMemSize, cudaStream_t stream);
14 extern __device__ __cudart_builtin__ cudaError_t CUDARTAPI cudaLaunchDeviceV2(void *parameterBuffer, cudaStream_t stream);
15 
16 extern __device__ __cudart_builtin__ void * CUDARTAPI cudaGetParameterBuffer(size_t alignment, size_t size);

■ 函数 cudaGetParameterBuffer() 的第一个参数是数据缓冲区的对齐值，默认为 64 Byte，以保证各种类型的数据都能容纳。(?) Parameter reordering in the parameter buffer is prohibited, and each individual parameter placed in the parameter buffer is required to be aligned. That is, each parameter must be placed at the nth byte in the parameter buffer, where n is the smallest multiple of the parameter size that is greater than the offset of the last byte taken by the preceding parameter. The maximum size of the parameter buffer is 4KB.

● 编译使用动态并行的程序需要显式链接库文件（Windows：cudadevrt.lib，Linux MacOS：libcudadevrt.a）。

1 # 直接编译和连接
2 $ nvcc hello_world.cu -o hello.exe -arch=sm_35 -rdc=true -lcudadevrt
3 
4 # 先编译后连接
5 $ nvcc hello_world.cu -o hello_world.o -arch=sm_35 -dc 
6 $ nvcc hello_world.o -o hello.exe -arch=sm_35 -rdc=true -lcudadevrt

CUDA C Programming Guide 在线教程学习笔记 Part 10【坑】

猜你喜欢