[CUDA entry notes] GPU storage structure model (1)

GPU memory structure model

1. The CPU can read and write the content stored in the Global Memory, Constant Memory and Texture Memory in the GPU device; the host code can transmit data to the device, and can also read data from the device;

2. Threads in the GPU use Register, Shared Memory, Local Memory, Global Memory, Constant Memory, and Texture Memory; the scope of different Memory is different, and it is related to threads, blocks, and grids;

Threads can read and write Register, Shared Memory, Local Memory and Global Memory; but can only read Constant Memory and Texture Memory;

寄存器，是GPU片上高速缓存，执行单元可以以极低的延迟访问寄存器。
寄存器的基本单元式寄存器文件，每个寄存器文件大小为32bit。寄存器变量是每个线程私有的，一旦thread执行结束，寄存器变量就会失效。把寄存器分配给每个线程，而每个线程也只能访问分配给自己的寄存器；
如果寄存器被消耗完，数据将被存储在局部存储器（本地存储器）中。如果每个线程使用了过多的寄存器，或声明了大型结构体或数据，或者编译器无法确定数据的大小，线程的私有数据就有可能被分配到local memory中，一个线程的输入和中间变量将被保存在寄存器或者是局部存储器中。
寄存器是GPU最快的memory，kernel中没有什么特殊声明的自动变量都是放在寄存器中，同样，这些变量都是线程私有的。当数组的索引是constant类型且在编译期能被确定的话，就是内置类型，数组也是放在寄存器中。
寄存器是稀有资源。在Fermi上，每个thread限制最多拥有63个register，Kepler则是255个。让自己的kernel使用较少的register就能够允许更多的block驻留在SM中，也就增加了Occupancy，提升了性能。

Shared Memory

共享存储器，同寄存器一样，都是片上存储器；存储在片上存储器中的变量可以以高度并行的方式高速访问；把共享存储器分配给线程块，同一个块中的所有线程都可以访问共享存储器中的变量，因为这些变量的存储单元已经分配给这个块；
共享存储器是一种用于线程协作的高效方式，方法是共享其中的输入数据和其中的中间计算结果；一般情况下，常用共享存储器来保存全局存储器中在kernel函数的执行阶段中需要频繁使用的那部分数据；

Local Memory

    本地存储器，存储位置在于显存上，也就是在局存储器上；
当线程使用的寄存器被占满时，数据将被存储在全局存储器中；
由于局部存储器中的数据被保存在显存中，而不是片上的寄存器或者缓存中，
因此对local memory的访问速度很慢。

Global Memory

    全局存储器，通过动态随机访问存储器（Dynamic Random Access Memory，DRAM）
实现，这里的DRAM就是通常说的显存，是设备独立的存储空间；

The computing unit on the GPU may experience long delays (hundreds of clock cycles) and limited access bandwidth when accessing the global memory; traffic congestion often occurs on the path of accessing the global memory, and only a few threads are allowed ( Not all threads) continue to access, thus causing some multi-core stream processors (Streaming Multiprocessor, SM) to be idle;

Constant Memory

    常数存储器，用于存储只读数据，常数变量虽然存在放全局存储器上，
单采用缓存提高了访问效率，用于存储需要频繁访问的只读参数；

Texture Memory

    纹理存储器

Scope and lifetime of variables in device memory

CUDA变量由于处于不同的存储器，则有各自不同的作用域和生存期；
作用域标识了能访问该变量的线程范围：单个线程、块内的所有线程或者网格内所有线程；
1）作用域为单个线程时，每个线程都会创建一个变量的私有副本放在寄存器中，每个线程只能访问其私有版本的变量；2）作用域为块内所有线程时，每个线程块会创建一个共享变量，由块内线程共享；3）作用域为网格内所有线程时，变量将被存储在全局存储器或者常数存储器中，由kernel生成的所有线程共享；注意，常数存储内的变量由所有网格内的线程共享，常数变量声明位置必须位于任何函数体外；
生命周期指定在程序的哪一段执行时间内变量是可用的：在kernel函数调用期间或在整个应用程序执行期间中。
1）寄存器和本地存储器内的变量生命周期在本线程执行期内，线程执行完成后变量内容不在存在；2）共享存储器内的变量声明在kernel函数中，其生命周期是指kernel函数的运行过程，当kernel函数终止执行时，其共享存储器内的变量内容不再存在；3）常数存储器内的变量的生命周期是整个应用的执行过程；

2. Common device storage APIs

2.1 Operate global memory
2.1.1 Apply for device memory;
cudaError_t cudaMalloc (void **devPtr, size_t size );
allocate new device memory to the pointer stored in devPtr, size is in bytes; after executing cudaMalloc successfully, record in devPtr It is the address to allocate video memory;

Next, allocate 32 floats of device memory space;

float *d_a;
int nBytes = 32 * sizeof(float);
cudaMalloc((void **)&d_a, nBytes);

2.1.2 Release device memory
The memory requested by cudaMalloc is released by cudaFree;

cudaError_t CUDARTAPI cudaFree(void *devPtr);

2.1.3 Data copy between host and device
cudaMemcpy is used to copy data between host (Host) and device (Device);

cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind )
copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, and the kind determines the direction of data copying;

cudaMemcpyHostToHost
cudaMemcpyHostToDevice: 由主机内存拷贝到设备内存；
cudaMemcpyDeviceToHost: 由设备内存拷贝到主机内存；
cudaMemcpyDeviceToDevice

2.1.4 Initialize the memory block
Use cudaMemset to initialize the value of the device memory;

cudaError_t cudaMemset(void* devPtr，int value，size_t count);
使用固定字节值value来填充devPtr所指向存储器区域的前count个字节；

2.2 Operate the constant memory
2.2.1 Copy from the host to the constant memory
Use cudaMemcpyToSymbol to copy the data in the host memory to the GPU;

template<class T>
cudaError_t cudaMemcpyToSymbol( const T& symbol，const void* src，size_t count，size_t offset，enum cudaMemcpyKind kind);
主机数据拷贝到设备上的symbol处；Symbol可以是位于全局存储器或不变存储器空间内的变量，也可以是一个指定全局存储器或常数存储器空间变量的字符串。kind值是cudaMemcpyHostToDevice或cudaMemcpyDeviceToDevice。

2.2.2 Copy from the constant storage to the host
Use cudaMemcpyFromSymbol to copy the data on the device to the host;

template<class T>
cudaError_t cudaMemcpyFromSymbol( void *dst，const T& symbol，size_t count，size_t offset，enum cudaMemcpyKind kind);

Copy from the symbol on the device to the target memory location dst, the direction of the copy is determined by the kind, there are cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice;