GPU memory architecture -- global memory local memory register heap shared memory constant memory texture memory

insert image description here
The above table expresses various characteristics of various memories. The scope column defines which part of the program can use this memory. The lifetime defines how long the data in this memory is visible to the program. In addition to this, L1 and L2 caches can also be used by GPU programs for faster memory access.
In short, all threads have a register file, which is the fastest. Shared memory can only be accessed by threads within the block, but is much more accessible than global memory blocks. Global memory is the slowest, but can be accessed by all blocks.

global memory

All blocks can read and write to global memory. This memory is slower, but can be read and written from anywhere in the code. A cache speeds up access to global memory. All memory allocated via cudaMalloc is global memory. The following simple code demonstrates how to use global memory from a program:

#include <stdio.h>
#define N 5

__global__ void gpu_global_memory(int *d_a)
{
    
    
	// "array" is a pointer into global memory on the device
	d_a[threadIdx.x] = threadIdx.x;
}

int main()
{
    
    
	// Define Host Array
	int h_a[N];
	//Define device pointer	
	int *d_a;       
						
	cudaMalloc((void **)&d_a, sizeof(int) *N);
	// now copy data from host memory to device memory 
	cudaMemcpy((void *)d_a, (void *)h_a, sizeof(int) *N, cudaMemcpyHostToDevice);
	// launch the kernel 
	gpu_global_memory << <1, N >> >(d_a);  
	// copy the modified array back to the host memory
	cudaMemcpy((void *)h_a, (void *)d_a, sizeof(int) *N, cudaMemcpyDeviceToHost);
	printf("Array in Global Memory is: \n");
	//Printing result on console
	for (int i = 0; i < N; i++) 
	{
    
    
		printf("At Index: %d --> %d \n", i, h_a[i]);
	}

	return 0;
}

local memory and register file

Local memory and register files are unique to each thread. Registers are the fastest memory available to each thread. When the variables used in the kernel cannot fit in the register file, they will be stored in local memory, which is called register overflow. Please note that there are two cases of using local memory: one is that there are not enough registers, and the other is that some cases cannot be placed in registers at all, such as when the subscript of a local array is indefinitely indexed. You can basically think of local memory as the only part of global memory per thread. Local memory is much slower than the register file. Although local memory is buffered through L1 cache and L2 cache, register overflow may affect the performance of your program.
A simple program is demonstrated below:

#include <stdio.h>
#define N 5

__global__ void gpu_local_memory(int d_in)
{
    
    
	int t_local;    
	t_local = d_in * threadIdx.x;     
	printf("Value of Local variable in current thread is: %d \n", t_local);
}

int main()
{
    
    

	printf("Use of Local Memory on GPU:\n");
	gpu_local_memory << <1, N >> >(5);  
	cudaDeviceSynchronize();
	return 0;
}

The t_local variable in the code is locally unique to each thread and will be stored in the register file. When calculating with this variable, the calculation speed will be the fastest.

Shared memory

Shared memory is inside the chip, so it's much faster than global memory. (There are two aspects to the speed of memory in CUDA, one is low latency and the other is large bandwidth. Here it refers specifically to low latency). Compared with uncached global memory access, shared memory is about 100 times lower in latency. Threads in the same block can access the same piece of shared memory (note: the contents of the shared memory seen by threads in different blocks are different), which is useful in applications where many threads need to share their results with other threads very useful in the program. But it can also cause confusion or erroneous results if not synchronized. If the calculation result of a thread is read by other threads before writing to the shared memory is completed, it will cause an error. Therefore, memory access should be properly controlled or managed. This is done by the _syncthreads() instruction, which ensures that all writes to memory complete before continuing program execution. This is also known as a barrier. The meaning of barrier is that all threads in the block will reach that line of code, where they will wait for other threads to complete. When all threads have arrived here, they can continue to execute together. In order to demonstrate the use of shared memory and thread synchronization, we give an example of calculating MA here:

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>

__global__ void gpu_shared_memory(float *d_a)
{
    
    
	// Defining local variables which are private to each thread
	int i, index = threadIdx.x;
	float average, sum = 0.0f;

	//Define shared memory
	__shared__ float sh_arr[10];
	sh_arr[index] = d_a[index];
	__syncthreads();    // This ensures all the writes to shared memory have completed

	for (i = 0; i<= index; i++) 
	{
    
     
		sum += sh_arr[i]; 
	}
	average = sum / (index + 1.0f);
	d_a[index] = average; 
}

int main(int argc, char **argv)
{
    
    
	//Define Host Array
	float h_a[10];   
	//Define Device Pointer
	float *d_a;       
	
	for (int i = 0; i < 10; i++)
	{
    
    
		h_a[i] = i;
	}
	// allocate global memory on the device
	cudaMalloc((void **)&d_a, sizeof(float) * 10);
	// now copy data from host memory  to device memory 
	cudaMemcpy((void *)d_a, (void *)h_a, sizeof(float) * 10, cudaMemcpyHostToDevice);
	
	gpu_shared_memory << <1, 10 >> >(d_a);
	// copy the modified array back to the host memory
	cudaMemcpy((void *)h_a, (void *)d_a, sizeof(float) * 10, cudaMemcpyDeviceToHost);
	printf("Use of Shared Memory on GPU:  \n");
	//Printing result on console
	for (int i = 0; i < 10; i++) 
	{
    
    
		printf("The running average after %d element is %f \n", i, h_a[i]);
	}
	return 0;
}

The MA operation is very simple. It is to calculate the average value of all elements before the current element in the array. Many threads will use the same data in the array when calculating. This is an ideal use case for using shared memory, which will result in faster data access than global memory. This will reduce the number of global memory accesses per thread, thereby reducing program latency. Numbers or variables on shared memory are defined with the __shared__ modifier. We in this example define an array on shared memory with 10 float elements. In general, the size of the shared memory should be equal to the number of threads per block. Since we are dealing with an array of 10 (elements), we also define the size of the shared memory to be this large.
The next step is to copy the data from global memory to shared memory. Each thread copies an element through its own index, so that the block as a whole completes the copy operation of the data, so that the data is written into the shared memory. On the next line, we start reading from the shared memory array, but before continuing, we should make sure that all (threads) have completed their write operations. So, let's do a synchronization using __syncthreads().
Then (each thread) uses the values stored in the shared memory (after reading) to calculate (from the first element) to the average value of the current element through the for loop, and store the results corresponding to each thread in The corresponding location in global memory.

constant memory

CUDA programmers often use another type of memory—constant memory. NVIDIA GPU cards logically provide users with 64KB of constant memory space, which can be used to store constant data required during kernel execution. Constant memory has an additional advantage over global memory for accessing small amounts of data in certain situations. The use of constant memory also reduces the bandwidth consumption of global memory to a certain extent. In this subsection, we will look at how to use constant memory in CUDA. We will use a simple program to perform the mathematical operation of a * x + b, where a and b are constants. The program code is as follows:

#include "stdio.h"
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>

//Defining two constants
__constant__ int constant_f;
__constant__ int constant_g;
#define N	5

//Kernel function for using constant memory
__global__ void gpu_constant_memory(float *d_in, float *d_out) 
{
    
    
	//Thread index for current kernel
	int tid = threadIdx.x;	
	d_out[tid] = constant_f*d_in[tid] + constant_g;
}

int main() 
{
    
    
	//Defining Arrays for host
	float h_in[N], h_out[N];
	//Defining Pointers for device
	float *d_in, *d_out;
	int h_f = 2;
	int h_g = 20;
	// allocate the memory on the cpu
	cudaMalloc((void**)&d_in, N * sizeof(float));
	cudaMalloc((void**)&d_out, N * sizeof(float));
	//Initializing Array
	for (int i = 0; i < N; i++) 
	{
    
    
		h_in[i] = i;
	}
	//Copy Array from host to device
	cudaMemcpy(d_in, h_in, N * sizeof(float), cudaMemcpyHostToDevice);
	//Copy constants to constant memory
	cudaMemcpyToSymbol(constant_f, &h_f, sizeof(int), 0, cudaMemcpyHostToDevice);
	cudaMemcpyToSymbol(constant_g, &h_g, sizeof(int));

	//Calling kernel with one block and N threads per block
	gpu_constant_memory << <1, N >> >(d_in, d_out);
	//Coping result back to host from device memory
	cudaMemcpy(h_out, d_out, N * sizeof(float), cudaMemcpyDeviceToHost);
	//Printing result on console
	printf("Use of Constant memory on GPU \n");
	for (int i = 0; i < N; i++) 
	{
    
    
		printf("The expression for input %f is %f\n", h_in[i], h_out[i]);
	}
	//Free up memory
	cudaFree(d_in);
	cudaFree(d_out);
	return 0;
}

Variables in constant memory are decorated with the __constant__ keyword. In the previous code, two floating-point numbers constant_f, constant_g are defined as constants that will not change during kernel execution. The second thing to note is that once they are defined using __constant__ (outside the kernel), they should not be defined inside the kernel again. The kernel function will use these two constants to perform a simple mathematical operation. In the main function, we pass the values of these two constants into the constant memory in a special way.
In the main function, two constants h_f and h_g are defined and initialized on the host, and then will be copied to the constant memory on the device. We will use the cudaMemcpyToSymbol function to copy these constants into the constant memory required for kernel execution. This function has five parameters: the first parameter is the target (to be written), that is, the h_f or h_g constant we just defined with __constant__; the second parameter is the source host address; the third parameter is the transmission Size; the fourth parameter is the offset of the write target, here is 0; the fifth parameter is the data transmission direction from the device to the host; the last two parameters are optional, so we will call the second cudaMemcpyToSymbol function later when omitting them.

texture memory

Texture memory is another ROM that can speed up program execution and reduce video memory bandwidth when data accesses have specific patterns. Like constant memory, it is also cached inside the chip. This memory was originally designed for graphics drawing, but can also be used for general-purpose computing. This type of memory becomes very efficient when programs access memory with a large degree of spatial proximity. Spatial proximity means that the read position of each thread is adjacent to the read position of other threads. This is very useful for image processing applications that need to deal with 4 adjacent correlation points or 8 adjacent points.
A general-purpose global memory cache will not be able to handle this spatial proximity effectively and may result in a large number of video memory read transfers. Texture storage is designed to take advantage of this memory access model, so that it will only be read 1 time from video memory, and then buffered, so the execution speed will be much faster. Texture memory supports 2D and 3D texture read operations. Using texture memory in your CUDA program is not so easy, especially for those who are not programming experts. In this section we will explain to you an example of how to do array assignment via texture storage:

#include "stdio.h"
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>

#define NUM_THREADS 10
#define N 10
texture <float, 1, cudaReadModeElementType> textureRef;

__global__ void gpu_texture_memory(int n, float *d_out)
{
    
    
	int idx = blockIdx.x*blockDim.x + threadIdx.x;
	if (idx < n)
	{
    
    
		float temp = tex1D(textureRef, float(idx));
		d_out[idx] = temp;
	}
}

int main()
{
    
    
	//Calculate number of blocks to launch
	int num_blocks = N / NUM_THREADS + ((N % NUM_THREADS) ? 1 : 0);
	//Declare device pointer
	float *d_out;
	// allocate space on the device for the result
	cudaMalloc((void**)&d_out, sizeof(float) * N);
	// allocate space on the host for the results
	float *h_out = (float*)malloc(sizeof(float) * N);
	//Declare and initialize host array
	float h_in[N];
	for (int i = 0; i < N; i++) 
	{
    
    
		h_in[i] = float(i);
	}
	//Define CUDA Array
	cudaArray *cu_Array;
	cudaMallocArray(&cu_Array, &textureRef.channelDesc, N, 1);
	//Copy data to CUDA Array
	cudaMemcpyToArray(cu_Array, 0, 0, h_in, sizeof(float)*N, cudaMemcpyHostToDevice);
	
	// bind a texture to the CUDA array
	cudaBindTextureToArray(textureRef, cu_Array);
	//Call Kernel	
  	gpu_texture_memory << <num_blocks, NUM_THREADS >> >(N, d_out);
	
	// copy result back to host
	cudaMemcpy(h_out, d_out, sizeof(float)*N, cudaMemcpyDeviceToHost);
	printf("Use of Texture memory on GPU: \n");
	for (int i = 0; i < N; i++)
	{
    
    
		printf("Texture element at %d is : %f\n",i, h_out[i]);
	}
	free(h_out);
	cudaFree(d_out);
	cudaFreeArray(cu_Array);
	cudaUnbindTexture(textureRef);	
}

Define a segment of texture memory capable of texture picking through "texture reference". Texture references are defined through variables of type texture<>. When it is defined, it has 3 parameters: the first is the parameter when the variable of type texture<> is defined, which is used to describe the type of texture element. In this example, it is of type float; the second parameter specifies the type of texture reference, which can be 1D, 2D, or 3D. In this example, it is a 1D texture reference; the third parameter is the read mode, which is an optional parameter to indicate whether to perform automatic type conversion when reading. Be sure to make sure that the texture reference is defined as a global static variable, and also make sure that it cannot be passed as an argument to any other function. In this kernel function, each thread reads the data whose thread ID is used as the index position through the texture reference, and then copies it to the global memory pointed to by the d_out pointer.
In the main function, after defining and allocating the memory and the array on the video memory, the array (the elements in it) on the host is initialized to the value of 0-9. In this example, you will see the use of CUDA arrays for the first time. They are similar to normal arrays, but are specific to textures. CUDA arrays are read-only for kernel functions. But it can be written on the host through the cudaMemcpyToArray function, as you can see in the previous code. In the cudaMemcpyToArray function, 0 in the second and third parameters represents the horizontal and vertical offsets of the target CUDA array to be transferred. An offset of О in both directions means that our transfer will start from the upper left corner (0, 0) of the target CUDA array. The memory layout in CUDA arrays is opaque to the user and is optimized specifically for texture fetching.
The cudaBindTextureToArray function binds texture references to CUDA arrays. The CUDA array we previously wrote to will be the backing store for this texture reference. After the texture reference binding is completed, we call the kernel, which will fetch the texture and write the resulting data to the target array in the video memory. Note: CUDA has two common storage methods for large amounts of data in video memory. One is ordinary linear storage, which can be directly accessed by pointers. The other is the CUDA array, which is opaque to the user and cannot be directly accessed by pointers in the kernel, but needs to be accessed through the corresponding functions of texture or surface. In the kernel of this example, the read from the texture reference uses the corresponding texture fetch function, while the write is performed directly with the normal pointer (d_out[]). When the kernel execution is complete, the resulting array is copied back to memory on the host and displayed in the console window. When we are done using the texture storage, we need to execute the unbind code, which is done by calling the cudaUnbindTexture function. Then use the cudaFreeArray() function to free the CUDA array space just allocated.