3.6.cuda runtime API-learning of shared memory

foreword

Teacher Du launched the tensorRT high-performance deployment course from scratch . I have read it before, but I didn’t take notes, and I forgot many things. This time I will do it again, and take notes by the way.

This course learns the streamlined CUDA tutorial - shared memory

The course outline can be seen in the mind map below

insert image description here

1. Shared memory

For shared memory (shared_memory) you need to know:

  1. Because the shared memory is closer to the computing unit, the access speed is faster ( the closer the faster, the closer the more expensive )
  2. Shared memory can usually be used as a cache for accessing global memory
  3. Shared memory can be used to communicate between threads
  4. Usually __syncthreads()appear at the same time, this function is to synchronize all the threads in the block , all executed to this line before going down
  5. The common way is to get the value from the global memory when the thread ID is 0, then syncthreads, and then use

2. Shared memory case

The main.cpp sample code for the shared memory case is as follows:


#include <cuda_runtime.h>
#include <stdio.h>

#define checkRuntime(op)  __check_cuda_runtime((op), #op, __FILE__, __LINE__)

bool __check_cuda_runtime(cudaError_t code, const char* op, const char* file, int line){
    
    
    if(code != cudaSuccess){
    
        
        const char* err_name = cudaGetErrorName(code);    
        const char* err_message = cudaGetErrorString(code);  
        printf("runtime error %s:%d  %s failed. \n  code = %s, message = %s\n", file, line, op, err_name, err_message);   
        return false;
    }
    return true;
}

void launch();

int main(){
    
    

    cudaDeviceProp prop;
    checkRuntime(cudaGetDeviceProperties(&prop, 0));
    printf("prop.sharedMemPerBlock = %.2f KB\n", prop.sharedMemPerBlock / 1024.0f);

    launch();
    checkRuntime(cudaPeekAtLastError());
    checkRuntime(cudaDeviceSynchronize());
    printf("done\n");
    return 0;
}

The main.cpp sample code for the shared memory case is as follows:

#include <cuda_runtime.h>
#include <stdio.h>

//demo1 //
/* 
demo1 主要为了展示查看静态和动态共享变量的地址
 */
const size_t static_shared_memory_num_element = 6 * 1024; // 6KB
__shared__ char static_shared_memory[static_shared_memory_num_element]; 
__shared__ char static_shared_memory2[2]; 

__global__ void demo1_kernel(){
    
    
    extern __shared__ char dynamic_shared_memory[];      // 静态共享变量和动态共享变量在kernel函数内/外定义都行,没有限制
    extern __shared__ char dynamic_shared_memory2[];
    printf("static_shared_memory = %p\n",   static_shared_memory);   // 静态共享变量,定义几个地址随之叠加
    printf("static_shared_memory2 = %p\n",  static_shared_memory2); 
    printf("dynamic_shared_memory = %p\n",  dynamic_shared_memory);  // 动态共享变量,无论定义多少个,地址都一样
    printf("dynamic_shared_memory2 = %p\n", dynamic_shared_memory2); 

    if(blockIdx.x == 0 && threadIdx.x == 0) // 第一个thread
        printf("Run kernel.\n");
}

/demo2//
/* 
demo2 主要是为了演示的是如何给 共享变量进行赋值
 */
// 定义共享变量,但是不能给初始值,必须由线程或者其他方式赋值
__shared__ int shared_value1;

__global__ void demo2_kernel(){
    
    
    
    __shared__ int shared_value2;
    if(threadIdx.x == 0){
    
    

        // 在线程索引为0的时候,为shared value赋初始值
        if(blockIdx.x == 0){
    
    
            shared_value1 = 123;
            shared_value2 = 55;
        }else{
    
    
            shared_value1 = 331;
            shared_value2 = 8;
        }
    }

    // 等待block内的所有线程执行到这一步
    __syncthreads();
    
    printf("%d.%d. shared_value1 = %d[%p], shared_value2 = %d[%p]\n", 
        blockIdx.x, threadIdx.x,
        shared_value1, &shared_value1, 
        shared_value2, &shared_value2
    );
}

void launch(){
    
    
    
    demo1_kernel<<<1, 1, 12, nullptr>>>();
    demo2_kernel<<<2, 5, 0, nullptr>>>();
}

The running effect is as follows:

insert image description here

Figure 2-1 Running effect of the shared memory case

In the main function, we cudaGetDevicePropertiesobtain the properties of the current device by calling the function, and print out the size of the shared memory of the device, which is generally 48KB.

The above sample code in turn shows two examples of using shared memory: demo1_kernelanddemo2_kernel

demo1_kernel

This example is mainly used to show the addresses of static shared variables and dynamic shared variables. In this example, we use two static shared variables and two dynamic shared variables, both can be defined inside or outside the kernel function, there is no restriction. The kernel function we started has only one thread block, and each thread block has only one thread, so only one thread will execute the kernel function and print the address of the corresponding shared variable.

Through the print statement, we can see that the address of the static shared variable will increase sequentially, while the address of the dynamic shared variable is always the same .

demo2_kernel

This example mainly demonstrates how to assign values ​​to shared variables, as shown in Figure 2-2.

insert image description here

Figure 2-2 Demo2_kernel schematic diagram

In this example, we define two shared variables shared_value1and shared_value2, the kernel function we start has two thread blocks, each thread block has 5 threads, and the first thread of each thread block will perform the assignment operation of the shared variable , so when the execution reaches __syncthreads()this step, the other 4 threads of each block will wait for the first thread to complete the assignment operation.

The shared variables corresponding to the first block are assigned values ​​of 123 and 55, and the shared variables corresponding to the second block are assigned values ​​of 331 and 8. Since shared memory is shared at the block level , the second block The shared variable results printed by all threads in one block are 123 and 55, and the shared variable results printed by all threads in the second block are 331 and 8, which can be seen from the running results.

The shared memory example demonstrates the basic concepts and usage of using shared memory. Shared memory can share data between threads in the same thread block , and has the characteristics of low latency and high bandwidth. In actual CUDA programs, shared memory is often used to improve memory access efficiency and implement cooperative algorithms.

3. Supplementary knowledge

Knowledge points about shared memory: ( from Teacher Du )

  1. sharedMemPerBlock indicates the maximum available shared memory in the block
  • So it is available so that the threads in the block can communicate with each other
    • The application example of sharedMemPerBlock is as follows

insert image description here

insert image description here

  1. Shared memory is on-chip memory, closer to the computing unit, so it is faster than globalMem, and can usually be used as a cache
  • The data is first read into sharedMem, and when doing various calculations, use sharedMem instead of globalMem
  1. demo_kernel<<<1, 1, 12, nullptr>>>();The third parameter 12 is to specify the size of the dynamic shared memory dynamic_shared_memory
  • dynamic_shared_memory variables must extern __shared__start with
  • and is defined as an array of indeterminate size []
  • The unit of 12 is bytes, that is, 3 floats can be safely stored
  • Putting variables outside a function is the same as inside a function
  • Its pointer is assigned when the cuda scheduler executes
  1. static_shared_memory as statically allocated shared memory
  • Do not add extern, __shared__start with
  • The size of the array needs to be specified when defining
  • Statically assigned addresses are lower than dynamically assigned addresses
  1. Dynamic shared variables, no matter how many are defined, the address is the same
  2. Static shared variables, define several addresses and then superimpose
  3. If the sum of all types of shared memory configured is greater than sharedMemPerBlock, the kernel function execution error, Invalid argument
  • The memory division of different types of static shared variable definitions is not necessarily continuous
  • There will be a memory alignment policy in the middle, so that there may be a gap between the first and second variables
  • Therefore, if there is a gap between your variables, the shared memory that may be less than all greater than will report an error

Summarize

This course has learned the use of shared memory. Shared memory can share data in a thread block. Because it is close to the computing unit, the access speed is faster than global memory. Shared memory usually occurs at the same time as __syncthreads , in order to synchronize all threads in the block. Shared memory variables are divided into static and dynamic. For static shared memory variables, several addresses are defined and then superimposed. For dynamic shared variables, no matter how many are defined, the address is the same. The defined shared variable cannot be given an initial value, and must be assigned by a thread or other means.

Guess you like

Origin blog.csdn.net/qq_40672115/article/details/131627843