[GPU] Nvidia CUDA programming advanced tutorial - NVSHMEM memory model

The blogger has not authorized any person or organization to reprint any original articles of the blogger, thank you for your support for the original!
blogger link

I work for an internationally renowned terminal manufacturer and am responsible for the research and development of modem chips.
In the early days of 5G, he was responsible for the development of the terminal data service layer and the core network. Currently, he is leading the research on technical standards for 6G computing power networks.


The content of the blog mainly revolves around:
       5G/6G protocol explanation
       computing power network explanation (cloud computing, edge computing, end computing)
       advanced C language explanation
       Rust language explanation



NVSHMEM memory model

insert image description here

PE: processing unit (process entity)

Symmetric memory

       NVSHMEM's memory allocation API nvshmem_malloc() works somewhat similar to the standard cudaMalloc(), but cudaMalloc() returns a private address of the local GPU1 . Objects allocated using nvshmem_malloc() are called symmetric data objects . Each symmetric data object has a corresponding data object of the same name, type, and size on all PEs. The virtual address corresponding to the pointer returned by nvshmem_malloc() is called a symmetric address . It is legal to use symmetric addresses for remote access to other PEs in NVSHMEM communication routines (symmetric addresses can also be used directly for access to PE local memory). We can manipulate virtual addresses just like normal local addresses. To access a copy of a symmetric data object on a remote PE using the NVSHMEM API, we can index the storage as usual with a pointer and use the corresponding location in the remote target PE . For example,

       If we execute the following statement:

int* a = (int*) nvshmem_malloc(sizeof(int));

Then we can 本地 PEperform local memory access on or 远程 PEremote memory access on to get the value of a[0]. One way to think about this operation is that, given M PEs, we distribute the data elements in an array of length M evenly across all PEs so that each PE has only one element. Since the symmetric data object has length 1 in this example, we only need to access a[0] on any PE.

Please add a picture description
       In NVSHMEM, dynamic memory allocation for symmetric data objects comes from a special memory area called ,对称堆(symmetric heap) created by NVSHMEM during program execution2 , and then used for subsequent dynamic memory allocation.

exercise 1

       Below we replace the call to cudaMalloc() with a call to nvshmem_malloc(). We can still use atomicAdd() on locally allocated data, so the symmetric object copy on each PE will get the same result as before.

       Second, we sum the results of all PEs. This is a union operation, which is a global reduce operation. In NVSHMEM we can use nvshmem_int_sum_reduce(team, dest, source, nreduce)to sum all instances of a symmetric object.

  • source: is the symmetric address we want to sum;
  • destination: where the results are stored;
  • nreduce: is the number of elements to reduce (only one for us, since our data is scalar);
  • team: is the group of PEs to be summed 3 (we will use the default group NVSHMEM_TEAM_WORLD, which is the set of all PEs);

In summary, what we have to do is:

// 累积所有 PE 的结果
int* d_hits_total = (int*) nvshmem_malloc(sizeof(int));
nvshmem_int_sum_reduce(NVSHMEM_TEAM_WORLD, d_hits_total, d_hits, 1);

Please add a picture description
Now, all PEs have the sum of the counts, so the third change we need to make is to only print the result on a single PE . By convention, we usually print on PE0.

if (my_pe == 0) {
    
    
    // 将最终结果复制回主机
    ...

    // 计算 pi 的最终值
    ...

    // 打印结果
    ...
}

The complete code is as follows (file name: nvshmem_pi_step3.cpp):

#include <iostream>
#include <curand_kernel.h>

#include <nvshmem.h>
#include <nvshmemx.h>

inline void CUDA_CHECK (cudaError_t err) {
    
    
    if (err != cudaSuccess) {
    
    
        fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

#define N 1024*1024

__global__ void calculate_pi(int* hits, int seed) {
    
    
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // 初始化随机数状态(网格中的每个线程不得重复)
    int offset = 0;
    curandState_t curand_state;
    curand_init(seed, idx, offset, &curand_state);

    // 在 (0.0, 1.0] 内生成随机坐标
    float x = curand_uniform(&curand_state);
    float y = curand_uniform(&curand_state);

    // 如果这一点在圈内,增加点击计数器
    if (x * x + y * y <= 1.0f) {
    
    
        atomicAdd(hits, 1);
    }
}


int main(int argc, char** argv) {
    
    
    // 初始化 NVSHMEM
    nvshmem_init();

    // 获取 NVSHMEM 处理元素 ID 和 PE 数量
    int my_pe = nvshmem_my_pe();
    int n_pes = nvshmem_n_pes();

    // 每个 PE(任意)选择与其 ID 对应的 GPU
    int device = my_pe;
    CUDA_CHECK(cudaSetDevice(device));

    // 分配主机和设备值
    int* hits = (int*) malloc(sizeof(int));
    int* d_hits = (int*) nvshmem_malloc(sizeof(int));

    // 初始化点击次数并复制到设备
    *hits = 0;
    CUDA_CHECK(cudaMemcpy(d_hits, hits, sizeof(int), cudaMemcpyHostToDevice));

    // 启动核函数进行计算
    int threads_per_block = 256;
    int blocks = (N / n_pes + threads_per_block - 1) / threads_per_block;

    int seed = my_pe;
    calculate_pi<<<blocks, threads_per_block>>>(d_hits, seed);
    CUDA_CHECK(cudaDeviceSynchronize());

    // 累积所有 PE 的结果
    int* d_hits_total = (int*) nvshmem_malloc(sizeof(int));
    nvshmem_int_sum_reduce(NVSHMEM_TEAM_WORLD, d_hits_total, d_hits, 1);

    if (my_pe == 0) {
    
    
        // 将最终结果复制回主机
        CUDA_CHECK(cudaMemcpy(hits, d_hits_total, sizeof(int), cudaMemcpyDeviceToHost));

        // 计算 pi 的最终值
        float pi_est = (float) *hits / (float) (N) * 4.0f;

        // 打印结果
        std::cout << "Estimated value of pi averaged over all PEs = " << pi_est << std::endl;
        std::cout << "Relative error averaged over all PEs = " << std::abs((M_PI - pi_est) / pi_est) << std::endl;
    }

    free(hits);
    nvshmem_free(d_hits);
    nvshmem_free(d_hits_total);

    // 最终确定 nvshmem
    nvshmem_finalize();

    return 0;
}

The compile and run instructions are as follows:

nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o nvshmem_pi_step3 exercises/nvshmem_pi_step3.cpp
nvshmrun -np $NUM_DEVICES ./nvshmem_pi_step3

The result is as follows:

Estimated value of pi averaged over all PEs = 3.14072
Relative error averaged over all PEs = 0.000277734


insert image description here


  1. The exception is that in systems where GPUs are connected using NVLink, the CUDA IPC mechanism can be used to allow the GPUs to directly access each other's memory. ↩︎

  2. The default size of the symmetric heap is 1GB, which can be controlled by the environment variable NVSHMEM_SYMMETRIC_SIZE. ↩︎

  3. Based on the OpenSHMEM 1.5 specification, using team to specify operations involving multiple PE groups is a new function of NVSHMEM 2.0. ↩︎

Guess you like

Origin blog.csdn.net/qq_31985307/article/details/128594791