The blogger has not authorized any person or organization to reprint any original articles of the blogger, thank you for your support for the original!
blogger link

I work for an internationally renowned terminal manufacturer and am responsible for the research and development of modem chips.
In the early days of 5G, he was responsible for the development of the terminal data service layer and the core network. Currently, he is leading the research on technical standards for 6G computing power networks.

The content of the blog mainly revolves around:
       5G/6G protocol explanation
       computing power network explanation (cloud computing, edge computing, end computing)
       advanced C language explanation
       Rust language explanation

NVSHMEM Histogram - Distributed Approach

insert image description here

PE: processing unit (process entity)

Let's look at another solution to this problem. A feature of the previous solution is that all histogram calculations are done locally. Then synchronize all threads and do a final reduction on the result.

Another way is to split the histogram and assign each part to different GPUs. When an entry in the input data does not belong to a histogram location on that GPU, we automatically increment the relevant histogram entry in the remote PE . Then we have to stitch the parts of the histogram together at the end. We refer to this approach as a "distributed" approach.

Please add a picture description

Tradeoffs Between Replicated and Distributed Approaches

We reduce the amount of GPU memory required for the histogram in the distributed approach compared to the replicated approach. We also reduced the pressure on local atomic operations on the histogram. But in turn, this increases the pressure of message passing and the pressure of atomic operations on remote GPUs.

practise

We arbitrarily partition the histogram into segments equal to the number of GPUs, and assign these segments sequentially to different GPUs. We will also assume that the segmentation of the histogram is implemented inside the kernel function, so that we can calculate mathematically which PE to send the data to (although it is easier to do this step as the general case , that is, the information is not known in advance and still needs to be provided as input data to the kernel function).

To update the histogram on the remote PE, we wish to use the equivalent of CUDA's atomicAdd(). The corresponding NVSHMEM function isnvshmem_int_atomic_add()

nvshmem_int_atomic_add(destination, value, target_pe);

in,

value: is the value to be increased;
target_pe: is the remote PE to be updated;
destination must be a symmetric address (such as the address allocated by nvshmem_malloc());

In the step of concatenating the histogram, we have an easy-to-use API nvshmem_int_collect()that concatenates the arrays on all PEs, such as putting the array from PE0 in the first part, the array from PE1 in the second part, and so on.

nvshmem_int_collect(team, destination, source, nelems);

in,

destination: Store the spliced array (the same on all PEs);
source: is the source array with length nelems. Since the histogram is evenly distributed among the PEs, the length of the target array should be n_pes*nelems, matching the length of the entire histogram.
team: Select the PE group. For global collective operations, we use workgroups NVSHMEM_TEAM_WORLD, which contain all PEs;

The relevant code is as follows (file name: histogram_step2.cpp)

#include <iostream>
#include <cstdlib>
#include <chrono>

#include <nvshmem.h>
#include <nvshmemx.h>

inline void CUDA_CHECK (cudaError_t err) {
    
    
    if (err != cudaSuccess) {
    
    
        fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

#define NUM_BUCKETS   16
#define MAX_VALUE     1048576
#define NUM_INPUTS    65536

__global__ void histogram_kernel(const int* input, int* histogram, int N)
{
    
    
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    int n_pes = nvshmem_n_pes();
    int buckets_per_pe = NUM_BUCKETS / n_pes;

    if (idx < N) {
    
    
        int value = input[idx];

        // 计算“全局”直方图索引号
        int global_histogram_index = ((size_t) value * NUM_BUCKETS) / MAX_VALUE;

        // 找出直方图指数对应的 PE。
        // 假设每个 PE 的桶数量相同
        // 我们从包含第一个桶的 PE 0 开始
        // 直到第一个值为 1 / n_pes 的贮体为止
        // 对其他 PE 采用类似方法。我们可在这个阶段采取简单的
        // 整数除法。
        int target_pe = global_histogram_index / buckets_per_pe;

        // 现在求出 PE 的局部直方图索引号。
        // 我们只需要用 PE 的起始桶的偏离值即可。
        int local_histogram_index = global_histogram_index - target_pe * buckets_per_pe;

        nvshmem_int_atomic_add(&histogram[local_histogram_index], 1, target_pe);
    }
}

int main(int argc, char** argv) {
    
    
    // 初始化 NVSHMEM
    nvshmem_init();

    // 获取 NVSHMEM 处理元素 ID 和 PE 数量
    int my_pe = nvshmem_my_pe();
    int n_pes = nvshmem_n_pes();

    // 每个 PE（任意）选择与其 ID 对应的 GPU
    int device = my_pe;
    CUDA_CHECK(cudaSetDevice(device));

    // 每台设备处理 1 / n_pes 的部分工作。
    const int N = NUM_INPUTS / n_pes;

    // 在主机上构建直方图输入数据
    int* input = (int*) malloc(N * sizeof(int));

    // 为每个 PE 初始化一个不同的随机数种子。
    srand(my_pe);

    // 输入数据范围从 0 至 MAX_VALUE - 1 不等
    for (int i = 0; i < N; ++i) {
    
    
        input[i] = rand() % MAX_VALUE;
    }

    // 复制到设备
    int* d_input;
    d_input = (int*) nvshmem_malloc(N * sizeof(int));
    CUDA_CHECK(cudaMemcpy(d_input, input, N * sizeof(int), cudaMemcpyHostToDevice));

    // 分配直方图数组 - 大小等同于主机上的
    // 完整直方图，且只分配设备上每个 GPU 的相关部分。
    int* histogram = (int*) malloc(NUM_BUCKETS * sizeof(int));
    memset(histogram, 0, NUM_BUCKETS * sizeof(int));

    int buckets_per_pe = NUM_BUCKETS / n_pes;

    int* d_histogram;
    d_histogram = (int*) nvshmem_malloc(buckets_per_pe * sizeof(int));
    CUDA_CHECK(cudaMemset(d_histogram, 0, buckets_per_pe * sizeof(int)));

    // 此外，还要为连接分配完整大小的设备直方图
    int* d_concatenated_histogram = (int*) nvshmem_malloc(NUM_BUCKETS * sizeof(int));
    CUDA_CHECK(cudaMemset(d_concatenated_histogram, 0, NUM_BUCKETS * sizeof(int)));

    // 为合理准确的计时执行一次同步
    nvshmem_barrier_all();

    using namespace std::chrono;

    high_resolution_clock::time_point tabulation_start = high_resolution_clock::now();

    // 执行直方图
    int threads_per_block = 256;
    int blocks = (NUM_INPUTS / n_pes + threads_per_block - 1) / threads_per_block;

    histogram_kernel<<<blocks, threads_per_block>>>(d_input, d_histogram, N);
    CUDA_CHECK(cudaDeviceSynchronize());

    nvshmem_barrier_all();

    high_resolution_clock::time_point tabulation_end = high_resolution_clock::now();

    // 连接所有 PE

    high_resolution_clock::time_point combination_start = high_resolution_clock::now();

    nvshmem_int_collect(NVSHMEM_TEAM_WORLD, d_concatenated_histogram, d_histogram, buckets_per_pe);

    high_resolution_clock::time_point combination_end = high_resolution_clock::now();

    // 打印 PE 0 上的结果
    if (my_pe == 0) {
    
    
        duration<double> tabulation_time = duration_cast<duration<double>>(tabulation_end - tabulation_start);
        std::cout << "Tabulation time = " << tabulation_time.count() * 1000 << " ms" << std::endl << std::endl;

        duration<double> combination_time = duration_cast<duration<double>>(combination_end - combination_start);
        std::cout << "Combination time = " << combination_time.count() * 1000 << " ms" << std::endl << std::endl;

        // 将数据复制回主机
        CUDA_CHECK(cudaMemcpy(histogram, d_concatenated_histogram, NUM_BUCKETS * sizeof(int), cudaMemcpyDeviceToHost));

        std::cout << "Histogram counters:" << std::endl << std::endl;
        int num_buckets_to_print = 4;
        for (int i = 0; i < NUM_BUCKETS; i += NUM_BUCKETS / num_buckets_to_print) {
    
    
            std::cout << "Bucket [" << i * (MAX_VALUE / NUM_BUCKETS) << ", " << (i + 1) * (MAX_VALUE / NUM_BUCKETS) - 1 << "]: " << histogram[i];
            std::cout << std::endl;
            if (i < NUM_BUCKETS - NUM_BUCKETS / num_buckets_to_print - 1) {
    
    
                std::cout << "..." << std::endl;
            }
        }
    }

    free(input);
    free(histogram);
    nvshmem_free(d_input);
    nvshmem_free(d_histogram);

    // 最终确定 nvshmem
    nvshmem_finalize();

    return 0;
}

Compile and run commands:

nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step2 histogram_step2.cpp
nvshmrun -np $NUM_DEVICES ./histogram_step2

The result of the operation is as follows:

Tabulation time = 0.195561 ms

Combination time = 0.029666 ms

Histogram counters:

Bucket [0, 65535]: 4135
...
Bucket [262144, 327679]: 4028
...
Bucket [524288, 589823]: 4088
...
Bucket [786432, 851967]: 4100

Comparing replicated and distributed methods in bold style

So far, we've focused on writing syntactically correct code without thinking about performance. Now let's examine the performance of the distributed and replicated approaches. In both cases, vary the NUM_BUCKETS parameter and the NUM_INPUTS parameter, while paying attention to the histogram building and combining times. Is one method faster than the other? If yes, is there a situation where the performance ratio is reversed?

For convenience, we provide solutions for both implementations below.

replication method

The source code is as follows:

#include <iostream>
#include <cstdlib>
#include <chrono>

#include <nvshmem.h>
#include <nvshmemx.h>

inline void CUDA_CHECK (cudaError_t err) {
    
    
    if (err != cudaSuccess) {
    
    
        fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

#define NUM_BUCKETS   16
#define MAX_VALUE     1048576
#define NUM_INPUTS    65536

__global__ void histogram_kernel(const int* input, int* histogram, int N)
{
    
    
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx < N) {
    
    
        int value = input[idx];

        int histogram_index = ((size_t) value * NUM_BUCKETS) / MAX_VALUE;

	    atomicAdd(&histogram[histogram_index], 1);
    }
}

int main(int argc, char** argv) {
    
    
    // 初始化 NVSHMEM
    nvshmem_init();

    // 获取 NVSHMEM 处理元素 ID 和 PE 数量
    int my_pe = nvshmem_my_pe();
    int n_pes = nvshmem_n_pes();

    // 每个 PE（任意）选择与其 ID 对应的 GPU
    int device = my_pe;
    CUDA_CHECK(cudaSetDevice(device));

    // 每台设备处理 1 / n_pes 的部分工作。
    const int N = NUM_INPUTS / n_pes;

    // 在主机上构建直方图输入数据
    int* input = (int*) malloc(N * sizeof(int));

    // 为每个 PE 初始化一个不同的随机数种子。
    srand(my_pe);

    // 输入数据范围从 0 至 MAX_VALUE - 1 不等
    for (int i = 0; i < N; ++i) {
    
    
        input[i] = rand() % MAX_VALUE;
    }

    // 复制到设备
    int* d_input;
    d_input = (int*) nvshmem_malloc(N * sizeof(int));
    CUDA_CHECK(cudaMemcpy(d_input, input, N * sizeof(int), cudaMemcpyHostToDevice));

    // 分配直方图数组
    int* histogram = (int*) malloc(NUM_BUCKETS * sizeof(int));
    memset(histogram, 0, NUM_BUCKETS * sizeof(int));

    int* d_histogram;
    d_histogram = (int*) nvshmem_malloc(NUM_BUCKETS * sizeof(int));
    CUDA_CHECK(cudaMemset(d_histogram, 0, NUM_BUCKETS * sizeof(int)));

    // 为合理准确的计时执行一次同步
    nvshmem_barrier_all();

    using namespace std::chrono;

    high_resolution_clock::time_point tabulation_start = high_resolution_clock::now();

    // 执行直方图
    int threads_per_block = 256;
    int blocks = (NUM_INPUTS / n_pes + threads_per_block - 1) / threads_per_block;

    histogram_kernel<<<blocks, threads_per_block>>>(d_input, d_histogram, N);
    CUDA_CHECK(cudaDeviceSynchronize());

    nvshmem_barrier_all();

    high_resolution_clock::time_point tabulation_end = high_resolution_clock::now();

    high_resolution_clock::time_point combination_start = high_resolution_clock::now();

    // 在所有 PE 上执行归约
    nvshmem_int_sum_reduce(NVSHMEM_TEAM_WORLD, d_histogram, d_histogram, NUM_BUCKETS);

    high_resolution_clock::time_point combination_end = high_resolution_clock::now();

    // 打印 PE 0 上的结果
    if (my_pe == 0) {
    
    
        duration<double> tabulation_time = duration_cast<duration<double>>(tabulation_end - tabulation_start);
        std::cout << "Tabulation time = " << tabulation_time.count() * 1000 << " ms" << std::endl << std::endl;

        duration<double> combination_time = duration_cast<duration<double>>(combination_end - combination_start);
        std::cout << "Combination time = " << combination_time.count() * 1000 << " ms" << std::endl << std::endl;

        // 将数据复制回主机
        CUDA_CHECK(cudaMemcpy(histogram, d_histogram, NUM_BUCKETS * sizeof(int), cudaMemcpyDeviceToHost));

        std::cout << "Histogram counters:" << std::endl << std::endl;
        int num_buckets_to_print = 4;
        for (int i = 0; i < NUM_BUCKETS; i += NUM_BUCKETS / num_buckets_to_print) {
    
    
            std::cout << "Bucket [" << i * (MAX_VALUE / NUM_BUCKETS) << ", " << (i + 1) * (MAX_VALUE / NUM_BUCKETS) - 1 << "]: " << histogram[i];
            std::cout << std::endl;
            if (i < NUM_BUCKETS - NUM_BUCKETS / num_buckets_to_print - 1) {
    
    
                std::cout << "..." << std::endl;
            }
        }
    }

    free(input);
    free(histogram);
    nvshmem_free(d_input);
    nvshmem_free(d_histogram);

    // 最终确定 nvshmem
    nvshmem_finalize();

    return 0;
}

Compile and run instructions:

nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step1 histogram_step1.cpp
nvshmrun -np $NUM_DEVICES ./histogram_step1

The result of the operation is as follows:

Tabulation time = 0.035362 ms

Combination time = 0.039909 ms

Histogram counters:

Bucket [0, 65535]: 4135
...
Bucket [262144, 327679]: 4028
...
Bucket [524288, 589823]: 4088
...
Bucket [786432, 851967]: 4100

distributed method

The source code is as follows:

#include <iostream>
#include <cstdlib>
#include <chrono>

#include <nvshmem.h>
#include <nvshmemx.h>

inline void CUDA_CHECK (cudaError_t err) {
    
    
    if (err != cudaSuccess) {
    
    
        fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

#define NUM_BUCKETS   16
#define MAX_VALUE     1048576
#define NUM_INPUTS    65536

__global__ void histogram_kernel(const int* input, int* histogram, int N)
{
    
    
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    int n_pes = nvshmem_n_pes();
    int buckets_per_pe = NUM_BUCKETS / n_pes;

    if (idx < N) {
    
    
        int value = input[idx];

        // 计算“全局”直方图索引号
        int global_histogram_index = ((size_t) value * NUM_BUCKETS) / MAX_VALUE;

        // 找出直方图指数对应的 PE。
        // 假设每个 PE 的桶数量相同
        // 我们从包含第一个桶的 PE 0 开始
        // 直到第一个值为 1 / n_pes 的贮体为止
        // 对其他 PE 采用类似方法。我们可在这个阶段采取简单的
        // 整数除法。
        int target_pe = global_histogram_index / buckets_per_pe;

        // 现在求出 PE 的局部直方图索引号。
        // 我们只需要用 PE 的起始桶的偏离值即可。
        int local_histogram_index = global_histogram_index - target_pe * buckets_per_pe;

        nvshmem_int_atomic_add(&histogram[local_histogram_index], 1, target_pe);
    }
}

int main(int argc, char** argv) {
    
    
    // 初始化 NVSHMEM
    nvshmem_init();

    // 获取 NVSHMEM 处理元素 ID 和 PE 数量
    int my_pe = nvshmem_my_pe();
    int n_pes = nvshmem_n_pes();

    // 每个 PE（任意）选择与其 ID 对应的 GPU
    int device = my_pe;
    CUDA_CHECK(cudaSetDevice(device));

    // 每台设备处理 1 / n_pes 的部分工作。
    const int N = NUM_INPUTS / n_pes;

    // 在主机上构建直方图输入数据
    int* input = (int*) malloc(N * sizeof(int));

    // 为每个 PE 初始化一个不同的随机数种子。
    srand(my_pe);

    // 输入数据范围从 0 至 MAX_VALUE - 1 不等
    for (int i = 0; i < N; ++i) {
    
    
        input[i] = rand() % MAX_VALUE;
    }

    // 复制到设备
    int* d_input;
    d_input = (int*) nvshmem_malloc(N * sizeof(int));
    CUDA_CHECK(cudaMemcpy(d_input, input, N * sizeof(int), cudaMemcpyHostToDevice));

    // 分配直方图数组 - 大小等同于主机上的
    // 完整直方图，且只分配设备上每个 GPU 的相关部分。
    int* histogram = (int*) malloc(NUM_BUCKETS * sizeof(int));
    memset(histogram, 0, NUM_BUCKETS * sizeof(int));

    int buckets_per_pe = NUM_BUCKETS / n_pes;

    int* d_histogram;
    d_histogram = (int*) nvshmem_malloc(buckets_per_pe * sizeof(int));
    CUDA_CHECK(cudaMemset(d_histogram, 0, buckets_per_pe * sizeof(int)));

    // 此外，还要为连接分配完整大小的设备直方图
    int* d_concatenated_histogram = (int*) nvshmem_malloc(NUM_BUCKETS * sizeof(int));
    CUDA_CHECK(cudaMemset(d_concatenated_histogram, 0, NUM_BUCKETS * sizeof(int)));

    // 为合理准确的计时执行一次同步
    nvshmem_barrier_all();

    using namespace std::chrono;

    high_resolution_clock::time_point tabulation_start = high_resolution_clock::now();

    // 执行直方图
    int threads_per_block = 256;
    int blocks = (NUM_INPUTS / n_pes + threads_per_block - 1) / threads_per_block;

    histogram_kernel<<<blocks, threads_per_block>>>(d_input, d_histogram, N);
    CUDA_CHECK(cudaDeviceSynchronize());

    nvshmem_barrier_all();

    high_resolution_clock::time_point tabulation_end = high_resolution_clock::now();

    // 连接所有 PE

    high_resolution_clock::time_point combination_start = high_resolution_clock::now();

    nvshmem_int_collect(NVSHMEM_TEAM_WORLD, d_concatenated_histogram, d_histogram, buckets_per_pe);

    high_resolution_clock::time_point combination_end = high_resolution_clock::now();

    // 打印 PE 0 上的结果
    if (my_pe == 0) {
    
    
        duration<double> tabulation_time = duration_cast<duration<double>>(tabulation_end - tabulation_start);
        std::cout << "Tabulation time = " << tabulation_time.count() * 1000 << " ms" << std::endl << std::endl;

        duration<double> combination_time = duration_cast<duration<double>>(combination_end - combination_start);
        std::cout << "Combination time = " << combination_time.count() * 1000 << " ms" << std::endl << std::endl;

        // 将数据复制回主机
        CUDA_CHECK(cudaMemcpy(histogram, d_concatenated_histogram, NUM_BUCKETS * sizeof(int), cudaMemcpyDeviceToHost));

        std::cout << "Histogram counters:" << std::endl << std::endl;
        int num_buckets_to_print = 4;
        for (int i = 0; i < NUM_BUCKETS; i += NUM_BUCKETS / num_buckets_to_print) {
    
    
            std::cout << "Bucket [" << i * (MAX_VALUE / NUM_BUCKETS) << ", " << (i + 1) * (MAX_VALUE / NUM_BUCKETS) - 1 << "]: " << histogram[i];
            std::cout << std::endl;
            if (i < NUM_BUCKETS - NUM_BUCKETS / num_buckets_to_print - 1) {
    
    
                std::cout << "..." << std::endl;
            }
        }
    }

    free(input);
    free(histogram);
    nvshmem_free(d_input);
    nvshmem_free(d_histogram);

    // 最终确定 nvshmem
    nvshmem_finalize();

    return 0;
}

Compile and run commands:

nvcc -x cu -arch=sm_70 -rdc=true -I $NVSHMEM_HOME/include -L $NVSHMEM_HOME/lib -lnvshmem -lcuda -o histogram_step1 histogram_step2.cpp
nvshmrun -np $NUM_DEVICES ./histogram_step2

The result of the operation is as follows:

Tabulation time = 0.18831 ms

Combination time = 0.028852 ms

Histogram counters:

Bucket [0, 65535]: 4135
...
Bucket [262144, 327679]: 4028
...
Bucket [524288, 589823]: 4088
...
Bucket [786432, 851967]: 4100

insert image description here

[nvidia CUDA advanced programming] NVSHMEM histogram - distributed method

NVSHMEM Histogram - Distributed Approach

Tradeoffs Between Replicated and Distributed Approaches

practise

Comparing replicated and distributed methods in bold style

replication method

distributed method

Guess you like