[High Performance Computing] Opencl syntax and related concepts (2): index, queue, kernel function

Data parallelism and task parallelism

Data parallelism divides large-scale computing tasks into multiple subtasks and applies these subtasks to different data sets simultaneously. Each subtask is executed on an independent processor, improving computing performance by parallel processing of different data sets. Data parallelism is particularly suitable for situations where the same operation is performed on large-scale data sets, such as matrix multiplication, image processing, etc.

Task parallelism is the division of computing tasks into multiple subtasks, with each subtask executing a different sequence of operations or instructions. Different subtasks are executed simultaneously on different processors, improving computing performance by executing multiple different operations in parallel. Task parallelism is particularly suitable for complex computing tasks, in which there are dependencies between different subtasks and need to work together to complete.

Common features of heterogeneous programming languages

Insert image description hereIn high-performance computing, "kernel" refers to the smallest execution unit or smallest computing unit of a computing task. It represents a set of instructions that can be executed on the processor to complete a specific calculation or operation.

A kernel is usually an implementation optimized for a specific computing problem or algorithm. It can contain many calculations and data processing operations, such as matrix multiplication, vector operations, sorting algorithms, etc. The kernel is designed to make full use of hardware resources (such as processors, memory, etc.) to improve computing performance and efficiency by executing multiple kernel instances in parallel.

The kernel typically runs independently of the main program and can execute in parallel on multiple processors. Optimization of the kernel includes the use of data locality, vectorized instructions, cache optimization and other technologies to maximize the use of the processor's computing power.

In high-performance computing, by rationally dividing and managing the cores of tasks, the potential of parallel computing can be fully utilized and the execution efficiency and overall performance of the program can be improved. Kernel optimization is a key part of high-performance computing.
Insert image description here

opencl division method

Insert image description here
(1) Global index: For the entire square matrix shown in the diagram, the coordinate position is (6, 5).
(2) Work group (localSize) is a second rough division of global work items.
(3) Local index is the index of the work item within the work group.
Using workgroups within a kernel function, data can be accessed by getting the global and local index of each work item. The calculation and access logic of the index is as follows:

  1. Get the global index: Use built-in variables get_global_id(dim)to get dimthe global index of the current work item in the dimension. dimThe value of is usually 0, indicating the first dimension.

  2. Get the local index: Use built-in variables get_local_id(dim)to get dimthe local index of the current work item in the dimension. dimThe value of is usually 0, indicating the first dimension.

  3. Get the global scope: Use built-in variables get_global_size(dim)to get dimthe size of the entire execution scope (number of global work items) in dimensions. dimThe value of is usually 0, indicating the first dimension.

  4. Get the local scope: Use the built-in variable get_local_size(dim)to get the size of the workgroup in a dimension dim. dimThe value of is usually 0, indicating the first dimension.

These indexes and ranges allow you to access data in global and local memory as needed. Here's a simple example:

__kernel void myKernel(__global float* input, __global float* output)
{
    
    
    // 获取全局索引和局部索引
    size_t globalId = get_global_id(0);
    size_t localId = get_local_id(0);

    // 获取全局范围和局部范围
    size_t globalSize = get_global_size(0);
    size_t localSize = get_local_size(0);

    // 计算对应全局索引的输入和输出数据索引
    size_t inputIndex = globalId;
    size_t outputIndex = globalId;

    // 计算对应局部索引的输入和输出数据索引
    size_t localInputIndex = localId;
    size_t localOutputIndex = localId;

    // 在此处使用索引访问输入和输出数据,并进行计算
    output[outputIndex] = input[inputIndex] + localInput[localInputIndex];

    // 等待所有工作项完成局部计算
    barrier(CLK_LOCAL_MEM_FENCE);

    // 在此处进行工作组级别的计算或协作

}

In this example, we myKernelget the global index and the local index in the kernel function. Using these indices, we can calculate the indices of the input and output arrays to be accessed and perform the corresponding calculation operations within the kernel function. You can also use local indexes for workgroup-level calculations or collaboration (such as using local memory for data sharing). The last barrierfunction is used to wait for all work items to complete local calculations to ensure that all work items can be executed synchronously.

opencl context definition

Device: A collection of OpenCL devices used by the host.
Kernel: OpenCL functions running on OpenCL devices.
Program object: Program source code and executable files that implement the kernel.
memory object: A group object in memory visible to the OpenCL device that contains
values ​​that can be processed by kernel instances.

String-based program objects

The context also includes one or more program objects, which contain the kernel's code. The choice of name program
object can be confusing. It's best to think of it as a dynamic library from which you can pull out the functions used by the kernel
. Program objects are constructed by the host program at runtime.
This may seem strange to programmers who are not involved in graphics . Consider the challenges faced by OpenCL programmers. He writes an OpenCL application and delivers the
application to end users, but those users may choose to run the application elsewhere. The programmer has absolutely no
control over where the end user runs the application (be it the CPU, CPU, or other chip). All the OpenCL programmer knows
is that the target platform conforms to the OpenCL specification.
The solution to this problem is to build the program object from source code at runtime. The host program defines
the device in the context. Only then is it possible to know how to compile the program source code to create the kernel code.
OpenCL is quite flexible in form regarding the source code itself . In many cases this will be a regular string that can be statically defined in the host program
, loaded from a file at runtime, or dynamically generated in the host program.

The same device, multiple command queues

在OpenCL中,您可以在一个设备上创建多个命令队列。这样做主要有以下几个优点:

1. 并发执行:通过使用多个命令队列,您可以并行地执行多个内核或命令,从而提高程序的性能和效率。

2. 独立管理:每个命令队列都有自己的内核执行顺序和状态,因此您可以更灵活地管理和调度不同的内核和任务。

3. 异步操作:使用多个命令队列可以实现异步操作。您可以将多个命令添加到不同的队列中,并在需要时进行同步。

然而,创建多个命令队列也可能有一些注意事项:

1. 设备限制:每个设备的命令队列数量是有限的,具体取决于硬件和驱动程序,超出限制可能导致错误。可以使用`clGetDeviceInfo`函数查询设备支持的最大命令队列数。

2. 内存和资源管理:每个命令队列都有独立的内存和资源管理,因此需要确保系统中的总资源使用不超过设备的限制。

3. 同步和数据共享:不同命令队列之间的同步和数据共享可能需要额外的机制,如事件或内存对象的共享。

为了示范在同一个设备上创建多个命令队列,下面是一个简单的代码片段:

```cpp
cl_device_id device;
cl_context context;
cl_command_queue queue1, queue2;
  
// 创建设备和上下文
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

// 创建命令队列
queue1 = clCreateCommandQueue(context, device, 0, NULL);
queue2 = clCreateCommandQueue(context, device, 0, NULL);

// 在命令队列1中提交命令
clEnqueueNDRangeKernel(queue1, kernel1, ...);

// 在命令队列2中提交命令
clEnqueueNDRangeKernel(queue2, kernel2, ...);

In this example, we clCreateCommandQueuecreate two command queues by calling the function twice: queue1and queue2. We can then submit different kernels or tasks in these two command queues respectively.

It should be noted that the execution order between different command queues is undetermined unless you use additional synchronization mechanisms, such as events or barriers, to ensure their order and synchronization. Additionally, the use of multiple command queues may require more complex task and data management to ensure proper synchronization and coordination.



```cpp
queue:命令队列对象,用于提交命令。

num_events_in_wait_list:等待事件列表中事件的数量。通常为0,表示没有等待的事件。

event_wait_list:指向等待事件列表的指针,用于指定在执行标记之前需要等待的事件。通常为NULL,表示没有等待的事件。

event:返回的事件对象,用于标识标记命令的事件。在事件完成之前,可以使用该事件来等待或查询内核的执行状态。

Example of executing multiple kernel functions simultaneously

#include <CL/cl.h>
#include <iostream>

#define NUM_ELEMENTS 1000
// 在这段代码中,R"(...)" 是一种称为原始字符串字面量(raw string literals)的C++11特性。这种特性允许在字符串中包含特殊字符而无需进行转义。
// 在您给出的代码中,R"( 和 ")" 包围的部分是一个原始字符串字面量,其中包含了一个内核函数的定义。
// 使用原始字符串字面量的好处是,您可以在字符串中直接使用特殊字符,例如反斜杠和双引号,而无需对它们进行转义。这在编写包含许多转义字符的长字符串时特别方便。
const char* kernelSource[] = {
    
    
    R"(
    __kernel void copy(__global int* input, __global int* output) {
        int gid = get_global_id(0);
        output[gid] = input[gid];
    }
    )",

    R"(
    __kernel void doublevalue(__global int* output) {
        int gid = get_global_id(0);
        output[gid] = output[gid]+1;
    }
    )"
};

int main() {
    
    
    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    cl_kernel kernel2;
    cl_mem inputBuffer, outputBuffer;
    cl_event syncEvent;
    int input[NUM_ELEMENTS];
    int output[NUM_ELEMENTS];

    // 初始化输入数据
    for (int i = 0; i < NUM_ELEMENTS; i++) {
    
    
        input[i] = i;
    }

    // 创建平台、设备和上下文
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // 创建内存对象
    inputBuffer = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(int) * NUM_ELEMENTS, input, NULL);
    outputBuffer = clCreateBuffer(context,  CL_MEM_READ_WRITE, sizeof(int) * NUM_ELEMENTS, NULL, NULL);
    cl_int err;
    // 创建内核程序并构建
    program = clCreateProgramWithSource(context, 2, kernelSource, NULL, &err);
    if (program == NULL || err != CL_SUCCESS) {
    
    
        std::cout << "Failed to create program object." << std::endl;
    }
    //1即num_devices:构建程序的设备数量。
    clBuildProgram(program, 1, &device, NULL, NULL, NULL);
    // 创建内核,第2个参数需要对应核函数名
    kernel = clCreateKernel(program, "copy", NULL);
    kernel2 = clCreateKernel(program, "doublevalue", NULL);

    // 设置内核参数
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &inputBuffer);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &outputBuffer);
    clSetKernelArg(kernel2, 0, sizeof(cl_mem), &outputBuffer);
    // 创建同步事件
    // syncEvent = clCreateUserEvent(context, NULL);
    size_t globalSize[1]={
    
    NUM_ELEMENTS};
    //注意:在同一个命令队列中,对于相同的命令队列上的内核函数,它们通常是按顺序执行的。
    //这意味着在命令队列中提交的内核函数将按照它们被插入的顺序进行执行。
    //执行第一个内核
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalSize, NULL, 0, NULL, NULL);

    // 执行第二个内核(在第一个内核完成后执行)第二个1为在内核程序执行前等待的数量
    clEnqueueNDRangeKernel(queue, kernel2, 1, NULL, globalSize, NULL, 0, NULL, NULL);

    // 读取结果
    clEnqueueReadBuffer(queue, outputBuffer, CL_TRUE, 0, sizeof(int) * NUM_ELEMENTS, output, 0, NULL, NULL);

    // 打印结果
    for (int i = 0; i < NUM_ELEMENTS; i++) {
    
    
        std::cout << "Output[" << i << "] = " << output[i] << std::endl;
    }

    // 释放资源
    clReleaseMemObject(inputBuffer);
    clReleaseMemObject(outputBuffer);
    clReleaseKernel(kernel);
    clReleaseKernel(kernel2);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);
    return 0;
}

Guess you like

Origin blog.csdn.net/hh1357102/article/details/132564011