[High Performance Computing] opencl syntax and related concepts (1): workflow, examples

The difference between opencl running on gpu and cpu

In OpenCL, in order to perform the same calculation on different devices, the code may have some slight differences. The main differences are reflected in the following aspects:

  1. Get device: When calling clGetDeviceIDsthe function, you need to specify different device types to get GPU and CPU devices. For example, use CL_DEVICE_TYPE_GPUthe parameter to get the GPU device and CL_DEVICE_TYPE_CPUthe parameter to get the CPU device.

  2. Kernel code tuning: Due to the architectural differences between GPUs and CPUs, some tuning may be required in the kernel code to maximize the performance potential of the device. For example, you can optimize GPU cores using strategies such as appropriate vectorization, work group size, etc., while for CPU you can focus on parallelism and thread optimization.

  3. Memory access patterns: Memory access patterns may be different for GPUs and CPUs. GPUs are generally better suited to using vectorized operations and global memory, while CPUs are better suited to using caches and single-threaded computation. Therefore, in your code, you may use different memory objects (such as buffers, images) and different memory access patterns to accommodate different devices.

However, most of the code logic and structure remains the same on GPU and CPU. Many workflows and function calls (such as create context, command queue, program, kernel, etc.) are the same. Therefore, when writing code, you can use conditional statements or detached functions to handle different device types and share much of the same task code.

Overall, OpenCL provides an abstraction layer that allows you to use unified code to perform computations on different devices. Understanding the characteristics and limitations of the device and performing appropriate code tuning according to different device types can help you maximize the use of GPU and CPU computing capabilities.

opencl basic workflow and concepts

In OpenCL, there are the following basic concepts and basic processes:

  1. Platform: Platform represents a computing environment, which can be a combination of computing devices with different characteristics and hardware configurations. Platform is the highest level concept and can include multiple devices, such as CPU, GPU, FPGA, etc. Each platform is offered by a different vendor and has its own features and supported hardware. Installing OpenCL SDKs (software development kits) from multiple hardware vendors allows multiple platforms to appear at the same time. Each hardware vendor's OpenCL SDK provides specific platform and device support, so when multiple SDKs are installed, each SDK creates its own platform.

By querying the list of available platforms, you can see that platforms from multiple hardware vendors coexist on the same system. Each platform represents a computing environment with different features and hardware configurations. You can use OpenCL functions to query platforms and devices, such as the clGetPlatformIDs and clGetDeviceIDs functions.

  1. Device: It is part of the platform and represents computing resources, which can be CPU, GPU or other hardware devices. Each device has its own computing capabilities and characteristics.

  2. Context: associates platforms and devices and provides access to computing resources. It is the basis for OpenCL function calls and contains the environment information required to perform operations on the device.

  3. Command Queue: used to send operations to the device and manage the execution order of operations. Command queues perform operations sequentially and enable interdependencies between operations.

  4. Buffer: used to transfer data between the host and the device. An area of ​​memory that can be allocated and used on the device.

  5. Program: A collection of code consisting of one or more kernel functions. Kernel functions are tasks that execute in parallel on the device and can accept parameters and access memory on the device.

The basic OpenCL workflow is as follows:

  1. Query and select platforms and devices: Use the clGetPlatformIDsand clGetDeviceIDsfunctions to query available platforms and devices and select the appropriate ones.

  2. Create context: Use clCreateContextthe function to create a context that associates the selected device with the platform.

  3. Create a command queue: Use clCreateCommandQueuethe function to create a command queue to manage the execution of operations.

  4. Create a kernel program: Use clCreateProgramWithSourcethe or clCreateProgramWithBinaryfunction to create a program that contains the source or binary code of one or more kernel functions.

  5. Compile and build kernel programs: Use clBuildProgramthe function to compile the program into executable form.

  6. Create a kernel object: Use clCreateKernelthe function to create a kernel object to execute kernel functions on the device.

  7. Create and manipulate buffers: Use clCreateBufferfunctions to create buffers, and functions such as clEnqueueReadBufferand clEnqueueWriteBufferto transfer data between the host and device.

  8. Set kernel parameters: Use clSetKernelArgthe function to set parameters for kernel functions.

  9. Enqueue kernel functions for execution: Use clEnqueueNDRangeKernelthe function to enqueue kernel functions and execute them in parallel on the device.

  10. Wait for the operation to complete and get the result data: Use clFinishthe function to wait for all command execution to complete and clEnqueueReadBufferread the result data from the device back to the host using a function such as .

The above are the basic concepts and workflow in OpenCL. Depending on the specific application and needs, further operations such as memory management, event processing, and error handling can be performed.

Example of adding two numbers

#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

#define SIZE 1024

int main() {
    
    
    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;

    cl_mem bufferA, bufferB, bufferC;

    int i;
    int* A = (int*)malloc(sizeof(int) * SIZE);
    int* B = (int*)malloc(sizeof(int) * SIZE);
    int* C = (int*)malloc(sizeof(int) * SIZE);

    // 初始化输入数据
    for (i = 0; i < SIZE; i++) {
    
    
        A[i] = i;
        B[i] = i * 2;
    }

    // 创建并初始化 OpenCL 环境
    clGetPlatformIDs(1, &platform, NULL);
    //这里显示的设备类型非常关键,弄错返回结果都为0
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_CPU, 1, &device, NULL);
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // 创建内存对象
    bufferA = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * SIZE, NULL, NULL);
    bufferB = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * SIZE, NULL, NULL);
    bufferC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * SIZE, NULL, NULL);

    // 将传设备
    clEnqueueWriteBuffer(queue, bufferA, CL_TRUE, 0, sizeof(int) * SIZE, A, 0, NULL, NULL);
    clEnqueueWriteBuffer(queue, bufferB, CL_TRUE, 0, sizeof(int) * SIZE, B, 0, NULL, NULL);

    // 创建和编译内核程序
    //在 C/C++ 中,字符串可以跨行分开写,只要字符串在同一行中使用双引号括起来、没有分号或逗号等终止符号,并且后续行以双引号开头即可。这被称为 “字符串字面量拼接”。
    const char* source =
        "__kernel void vector_add(__global const int* a, __global const int* b, __global int* c) {"
        "   int i = get_global_id(0);"
        "   c[i] = a[i] + b[i];"
        "}";
    program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
    clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // 创建内核对象
    kernel = clCreateKernel(program, "vector_add", NULL);

    // 设置内核参数
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferA);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufferB);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufferC);

    // 执行内核
    size_t globalSize[1] = {
    
     SIZE };
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalSize, NULL, 0, NULL, NULL);

    // 从设备获取计算结果
    clEnqueueReadBuffer(queue, bufferC, CL_TRUE, 0, sizeof(int) * SIZE, C, 0, NULL, NULL);

    // 打印结果
    for (i = 0; i < SIZE; i++) {
    
    
        printf("%d + %d = %d\n", A[i], B[i], C[i]);
    }

    // 清理资源
    clReleaseMemObject(bufferA);
    clReleaseMemObject(bufferB);
    clReleaseMemObject(bufferC);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    free(A);
    free(B);
    free(C);

    return 0;
}

Related knowledge points

Use identifiers to complete automatic copying of memory from host to device

CL_MEM_WRITE_ONLY is one of the identifiers of memory objects in OpenCL and is used to indicate that the memory object can only be used for write operations.

When you allocate memory for a memory object, you can use different identifiers to specify the purpose of the memory object. The CL_MEM_WRITE_ONLY identifier indicates that the memory object can only be used for write operations, not read operations. This means you can write data to that memory object, but you cannot read data from it.

In the above sample code, we used the CL_MEM_WRITE_ONLY identifier to create the bufferResult memory object used to store the calculation results. This means that we can write the results of calculations to this memory object, but we cannot read the results directly from it. To get the calculation results, we need to use the clEnqueueReadBuffer function to transfer the results from device memory to host memory.

Such an identifier allows OpenCL to perform some optimizations because it knows that the memory object is only for writing and can optimize accordingly. This provides better performance and efficiency.

You can also use flag bits to avoid explicit memory allocation:

// 创建内存对象
bufferA = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(int)*SIZE, A, NULL);
bufferB = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(int)*SIZE, B, NULL);
bufferC = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(int)*SIZE, NULL, NULL);

// // 将传设备
// clEnqueueWriteBuffer(queue, bufferA, CL_TRUE, 0, sizeof(int) * SIZE, A, 0, NULL, NULL);
// clEnqueueWriteBuffer(queue, bufferB, CL_TRUE, 0, sizeof(int) * SIZE, B, 0, NULL, NULL);

Global execution scope and local execution scope

In OpenCL, global execution range and local execution range are two different concepts used to describe work items and work groups executed in parallel.

The global execution scope refers to the total number of work items executed in parallel on the computing device, and can also be understood as specifying the number and distribution of work items. It determines the scale of the entire parallel computing. Each work item has a unique global identifier in the global scope, which can be obtained through the built-in function get_global_id.

Local execution scope refers to the number and size of workgroups executing in parallel on a computing device. A workgroup is a group of related work items that can share local memory, and the local identifier of the work item within the work group is obtained through the built-in function get_local_id. Workgroups are commonly used for collaborative computing and communication in shared memory on computing devices.

In summary, the global execution scope determines the number of work items and global identifiers, while the local execution scope determines the number of work groups and the division of work items within the work groups.

When setting the execution scope, you can specify the size of the global execution scope and the local execution scope through clEnqueueNDRangeKernel or the corresponding function parameters. If you do not need to use the local execution scope, you can set it to NULL or 0.

In OpenCL, get_global_id(0) is a built-in function used to obtain the globally unique identifier of the current work item from the global execution scope. It is a built-in function for parallel computing that determines the position of the current work item within the global work scope to which it belongs.

In the above code, get_global_id(0) is used in the kernel function vectorAdd, which means to obtain the index of the current work item in the first dimension of the global scope (dimension number is 0). This index can be used to get the corresponding elements from the input arrays a and b and store the calculation results in the result array.

Index example

For example, if 128 work items are used when executing the kernel function and the global index of the current work item is 10, then get_global_id(0) will return 10, indicating that the index of the current work item in the first dimension of the global scope is 10 .

For example, if you set the global execution scope to 128 when executing a kernel function, then the work items will be indexed from 0 to 127 in the first dimension, for a total of 128 unique index values. Each work item can obtain its index in this dimension through the get_global_id(0) function.
This line of code is an example of using OpenCL's C++ wrapper libraries (cl::CommandQueue and cl::NDRange) to perform global execution scope setting of a kernel function.

queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange, NULL, &event);

Specifically, this line of code means the following:

queue is a command queue object that has been created for executing commands.
enqueueNDRangeKernel is a member function of the command queue, used to add kernel functions to the command queue and specify its execution scope.
kernel is the kernel function object to be executed.
cl::NullRange represents the starting position of the execution range in each dimension. In this example, we use an empty range, so start at index 0 of the first dimension.
cl::NDRange(N) represents the execution range size in each dimension. In this example, we only execute in the first dimension, and the execution range size is N. This means that there are N work items executing the kernel function in the first dimension.
cl::NullRange represents the local execution range size in each dimension. In this example, we did not specify a local execution scope.
NULL represents an event dependency list that specifies events to wait for before executing the kernel function.
&event is a pointer to receive the generated event object. This event object can be used to wait for the execution of the kernel function to complete or obtain other related information when needed.
In summary, this line of code adds the kernel function kernel to the queue's command queue and sets a global execution scope with N work items via cl::NDRange(N).

Example 1

如何用上面这些方法构建一个357的多维索引,在核函数内索引多维数组进行操作:
__kernel void my_kernel(__global float* array, const int N, const int M, const int P)
{
    
    
    int globalIdX = get_global_id(0);  // 第一个维度上的全局索引
    int globalIdY = get_global_id(1);  // 第二个维度上的全局索引
    int globalIdZ = get_global_id(2);  // 第三个维度上的全局索引

    if (globalIdX < N && globalIdY < M && globalIdZ < P)
    {
    
    
        int index = globalIdX + N * globalIdY + N * M * globalIdZ;
        array[index] = 0.0f;  // 对多维数组进行操作
    }
}
size_t globalSize[3] = {
    
    3, 5, 7};  // 全局执行范围的大小

cl::Kernel kernel(program, "my_kernel");  // 创建内核对象
cl::CommandQueue queue(context, device);  // 创建命令队列

kernel.setArg(0, bufferArray);  // 设置内核参数
kernel.setArg(1, N);
kernel.setArg(2, M);
kernel.setArg(3, P);

queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(globalSize), cl::NullRange);

When introducing local workgroups, taking the 1-dimensional case as an example, the division is as follows:
Insert image description here
There are 16 workgroups in the above settings, and each workgroup has four local IDs: 0, 1, 2, and 3.

__kernel void add_vec(__global int * data_in,
                      __global int *mem_global_id,
                      __global int *mem_local_id,
                      __global int *data_out,
                      int length)
{
    
    
            int i,j;
            int global_id;
            int local_id;
 
            global_id = get_global_id(0);
            local_id  = get_local_id(0);
 
            mem_global_id[global_id] =global_id;
            mem_local_id [global_id] = local_id;
 
 
            for(i=0; i<length; i++)
            {
    
    
                       data_out[i] =data_in[i]*2;
            }
 
}

For the 2-dimensional case:

cl_uint work_dim = 2;
size_t global_item_size[2] = {
    
    8, 8};
size_t local_item_size[2] = {
    
    2, 2};
 
/* Execute Data Parallel Kernel */
ret =clEnqueueNDRangeKernel(command_queue, kernel, work_dim, NULL,
                            global_item_size,local_item_size,
                            0,NULL, NULL);

An example of max pooling:

/**********************************************
function:max_pooling, 2*2
2018/05/24
**********************************************/
 
__kernelvoid pooling(__global int * data_in,
                     __global int *mem_global_id,
                     __global int *mem_local_id,
                     __global int *data_out,
                     int width)
{
    
    
            int i,j;
            int global_id_x, global_id_y;
            int local_id_x, local_id_y;
 
            global_id_x = get_global_id(0);
            global_id_y = get_global_id(1);
            local_id_x  = get_local_id(0);
            local_id_y  = get_local_id(1);
}

Queue concept in opencl

The concept of execution queue is introduced in OpenCL to achieve parallel computing. Execution queues are a way to control and manage the execution of kernel functions.

The execution queue contains kernel functions to be executed and their corresponding parameters. Scheduling and parallel execution of kernel functions can be achieved by inserting kernel functions and parameters into the execution queue. The execution queue can execute kernel functions sequentially or execute multiple kernel functions simultaneously, thereby improving computing performance.

Execution queues can also be used to control the execution order of kernel functions, synchronization operations, and event management. The execution order of kernel functions can be specified to ensure that they are executed in a specific order. You can also use events to synchronize the execution of kernel functions and wait for the kernel function to complete before performing subsequent operations.

Therefore, the role of the concept of execution queue in OpenCL is to provide an effective management and scheduling method for parallel computing, so that programs can make full use of the parallel capabilities of computing devices and improve computing performance.

Using multiple execution queues allows you to execute multiple kernel functions simultaneously, taking fuller advantage of the parallel capabilities of your computing device. Each execution queue can independently manage and schedule the execution of kernel functions and can have different execution sequences, parameters, and event dependencies.

By rationally using multiple execution queues, you can give full play to the parallel performance of computing devices and improve computing efficiency and throughput. However, it should be noted that when using multiple execution queues, it is necessary to reasonably allocate resources and handle memory access conflicts to avoid race conditions and resource contention.

In summary, the use of multiple execution queues can provide a more flexible parallel computing strategy and help optimize computing performance and resource utilization.

Guess you like

Origin blog.csdn.net/hh1357102/article/details/132477386