[High Performance Computing] Opencl syntax and related concepts (3) Events, memory

Event concept in opencl

When talking about events in OpenCL, they represent status information for various stages of execution or operations. By using events, you can track and manage the progress and sequence of kernel execution and memory operations. The following are key concepts related to OpenCL events:

  1. Creating events: You can create events manually using clCreateUserEventthe or function, or automatically when performing other operations using the OpenCL API.clCreateUserEventWithProperties

  2. Kernel execution events: When you submit a kernel to the command queue for execution, an event object is returned that you can use to track the status of kernel execution.

  3. Wait for an event: You can use clWaitForEventsthe or clWaitForEventsWithTimeoutfunction to block the program until the specified event is completed. This is important to ensure kernel execution order and dependencies.

  4. Event callbacks: You can asynchronously notify your application of the completion of an event by registering a callback function to the event. Once the event completes, the callback function is called, which is useful for handling asynchronous tasks.

  5. Event status query: Use clGetEventInfofunctions to query the status information of an event, such as whether the event is completed, the start and end time of the event, etc. This information is helpful for performance analysis and debugging.

Note that OpenCL events are tools optimized for parallel execution and asynchronous operations. Understanding event-related knowledge can help you better manage and optimize the execution process and performance of OpenCL programs.

event queue

In openCL's command queue execution, a corresponding event mechanism is provided to control the timing of kernel function execution. The clEnqueueNDRangeKernel() function provides two parameters, event_wait_list and event. The former parameter is the event queue, which means that the current kernel function will not actually start until all events in the queue are triggered; the latter one is a single event, which means the current kernel Triggered after function execution is completed.

Event synchronization mechanism

// 立即执行内核函数 filter_A
  cl_event event_a = NULL;
  err_code = clEnqueueNDRangeKernel(cmd_queue_, kernel_filterA_,
                                    2,  // 数据的维度: 二维数据
                                    NULL, global_work_size, local_work_size, 0,
                                    NULL, &event_a);

  // 立即执行内核函数 filter_B
  cl_event event_b = NULL;
  err_code = clEnqueueNDRangeKernel(cmd_queue_, kernel_filterB_,
                                    2,  // 数据的维度: 二维数据
                                    NULL, global_work_size, local_work_size, 0,
                                    NULL, &event_b);

  // 设置等待事件列表
  cl_event wait_events[2];
  wait_events[0] = event_a;
  wait_events[1] = event_b;

  // 执行内核函数 filter_sum
  // 需要等前两个内核函数都执行完成,event_a 和 event_b两个事件都被触发后才真正执行。
  cl_event event_sum = NULL;
  err_code = clEnqueueNDRangeKernel(cmd_queue_, kernel_filterSum_,
                                    2,  // 数据的维度: 二维数据
                                    NULL, global_work_size, local_work_size,
                                    2, wait_events, &event_sum);
// 等待 filter_sum执行完成
// 在函数clWaitForEvents中,参数num_events表示等待的事件数量。它指定了要等待的事件数组中的事件数量。在您的例子中,clWaitForEvents(1, &filter_sum)中的参数1表示只等待一个事件,即filter_sum事件。
//这意味着执行clWaitForEvents函数的线程将一直等待,直到filter_sum事件完成或取消。一旦filter_sum事件完成,线程将继续执行后续的代码。
  clWaitForEvents(1, &filter_sum);  

  // 释放所有事件对象
  clReleaseEvent(event_a);
  clReleaseEvent(event_b);
  clReleaseEvent(event_sum);

Host memory

(host memory): This memory area is only visible to the host. As with most details about the host
, OpenCL only defines how host memory interacts with OpenCL objects and constructs.

global memory

(global memory): This storage area allows reading and writing of all work items in all work groups. Work
items can read and write any element of a memory object in global memory. Reading and writing global memory may be cached
depending on the device's capacity.

constant memory

(constant memory): This memory area of ​​global memory remains unchanged during the execution of a kernel. The
host allocates and initializes memory objects placed in constant memory. These objects are read-only for work items.

local memory

(localmemory): This memory area is local to the work group. This memory area can be used to divide

Configure variables shared by all work items in this work group. It can be implemented as a dedicated memory area on the OpenCL device
. Alternatively, a local memory area can be mapped to a section of global memory.

private memory

(private memory): This memory area is a private area for a work item.
Variables defined in memory that are private to a work item are not visible to other work items.

Memory distribution diagram

These memory regions and their relationship to the platform and execution model are shown in the figure. Work items run on processing units and have their own private memory. A workgroup runs on a compute unit and shares a local memory area with the work items in the group. OpenCL device memory leverages the host to support global memory.
Insert image description here
In OpenCL, there are mainly the following types of memory:

  1. Global Memory: Global memory is the most commonly used type of memory and allows data exchange between multiple cores. It can be __globaldefined through the modifier and cl_memaccessed through a pointer of type.
__kernel void myKernel(__global float* data) {
    // 访问全局内存,全局内存作为参数,就是都能访问并改写
    float value = data[0];
    // ...
}
  1. Constant Memory: Constant memory is used to store read-only global data and can be __constantdefined through modifiers. Constant memory is useful for global data that is frequently read and can improve access efficiency.
__constant float constantData[10];

__kernel void myKernel() {
    // 访问常量内存
    float value = constantData[0];
    // ...
}
  1. Local Memory: Local memory is used to store private data required by each work item (thread), which can be __localdefined through modifiers. Local memory is shared only within a workgroup (a group of work items).
    // 创建内核程序
    const char* source = "__kernel void myKernel(__local float* localData) { \
       // 在局部内存中定义局部数组
       __local float localArray[128]; \
       // ... \
     }";

In addition, there are special types of memory such as Private Memory and Image Memory.

Private memory is a private storage space for each work item (thread). It is used to store temporary variables and intermediate results in calculations. By default, all local variables are stored in private memory.

__kernel void myKernel() {
    // 私有内存中的局部变量
    float value;
    // ...
}

The concept of memory consistency model

The Memory Consistency Model refers to the provisions for the ordering and visibility of read and write operations when memory is shared between multiple processors (or threads) in parallel computing. In short, it defines how multiple processors see and interact with data in shared memory.

In parallel computing, instructions from different processors may be executed in different orders. Due to out-of-order execution and memory cache characteristics between processors, unexpected results may occur in read and write operations on shared memory.

The goal of the memory consistency model is to provide a memory access model that is more intuitive, understandable, and easily programmable on multiprocessor systems. It specifies the behavior of concurrent programs, ensuring that the results of program execution on multiple processors are the same as expected when the program was written in the order it was written.

Different memory consistency models have different definitions of ordering and visibility rules for read and write operations, such as sequential consistency, weak consistency, loose consistency, etc. Each model has different trade-offs and applicable scenarios, and developers need to choose an appropriate memory consistency model based on specific application requirements.

It is important to realize that the memory consistency model only applies to shared memory parallel computing models, while for distributed computing, message passing, and other models, there are often different consistency models and mechanisms.

out-of-order execution

Out-of-order execution (Out-of-Order Execution) is a technology for processors to execute instructions. Its purpose is to improve instruction-level parallelism and execution efficiency. Traditionally, the processor executes instructions in the order they are in the program, but out-of-order execution allows the processor to execute instructions out of order in the program, as long as the final execution result is consistent with the result of execution in program order.

The main idea of ​​out-of-order execution is to decouple instructions into independent micro-operations, and reschedule and execute these micro-operations by leveraging the parallelism of hardware resources, such as computing units and memory subsystems. To maximize the efficiency of parallel execution of instructions.

The principle behind out-of-order execution techniques relies on two observations: data dependence and control dependence. The dependency relationship between data-dependent instructions must ensure the correct order of execution; while control dependency refers to conditional branch instructions (such as if statements), which determine the execution order of the next instruction based on the result of the branch.

Through out-of-order execution technology, the processor can detect data dependencies and control dependencies, and perform instruction reordering and parallel execution according to the actual situation to fully utilize hardware resources to improve execution efficiency, but still maintain the semantic consistency of the program.

It should be noted that out-of-order execution technology is implemented inside the processor. From the outside, the execution result of the program should be the same as the result of sequential execution. That is to ensure that the processor is transparent to the outside and complies with the instruction set architecture (ISA). semantic specification.

Guess you like

Origin blog.csdn.net/hh1357102/article/details/132585848