Notes Part 1 - OpenCL Basics

Note: This note is based on the integration of various blog posts on the Internet. If there is any infringement, please inform us in time. There is a link to the reference blog at the end of the article.

Introduction to the architecture of CPU and GPU

CPU: Multiple instruction single data stream (pipeline mode), MISD, good at logic control.

GPU: single instruction multiple data stream (vector algorithm), SIMD, good at parallel computing.

Therefore, the architecture of 1 CPU + several GPGPUs (general-purpose parallel processing GPUs) is heterogeneous programming.

Applications developed using the common OpenCL interface (API) can be used in different SDKs. OPenCL is just a standard. Intel, AMD, NVIDIA and other major manufacturers use different SDKs for custom development;

Abstraction of OpenCL hardware layer

It is a Host (control processing unit, usually borne by a CPU) and a bunch of Compute Devices (computing devices, borne by some GPU, CPU and other supported chips), of which Compute Device is divided into many Processing Elements (this is independent The smallest unit involved in single-data computing, this different hardware implementation is different. For example, the GPU may be one of the Processors, and the CPU is mostly a Core, I guess. Because this implementation is hidden from developers), many of which Processing Elements can be grouped into a Compute Unit, elements in a Unit can easily share memory, and only elements in a Unit can achieve synchronization and other operations.

It can be understood as:

Host is the general manager of the company: responsible for the overall task allocation and scheduling of the company.

Compute Device is a number of implementation departments managed by the general manager of the company: each such department is responsible for completing the tasks assigned by the general manager in parallel. However, each department performs the same computing tasks.

Compute Unit is each project team under each department, responsible for realizing tasks at the technical level. The technical details of each project group are not disclosed to the public (Local Memory is shared within the project group).

The Processing element is an employee, responsible for the implementation of tasks within the project team, and is the lowest working node . Employees have their own private technology (Private Memory). In the whole company, each Processing Element has its own index number (Global ID) and job number. In each project group (Workgroup), there is a number in its own project group, which is LocalID. The following figure is divided from Compute Device.

As shown above, there are global memory (variables), global constants, local memory and local variables.

Software layer architecture of OpenCL:

Setup related knowledge (system initialization):

-Device: corresponds to a computing device; note that a multi-core CPU is an entire Device. That is, Device is a single CPU or GPU or FPGA.

-Context: Context; a Context includes several Devices, and Context is the link between these Devices. Only Devices in the same Context can communicate with each other. You can have many Contexts on your board. Context can be created by CPU or CPU+GPU.

-Command queue: The command sequence submitted to each Device.

Memory related knowledge:

       -Buffer: memory

       -Images: The native "type" of image processing for GPU operations, representing images of various dimensions.

       -Data interaction method: A copies the address to the work group space, and then copies it out after the calculation is completed (passing the formal parameters); B transmits the address of the calculation data group in (passing the pointer).

Knowledge about GPU code execution:

       -Program: Includes Kernel and other libraries, etc. OpenCL is a dynamically compiled language. After the source code is compiled, an identical intermediate file is generated, and different programs are generated when the intermediate file is connected to different linkers. Then read the program into Compute Device.

       -kernel:inThe kernel program running at the processing element (PE, employee) is the most basic processing thing. The core (algorithm) of each PE is the same and the data processed is different, which is the so-called single instruction multiple data system (SIMD).

       -Work item: The representation of processing element in the code, a Work Item represents a processing element (PE, employee).

       - Program objects: source code files (.cl files) and executable files of kernel programs. Memory Objects: Variables required by the computing device to execute OpenCL kernel programs.

Knowledge about parallelism and synchronization:

       - Task parallelism and data parallelism: Task parallelism refers to the working mode of the pipeline. Data parallelism refers to the way that multiple sets of data perform the same computation at the same time (basically all matrix operations can be used!)

       -Command queue: Each Compute Device has one or more command queues. But a command queue manages only one computing device. Through the command queue, the asynchronous control of the host computer and the computing device can be realized. It is divided into startup commands (starting to execute the kernel program); memory commands (moving data between the host and the memory device, or performing memory mapping); synchronization commands (constraining the execution of commands on the computing device).

       -Events: Each command in the command queue has the status of the command. When the status of the command changes, an event Event is generated. In the context of distributed computing, Events are used for synchronization between different computing units.

-Divided by the location of the event: Kernel-side event: mainly responsible for the synchronization operation of asynchronous execution of commands (phase synchronization of multiple processing units) and the synchronization of the global kernel and local memory. Host-side events: Complete synchronization operations between command queues (coordinate the operations of various computing devices).

-Divided by the cause of the incident:

command event

CL_QUEUED: The command has been added to the command queue.

CL_SUBMITTED: The command has been submitted by the host to the device associated with the command queue.

CL_RUNNING: The command is being executed.

CL_COMPLETE: The command has been completed

ERROR_CORE: Negative number, referring to different error conditions.

User-defined events

User-defined events are used when synchronization between ComputeDevices in the same context is required.

cl_event clCreateUserEvent(

cl_context context,  // Specify the context

cl_uint* errcode_ret  // The error value associated with this function

)

If the creation is successful, errcode_ret will be assigned the value of CL_SUCCESS or the following value in case of error:

CL_INVALID_EVENT: The context is invalid

CL_OUT_OF_RESOURCE: resource not ready or failed to allocate resource

CL_OUT_OF_HOST_MEMORY: The host resource is not ready or the resource allocation failed

Then we can set the value of the returned event in each handler function:

cl_intclSetUserEventStatus(

cl_eventevent, //Specific event value

cl_intexecution_status //points to the status

)

-Synchronize:

OpenCL is based on a task-parallel, host-controlled model, where each task is data-parallel by using a thread-safe command queue associated with the device.

- Single device sync: via the following settings:

A sets a barrier (that is, a barrier, before all items in the group reach this barrier, all items are not executed downward).

cl_int clEnqueueBarrier(

cl_command_queuecommand_queue

)

B waits for an event (waiting for an event, that is, adding a waiting event to the command queue, and only after the waiting event is satisfied, the command added later can be executed).

C blocking access: set a flag when memory IO is read, when the flag is valid, it will block until the copy is completed.

-Multi-device synchronization: cFinish, wait for the execution of another command queue to complete, and then the command can continue to be executed. For example, a CPU and a GPU need to access each other's data, so how to achieve synchronization? If the CPU wants to access the CPU's data, it must wait for the CPU's current command queue to complete execution and no longer occupy memory before the GPU can access it. clFinish can block the execution of the program until all commands in the command queue have been executed. But that's just the equivalent of adding a barrier at the end.



Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324603248&siteId=291194637