OpenCL: Introduction

 

https://www.cnblogs.com/wangshide/archive/2012/01/07/2315830.html

Structure of this article:

Background
What is OpenCL? The basic concept of
framework composition . The basic steps of writing OpenCL programs. Reference blog post 1. Background In the past, the technology of using GPU to accelerate image rendering was very mature, because GPU is a typical single instruction multiple data (SIMD) architecture, which is good at Large-scale parallel computing; CPU is a multi-instruction single data stream (MISD) architecture, which is better at logic control.




In today's situation where the amount of data calculation is becoming more and more huge, in order to improve computing efficiency, people hope to extend the large-scale parallel computing capabilities of GPUs to more fields, not just limited to image rendering. In this way, the CPU is only responsible for logic control, and the GPU is more responsible for calculation. This architecture of one CPU (control unit) + several GPUs (sometimes may add several CPUs) (computing units) is the so-called heterogeneous programming.

OpenCL appeared in this situation. It is a standard for heterogeneous computing and can be used for GPU programming. In fact, before OpenCL came out, NVIDIA launched the GPGPU computing CUDA architecture. It's just that CUDA can only use its own N card and does not support other graphics cards, while OpenCL is a general standard that supports A card, N card, etc., and also supports CPU computing.

Regarding the difference between GPU and CPU, you can refer to my previous blog post GPU Programming-The design difference between CPU and GPU.

2. What is
OpenCL ? OpenCL (full name Open Computing Langugae, Open Computing Language) is the first open standard for parallel programming of heterogeneous systems (which can be composed of CPU, GPU or other types of processor architectures). It is cross-platform.

OpenCL is composed of two parts, one is the language used to write kernels (functions running on OpenCL devices), and the other is the API (functions) used to define and control the platform. OpenCL provides task-based and data-based parallel computing systems. It greatly expands the application range of GPUs and makes them no longer limited to the graphics field.

OpenCL is a standard, and Intel, Nvidia, ARM, AMD, QUALCOMM, Apple have their corresponding OpenCL implementations. For example, NVDIA integrates the implementation of OpenCL into its CUDA SDK, while AMD puts its implementation in the AMD APP (Accelerated Paral Processing) SDK...

Third, the composition of the framework
These concepts may be more difficult to understand, it doesn't matter, it should be easy to understand after reading some related examples.

OpenCL platform API: The platform API defines the functions used by the host program to discover OpenCL devices and the functions of these functions. It is also defined to create a context for the OpenCL application (the context represents all the software and hardware resources + memory + processing that the program has when it is running器) function. The platform here refers to the combination of the host, the OpenCL device and the OpenCL framework.
OpenCL runtime API: The platform API is mainly used to create a context, and the runtime API emphasizes the use of this context to meet the needs of the application function set, used to manage the context to create a command queue and other operations that occur at runtime. For example, a function that submits a command to the command queue.
OpenCL programming language: a programming language used to write kernel code, based on an extended subset of the ISO C99 standard, usually called the OpenCL C programming language.
Fourth, the basic concept
OpenCL program is the same as the CUDA program, it is divided into two parts, one is running on the host (CPU as the core), and the other is running on the device (with the GPU as the core). The device has one or more computing units, and the computing unit contains one or more processing units. The program running on the device is called a kernel function. But for the writing of kernel functions, CUDA is generally written directly in the program. OpenCL is written in a separate file, and the file suffix is ​​.cl, which is executed after being read in by the host code. This is the point that OpenCL is very similar to the rendering program in OpenGL. Like.

Refer to the figure above to briefly summarize some of the basic concepts in OpenCL, which will be mentioned in the following four models of OpenCL:

The OpenCL platform consists of two parts, the host and the OpenCL device:

Host: The host is generally a CPU, playing the role of an organizer. Its role includes defining the kernel, specifying the context for the kernel, defining NDRange and queues, etc. The queue controls the details of how and when the kernel is executed.
Device: Usually called a computing device, it can be a CPU, GPU, DSP, or any other processor provided by the hardware and supported by the OpenCL developer.
SIMT (Single Instruction Multi Thread): Single instruction multithreading, the main method of GPU parallel operation. Many multithreads execute the same operation instruction at the same time. Of course, the data of each thread may be different, but the operations performed are the same.

Kernel: It is the entry function for performing calculations on the device program and is called on the host.

Work-item: It is the same concept as Threads in CUDA. N multiple work-items (threads) execute the same core function, and each Work-item has a unique and fixed ID number, generally This ID number is used to distinguish the data to be processed.

Work-group: It is the same concept as the thread block in CUDA. N multiple work items form a work group, and these Work-items in the Work-group can communicate and collaborate.

ND-Range: It is the same concept as the grid in CUDA, which defines the organization of Work-group.

Context: Defines the operating environment of the entire OpenCL, including Kernel, Device, program objects and memory objects:

Device: The computing device called by the OpenCL program.
Kernel: A parallel program executed on a computing device.
Program object: the source code (.cl file) and executable file of the kernel program.
Memory objects: variables required by computing devices to execute OpenCL programs.
The above only briefly lists some concepts, not complete, and the explanation is also very brief. The purpose is only to provide a simple understanding of OpenCL, and then gradually deepen and fully understand the concepts involved in OpenCL.

5. Basic steps for writing OpenCL programs
1) Obtain the platform -> clGetPlatformIDs
2) Obtain devices from the platform -> clGetDeviceIDs
3) Create context -> clCreateContext
4) Create command queue -> clCreateCommandQueue
5) Create cache -> clCreateBuffer
6) Read Take the program file and create the program -> clCreateProgramWithSource
7) Compile the program -> clBuildProgram
8) Create the kernel -> clCreateKernel
9) Set the parameters for the kernel -> clSetKernelArg
10) Send the kernel to the command queue, execute the kernel -> clEnqueueNDRangeKernel
11) Get the calculation Result -> clEnqueueReadBuffer
12) Release resources -> clReleaseXX**

6. Reference blog post
OpenCL: a heterogeneous computing architecture
 

OpenCL: a heterogeneous computing architecture

table of Contents

1 Summary

Due to the limitation of transistor power consumption and physical performance, the development of CPU has been greatly restricted. People turn to find other ways to improve system performance, such as multi-core processors, heterogeneous platforms, and so on. The emergence of Open Computing Language (OpenCL) provides a standard for parallel computing in a large number of heterogeneous systems. OpenCL provides a hardware-independent programming language through a series of API definitions, providing programmers with a flexible and efficient programming environment. This article points out the advantages and disadvantages of OpenCL programming through an in-depth discussion of the OpenCL computing architecture. And carried out related programming practice, through the parallel programming test on different devices, it is shown that if the OpenCL parallel programming architecture is adopted, the operating efficiency of the program can be significantly improved.

Judging from the current situation, heterogeneous systems have a very high cost performance. I believe that in the near future, OpenCL will become an important part of computer parallel and heterogeneous computing.

Keywords: OpenCL, heterogeneous computing, CPU/GPU computing, parallel computing

2 Why do I need OpenCL?

In the past few decades, the computer industry has undergone tremendous changes. The continuous improvement of computer performance provides a strong guarantee for various current applications. For the speed of a computer, as described by Moore's Law, it is achieved by increasing the number of transistors to increase the frequency. However, since the beginning of the 21st century, this growth method has been subject to some restrictions. The size of transistors has become very small, and its physical characteristics have determined that it is difficult to increase the frequency by increasing the number of transistors on a large scale. It also increases at a non-linear rate, so this method is greatly restricted. In the future, this trend will continue to be one of the most important factors affecting computer systems.

In order to solve this problem, there are usually two ways. The first is to provide support for multi-tasking, multi-threading, etc. by increasing the number of processor cores, so as to improve the performance of the system as a whole. The second method is to use a heterogeneous method, for example, the computing power of computing devices such as CPU (Central Processing Unit), GPU (Graphic Processing Unit), or even APU (Accelerated Processing Units, the fusion of CPU and GPU) can be used to achieve both Increase the speed of the system.

Heterogeneous systems are becoming more and more common, and more and more attention is being paid to the computing that supports this environment. Currently, different manufacturers usually only provide the realization of their own device programming. For heterogeneous systems, it is generally difficult to use the same style of programming language to implement mechanical programming, and it is also very difficult to process different devices as a unified computing unit.

The Open Computing Language (Open Computing Language: OpenCL) is designed to meet this important need. By defining a set of mechanisms, a hardware-independent software development environment can be realized. Using OpenCL can make full use of the parallel features of the device, support different levels of parallelism, and can effectively map to a homogeneous or heterogeneous single device composed of CPU, GPU, FPGA (Field-Programmable Gate Array) and future devices Or a multi-device system. OpenCL defines the runtime, allows it to be used to manage resources, and combines different types of hardware in the same execution environment, and it is hoped that in the near future, it will support the dynamic balance of computing, power consumption and other resources in a more natural way. .

I believe that in the near future, OpenCL will be widely used in heterogeneous parallel programming.

3 OpenCL architecture

 

3.1 Introduction

OpenCL provides an open framework standard for writing programs for heterogeneous platforms, especially parallel programs. The heterogeneous platforms supported by OpenCL can be composed of multi-core CPUs, GPUs or other types of processors. OpenCL is composed of two parts, one is the language used to write the kernel program (code running on the OpenCL device), and the other is the API that defines and controls the platform. OpenCL provides task-based and data-based parallel computing systems. It greatly expands the application range of GPUs and makes them no longer limited to the graphics field.

OpenCL is maintained by Khronos Group. Khronos Group is a non-profit technical organization that maintains multiple open industry standards, such as OpenGL and OpenAL. These two standards are used for three-dimensional graphics and computer audio respectively.

The OpenCL source program can be compiled and executed on a multi-core CPU or on a GPU, which greatly improves the performance and portability of the code. The OpenCL standard is formulated by the corresponding standards committee, and the members of the committee come from various important manufacturers in the industry (mainly: AMD, Intel, IBM and NVIDIA). As a long-awaited thing for users and programmers, OpenCL brings two important changes: a cross-vendor non-proprietary software solution; a cross-platform heterogeneous framework to simultaneously utilize the capabilities of all computing units in the system.

OpenCL supports a wide range of applications, and it is difficult to generalize the process of developing applications. However, generally speaking, an application based on a heterogeneous platform mainly includes the following steps [3]:

  1. Find out all the components that make up a heterogeneous platform.
  2. Investigate the characteristics of the components, so that the software can be implemented according to different hardware characteristics.
  3. Create a set of kernels that run on the platform.
  4. Set storage objects related to calculations.
  5. Execute the kernels in the correct order on the right components.
  6. Collect the results.

These steps are implemented through a series of APIs and kernel programming environments inside OpenCL. This implementation uses a "divide and conquer" strategy. The problem can be decomposed into the following model [1] Platform model execution model storage model programming model

These concepts are the core of OpenCL's overall architecture. These four models will run through the entire OpenCL programming process.

The following briefly introduces the relevant content of these four models.

3.2 Platform model

The platform model (Figure 1) specifies a processor (Host) to coordinate program execution, and one or more processors (Devices) to execute OpenCL C code. This is actually just an abstract hardware model, so that it is convenient for programmers to write OpenCL C functions (called kernels) and execute them on different devices. 

platform

The device in the figure can be seen as a CPU/GPU, and the computing unit in the device can be seen as the core of the CPU/GPU. All processing nodes of the computing unit are used as SIMD units or SPMD units (each processing node maintains its own Program counter) executes a single instruction stream. The abstract platform model is closer to the current GPU architecture.

The platform can be considered as an implementation of the OpenCL API provided by different vendors. If a platform is selected, generally only the devices supported by the platform can be run. As far as the current situation is concerned, if you choose Intel’s OpenCL SDK, you can only use Intel’s CPU for calculations. If you choose AMD’s APP SDK, you can use AMD’s CPU and AMD’s GPU for calculations. Generally speaking, company A's platform cannot communicate with company B's platform after selection.

3.3 Execution model

The most important thing in the execution model is the concept of kernel, context, and command queue. The context manages multiple devices, each device has a command queue, and the host program submits the kernel program to a different command queue for execution.

3.3.1 Kernel

The kernel is the core of the execution model and can be executed on the device. Before a kernel is executed, an N-dimensional range (NDRange) needs to be specified. An NDRange is a one-, two-, or three-dimensional index space. You also need to specify the number of global working nodes and the number of nodes in the working group. As shown in Figure NDRange , the range of global working nodes is {12, 12}, the range of working groups is {4, 4}, and there are 9 working groups in total.

 

NDRange

For example, a kernel program for vector addition:

__kernel void VectorAdd(__global int *A, __global int *B, __global int *C){
    int id = get_global_id(0);
    C[id]  = A[id] + B[id];
}

 

If the definition vector is 1024 dimensions, in particular, we can define the global working node as 1024 and the node in the working group as 128, then there are 8 groups in total. Defining a working group is mainly to provide convenience for some programs that only need to exchange data within the group. Of course, the number of working nodes is limited by the device. If a device has 1024 processing nodes, then a 1024-dimensional vector can be calculated once for each node. If a device has only 128 processing nodes, then each node needs to be calculated 8 times. Set the number of nodes reasonably and the number of work groups can improve the parallelism of the program.

3.3.2 Context

For a host to make the kernel run on the device, it must have a context to interact with the device. A context is an abstract container that manages the memory objects on the device and tracks the programs and kernels created on the device.

3.3.3 Command Queue

The host program uses the command queue to submit commands to the device. A device has a command queue and is context-dependent. The command queue schedules the commands executed on the device. These commands are executed asynchronously on the host program and the device. When executing, the relationship between commands has two modes: (1) sequential execution, (2) out-of-order execution.

The execution of the kernel and the memory commands submitted to a queue generate event objects. This is used to control the execution of commands and coordinate the operation of the host and the device.

3.4 Memory model

Generally speaking, there are different storage systems between different platforms. For example, the CPU has a cache but the GPU does not. For the portability of the program, OpenCL defines an abstract memory model. When the program is implemented, only the abstract memory model needs to be paid attention to, and the specific mapping to the hardware is completed by the driver. The definition of memory space and the mapping with hardware are roughly as shown in the figure. 

memory

The memory space can be specified by keywords in the program. Different definitions are related to the location of the data. There are mainly the following basic concepts [2]:

  • Global memory: All work items in all work groups can read and write to it. The work item can read and write any element of this memory object. Reads and writes to global memory may be cached, depending on the capabilities of the device.
  • Invariant memory: An area in global memory that remains unchanged during the execution of the kernel. The host is responsible for the allocation and initialization of this memory object.
  • Local memory: A memory area belonging to a working group. It can be used to assign some variables, which are shared by all work items in this work group. On OpenCL devices, it may be implemented as a dedicated memory area, or it may be mapped to global memory.
  • Private memory: the memory area belonging to a work item. Variables defined in the private memory of one work item are invisible to another work item.

3.5 Programming model

OpenCL supports data parallel and task parallel programming, and supports the mixing of the two modes at the same time. For synchronization, OpenCL supports the synchronization of work items in the same work group and the synchronization of commands in the same context in the command queue.

4 Programming example based on OpenCL

In this section, an example of image rotation is used to specifically introduce the steps of OpenCL programming. First, the implementation process is given, and then the C loop implementation and OpenCL C kernel implementation for image rotation are given.

 

4.1 Process

flow

4.2 Image rotation

 

4.2.1 Principle of Image Rotation

Image rotation refers to rotating a defined image around a certain point in a counterclockwise or clockwise direction by a certain angle, usually refers to rotating around the center of the image in a counterclockwise direction. Assuming that the upper left corner of the image is (l, t) and the lower right corner is (r, b), then any point (x, y) on the image is rotated counterclockwise by an angle θ around its center (xcenter, ycenter), and the new coordinate position The calculation formula of (x',y') is:

x′ = (x - xcenter) cosθ - (y - ycenter) sinθ + xcenter,

y′ = (x - xcenter) sinθ + (y - ycenter) cosθ + ycenter.

C code:

void rotate(
      unsigned char* inbuf,
      unsigned char* outbuf,
      int w, int h,
      float sinTheta,
      float cosTheta)
{
   int i, j;
   int xc = w/2;
   int yc = h/2;
   for(i = 0; i < h; i++)
   {
     for(j=0; j< w; j++)
     {
       int xpos =  (j-xc)*cosTheta - (i - yc) * sinTheta + xc;
       int ypos =  (j-xc)*sinTheta + (i - yc) * cosTheta + yc;
       if(xpos>=0&&ypos>=0&&xpos<w&&ypos<h)
          outbuf[ypos*w + xpos] = inbuf[i*w+j];
     }
   }
}


OpenCL C kernel code:

#pragma OPENCL EXTENSION cl_amd_printf : enable
__kernel  void image_rotate(
      __global uchar * src_data,
      __global uchar * dest_data,        //Data in global memory
      int W,    int H,                   //Image Dimensions
      float sinTheta, float cosTheta )   //Rotation Parameters
{
   const int ix = get_global_id(0);
   const int iy = get_global_id(1);
   int xc = W/2;
   int yc = H/2;
   int xpos =  ( ix-xc)*cosTheta - (iy-yc)*sinTheta+xc;
   int ypos =  (ix-xc)*sinTheta + ( iy-yc)*cosTheta+yc;
   if ((xpos>=0) && (xpos< W)   && (ypos>=0) && (ypos< H))
      dest_data[ypos*W+xpos]= src_data[iy*W+ix];
}


Rotate 45 degrees

sky_rot

 

 

As given in the code above, a double loop is required in the C code to calculate the new coordinate position on the abscissa and ordinate. In fact, the calculation of each point in the image rotation algorithm can be performed independently, and has nothing to do with the coordinate positions of other points, so parallel processing is more convenient. Parallel processing is used in the OpenCL C kernel code.

The above code was tested on Intel’s OpenCL platform. The processor is a dual-core processor and the image size is 4288*3216. If the loop mode is used, the running time is stable at about 0.256s, and if the OpenCL C kernel parallel mode is used, The running time is stable at about 0.132 seconds. The GPU test was conducted on NVIDIA's GeForce G105M graphics card, and the running time was stable at about 0.0810s. From the loop method, dual-core CPU parallel and GPU parallel computing, it can be seen that OpenCL programming can indeed greatly improve execution efficiency.

5 Summary

Through the analysis and experiment of OpenCL programming, it can be concluded that applications written in OpenCL have good portability and can run on different devices. OpenCL C kernel is generally processed in parallel, so it can greatly improve the efficiency of the program.

Heterogeneous parallel computing is becoming more and more common. However, for the existing OpenCL version, there are indeed many shortcomings. For example, writing the kernel requires a more in-depth analysis of the problem of parallelism, and the management of memory still requires programs. The operator to explicitly affirm and explicitly move between the main memory and the device’s memory cannot be completely left to the system for automatic completion. From these aspects, OpenCL does need to be strengthened. To enable people to develop applications efficiently and flexibly, there is still a lot of work to be done.

6 References

【1】 Aaftab Munshi. The OpenCL Specification Version1.1 Document Revision:44[M]. Khronos OpenCL Working Group. 2011.6.1.

[2] Aaftab Munshi. Translated by Ni Qingliang. OpenCL Specification Version1.0 Document Revision:48[M]. Khronos OpenCL Working Group. 2009.10.6.

【3】Aaftab Munshi, Benedict R. Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg. OpenCL Programming Guide [M]. Addison-Wesley Professional. 2011.7.23.

【4】Benedict Gaster, Lee Howes, David R. Kaeli and Perhaad Mistry. Heterogeneous Computing with OpenCL[M]. Morgan Kaufmann, 1 edition. 2011.8.31.

【5】Slo-Li Chu, Chih-Chieh Hsiao. OpenCL: Make Ubiquitous Supercomputing Possible[J]. IEEE International Conference on High Performance Computing and Communications. 2010 12th 556-561.

【6】John E. Stone, David Gohara, Guochun Shi. OpenCL: A parallel programming standard for heterogeneous computing systems[J]. Copublished by the IEEE CS and the AIP. 2010.5/6 66-72.

【7】Kyle Spafford, Jeremy Meredith, Jeffrey Vetter. Maestro:Data Orchestration and Tuning for OpenCL Devices[J]. P. D'Ambra,M.Guarracino, and D.Talia (Eds.):Euro-Par 2010,Part II,LNCS6272, pp. 275–286, 2010. \copyright Springer-Verlag Berlin Heidelberg 2010.

 

Author: Let it be!

Date: 2011-11-13 00:12:07

Copyright reserved.

 

HTML generated by org-mode 6.35i in emacs 24

Guess you like

Origin blog.csdn.net/u010451780/article/details/114746028