OpenCL Programming Guide-1.1 Introduction to OpenCL

What is OpenCL

OpenCL is an industry-standard framework for programming computers that consist of combinations of CPUs, GPUs, and other processors. These so-called "heterogeneous systems" have become an important class of platforms, and OpenCL is the first industry standard that directly addresses the needs of these heterogeneous systems. OpenCL was first released in December 2008, and early products only came out in the fall of 2009, so OpenCL is a fairly new technology.

Using OpenCL, it is possible to write a program that will run successfully on a variety of systems, including mobile phones, laptops, and even nodes in massive supercomputers.

OpenCL provides a high degree of portability by exposing the hardware, rather than hiding it behind elegant abstractions. This means that OpenCL programmers must explicitly define platforms, contexts, and schedule work on different devices. Not all programmers need (or want) the detailed control that OpenCL provides. That's okay, if other options are available, the high-level programming model is often the better approach. However, even a high-level programming model needs a solid (and portable) foundation, and OpenCL can serve as that foundation.

The Future of Multicore: Heterogeneous Platforms

The computing world has changed dramatically over the past decade. In the early years, innovation was always driven by raw performance. Beginning in the last few years, however, the focus has shifted to performance per watt of power consumed. Semiconductor companies will continue to squeeze more and more transistors onto a chip, but the competition among these manufacturers is no longer raw performance, but power efficiency.

This shift largely changed the computers the industry produced. First, the microprocessor in a computer is built from multiple low-power cores. The concept of multi-core was first proposed by APChandrakasan et al. in their article "Optimizing Power Using Transformations". Their views are shown in Figure 1-1.
insert image description here
The energy consumed by gate switching in the CPU is the capacitance (C) times the square of the voltage (V). The number of times these gates switch in 1 second is equal to the frequency. Therefore, the power consumption of a microprocessor is calculated as P=CV 2 f. If you compare a single-core processor with frequency f and voltage V to a similar dual-core processor with each core at frequency f/2, the number of loops in the chip increases. Following the model described in the article "Optimizing Power Using Transformations", this theoretically increases capacitance by a factor of 2.2. However, the voltage will be significantly reduced to 0.6V. So in both cases, the number of instructions executed per second is the same, but the power consumption in the dual-core processor is 0.396 times that of the single-core processor. It was this fundamental relationship that drove the transition from microprocessors to multicore chips. Multi-core running at low frequency will have a significant improvement in power efficiency.

The next question is "are these cores the same (isomorphic) or not?" To understand this trend, the power performance of the special-purpose versus general-purpose logic needs to be considered. A general-purpose processor must, by its nature, include a large number of functional units to respond to computational demands. The chip thus becomes a general-purpose processor. However, processors dedicated to a particular function don't waste as many transistors because they only contain the functional units needed for that particular function. The results are shown in Figure 1-2
insert image description here
here for a general purpose CPU (IntelCore 2 Quad Processor (Q6700)), a GPU (NVIDIA GTX 280) and a rather specialized research processor (Intel 80-Core Terascale Processor, At its core is a pair of floating-point add-multiply arithmetic units) for comparison. In order to make the comparison process as fair as possible, each chip is manufactured using 65nm process technology, and we used the manufacturer's published peak performance and thermal design point power consumption. From Figure 1-2 it is clear that the more specialized the chip, the better the power efficiency, as long as the tasks are well matched to the processor.

So, it's reasonable to believe that in a world that puts a lot of emphasis on maximizing performance-per-watt, it's entirely plausible that systems will increasingly rely on multiple cores and, where feasible, leverage specialized silicon. This is especially important for mobile devices, where power conservation is critical. However, the heterogeneous trend is already in front of us. Consider the schematic diagram of a modern PC (see Figure 1-3).
insert image description here
There are two sockets here, and each socket can install different multi-core CPUs, a graphics/memory controller (GMCH) connected to system memory (DRAM), and a graphics processing unit (Graphics Processing Unit, CPU). This is a heterogeneous platform, offering multiple instruction sets and levels of parallelism that must be fully exploited to maximize the system's capabilities.

Whether it is now or in the future, at a high level, the basic platform is clear. While the plethora of details and numerous new ideas are sure to surprise us, the hardware trend is clear. The future is definitely dominated by heterogeneous multi-core platforms. The question we face is how the software fits into these platforms.

Software in a multicore world

Parallel hardware improves performance by running multiple operations simultaneously. To be useful, parallel hardware requires software execution that can run concurrently as multiple streams of operations, in other words, we need parallel software.

In order to understand parallel software, we must first understand a more general concept: concurrency. Concurrency is an old concept in computer science that we are all familiar with. A software system is said to be concurrent when it contains multiple active streams of operations that advance simultaneously. Concurrency is important in all modern operating systems. While some operation streams (threads) are waiting for certain resources, other operation streams are allowed to proceed, which can maximize resource utilization. Through concurrency, users interacting with the system also have the illusion that interaction with the system is continuous and almost instantaneous.

When concurrent software runs on a computer with multiple processing units, the threads can actually run concurrently, enabling parallel computing. The concurrency supported by hardware is parallelism.

It is very difficult for programmers to find out the concurrency in the problem, express this concurrency in software, and then run the resulting program to provide the required performance through concurrency. Finding concurrency in a problem might be as simple as executing a separate stream of operations for each pixel in an image. Or it could be extremely complex, with multiple streams of operations sharing information, the execution of which must be closely coordinated.

Once the concurrency in the problem is identified, programmers must express this concurrency in their source code. Specifically, it is necessary to define concurrently executed operation streams, associate execution times with them, and manage dependencies between these operation streams to ensure that running these operation streams in parallel can generate correct results. This is the core problem of parallel programming.

Most people are not equipped to deal with the low-level details of parallel computing. Even an expert parallel programmer can't handle the burden of managing every memory conflict or scheduling a single thread. Therefore, the key to parallel programming is a high-level abstraction or model (model), which makes parallel programming problems more manageable.

There are too many programming models out there, often divided into different categories, but these categories overlap, and the category names are often vague and confusing. For our purposes, we'll consider two parallel programming models: task parallelism and data parallelism. At a high level, the basic idea behind these two models is simple.

In the data parallel programming model, the programmer thinks in terms of a collection of data elements that can be updated concurrently. Parallelism is expressed as the concurrent application of the same instruction stream (a task) to each data element, and parallelism is reflected in the data. We provide a simple example of data parallelism in Figure 1-4. Consider a simple task: return the square of an input vector of numbers (A_vector). Using the data-parallel programming model, a vector is updated in parallel by applying tasks to each element in the vector, producing a new result vector. Of course, this example is very simple. In practice, the number of operations in a task must be large to amortize the overhead of data movement and manage parallel computation. However, the simple example in Figure 1-4 fully illustrates the core idea of ​​this programming model.
insert image description here
In the task parallel programming model, programmers directly define and process concurrent tasks. The problem is decomposed into tasks that can run concurrently, and then mapped to a processing unit (Processing Element, PE) of a parallel computer for execution. This model is easiest to use if the tasks are completely independent, but it can also be used for tasks that share data. If a computation is to be performed using a set of tasks, the computation is not complete until the last task completes. Because the computational demands of tasks vary widely, it can be difficult to distribute tasks so that they complete at roughly the same time. This is a load balancing issue. Consider the example in Figure 1-5, where six independent tasks execute concurrently on three processing units. In the first case, the first processing unit has too much work to do and takes far longer to run than the other processing units. The second case uses a different distribution of tasks, and here is a more ideal case where each processing unit completes almost simultaneously. This is an example of a central idea in parallel computing, load balancing.
insert image description here
The choice between data parallelism and task parallelism is determined by the specific needs of the problem being solved. For example, problems related to updating nodes on a grid map immediately to a data-parallel model. On the other hand, a problem formulated as a graph traversal can naturally consider a task parallel model. Therefore, a well-rounded parallel programmer needs to be familiar with both programming models. In addition, as a general-purpose programming framework (such as OpenCL), it must also support both models.

In addition to the programming model, the next step in the parallel programming process is to map the program to real hardware. Here, heterogeneous computers pose unique problems. Compute units in a system may have different instruction sets and different memory architectures, and may run at different speeds. An efficient and viable program must understand these differences and be able to properly map parallel software to the most appropriate OpenCL device.

Often, programmers approach this problem by imagining their software as a set of modules implementing different parts of the problem. These modules are explicitly bound to components in heterogeneous platforms. For example, graphics software runs on the GPU and other software runs on the CPU.

General-Purpose GPU (GPGPU) programming breaks this model. Algorithms other than graphics are modified to be suitable for GPU processing. The CPU does the calculations and manages the VOs, but all "substantial" calculations are "shared" to the GPU. Basically, heterogeneous platforms are ignored and the focus is on one component in the system: the GPU.

OpenCL does not recommend this approach. In fact, since the user has paid for "all OpenCL devices" in the system, a valid program should use all of them. This is exactly the approach OpenCL encourages programmers to adopt, and it is also our expectation for heterogeneous platform design programming environments.

Hardware heterogeneity is complex. Programmers increasingly rely on high-level abstractions that hide hardware complexity. Heterogeneous programming languages ​​advertise heterogeneity, going against the trend of increasing abstraction.

This is no problem. A language doesn't have to address the needs of every programmer community. The high-level framework that simplifies programming problems can be mapped to a high-level language, which is further mapped to the underlying hardware abstraction layer to ensure portability. OpenCL is exactly this hardware abstraction layer.

Guess you like

Origin blog.csdn.net/qq_36314864/article/details/130599588