Qualcomm_Mobile_OpenCL.pdf translation -3

3 using OpenCL on Xiaolong

In today's Android operating system and IOT (Internet of Things) on the market, Xiaolong is the strongest performance of the chip is the most widely used. Xiaolong mobile platforms will put together the best combination of components on a single chip, thus ensuring that the equipment based on Snapdragon platform will bring the ultimate power efficiency and integrated solutions that bring the latest mobile phone user experience .

Snapdragon is a multiprocessor system, comprising such a multi-mode demodulator (multimode modem), CPU, GPU, DSP, location / GPS, multimedia, power management, RF, and optimization software for the operating system, memory, may be connected sex (Wi-Fi, Bluetooth) and so on.

If you want to understand the current list of consumer devices include Snapdragon processor, or want to know more about the Snapdragon processors in other areas, please visit the website http://www.qualcomm.com/snapdragon/devices . Adreno GPUs general application renders images in use, but it also has a deal with the general processor computationally complex tasks of capabilities, such as audio and image processing, computer vision. Performing calculation using data-parallel OpenCL harnesses the power of the GPU.

OpenCL 3.1 on Snapdragon

Adreno A3x, A4x and A5x series of GPUs has been able to fully support OpenCL, and has been fully compliant with OpenCL standards. OpenCL has different versions and profiles, so different series of Anreno GPUs OpenCL may support different versions, as shown in Table 3-1:

Table 3-1 OpenCL support of Adreno GPUS

GPU series	Adreno A3X	Adreno A4x	Adreno A5x
OpenCL version	1.1	1.2	2.0
OpenCL profile	Embedded	Full	Full

In addition to the support of OpenCL different versions and profiles, different Adreno GPUs may also have other different properties, such as different extensions support, support for different picture object largest dimension, and so on. ClGetDeviceInfo can get the full details of the device by calling the API function.

3.2 Andreno GPU architecture

This chapter from the perspective of the overall high-level introduction Adreno associated with the OpenCL framework.

3.2.1 Adreno architecture associated with the OpenCL

FIG. 3-1 A5x GPUs associated with the OpenCL framework

Adreno GPUs support many image and calculate the API, including OpenGL ES, OpenCL, DirectX and Vulkan and so on. Figure 3-1 illustrates the related OpenCL Andreno A5x overall hardware architecture, which is omitted image-related hardware module. A5x and other Andreno series GPU There are many different, but the difference is small in OpenCL.

OpenCL execution of key hardware modules are as follows:

n SP (Shader or Streaming processor colored or streaming media processor (image rendering relative term) )

o Andreno GPU core part. Comprising a number of hardware modules, including an arithmetic logic unit (ALU), load / store unit, the flow control unit, a register file and the like.

o shading program execution image (such as colored triangle, fragment shading, coloring calculated) (use in rendering images) and perform computing tasks such as OpenCL kernels.

o Each SP corresponds to a computing unit or a plurality of OpenCL.

o The GPU series and different levels, Adreno GPUs may include one or more SP. Low-level chipsets may have only one SP, advanced chipsets may have more SP. In FIG. 3-1, only one SP.

o qualifier defined using __read_write (OpenCL2.0 characteristics) of the object image and the object buffer, SP through the load and store L2 cache.

For read-only image of the object o, SP to load data by texture processor / L1 module.

n TP (Texture processor texture processor (image rendering relative term.) )

o performing texturing operations, such as filtering and texture for acquisition request based on the kernel.

o TP is bound together with the L1 cache, when the texture data buffer miss occurs, the L1 cache UCHE (Unified L2 of the Cache stresses below) to obtain data.

n Unified L2 Cache (LEFT)

o responsible for the SP from the buffer ( Buffer object) data objects stored / loaded, the request is responsible for loading the image object (Image Object) to L1 cache. (FIG. 3-1 , when the SP to a request to load / from the buffer memory data by UCHE, when the L1 cache when the request to load image of the object, by UCHE )

3.2.2 Waves and Fibers (Qualcomm internally defined concept, direct, specific meaning will be explained in detail in the English section)

In Adreno GPUs, the smallest unit of execution called fiber. A fiber corresponds to item in a work OpenCL (work items, opencl the concept, because the code is also frequently used, it represents a direct English) . In "fixed pace" run with a set of fiber is called wave. SP may accommodate a plurality of wave activated simultaneously. Each wave of the previous program independent and irrespective of the operating state of the other waves. Note the following:

n wave size, or in a number of the fbers Wave, and Kernel for GPU specified, this fixed number. ( The GPU given maximum, kernel functions given the current value needs to run) .

In the n Adreno GPU, wave depends on the size of the series and the GPU compiler, typical values are 8,16,32,54,128 like.

n a workgroup (Working Group, OpenCL conceptual) may be performed one or more waves, which is mainly determined by the size workgroup. For example, if a size workgroup is less than or equal to the size of the wave, a wave may require only. Of course, the more the wave of course better, because better able to hide between wave latency. (As cpu assembly line, such as a wave executing the load data, counting, another wave can begin load data operation. Thus the second wave of the load operation time is hidden.)

n SP ALU instructions can execute simultaneously on one or more waves.

In a workgroup n, the number of lines can be run maximum wave is determined by the hardware. Typically, Adreno GPUs support 16 waves.

Given a number n kernel function can be activated on a SP maximum size of waves is determined and kernel registers occupied by the register file (although hardware is the maximum support 16 th waves running simultaneously, but if the kernel function over Registers and more, register enough, might only 8 Ge waves running at the same time) , but is also determined by the GPU series and grade.

n general, kernel function is more complex, less activatable waves. (Thinking here optimized, complex kernel split into a plurality of simple function kernel function, the speed can be increased in parallel)

n Given a kernel function, the maximum size of the workgroup is the product size and the maximum number of allowed wave of the wave.

OpenCL 1.x document is not exposed on the concept of wave, in OpenCL2.0 has been allowed an application by using the wave cl_khr_subgroups extensions, but from the beginning Adreno A5x GPU support.

3.2.3 Delayed Hide

Latency hiding GPU is effective to parallelize the most powerful features of a process, so that the GPU reach a very high throughput. For example as follows:

n SP begins execution of the first wave.

After several n is an ALU instruction execution, the wave from the external memory needs (possibly global / local / private memory) for further processing of the next data, and these data are not currently acquired.

n SP data acquisition request for the data wave.

n SP already prepared to switch the second wave started.

n SP proceed to the second wave, until the two wave-dependent content (such as data registers or the like) is not ready.

n SP may switch to the third wave, or a wave back to the first cutting performed, if the first wave has executable (data has been acquired) words.

In this way, SP has been very busy in the state, but also like a full time job, or external dependencies have been well hidden.

2.4 workgroup assignment

A typical OpenCL kernel need to use multiple workgroup. In Adreno GPU, each is assigned to a workgroup SP, in general, at the same time each SP can only run a workgroup. If you have the rest of the workgroup, queued for execution on the GPU.

With 2-dimensional workgroup an example shown in FIG. 3-2, and assuming that the GPU has four SP. Figure 3-3 shows how these workgroup is assigned to a different SP. In this example, a total of nine Workgroup, and executed by each of a SP. Each workgroup has four wave, the wave size is 16.

FIG. 3-2 workgroup and dispatched to the example layout executing on Andreno GPU

FIG Examples 3-3 workgroup is assigned to executing on SP

OpenCL standard workgroup neither a defined sequence and executing the startup, there is no synchronization mechanism between defined workgroup. For Adreno GPUs, developers can not be assumed to start in the order specified workgroup at SP. Similarly, wave can not be assumed to start in the order specified.

In most Adreno GPUs in a SP at the same time you can only run a workgroup, and after a workgroup must be completed, another workgroup can begin. But in the more advanced and updated GPU series, such as Adreno A540 GPU, one SP can perform multiple workgroup.

3.3 Adreno A3x, A4x A5x and different points on the OpenCL

Each new Adreno GPU series will bring great improvement on the characteristics and performance of OpenCL. This chapter will discuss the key changes affect the performance of OpenCL.

3.3.1 L2 cache

Adreno A320 and from the Adreno 330 GPUs Adreno A420, A430, A530 and A540 GPUs, for better efficiency and performance, L2 cache architecture has been greatly improved, while also increasing the capacity of the L2 cahe.

3.3.2 Local memory Local memory

Adreno A3x to and from A4x A5x series, Local memory has improved in capacity, load / store merge access and throughput (coalesced access). Table 3-2 shows the access of the merger on the different series of different points.

Table 3-2 summarizes the performance of local memory

GPUs	Adreno A3X	Adreno A4x	Adreno A5x
The combined access	not support	not support	Support, each operation may be carried by up to four work item / 128 storage.

The combined access is an important concept OpenCL and GPU parallel computing. In essence, it means that the underlying hardware can be a plurality of work item data load / store requests into one request, so as to enhance the load / store data efficiency. If there is no access to the combined support hardware load and store operations must be performed for each separate request, this will result in poor performance.

Figure 3-4 illustrates the difference between consolidated and non-consolidated access visit. To be able to combine a plurality of work items in the request, data address request must be contiguous. In the coalescing, Adreno GPUs can give four work items in a data loading process, and not merge access, it is necessary to be able to handle four load the same amount of data.

3-4 in FIG loading data coalescing vs non-coalescing

3.4 context between the image and the calculation task switching

3.4.1 context switching

In Adreno GPUs, if a high priority task, such as a graphical user interface (UI) were rendering request, while a low-priority task is running on the GPU, then the latter will be suspended, and then switch to a high priority GPU perform on stage task. When the high-priority task is finished, the low-priority tasks will be restored. This task switching is called a context switch. Context switching is very time consuming because it requires sophisticated hardware and software operations. However, this is a very important characteristic, can make urgent and time-critical tasks completed in time, such as automatic application class.

3.4.2 restrictions kernel / workgroup execution time on the GPU

Sometimes, a kernel function computing tasks may be performed for a time, or may trigger a warning and cause GPU restart. To avoid these unpredictable behavior, has not recommended kernel function takes a long time to complete workgroup. Normally, on an Android device, UI rendering occurs frequently, such as every 30 ms, so a long-running kernel may cause UI lag or no response, leading to bad user experience. Ideal kernel execution time is based on the actual case may be. However, a good general guideline is that a kernel execution time of 10ms should be on the order of magnitude.

Associated lifting on 3.5 OpenCL standard

Adreno A3x GPU support for OpenCL 1.1 embedded version of the profile, Adreno A4x GPU support for OpenCL 1.2 full version, support for OpenCL 2.0 full version on Andreno A5x GPU.

From the OpenCL 1.1 embedded profile version to the full version of OpenCL 1.2, the main change is in the software, rather than hardware, such as improved API functions.

However, from the full version to the full version of OpenCL 1.2 OpenCL 2.0, the introduction of many new hardware features, such as SVM (shared virtual memory), kernel-enqueue-kernel, and so on. Table 3-3 lists the three Adreno series GPU, different points on OpenCL support.

Table 3-3 Characteristics Standard OpenCL support on Adreno GPU's

characteristic	Adreno A3x support embedded OpenCL1.1	Adreno A4x support full OpenCL1.2	Complete OpenCL2.0 Adreno A5x support
Separately compiled and linked objects	not support	not support	stand by
Rounding mode	Rounded to 0	Rounding to the nearest even number	Rounding to the nearest even number
Compiled kernel	not support	stand by	stand by
1-dimensional texture, dimension 1/2-dimensional array of picture	not support	not support	Support (only when the merger get in)
Shared Virtual Memory	not support	not support	stand by
pipeline	not support	not support	stand by
Reproduced / stored images	not support	not support	stand by
Nested parallelism	not support	not support	stand by
KEK（Kernel-enqueue-kernel）	not support	not support	stand by
General memory	not support	not support	stand by
C ++ atomic operation	not support	not support	stand by

3.6 OpenCL extension

In addition to supporting the core functions of OpenCL outside, Adreno OpenCL platform also extended support through a number of additional features, which can further improve the availability and OpenCL advanced hardware capabilities to fully use the Adreno GPU. Adreno GPU available on the specified extensions can query function by function clGetPlatformInfo. Documentation extended functionality available on the QTI developer page. (URL https://developer.qualcomm.com)