Basic introduction to OpenCL task scheduling | JD Logistics Technical Team

Currently, the demand for scientific computing has increased dramatically. Heterogeneous computing based on CPU-GPU heterogeneous systems has been widely used in the field of scientific computing. OpenCL has become increasingly popular in the field of heterogeneous computing due to its cross-platform characteristics, and its scheduling difficulties have also increased. It is exposed that traditional OpenCL task scheduling needs to determine the scheduling plan in the coding stage. This kind of manual scheduling is difficult, has poor adaptability, is inefficient, and has resource competition problems. MultiCL decouples the command queue from the device by extending the OpenCL standard, realizes adaptive scheduling, and provides different scheduling methods for developers of different levels, alleviating OpenCL scheduling problems.

1 Basic introduction to OpenCL

OpenCL is the first open, free standard for general-purpose parallel programming of heterogeneous systems, suitable for heterogeneous hybrid programming across CPUs, GPUs, and other processors. OpenCL implements the foundational layer of a parallel computing ecosystem that is independent of hardware, operating systems, and applications by creating an efficient, low-level programming interface. OpenCL is used to coordinate parallel computing between hosts and heterogeneous computing devices that support the OpenCL standard, and has a clear cross-platform programming language.

OpenCL is the industry standard for programming on heterogeneous systems. OpenCL is more than just a programming language, it is an industry-standard framework for programming heterogeneous systems. Compared with CUDA, OpenCL programs can be ported on hardware from different vendors and have good function portability. However, since OpenCL, as a low-level general framework, opens up a lot of hardware details to programmers, the advantage is that it provides programmers with a lot of space for program optimization, but it also causes OpenCL programs to not have good performance portability, that is, the same program There will be a huge performance difference during the migration process. At the same time, for programmers, the division and scheduling of tasks are completely specified by programmers. It is difficult for programmers to perceive and predict the operating status and load conditions of the overall system, making it difficult for programs to achieve optimal efficiency.

In order to achieve cross-platform heterogeneous computing, the OpenCL architecture includes the following parts:

  1. OpenCL platform layer: The OpenCL platform layer allows the host program to discover an OpenCL device and call the device to participate in calculations, and create a context for a specific device or device group.
  2. OpenCL runtime: The OpenCL runtime allows the host program to operate on the context after it is created. The OpenCL runtime provides software support for OpenCL programs.
  3. OpenCL compiler: The OpenCL compiler compiles kernel files into executable files based on kernel and device extension information. The compiler supports a subset of ISO C99 (ISO/IEC 9899:1999-Programming languages) languages ​​based on parallel extensions. The OpenCL compiler can create executable files in two ways: just-in-time compilation and binary loading.

The OpenCL architecture can be divided into platform model, execution model, storage model and programming model.

2 MultiCL

The OpenCL standard requires developers to determine the scheduling plan for OpenCL programs during the program coding stage, causing OpenCL to face problems such as high scheduling difficulty, poor adaptability, resource competition, and low efficiency. In order to solve the above problems, Ashwin M et al. further developed the OpenCL task-level parallel scheduling framework based on SunCL. This framework provides cross-vendor OpenCL runtime support. MultiCL implements context-level full queue scheduling and local scheduling functions for specific command queues. Based on this, the author implements dynamic scheduling of OpenCL tasks.
The OpenCL kernel scheduling strategy depends on the programming model. Developers must clarify the kernel scheduling scheme included in the program in advance. In order to realize OpenCL's adaptive task scheduling, Ashwin M et al. extended the OpenCL API. Table 1 describes the extended attributes of MultiCL.

Table 1 MultiCL extended attributes

In order to express the global queue scheduling mechanism, the MultiCL framework extends a new context attribute CL_CONTEXT_SCHEDULER, which implements context-level OpenCL adaptive scheduling, using the context created by CL_CONTEXT_SCHEDULER. The command queue associated with this context is decoupled from the OpenCL device, that is Creating an OpenCL command queue does not require specifying an OpenCL device, which solves the problem of OpenCL kernel scheduling that needs to be completed in the encoding phase from a contextual level. The OpenCL context attribute contains two parameters, which correspond to two global scheduling strategies: round-robin scheduling (ROUND_ROBIN) and adaptive scheduling (AUTO_FIT). Loop scheduling is to schedule the command queue to the next available device when the scheduler is triggered. In the actual operation process, since the GPU device supports the simultaneous execution of multiple cores, this solution selects a device with sufficient computing resources for scheduling. This method has minimal scheduling overhead, but does not always choose the best queue-to-device mapping. The automatic scheduling strategy is a dynamic scheduling strategy based on kernel pre-execution proposed by the MultiCL framework. The MultiCL framework will select the best queue-device mapping based on the dynamic scheduling strategy.

At the same time, in order to implement local scheduling strategies, the OpenCL API command scheduling queue attributes are expanded and SCHED AUTO or SCHED OFF is added to determine whether a specific queue selects automatic scheduling mode or manual scheduling mode. The MultiCL framework provides users with the option to manually specify the scheduling scheme using the SCHED_OFF flag. Users can also add the SCHED_AUTO flag when creating a command queue to implement automatic scheduling, including static scheduling (SCHED_AUTO_STATIC) and dynamic scheduling (SCHED_AUTO_DYNAMIC). The modifications to the command queue properties are intended to be used as fine-tuning parameters for scheduling, and different levels of scheduling strategies are developed for OpenCL developers of different levels. For developers with rich experience, they can directly ignore the command queue parameters provided by the MultiCL framework and choose to manually schedule all queues, which facilitates developers to implement task division and kernel optimization according to the underlying characteristics of heterogeneous systems. For mid-level developers, the bottom layer If you have a certain understanding of the architecture, you can select the command queue attribute as the system's scheduling prompt according to the load type of the OpenCL kernel (for example: SCHED_IO_BOUND, SCHED_COMPUTE_BOUND, etc.), so that the MultiCL framework can build a micro-kernel and schedule it according to the OpenCL kernel type prompt. For junior developers, Then you can use the adaptive dynamic scheduling provided by the MultiCL framework, and let the MultiCL runtime determine the kernel scheduling scheme. This method will bring a certain runtime overhead.

The MultiCL extended clSetCommandQueueSchedProperty command is used to mark the scheduler area and can set more scheduler flags if needed, for example using the following properties: SCHED_COMPUTE_BOUND, SCHED_MEM_BOUND, SCHED_IO_BOUND or SCHED_ITERATIVE to specify the type of calculation expected in the queue, in contrast to the extended parameters of clCreateCommandQueue Correspondingly, this function facilitates scheduling plan optimization. The parameters of the OpenCL kernel startup command include the command queue, the kernel object and the kernel's startup configuration. For the convenience of automatic scheduling, MultiCL implements the new OpenCL API command clSetKernelWorkGroupInfo, which is used to set kernel configuration parameters. The MultiCL framework can obtain the kernel's work group and work item information in the work group before the OpenCL kernel scheduling command is added to the command queue. Facilitates kernel scheduling by the MultiCL framework.

The MultiCL runtime design is shown in Figure 1.

User command queues created with the SCHED_OFF flag will be statically mapped to the selected device (the mapping scheme is specified by the developer during the coding phase), while command queues created with the SCHED_AUTO flag will be automatically scheduled by MultiCL. After the MultiCL framework adds the OpenCL kernel startup command to the queue, if the command queue is a command queue created with the SCHED_OFF flag, the static scheduling scheme is used. The MultiCL runtime retains the static scheduling scheme of the SunCL runtime, and the static hardware is reserved for developers. Encoding scheduling functionality. If the command queue is created with the SCHED_AUTO flag, it is mapped to a device in the device pool by the adaptive scheduler. The above scheduling process includes device static analysis (performed in the clGetPlatformIds reading device information phase), kernel static analysis (clBuildProgram, clCreateProgramWithSource), kernel dynamic analysis (clEnqueue* command phase) and device mapping. The MultiCL dynamic scheduling module executes concurrently with the OpenCL program through independent threads, and uses the overhead of the OpenCL initialization phase to hide the overhead caused by MultiCL runtime kernel analysis and device analysis.

Figure 1 MultiCL runtime structure diagram

MultiCL dynamic scheduling contains three main runtime modules: device analyzer, kernel analyzer and task scheduler.

  1. Device Profiler, used to collect and analyze device performance (memory, computing power and I/O);
  2. Kernel profiler, which analyzes and predicts the execution time of the kernel on different devices;
  3. The task scheduler schedules tasks in the command queue marked with SCHED_AUTO to the device. The above three runtime modules implement dynamic scheduling of OpenCL tasks based on MultiCL's extension of the OpenCL standard. The device analyzer and kernel analyzer analyze the OpenCL device and kernel respectively, determine the degree of fit between the OpenCL device and the kernel, and provide a data basis for the task scheduler.

The device profiler is the device profiler that is executed during the call to the OpenCL clGetPlatformIds command. The device analyzer retrieves profile information for an OpenCL device from the device profile. If profile information for the device does not exist in the file, the MultiCL runtime runs data bandwidth and instruction throughput benchmarks and stores the measured metrics as static information in the device profile. Benchmarks are derived from the SHOC Benchmarks NVIDIA SDK and these benchmarks are part of the MultiCL runtime. The benchmark needs to be run again only if the system configuration changes (for example, a new device is added to or removed from the system).

Kernel profilers estimate kernel execution times through performance modeling or performance prediction techniques. In order to reduce the time overhead caused by the kernel analyzer, MultiCL runs a mini-kernel once per device and stores the corresponding execution time as part of the kernel configuration file. This micro-kernel is created by micro-simulation technology and scheduled to run on every device participating in computing in the system. The kernel divides individual workgroup work items and records the relative performance of the micro-kernel on each device. This method will bring potential runtime overhead to the current program, but after the runtime optimization of the micro-kernel used by MultiCL, experiments have verified that its runtime overhead is very small and sometimes can be ignored. At the same time, this method can also more accurately reflect the differences in the operation of the kernel on different devices. Microkernel creation occurs during the OpenCL clCreateProgramWithSource and clBuildProgram commands. The MultiCL runtime intercepts the clCreateProgramWithSource call and creates a mini-kernel object for each core, building the program with the new mini-kernel into a separate binary by intercepting the clBuildProgram call. Although this approach doubles the build time of OpenCL, the author believes that this is an initial setup cost and does not change the actual running time of the program.

The task scheduler will obtain each OpenCL command queue and add the associated command queue to the ready queue pool for scheduling. MultiCL only obtains the task mapping scheme through summary device information files and kernel configuration files. The MultiCL runtime reads the aggregated kernel configuration file for each queue, and the task scheduler determines the ideal queue-device mapping using a strategy that minimizes command queue kernel execution time, minimizing concurrent execution time. The dynamic scheduling method ensures ideal kernel-device mapping. At the same time, the number of devices under the OpenCL platform is not large, so the overhead caused by scheduling is negligible. After the scheduler maps a command queue to a device, the queue is removed from the queue pool.

3 Summary

OpenCL is the first to propose an open standard for general-purpose parallel programming of heterogeneous systems, suitable for heterogeneous hybrid programming across CPUs, GPUs and other processors. MultiCL extends the OpenCL standard to perform static and dynamic scheduling at the context full level and the command queue local level. MultiCL is a runtime system that completes the dynamic scheduling of OpenCL through the device analyzer, kernel analyzer and task scheduler, while retaining the static scheduling method specified by the developer of OpenCL itself. MultiCL runtime optimization allows developers to focus on Application-level data and task decomposition, rather than device-level architectural details and device scheduling, greatly eases the difficulty of OpenCL development and provides good software support for future task scheduling research. However, MultiCL also introduces pre-execution overhead, which reduces program execution efficiency. At the same time, its pre-execution evaluation scheme cannot reflect the actual operating status of the kernel and needs further optimization.

Author: JD Logistics Jin Ziwei

Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source when reprinting

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10143761