【Altera SoC体验之旅】+ 正式开启OpenCL模式

http://bbs.eeworld.com.cn/forum.php


最近可谓几经周折。先前的Lark板子虽然看上去很高端,但实在是资料太少,对于我的应用来说从头开始搭模块不太现实。
与EEWorld 影子 沟通后,在她帮助下,和网友 @chenzhufly 互换了板子,他用的是Arrow  SoC。这个板子资料丰富一些,至少在RocketBoard上有很多教程和资料。
一切看上去都很完美,但做完所有实验后发现,本来Altera承诺的“支持OpenCL开发”结果是一句口号,我找遍了官网也没有发现这块板子的BSP。问过了Arrow的员工 @Alex,得到回答也是暂时还没有BSP。
于是不得以,又换了一块支持OpenCL开发的板子——友晶的DE1-SoC,这块性价比最高的板子。与我交换板子的是 @coyoo 大神(《深入理解Altera FPGA应用设计》作者),不得不说,论坛果然卧虎藏龙啊。

有幸参加这次比赛,有幸体验了三块不同的板子(总共才4块,太值了),有幸认识了一群技术上的大牛,想想这次赚大发了。

一定有同学会问,你到底要做什么东东,非要用Open CL?

不止一个人问过这个问题了,其实我看到这个比赛时,想想自己都已经不是学生了,没有那么多课外时间搞比赛,所以没打算报名,但刚好看到在全球计算机大会上Altera与百度合作研发的深度神经网络加速器(DNN by FPGA),而自己恰好又有个想法在FPGA上完成卷积神经网络的搭建(工作相关),各种机缘巧合下,毅然报名了。

神经网络有什么用途?它是模拟人大脑的组织形式,用大量神经元之间相互传递消息实现认知功能的,最简单的例子就是物体识别,人看到一张桌子,就会知道这是个桌子,而不是凳子,因为符合“桌子”特征。在人脑中已经通过大量训练,将“桌子”特征记录在神经元之间的权值上了。而对于计算机,通过摄像头看到桌子时,只是一堆像素值(RGB),浅层次的处理如中值滤波,相关,Sobel滤波是无法认知“桌子”这个特征的,而只是将某一维度的信息呈现给用户,让用户自己判断。为了将信息有效组织,需要构建大量的相同功能的神经元,每个单元执行最基本的操作(将输入累加,满足条件时输出给下一个神经元),这样层层累积,最终实现深层次的认知功能,在最末端的神经元直接可以回答“这是个桌子”或者“这是个凳子”或者“这是个椅子”。
卷积神经网络是在上面神经网络基础上做了一些近似。将同一层的神经元权值共享,减少了连接数,有利于计算机实现。

好了,说了这么多,其实说白了一句话就是,我目前算法是用C/C++以及CUDA实现的,如果迁移到FPGA上运行,使用OpenCL是最快的方式,也是这次体验最重要的内容(以前在FPGA上开发都是VHDL/Verilog,设计+仿真验证+调试太花时间,短期内难以完成,而且我目前只关心算法,不关心底层实现,如果能实现最基本的功能,这一阶段就算完成了,后面再考虑资源、时序、性能上的优化。

拿到板子后,仔细阅读了官方文档,搭建OpenCL环境。

今天时间关系,不再详细展开OpenCL的语法、结构,直接上例子。

烧写TF卡,流程参考我之前的帖子。烧写完成,将SW10拨码开关设置为“01010”(这个很重要,如果没有配置FPGA,后面脚本会lock),上电启动。
上一张图:

PC上打开Putty,设置波特率115200,用户名root,没有密码,进入系统。

可以看得出系统是Poky 8.0 (Yocto Project 1.3 Reference Distro) 1.3 socfpga ttyS0,和之前Lark板子上默认的系统是一样的。
ls一下,当前目录下有很多例程。
先做个准备活动:运行初始化OpenCL环境的脚本:
source ./init_opencl.sh
很快就结束了。我们打开看下这个脚本内容都是什么东东?

  1. root@socfpga:~/vector_Add# cat ~/init_opencl.sh
  2. export ALTERAOCLSDKROOT=/home/root/opencl_arm32_rte
  3. export AOCL_BOARD_PACKAGE_ROOT=$ALTERAOCLSDKROOT/board/c5soc
  4. export PATH=$ALTERAOCLSDKROOT/bin:$PATH
  5. export LD_LIBRARY_PATH=$ALTERAOCLSDKROOT/host/arm32/lib:$LD_LIBRARY_PATH
  6. insmod $AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko

复制代码

首先设置了几个环境变量:
ALTERAOCLSDKROOT
AOCL_BOARD_PACKAGE_ROOT
PATH
LD_LIBRARY_PATH
之后执行了insmod操作,加载驱动。
我们可以知道OpenCL的服务是由驱动模块$AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko 提供的。
OK,就绪,下面先进入helloworld目录。

  1. root@socfpga:~# cd helloworld/
  2. root@socfpga:~/helloworld# ls
  3. hello_world.aocx  helloworld
  4.  

复制代码

这个目录有hello_world.aocx和 helloworld两个文件。前者运行在FPGA上(OpenCL中称为核函数, Kernel),后者运行在ARM上(OpenCL中称为主机程序,Host Program)。两者编译过程如图所示。

运行步骤如下:

  1. root@socfpga:~/helloworld# aocl program /dev/acl0 hello_world.aocx
  2. aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
  3. Reprogramming was successful!
  4. root@socfpga:~/helloworld# ./helloworld
  5. Querying platform for info:
  6. ==========================
  7. CL_PLATFORM_NAME                         = Altera SDK for OpenCL
  8. CL_PLATFORM_VENDOR                       = Altera Corporation
  9. CL_PLATFORM_VERSION                      = OpenCL 1.0 Altera SDK for OpenCL, Version 14.0
  10.  
  11. Querying device for info:
  12. ========================
  13. CL_DEVICE_NAME                           = de1soc_sharedonly : Cyclone V SoC Development Kit
  14. CL_DEVICE_VENDOR                         = Altera Corporation
  15. CL_DEVICE_VENDOR_ID                      = 4466
  16. CL_DEVICE_VERSION                        = OpenCL 1.0 Altera SDK for OpenCL, Version 14.0
  17. CL_DRIVER_VERSION                        = 14.0
  18. CL_DEVICE_ADDRESS_BITS                   = 64
  19. CL_DEVICE_AVAILABLE                      = true
  20. CL_DEVICE_ENDIAN_LITTLE                  = true
  21. CL_DEVICE_GLOBAL_MEM_CACHE_SIZE          = 32768
  22. CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE      = 0
  23. CL_DEVICE_GLOBAL_MEM_SIZE                = 536870912
  24. CL_DEVICE_IMAGE_SUPPORT                  = false
  25. CL_DEVICE_LOCAL_MEM_SIZE                 = 16384
  26. CL_DEVICE_MAX_CLOCK_FREQUENCY            = 1000
  27. CL_DEVICE_MAX_COMPUTE_UNITS              = 1
  28. CL_DEVICE_MAX_CONSTANT_ARGS              = 8
  29. CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE       = 134217728
  30. CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS       = 3
  31. CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS       = 8192
  32. CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE       = 1024
  33. CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR    = 4
  34. CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT   = 2
  35. CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT     = 1
  36. CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG    = 1
  37. CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT   = 1
  38. CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE  = 0
  39. Command queue out of order?              = false
  40. Command queue profiling enabled?         = true
  41. Using AOCX: hello_world.aocx
  42.  
  43. Kernel initialization is complete.
  44. Launching the kernel...
  45.  
  46. Thread #2: Hello from Altera's OpenCL Compiler!
  47.  
  48. Kernel execution is complete.
  49.  

复制代码

  1.  
  2.  

复制代码

可见,运行成功了。
想看源代码,可以在DE1-SoC_openCL_BSP.zip中找到,路径为examples/helloworld/。
后缀为.cl的文件为核函数。上面例子的核函数如下:

  1. // AOC kernel demonstrating device-side printf call
  2. __kernel void hello_world(int thread_id_from_which_to_print_message) {
  3.   // Get index of the work item
  4.   unsigned thread_id = get_global_id(0);
  5.  
  6.   if(thread_id == thread_id_from_which_to_print_message) {
  7.     printf("Thread #%u: Hello from Altera's OpenCL Compiler!\n", thread_id);
  8.   }
  9. }
  10.  

复制代码

类似C函数,只不过前缀加上“__kernel”关键词,指定它运行在设备(FPGA)上。使用Altera的OpenCL工具就可以编译为FPGA比特流配置文件。
这里的函数功能很简单,只是判断自身线程号是否与主机指定的相同,如果相同则输出一句话,其他线程保持沉默。
接着看下Host Program长什么样。

  1. #include <assert.h>
  2. #include <stdio.h>
  3. #include <stdlib.h>
  4. #include <math.h>
  5. #include <cstring>
  6. #include "CL/opencl.h"
  7. #include "AOCL_Utils.h"
  8.  
  9. using namespace aocl_utils;
  10.  
  11. #define STRING_BUFFER_LEN 1024
  12.  
  13. // Runtime constants
  14. // Used to define the work set over which this kernel will execute.
  15. static const size_t work_group_size = 8;  // 8 threads in the demo workgroup
  16. // Defines kernel argument value, which is the workitem ID that will
  17. // execute a printf call
  18. static const int thread_id_to_output = 2;
  19.  
  20. // OpenCL runtime configuration
  21. static cl_platform_id platform = NULL;
  22. static cl_device_id device = NULL;
  23. static cl_context context = NULL;
  24. static cl_command_queue queue = NULL;
  25. static cl_kernel kernel = NULL;
  26. static cl_program program = NULL;
  27.  
  28. // Function prototypes
  29. bool init();
  30. void cleanup();
  31. static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name);
  32. static void device_info_uint( cl_device_id device, cl_device_info param, const char* name);
  33. static void device_info_bool( cl_device_id device, cl_device_info param, const char* name);
  34. static void device_info_string( cl_device_id device, cl_device_info param, const char* name);
  35. static void display_device_info( cl_device_id device );
  36.  
  37. // Entry point.
  38. int main() {
  39.   cl_int status;
  40.  
  41.   if(!init()) {
  42.     return -1;
  43.   }
  44.  
  45.   // Set the kernel argument (argument 0)
  46.   status = clSetKernelArg(kernel, 0, sizeof(cl_int), (void*)&thread_id_to_output);
  47.   checkError(status, "Failed to set kernel arg 0");
  48.  
  49.   printf("\nKernel initialization is complete.\n");
  50.   printf("Launching the kernel...\n\n");
  51.  
  52.   // Configure work set over which the kernel will execute
  53.   size_t wgSize[3] = {work_group_size, 1, 1};
  54.   size_t gSize[3] = {work_group_size, 1, 1};
  55.  
  56.   // Launch the kernel
  57.   status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
  58.   checkError(status, "Failed to launch kernel");
  59.  
  60.   // Wait for command queue to complete pending events
  61.   status = clFinish(queue);
  62.   checkError(status, "Failed to finish");
  63.  
  64.   printf("\nKernel execution is complete.\n");
  65.  
  66.   // Free the resources allocated
  67.   cleanup();
  68.  
  69.   return 0;
  70. }
  71.  
  72. /////// HELPER FUNCTIONS ///////
  73.  
  74. bool init() {
  75.   cl_int status;
  76.  
  77.   if(!setCwdToExeDir()) {
  78.     return false;
  79.   }
  80.  
  81.   // Get the OpenCL platform.
  82.   platform = findPlatform("Altera");
  83.   if(platform == NULL) {
  84.     printf("ERROR: Unable to find Altera OpenCL platform.\n");
  85.     return false;
  86.   }
  87.  
  88.   // User-visible output - Platform information
  89.   {
  90.     char char_buffer[STRING_BUFFER_LEN];
  91.     printf("Querying platform for info:\n");
  92.     printf("==========================\n");
  93.     clGetPlatformInfo(platform, CL_PLATFORM_NAME, STRING_BUFFER_LEN, char_buffer, NULL);
  94.     printf("%-40s = %s\n", "CL_PLATFORM_NAME", char_buffer);
  95.     clGetPlatformInfo(platform, CL_PLATFORM_VENDOR, STRING_BUFFER_LEN, char_buffer, NULL);
  96.     printf("%-40s = %s\n", "CL_PLATFORM_VENDOR ", char_buffer);
  97.     clGetPlatformInfo(platform, CL_PLATFORM_VERSION, STRING_BUFFER_LEN, char_buffer, NULL);
  98.     printf("%-40s = %s\n\n", "CL_PLATFORM_VERSION ", char_buffer);
  99.   }
  100.  
  101.   // Query the available OpenCL devices.
  102.   scoped_array<cl_device_id> devices;
  103.   cl_uint num_devices;
  104.  
  105.   devices.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
  106.  
  107.   // We'll just use the first device.
  108.   device = devices[0];
  109.  
  110.   // Display some device information.
  111.   display_device_info(device);
  112.  
  113.   // Create the context.
  114.   context = clCreateContext(NULL, 1, &device, NULL, NULL, &status);
  115.   checkError(status, "Failed to create context");
  116.  
  117.   // Create the command queue.
  118.   queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &status);
  119.   checkError(status, "Failed to create command queue");
  120.  
  121.   // Create the program.
  122.   std::string binary_file = getBoardBinaryFile("hello_world", device);
  123.   printf("Using AOCX: %s\n", binary_file.c_str());
  124.   program = createProgramFromBinary(context, binary_file.c_str(), &device, 1);
  125.  
  126.   // Build the program that was just created.
  127.   status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  128.   checkError(status, "Failed to build program");
  129.  
  130.   // Create the kernel - name passed in here must match kernel name in the
  131.   // original CL file, that was compiled into an AOCX file using the AOC tool
  132.   const char *kernel_name = "hello_world";  // Kernel name, as defined in the CL file
  133.   kernel = clCreateKernel(program, kernel_name, &status);
  134.   checkError(status, "Failed to create kernel");
  135.  
  136.   return true;
  137. }
  138.  
  139. // Free the resources allocated during initialization
  140. void cleanup() {
  141.   if(kernel) {
  142.     clReleaseKernel(kernel);  
  143.   }
  144.   if(program) {
  145.     clReleaseProgram(program);
  146.   }
  147.   if(queue) {
  148.     clReleaseCommandQueue(queue);
  149.   }
  150.   if(context) {
  151.     clReleaseContext(context);
  152.   }
  153. }
  154.  
  155. // Helper functions to display parameters returned by OpenCL queries
  156. static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name) {
  157.    cl_ulong a;
  158.    clGetDeviceInfo(device, param, sizeof(cl_ulong), &a, NULL);
  159.    printf("%-40s = %lu\n", name, a);
  160. }
  161. static void device_info_uint( cl_device_id device, cl_device_info param, const char* name) {
  162.    cl_uint a;
  163.    clGetDeviceInfo(device, param, sizeof(cl_uint), &a, NULL);
  164.    printf("%-40s = %u\n", name, a);
  165. }
  166. static void device_info_bool( cl_device_id device, cl_device_info param, const char* name) {
  167.    cl_bool a;
  168.    clGetDeviceInfo(device, param, sizeof(cl_bool), &a, NULL);
  169.    printf("%-40s = %s\n", name, (a?"true":"false"));
  170. }
  171. static void device_info_string( cl_device_id device, cl_device_info param, const char* name) {
  172.    char a[STRING_BUFFER_LEN];
  173.    clGetDeviceInfo(device, param, STRING_BUFFER_LEN, &a, NULL);
  174.    printf("%-40s = %s\n", name, a);
  175. }
  176.  
  177. // Query and display OpenCL information on device and runtime environment
  178. static void display_device_info( cl_device_id device ) {
  179.  
  180.    printf("Querying device for info:\n");
  181.    printf("========================\n");
  182.    device_info_string(device, CL_DEVICE_NAME, "CL_DEVICE_NAME");
  183.    device_info_string(device, CL_DEVICE_VENDOR, "CL_DEVICE_VENDOR");
  184.    device_info_uint(device, CL_DEVICE_VENDOR_ID, "CL_DEVICE_VENDOR_ID");
  185.    device_info_string(device, CL_DEVICE_VERSION, "CL_DEVICE_VERSION");
  186.    device_info_string(device, CL_DRIVER_VERSION, "CL_DRIVER_VERSION");
  187.    device_info_uint(device, CL_DEVICE_ADDRESS_BITS, "CL_DEVICE_ADDRESS_BITS");
  188.    device_info_bool(device, CL_DEVICE_AVAILABLE, "CL_DEVICE_AVAILABLE");
  189.    device_info_bool(device, CL_DEVICE_ENDIAN_LITTLE, "CL_DEVICE_ENDIAN_LITTLE");
  190.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHE_SIZE");
  191.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE");
  192.    device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_SIZE, "CL_DEVICE_GLOBAL_MEM_SIZE");
  193.    device_info_bool(device, CL_DEVICE_IMAGE_SUPPORT, "CL_DEVICE_IMAGE_SUPPORT");
  194.    device_info_ulong(device, CL_DEVICE_LOCAL_MEM_SIZE, "CL_DEVICE_LOCAL_MEM_SIZE");
  195.    device_info_ulong(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, "CL_DEVICE_MAX_CLOCK_FREQUENCY");
  196.    device_info_ulong(device, CL_DEVICE_MAX_COMPUTE_UNITS, "CL_DEVICE_MAX_COMPUTE_UNITS");
  197.    device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_ARGS, "CL_DEVICE_MAX_CONSTANT_ARGS");
  198.    device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, "CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE");
  199.    device_info_uint(device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
  200.    device_info_uint(device, CL_DEVICE_MEM_BASE_ADDR_ALIGN, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
  201.    device_info_uint(device, CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, "CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE");
  202.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR");
  203.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT");
  204.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT");
  205.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG");
  206.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT");
  207.    device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE");
  208.  
  209.    {
  210.       cl_command_queue_properties ccp;
  211.       clGetDeviceInfo(device, CL_DEVICE_QUEUE_PROPERTIES, sizeof(cl_command_queue_properties), &ccp, NULL);
  212.       printf("%-40s = %s\n", "Command queue out of order? ", ((ccp & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE)?"true":"false"));
  213.       printf("%-40s = %s\n", "Command queue profiling enabled? ", ((ccp & CL_QUEUE_PROFILING_ENABLE)?"true":"false"));
  214.    }
  215. }

复制代码

主机程序比较长,主要执行流程为:
初始化平台、寻找设备、打印设备信息、创建设备上下文、在设备上下文中创建指令队列、载入设备代码、编译设备代码、创建核函数对象、设置核函数参数、运行核函数、等待核函数运行结束、清除所有对象。
这是OpenCL的最基本流程,虽然比较繁琐,但熟悉之后几乎每次都是这几步,代码改动很少,真正需要用心设计的是核函数。

好了,再运行一个例子就睡觉。
进入上一级目录,然后切入vectorAdd,运行一下:
 

  1. root@socfpga:~/helloworld# cd ..
  2. root@socfpga:~# ls
  3. README            helloworld        opencl_arm32_rte  vector_Add
  4. boardtest         init_opencl.sh    swapper
  5. root@socfpga:~# cd vector_Add/
  6. root@socfpga:~/vector_Add# ls
  7. vectorAdd       vectorAdd.aocx
  8. root@socfpga:~/vector_Add# aocl program /dev/acl0 vectorAdd.aocx
  9. aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
  10. Reprogramming was successful!
  11. root@socfpga:~/vector_Add# ./vectorAdd
  12. Initializing OpenCL
  13. Platform: Altera SDK for OpenCL
  14. Using 1 device(s)
  15.   de1soc_sharedonly : Cyclone V SoC Development Kit
  16. Using AOCX: vectorAdd.aocx
  17. Launching for device 0 (1000000 elements)
  18.  
  19. Time: 107.127 ms
  20. Kernel time (device 0): 6.933 ms
  21.  
  22. Verification: PASS
  23.  

复制代码



这是个向量相加的例子,也是很经典的并行计算例子。核函数内容如下:

  1. __kernel void vectorAdd(__global const float *x,
  2.                         __global const float *y,
  3.                         __global float *restrict z)
  4. {
  5.     // get index of the work item
  6.     int index = get_global_id(0);
  7.  
  8.     // add the vector elements
  9.     z[index] = x[index] + y[index];
  10. }
  11.  

复制代码

主机程序如下:

  1. #include <stdio.h>
  2. #include <stdlib.h>
  3. #include <math.h>
  4. #include "CL/opencl.h"
  5. #include "AOCL_Utils.h"
  6.  
  7. using namespace aocl_utils;
  8.  
  9. // OpenCL runtime configuration
  10. cl_platform_id platform = NULL;
  11. unsigned num_devices = 0;
  12. scoped_array<cl_device_id> device; // num_devices elements
  13. cl_context context = NULL;
  14. scoped_array<cl_command_queue> queue; // num_devices elements
  15. cl_program program = NULL;
  16. scoped_array<cl_kernel> kernel; // num_devices elements
  17. scoped_array<cl_mem> input_a_buf; // num_devices elements
  18. scoped_array<cl_mem> input_b_buf; // num_devices elements
  19. scoped_array<cl_mem> output_buf; // num_devices elements
  20.  
  21. // Problem data.
  22. const unsigned N = 1000000; // problem size
  23. scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements
  24. scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements
  25. scoped_array<scoped_array<float> > ref_output; // num_devices elements
  26. scoped_array<unsigned> n_per_device; // num_devices elements
  27.  
  28. // Function prototypes
  29. float rand_float();
  30. bool init_opencl();
  31. void init_problem();
  32. void run();
  33. void cleanup();
  34.  
  35. // Entry point.
  36. int main() {
  37.   // Initialize OpenCL.
  38.   if(!init_opencl()) {
  39.     return -1;
  40.   }
  41.  
  42.   // Initialize the problem data.
  43.   // Requires the number of devices to be known.
  44.   init_problem();
  45.  
  46.   // Run the kernel.
  47.   run();
  48.  
  49.   // Free the resources allocated
  50.   cleanup();
  51.  
  52.   return 0;
  53. }
  54.  
  55. /////// HELPER FUNCTIONS ///////
  56.  
  57. // Randomly generate a floating-point number between -10 and 10.
  58. float rand_float() {
  59.   return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
  60. }
  61.  
  62. // Initializes the OpenCL objects.
  63. bool init_opencl() {
  64.   cl_int status;
  65.  
  66.   printf("Initializing OpenCL\n");
  67.  
  68.   if(!setCwdToExeDir()) {
  69.     return false;
  70.   }
  71.  
  72.   // Get the OpenCL platform.
  73.   platform = findPlatform("Altera");
  74.   if(platform == NULL) {
  75.     printf("ERROR: Unable to find Altera OpenCL platform.\n");
  76.     return false;
  77.   }
  78.  
  79.   // Query the available OpenCL device.
  80.   device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
  81.   printf("Platform: %s\n", getPlatformName(platform).c_str());
  82.   printf("Using %d device(s)\n", num_devices);
  83.   for(unsigned i = 0; i < num_devices; ++i) {
  84.     printf("  %s\n", getDeviceName(device[i]).c_str());
  85.   }
  86.  
  87.   // Create the context.
  88.   context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
  89.   checkError(status, "Failed to create context");
  90.  
  91.   // Create the program for all device. Use the first device as the
  92.   // representative device (assuming all device are of the same type).
  93.   std::string binary_file = getBoardBinaryFile("vectorAdd", device[0]);
  94.   printf("Using AOCX: %s\n", binary_file.c_str());
  95.   program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);
  96.  
  97.   // Build the program that was just created.
  98.   status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
  99.   checkError(status, "Failed to build program");
  100.  
  101.   // Create per-device objects.
  102.   queue.reset(num_devices);
  103.   kernel.reset(num_devices);
  104.   n_per_device.reset(num_devices);
  105.   input_a_buf.reset(num_devices);
  106.   input_b_buf.reset(num_devices);
  107.   output_buf.reset(num_devices);
  108.  
  109.   for(unsigned i = 0; i < num_devices; ++i) {
  110.     // Command queue.
  111.     queue[i] = clCreateCommandQueue(context, device[i], CL_QUEUE_PROFILING_ENABLE, &status);
  112.     checkError(status, "Failed to create command queue");
  113.  
  114.     // Kernel.
  115.     const char *kernel_name = "vectorAdd";
  116.     kernel[i] = clCreateKernel(program, kernel_name, &status);
  117.     checkError(status, "Failed to create kernel");
  118.  
  119.     // Determine the number of elements processed by this device.
  120.     n_per_device[i] = N / num_devices; // number of elements handled by this device
  121.  
  122.     // Spread out the remainder of the elements over the first
  123.     // N % num_devices.
  124.     if(i < (N % num_devices)) {
  125.       n_per_device[i]++;
  126.     }
  127.  
  128.     // Input buffers.
  129.     input_a_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
  130.         n_per_device[i] * sizeof(float), NULL, &status);
  131.     checkError(status, "Failed to create buffer for input A");
  132.  
  133.     input_b_buf[i] = clCreateBuffer(context, CL_MEM_READ_ONLY,
  134.         n_per_device[i] * sizeof(float), NULL, &status);
  135.     checkError(status, "Failed to create buffer for input B");
  136.  
  137.     // Output buffer.
  138.     output_buf[i] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
  139.         n_per_device[i] * sizeof(float), NULL, &status);
  140.     checkError(status, "Failed to create buffer for output");
  141.   }
  142.  
  143.   return true;
  144. }
  145.  
  146. // Initialize the data for the problem. Requires num_devices to be known.
  147. void init_problem() {
  148.   if(num_devices == 0) {
  149.     checkError(-1, "No devices");
  150.   }
  151.  
  152.   input_a.reset(num_devices);
  153.   input_b.reset(num_devices);
  154.   output.reset(num_devices);
  155.   ref_output.reset(num_devices);
  156.  
  157.   // Generate input vectors A and B and the reference output consisting
  158.   // of a total of N elements.
  159.   // We create separate arrays for each device so that each device has an
  160.   // aligned buffer.
  161.   for(unsigned i = 0; i < num_devices; ++i) {
  162.     input_a[i].reset(n_per_device[i]);
  163.     input_b[i].reset(n_per_device[i]);
  164.     output[i].reset(n_per_device[i]);
  165.     ref_output[i].reset(n_per_device[i]);
  166.  
  167.     for(unsigned j = 0; j < n_per_device[i]; ++j) {
  168.       input_a[i][j] = rand_float();
  169.       input_b[i][j] = rand_float();
  170.       ref_output[i][j] = input_a[i][j] + input_b[i][j];
  171.     }
  172.   }
  173. }
  174.  
  175. void run() {
  176.   cl_int status;
  177.  
  178.   const double start_time = getCurrentTimestamp();
  179.  
  180.   // Launch the problem for each device.
  181.   scoped_array<cl_event> kernel_event(num_devices);
  182.   scoped_array<cl_event> finish_event(num_devices);
  183.  
  184.   for(unsigned i = 0; i < num_devices; ++i) {
  185.  
  186.     // Transfer inputs to each device. Each of the host buffers supplied to
  187.     // clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
  188.     // for the host-to-device transfer.
  189.     cl_event write_event[2];
  190.     status = clEnqueueWriteBuffer(queue[i], input_a_buf[i], CL_FALSE,
  191.         0, n_per_device[i] * sizeof(float), input_a[i], 0, NULL, &write_event[0]);
  192.     checkError(status, "Failed to transfer input A");
  193.  
  194.     status = clEnqueueWriteBuffer(queue[i], input_b_buf[i], CL_FALSE,
  195.         0, n_per_device[i] * sizeof(float), input_b[i], 0, NULL, &write_event[1]);
  196.     checkError(status, "Failed to transfer input B");
  197.  
  198.     // Set kernel arguments.
  199.     unsigned argi = 0;
  200.  
  201.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_a_buf[i]);
  202.     checkError(status, "Failed to set argument %d", argi - 1);
  203.  
  204.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &input_b_buf[i]);
  205.     checkError(status, "Failed to set argument %d", argi - 1);
  206.  
  207.     status = clSetKernelArg(kernel[i], argi++, sizeof(cl_mem), &output_buf[i]);
  208.     checkError(status, "Failed to set argument %d", argi - 1);
  209.  
  210.     // Enqueue kernel.
  211.     // Use a global work size corresponding to the number of elements to add
  212.     // for this device.
  213.     //
  214.     // We don't specify a local work size and let the runtime choose
  215.     // (it'll choose to use one work-group with the same size as the global
  216.     // work-size).
  217.     //
  218.     // Events are used to ensure that the kernel is not launched until
  219.     // the writes to the input buffers have completed.
  220.     const size_t global_work_size = n_per_device[i];
  221.     printf("Launching for device %d (%d elements)\n", i, global_work_size);
  222.  
  223.     status = clEnqueueNDRangeKernel(queue[i], kernel[i], 1, NULL,
  224.         &global_work_size, NULL, 2, write_event, &kernel_event[i]);
  225.     checkError(status, "Failed to launch kernel");
  226.  
  227.     // Read the result. This the final operation.
  228.     status = clEnqueueReadBuffer(queue[i], output_buf[i], CL_FALSE,
  229.         0, n_per_device[i] * sizeof(float), output[i], 1, &kernel_event[i], &finish_event[i]);
  230.  
  231.     // Release local events.
  232.     clReleaseEvent(write_event[0]);
  233.     clReleaseEvent(write_event[1]);
  234.   }
  235.  
  236.   // Wait for all devices to finish.
  237.   clWaitForEvents(num_devices, finish_event);
  238.  
  239.   const double end_time = getCurrentTimestamp();
  240.  
  241.   // Wall-clock time taken.
  242.   printf("\nTime: %0.3f ms\n", (end_time - start_time) * 1e3);
  243.  
  244.   // Get kernel times using the OpenCL event profiling API.
  245.   for(unsigned i = 0; i < num_devices; ++i) {
  246.     cl_ulong time_ns = getStartEndTime(kernel_event[i]);
  247.     printf("Kernel time (device %d): %0.3f ms\n", i, double(time_ns) * 1e-6);
  248.   }
  249.  
  250.   // Release all events.
  251.   for(unsigned i = 0; i < num_devices; ++i) {
  252.     clReleaseEvent(kernel_event[i]);
  253.     clReleaseEvent(finish_event[i]);
  254.   }
  255.  
  256.   // Verify results.
  257.   bool pass = true;
  258.   for(unsigned i = 0; i < num_devices && pass; ++i) {
  259.     for(unsigned j = 0; j < n_per_device[i] && pass; ++j) {
  260.       if(fabsf(output[i][j] - ref_output[i][j]) > 1.0e-5f) {
  261.         printf("Failed verification @ device %d, index %d\nOutput: %f\nReference: %f\n",
  262.             i, j, output[i][j], ref_output[i][j]);
  263.         pass = false;
  264.       }
  265.     }
  266.   }
  267.  
  268.   printf("\nVerification: %s\n", pass ? "PASS" : "FAIL");
  269. }
  270.  
  271. // Free the resources allocated during initialization
  272. void cleanup() {
  273.   for(unsigned i = 0; i < num_devices; ++i) {
  274.     if(kernel && kernel[i]) {
  275.       clReleaseKernel(kernel[i]);
  276.     }
  277.     if(queue && queue[i]) {
  278.       clReleaseCommandQueue(queue[i]);
  279.     }
  280.     if(input_a_buf && input_a_buf[i]) {
  281.       clReleaseMemObject(input_a_buf[i]);
  282.     }
  283.     if(input_b_buf && input_b_buf[i]) {
  284.       clReleaseMemObject(input_b_buf[i]);
  285.     }
  286.     if(output_buf && output_buf[i]) {
  287.       clReleaseMemObject(output_buf[i]);
  288.     }
  289.   }
  290.  
  291.   if(program) {
  292.     clReleaseProgram(program);
  293.   }
  294.   if(context) {
  295.     clReleaseContext(context);
  296.   }
  297. }

复制代码

将100w维度的两个向量相加,用时107.127ms,你可以试试只用ARM计算,看需要多久,对比下性能。

好了,今天到此为止,大家晚安!

猜你喜欢

转载自blog.csdn.net/sunjing_/article/details/81744540