Intel oneAPI——让高性能计算触手可及

1. 前言
2. Intel oneAPI概述
3. 内存和任务管理
- 3.1 Queue
- 3.2 Device Memory
4. 多线程及正确性保障
5. Vtune性能分析

1. 前言

在人工智能兴起的今天，大规模、高性能计算已成为社会发展的刚需。动辄千万节点规模的社交网络、交通网络，语言聊天模型中的大规模神经网络，以及航空航天等涉及大规模计算的场景，都少不了并行计算的支持。并行计算是一种一次可执行多个指令的算法，目的是提高计算速度，及通过扩大问题求解规模，解决大型而复杂的计算问题。理论上，使用并行计算可以将性能提升至单线程计算的任意倍数，这大大提高了计算机的性能，极大地便利了人们的日常生活。

2. Intel oneAPI概述

Intel oneAPI 是一个跨行业、开放、基于标准的统一的编程模型，它为跨 CPU、GPU、FPGA、专用加速器的开发者提供统一的体验。oneAPI的开放规范基于行业标准和现有开发者编程模型，广泛适用于不同架构和来自不同供应商的硬件。oneAPI 行业计划鼓励生态系统内基于oneAPI规范的合作以及兼容 oneAPI的实践。
Intel oneAPI 产品是英特尔基于oneAPI 的实现，它包括了 oneAPI 标准组件如直接编程工具（Data Parallel C++）、含有一系列性能库的基于 API 的编程工具，以及先进的分析、调试工具等组件。开发人员从现在开始就可以在英特尔 DevCloud for oneAPI 上对基于多种英特尔架构（包括英特尔至强可扩展处理器、带集成显卡的英特尔酷睿处理器、英特尔 FPGA 如英特尔 Arria、Stratix 等）的代码和应用进行测试。
Intel oneAPI 功能丰富，涵盖了计算机编程各个方面，是世界范围内公认的优秀编程模型。Intel oneAPI包括以下套件：

网站https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html#gs.pptpr2包含上述套件的详细说明

网站https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html#gs.pptr7m包含上述套件的详细说明

接下来从其精细的内存和任务管理、并行化、安全保证、高性能操作、程序分析等方面，结合具体详实登录例子，介绍 Intel oneAPI的丰富功能。

3. 内存和任务管理

3.1 Queue

oneapi使用DPC++语言，DPC++基于传统C++语言，支持CPU、GPU、FPGA等异构计算设备和分布式计算，提供了一种统一的编程模型以实现异构计算，而无需学习多种不同的编程语言和API。

Intel OneAPI中的queue是一种基于DPC++编程语言的命令队列，允许开发人员将命令（比如内存拷贝、计算任务等）添加到队列中，并在异构设备上执行这些命令。queue支持多种不同的执行选项，例如同步执行和异步执行，可以帮助开发人员实现更加高效的计算。

创建queue并选择设备

queue q(gpu_selector_v); //使用了gpu_selector_v选择器，指定计算设备
std::cout << "Device: " << myQueue.get_device().get_info<info::device::name>() << "\n"; //打印设备

在这里插入图片描述

命令执行：

使用submit函数，并向其中传递一个lambda表达式。这个lambda表达式接受一个handler对象作为参数，用于向队列中添加要执行的命令：

myQueue.submit([&](handler& h) {
    
    
  // 添加要执行的命令
  h.single_task([]() {
    
     std::cout << "Hello, world!" << std::endl; });
});

并行执行任务：

myQueue.submit([&](handler& h) {
    
    
  h.parallel_for(range<1>(N), [=](id<1> i) {
    
    
    // CODE THAT RUNS ON DEVICE
  });
});

myQueue.submit([&](handler& h) {
    
    
h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){
    
    
    // CODE THAT RUNS ON DEVICE
});
});

以下是利用oneapi queue的bfs运算的部分代码：

q.submit([&](handler& h) {
    
    
    h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) {
    
    
        auto tid = item.get_global_id()[0];
        for (uint i = tid; i < nodeNum; i += N) {
    
    
            uint id = activeNodesD[i];
            uint edgeIndex = nodePointersD[id];
            uint sourceValue = valueD[id];
            uint finalValue;
            for (uint i = 0; i < degreeD[id]; i++) {
    
    
                finalValue = sourceValue + 1;
                uint vertexId;
                vertexId = edgeListD[edgeIndex + i];
                if (1!=valueD[vertexId]) {
    
    
                    valueD[vertexId] = 1;
                    labelD[vertexId] = 1;
                }
    }
        }
    });
}).wait();

3.2 Device Memory

Intel OneAPI提供了一些用于在设备内存上进行操作的函数和方法，可以帮助开发人员更加轻松地管理和执行异构计算，如分配和释放设备内存、复制数据和执行计算任务等。

在device端分配和释放内存

在Intel OneAPI中，可以使用 malloc_device 函数在设备上动态分配内存， free 函数来释放内存。

void* malloc_device(size_t bytes, const queue& q);
// bytes：分配内存大小（字节），q：使用的设备队列
// 例如：
int* deviceData = malloc_device<int>(vertexArrSize,myqueue);
free(deviceData, myQueue);

从主机内存复制到设备内存

myQueue.memcpy(deviceData, hostData, 1000 * sizeof(int));

从设备内存复制到主机内存

myQueue.memcpy(hostData, deviceData, 1000 * sizeof(int));

在主机和设备之间动态共享内存

void* malloc_shared(size_t bytes, const queue& q);
// bytes：分配内存大小（字节），q：使用的设备队列
// 例如：
int* sharedData  = malloc_shared<int>(vertexArrSize,myqueue);
free(sharedData , myQueue);

初始化

memset 函数对设备内存进行初始化操作，该函数用于将一段内存空间中的每一个字节都设置成指定的值。

void* memset(void* ptr, int value, size_t num);
// ptr：指向要被设置的内存空间的指针，value：要设置的值，以整数形式传递，num：要设置的字节数。
// 例如：
memset(deviceData, 0, bytes);

以下是利用device memory的内存初始化部分：

degreeD = malloc_device<SIZE_TYPE>(vertexArrSize,streamStatic);
isActiveD = malloc_device<uint>(vertexArrSize,streamStatic);
isStaticActive = malloc_device<uint>(vertexArrSize,streamStatic);
isOverloadActive = malloc_device<uint>(vertexArrSize,streamStatic);

streamStatic.memcpy(degreeD, degree, vertexArrSize * sizeof(SIZE_TYPE));
streamStatic.memcpy(isActiveD, label, vertexArrSize * sizeof(uint));

streamStatic.memset(isStaticActive, 0, vertexArrSize * sizeof(uint));
streamStatic.memset(isOverloadActive, 0, vertexArrSize * sizeof(uint));

4. 多线程及正确性保障

4.1 oneTBB

oneTBB（oneAPI Threading Building Blocks）是一个面向任务的C++库，旨在使底层复杂的线程和锁机制对用户透明化。跨平台、跨体系结构特性使其更加灵活，一份代码即可以兼容多种不同的软硬件环境。这使得用户能够更专注于任务本身，只需使用少量抽象层的接口代码即可实现大规模的图计算任务。

// 该代码用于大规模图计算任务中填充边数组的多线程执行
oneapi::tbb::parallel_for( oneapi::tbb::blocked_range<size_t>(0,overloadNodeSize),
    [=](const oneapi::tbb::blocked_range<size_t>& r) {
    
    
        for(size_t i=r.begin(); i!=r.end(); ++i){
    
    
			// 按照CSR格式计算对应的边与结点
            unsigned int thisNode = overloadNodeList[i];
            unsigned int thisDegree = degree[thisNode];
            EDGE_POINTER_TYPE fromHere = activeOverloadNodePointers[i];
            EDGE_POINTER_TYPE fromThere = nodePointers[thisNode];

            // 计算on-demand结点的边数据
            for (unsigned int j = 0; j < thisDegree; j++) {
    
    
                overloadEdgeList[fromHere + j] = edgeArray[fromThere + j];
            }
        }
    }
);

4.2 高性能处理——reduction

oneAPI reduce为用户提供了包括group在内多粒度的reduction接口，使得用户可以便捷地实现多层次的任务管理，利用多线程的异步执行机制，辅以恰当的同步，可以极大地提升程序的效率。

q.submit([&](handler &h) {
    
                                                                                     // 提交任务
   h.parallel_for(nd_range<1>(N, B), [=](nd_item<1> item) [[intel::reqd_sub_group_size(sub_group_size)]] {
    
     // 并行处理
     // 执行pagerank算法
     auto sg = item.get_sub_group();
     auto tid = item.get_global_id()[0];
     int gid = sg.get_group_id()[0] + B / 32 * floor(tid / B);
     SIZE_TYPE cnt_disenableD = 0;
     // 没有依赖关系，自动并行执行
     for (SIZE_TYPE i = tid; i < vertexArrSize; i += N){
    
    
       if (inactiveNodeD[i])
         continue;
       uint edgeIndex = nodePointers[i];
       T tempSum = 0;
       for (uint j = edgeIndex; j < edgeIndex + degree[i]; j++){
    
    
         uint srcNodeIndex = edgeArray[j];
         if (outDegree[srcNodeIndex]){
    
    
           T tempValue = output[srcNodeIndex] / outDegree[srcNodeIndex];
           tempSum += tempValue;
         }
       }
       valueD[i] = (1.0 - beta) + beta * tempSum;
     }
     for (SIZE_TYPE i = tid; i < vertexArrSize; i += N){
    
    
       if (inactiveNodeD[i])
         continue;
       T diff = abs(valueD[i] - output[i]);
       output[i] = valueD[i];
       if (diff < 0.001){
    
    
         inactiveNodeD[i] = true;
         cnt_disenableD++;
       }
     }
     item.barrier(access::fence_space::local_space);                        // 同步
     SIZE_TYPE sum = reduce_over_group(sg, cnt_disenableD, sycl::plus<>()); // 调用reduction接口，统计活跃节点数
     if (sg.get_local_id()[0] == 0)
       disenableD[gid] = sum;
   });
 }).wait(); // 等待任务完成，完成前阻塞

4.3 原子操作

提高计算机处理速度的常见方法是使用多线程，即多个线程同时访问共享资源，各自完成自己的计算。理想情况下，不同线程访问和使用的共享资源是保持一致的。然而，每个线程访问共享资源时，都需要确保该时刻没有其他任何线程修改了这一资源，否则不同线程将访问或使用到不同的资源，导致程序结果不正确。因此需要特殊的操作避免这种情况发生。

oneAPI提供了原子操作(atomic)的接口，所谓原子操作是指不会被线程调度机制打断的操作；这种操作一旦被一个线程开始，就一直由这个线程运行到结束，中间不会有其他进程运行。

以下面代码展示了oneAPI支持的原子操作：

#include <algorithm>
#include <stdio.h>                  /* for EOF */
#include <string.h>                 /* for strchr() */
#include<CL/sycl.hpp>

using namespace sycl;
using namespace std;

int main() {
    
    
    sycl::queue q;
    int size = 1000;
    //int sum = 0;
    int* pre_sum = new int[1];
    pre_sum [0]= 0;
    int* value = new int[size];
    int* valueD = malloc_device<int>(size, q);
    int* sum = malloc_device<int>(1, q);
    q.memcpy(sum, pre_sum, sizeof(int)).wait();
    //sum[0] = 0;
    int* sum_noatomic = malloc_device<int>(1, q);
    q.memcpy(sum_noatomic, pre_sum, sizeof(int)).wait();
    //sum_noatomic[0] = 0;
    //int* valueD =;
    for (int i = 0; i < size; i++) {
    
    
        value[i] = i + 1;
    }
    q.memcpy(valueD, value, size * sizeof(int)).wait();
    cout << "before buffer \n";
    //buffer<int> buf(sum, 1);
	int NUM_THREADS = 50;
	
	buffer<int> buf(pre_sum, 1);

    q.submit([&](handler& h) {
    
    
        ext::oneapi::atomic_accessor acc(buf, h, ext::oneapi::relaxed_order, ext::oneapi::system_scope);
        h.parallel_for(sycl::range(NUM_THREADS), [=](sycl::id<1> ind) {
    
    
            //int ind = item.get_global_id(0);
            for (int index
                = ind; index < size; index += NUM_THREADS) {
    
    
                acc[0] += valueD[index];    
            }
            }

        );
        }).wait();
        host_accessor host_acc(buf,read_only);

        cout << "atomic sum = " << host_acc[0] << "\n";

        //no atomic test begin
			int* sum_noatomic_tmp = malloc_shared<int>(NUM_THREADS*sizeof(int), q);
			q.memset(sum_noatomic_tmp,0,NUM_THREADS*sizeof(int));
			q.submit([&](handler& h) {
    
    
            h.parallel_for(sycl::range(NUM_THREADS), [=](sycl::id<1> index) {
    
    
				int global_id=index[0];
				int sum=0;
				for(int i=global_id;i<size;i+=NUM_THREADS)
					sum+=valueD[i];
				sum_noatomic_tmp[global_id]=sum;
                });
            }).wait();
			
			pre_sum[0]=0;
			for(int i=0;i<NUM_THREADS;i++)
				pre_sum[0]+=sum_noatomic_tmp[i];
            cout << "noatomic sum = " << pre_sum[0] << "\n";
        //no atomic test end

        free(sum, q);
        free(valueD, q);
		free(sum_noatomic_tmp,q);
        delete []value;      
        return 0;
}

5. Vtune性能分析

下面介绍oneapi常用的分析工具：Vtune和Advisor.
Vtune是用于检测和优化性能瓶颈的一款软件工具，可用于CPU、GPU、FPGA系统的代码测试。测试代码是一段运行单源最短路径算法（简称SSSP）的代码，使用的图数据集规模为995节点、24087条边。可执行文件名称为MAIN_TEST1，数据集名称为email_Eu_core.bwcsr.启动vtune分析的命令如下：

$ vtune -collect io ../MAIN_TEST1 --input ../email_Eu_core.bwcsr --source 0 --type sssp

-collect选项表示收集的是CPU与外设IO过程的数据。程序正确执行后，vtune会产生下面几个方面的数据。

CPU 使用情况

如下，包括程序运行使用的CPU时间、CPI、总线程数等信息。

CPU Time: 0.308s
Effective Time: 0.308s
Spin Time: 0s
Overhead Time: 0s
Instructions Retired: 921,800,000
CPI Rate: 0.919
Total Thread Count: 4
Paused Time: 0s

PCIe传输情况汇总

在观察分析数据前，简要介绍一下PCIe的情况：
PCI-Express(peripheral component interconnect express)是一种高速串行计算机扩展总线标准，由英特尔在2001年提出。它基于点到点拓扑，单独的串行链路将每个设备连接到根系统（主机），在数据传输速度上做出了重大升级。

PCIe Traffic Summary
    Inbound PCIe Read, MB/sec: 18.511
        L3 Hit, %: 0.000
        L3 Miss, %: 100.000
         | A significant portion of inbound I/O read requests misses the L3
         | cache. To reduce inbound read latency and to avoid induced DRAM and
         | UPI traffic, make sure both the device and the memory it accesses
         | reside on the same socket and/or consider optimizations that localize
         | I/O data in the L3.
         |

上面这一段描述了L3缓存读取数据的命中率和失误率。上述情况是因为使用的数据集远小于平常测试的数据集。

Inbound PCIe Write, MB/sec: 7.341
    L3 Hit, %: 20.662
    L3 Miss, %: 79.338
     | A significant portion of inbound I/O write requests misses the L3
     | cache. To reduce inbound write latency and to avoid induced DRAM and
     | UPI traffic, make sure both the device and the memory it accesses
     | reside on the same socket and/or consider optimizations that localize
     | I/O data in the L3.
     |
    CPU/IO Conflicts, %: 0.000
    Average Latency, ns: 195.302

上面这一段描述了L3缓存写入数据的命中率和失误率。失误率有所降低可能是因为上面读入时将部分数据提前导入了三级缓存。

    Outbound PCIe Read, MB/sec: 0.301
     | Non-zero outbound read traffic caused by loads from memory-mapped I/O
     | devices may significantly limit system throughput. Explore MMIO accesses
     | to locate the code reading the memory of I/O devices through MMIO space.
     |
Outbound PCIe Write, MB/sec: 3.711

上面两段描述了PCIe出栈读取和写入的速率。

带宽利用情况

| Bandwidth Domain                                                                                                             | Platform Maximum | Observed Maximum | Average | % of Elapsed Time with High BW Utilization(%)  |
|------------------------------------------------------------------------------------------------------------------------------|------------------|------------------|---------|------------------------------------------------|
| DRAM, GB/sec                                                                                                                 | 70               | 3.600            | 3.066   | 0.0%                                           |
| DRAM Single-Package, GB/sec                                                                                                  | 35               | 3.200            | 1.802   | 0.0%                                           |
| UPI Utilization Single-link, (%)  100                          3.200    2.475                                           0.0% | 100              | 3.200            | 2.475   | 0.0%                                           |
| PCIe Bandwidth, MB/sec                                                                                                       | 40               | 45.100           | 29.533  | 80.7%                                          |

如上表，四条总线上的带宽利用情况，分别给出了它们的平台理论最大值、观测最大值、平均值和实耗时间占总带宽利用率的比例。如图，PCIe总线上的实耗时间占比最高，数据主要在这条总线上进行传输。

热点函数

Function	Module	CPU Time	% of CPU Time(%)
[Outside any known module]	[Unknown]	0.268s	87.0%
func@0x1f3f40	libcuda.so.510.85.02	0.010s	3.3%
__memset_evex_unaligned_erms	libc.so.6	0.003s	1.0%
func@0x331ef0	libcuda.so.510.85.02	0.003s	1.0%

上图是根据CPU占用时间列出的热点函数排名，可见占用时间最高的是一个未知模块，其次是一个函数，函数名是一串数字，如果想要看到自己写的函数的名称可以尝试在编译时加上-g选项。

除了上面所列，还有两条核利用率相关的数据如下：
Effective Physical Core Utilization: 3.9% (0.781 out of 20)
较低的物理核利用率可能是由于负载不均、线程切换开销、线程通信与同步开销等。

Effective Logical Core Utilization: 2.1% (0.828 out of 40)
想要优化逻辑核利用率，首先应从优化物理核利用率上考虑，提高上述利用率可以提升处理器吞吐量和多线程应用的性能。

其他

主要是一些平台和硬件信息，如下：

Collection and Platform Info

如下是应用在控制台输入的指令：

Application Command Line: ../MAIN_TEST1 "--input" "../email_Eu_core.bwcsr" "--source" "0" "--type" "sssp"

如下是运行时使用的操作系统名称和版本。

Operating System: DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
Result Size: 7.3 MB

如下是数据收集开始和结束时间：

Collection start time: 14:11:23 12/11/2022 UTC
Collection stop time: 14:11:24 12/11/2022 UTC
Collector Type: Driverless Perf system-wide sampling
CPU
    Name: Intel(R) Xeon(R) Processor code named Cascadelake
    Frequency: 2.195 GHz
    Logical CPU Count: 40
    Max DRAM Single-Package Bandwidth: 35.000 GB/s
	如下是硬件的缓存情况：
    Cache Allocation Technology
        Level 2 capability: not detected
        Level 3 capability: available

以上只是Vtune的命令行测试输出，一般用于远程测试程序。如果在本地使用Vtune测试，或者将远程的分析结果文件下载到本地用Vtune打开，就可以看到分析结果的图形界面如下：（下图是PCIe带宽利用分布图）

在这里插入图片描述

Intel oneAPI——让高性能计算触手可及

Intel oneAPI——让高性能计算触手可及

1. 前言

2. Intel oneAPI概述

3. 内存和任务管理

3.1 Queue

3.2 Device Memory

4. 多线程及正确性保障

4.1 oneTBB

4.2 高性能处理——reduction

4.3 原子操作

5. Vtune性能分析

猜你喜欢