Core vs Runtime libraries

Core库是低级算法实现的集合，它被设计为嵌入到现有的项目和应用程序中：

它不分配任何内存（所有的内存分配/映射必须由调用者处理）。
它不执行任何类型的多线程（但向调用者提供有关工作负载如何拆分的信息）。

运行时库是Core库非常基本的封装，可用于快速原型，这是意味着它很基础：

它使用标准的malloc()分配图像和张量。
它使用非常简单的线程池，以非常基本的方式实现NEON代码的多线程。
对于OpenCL，它对所有映射操作和内核使用默认的CLScheduler命令队列。

为了获得最大的性能，用户需要重新实现一个等效的运行时库，以更好地满足需求（更智能的多线程策略，NEON和OpenCL之间的负载平衡等）

Windows, kernels, multi-threading and functions

Windows

Windows表示要执行的工作负载，它最多可处理维度为 Coordinates::num_max_dimensions。每个维度由start、end和step定义。

只要以下所有规则保持为真，它就可以拆分为子窗口：

max[n].start() <= sub[n].start() < max[n].end()
sub[n].start() < sub[n].end() <= max[n].end()
max[n].step() == sub[n].step()
(sub[n].start() - max[n].start()) % max[n].step() == 0
(sub[n].end() - sub[n].start()) % max[n].step() == 0

Kernels

IKernel接口的每个实现（核心库中所有内核的基类）都以相同的方式工作：

OpenCL kernels：

// Initialize the CLScheduler with the default context and default command queue
// Implicitly initializes the CLKernelLibrary to use ./cl_kernels as location for OpenCL kernels files and sets a default device for which OpenCL programs are built.
CLScheduler::get().default_init();
cl::CommandQueue q = CLScheduler::get().queue();
//Create a kernel object:
MyKernel kernel;
// Initialize the kernel with the input/output and options you want to use:
kernel.configure( input, output, option0, option1);
// Retrieve the execution window of the kernel:
const Window& max_window = kernel.window();
// Run the whole kernel in the current thread:
kernel.run( q, max_window ); // Enqueue the kernel to process the full window on the default queue
// Wait for the processing to complete:
q.finish();

NEON/CPP kernels:

//Create a kernel object:
MyKernel kernel;
// Initialize the kernel with the input/output and options you want to use:
kernel.configure( input, output, option0, option1);
// Retrieve the execution window of the kernel:
const Window& max_window = kernel.window();
// Run the whole kernel in the current thread:
kernel.run( max_window ); // Run the kernel on the full window

多线程

上一节介绍如何在当前线程中运行NEON/CPP kernel，但如果系统有多个CPU核心，则可能需要kernel使用多个核心。这是如何做到的：

    ThreadInfo info;
    info.cpu_info = _info;
    const Window      &max_window     = kernel->window();
    const unsigned int num_iterations = max_window.num_iterations(split_dimension);
    info.num_threads                  = std::min(num_iterations, _num_threads);
    if(num_iterations == 0)
    {
        return;
    }
    if(!kernel->is_parallelisable() || info.num_threads == 1)
    {
        kernel->run(max_window, info);
    }
    else
    {
        int  t         = 0;
        auto thread_it = _threads.begin();
        for(; t < info.num_threads - 1; ++t, ++thread_it)
        {
            Window win     = max_window.split_window(split_dimension, t, info.num_threads);
            info.thread_id = t;
            thread_it->start(kernel, win, info);
        }
        // Run last part on main thread
        Window win     = max_window.split_window(split_dimension, t, info.num_threads);
        info.thread_id = t;
        kernel->run(win, info);
        try
        {
            for(auto &thread : _threads)
            {
                thread.wait();
            }
        }
        catch(const std::system_error &e)
        {
            std::cerr << "Caught system_error with code " << e.code() << " meaning " << e.what() << '\n';
        }
    }

这是所有NEON函数在NEON运行时库中使用的非常基本的实现。

也可以看看
CPPScheduler。

扫描二维码关注公众号，回复： 886393 查看本文章

注意一些像NEHistogramKernel这样的内核需要一些本地的临时缓冲来执行它们的计算。为了避免线程之间的内存损坏，本地缓冲区必须具有以下大小：memory_needed_per_thread * num_threads，必须将0和num_threads之间唯一的thread_id分配给传递给run函数的ThreadInfo对象。

函数

函数将自动分配上面提到的临时缓冲区，并使用上一节中介绍的非常基本的调度程序自动执行多线程内核的执行。

简单函数只能调用一个内核（例如NEConvolution3x3），而更复杂的函数则由多个内核构成流水线（如NEGaussianPyramid，NEHarrisCorners）。检查他们的文档，找出每个函数使用哪个内核。

//Create a function object:
MyFunction function;
// Initialize the function with the input/output and options you want to use:
function.configure( input, output, option0, option1);
// Execute the function:
function.run();

警告
Compute Library需要Mali OpenCL DDK r8p0或更高版本（使用-cl-arm-non-uniform-work-group-size标志编译OpenCL内核）

注意
运行时库中的所有OpenCL函数和对象都使用与CLScheduler关联的命令队列来执行所有操作，但真正的实现将使用不同的队列来映射操作和内核，以便达到更好的GPU利用率。

OpenCL调度程序和内核库

Compute Library运行时对所有操作使用单个命令队列和上下文。

用户可以通过CLScheduler的接口来获取/设置该上下文和命令队列。

用户可以通过CLScheduler的接口获取/设置目标GPU设备。

注意
确保应用程序使用与OpenCL相同的上下文，跨上下文共享对象是禁止的。这是通过在应用程序的开头调用CLScheduler::init()或CLScheduler::default_init()来完成的。
确保在创建函数类后，调度程序的目标不会更改。

本库使用的所有OpenCL内核均构建并存储在CLKernelLibrary中。如果编译库时设置embed_kernels=0，则应用程序可以通过调用CLKernelLibrary::init()来设置OpenCL内核的路径，默认情况下路径设置为“./cl_kernels”

OpenCL事件和同步

为了阻塞直到CLScheduler的命令队列中的所有作业完成，用户可以调用CLScheduler::sync()或使用CLScheduler::enqueue_sync_event()创建同步事件

例如：

        PPMLoader     ppm;
        constexpr int scale_factor = 2;
        CLScheduler::get().default_init();
        if(argc < 2)
        {
            // Print help
            std::cout << "Usage: ./build/cl_events [input_image.ppm]\n\n";
            std::cout << "No input_image provided, creating a dummy 640x480 image\n";
            // Create an empty grayscale 640x480 image
            src.allocator()->init(TensorInfo(640, 480, Format::U8));
        }
        else
        {
            ppm.open(argv[1]);
            ppm.init_image(src, Format::U8);
        }
        TensorInfo dst_info(src.info()->dimension(0) / scale_factor, src.info()->dimension(1) / scale_factor, Format::U8);
        // Configure the temporary and destination images
        dst.allocator()->init(dst_info);
        tmp_scale_median.allocator()->init(dst_info);
        tmp_median_gauss.allocator()->init(dst_info);
        //Configure the functions:
        scale.configure(&src, &tmp_scale_median, InterpolationPolicy::NEAREST_NEIGHBOR, BorderMode::REPLICATE);
        median.configure(&tmp_scale_median, &tmp_median_gauss, BorderMode::REPLICATE);
        gauss.configure(&tmp_median_gauss, &dst, BorderMode::REPLICATE);
        // Allocate all the images
        src.allocator()->allocate();
        dst.allocator()->allocate();
        tmp_scale_median.allocator()->allocate();
        tmp_median_gauss.allocator()->allocate();
        // Fill the input image with the content of the PPM image if a filename was provided:
        if(ppm.is_open())
        {
            ppm.fill_image(src);
            output_filename = std::string(argv[1]) + "_out.ppm";
        }

OpenCL/NEON互操作性

您可以混合使用OpenCL和NEON内核和函数。但是，用户需要处理OpenCL对象的映射/解映射，例如：

        PPMLoader ppm;
        CLScheduler::get().default_init();
        if(argc < 2)
        {
            // Print help
            std::cout << "Usage: ./build/cl_convolution [input_image.ppm]\n\n";
            std::cout << "No input_image provided, creating a dummy 640x480 image\n";
            // Create an empty grayscale 640x480 image
            src.allocator()->init(TensorInfo(640, 480, Format::U8));
        }
        else
        {
            ppm.open(argv[1]);
            ppm.init_image(src, Format::U8);
        }
        TensorInfo scale_median_info(TensorInfo(src.info()->dimension(0) / 2, src.info()->dimension(1) / 2, Format::U8));
        // Configure the temporary and destination images
        scale_median.allocator()->init(scale_median_info);
        median_gauss.allocator()->init(scale_median_info);
        dst.allocator()->init(scale_median_info);
        scale.configure(&src, &scale_median, InterpolationPolicy::NEAREST_NEIGHBOR, BorderMode::REPLICATE);
        median.configure(&scale_median, &median_gauss, BorderMode::REPLICATE);
        gauss.configure(&median_gauss, &dst, BorderMode::REPLICATE);
        // Allocate all the images
        src.allocator()->allocate();
        scale_median.allocator()->allocate();
        median_gauss.allocator()->allocate();
        dst.allocator()->allocate();
        // Fill the input image with the content of the PPM image if a filename was provided:
        if(ppm.is_open())
        {
            ppm.fill_image(src);
            const std::string output_filename = std::string(argv[1]) + "_out.ppm";
        }

也可以参考
main_neoncl_scale_median_gaussian

算法

该库中的所有计算机视觉算法都遵循OpenVX 1.1规范。请参阅Khronos文档以获取更多信息。

图像、填充、边界模式和张量

库中大多数内核和函数处理图像，但为了前瞻性大多数内核实际上接受张量。有关它们为什么相关的更多信息，请参阅下文。

注意
每个内存对象只能由一个内核写入，但它可以被几个内核读取。从几个内核写入同一个对象将导致未定义的行为。写入对象的内核必须在读取内核之前进行配置。

填充和边框模式

一些算法需要当前像素周围的邻域来计算其值。这意味着算法将无法处理图像的边界，除非您提供有关如何处理这些边界像素的更多信息。枚举 BorderMode 用于此目的。

有3种类型的BorderMode：

BorderMode::UNDEFINED：将图像外部的相邻像素视为未定义。结果，边界上的所有像素将具有未定义的值。
BorderMode::REPLICATE：图像外的相邻像素被视为与最接近的有效像素具有相同的值。
BorderMode::CONSTANT：图像外的相邻像素被视为具有相同的常量值。（用户可以指定这个值）。

并且，OpenCL和NEON都使用向量加载并存储指令来访问缓冲区中的数据。因此，为了避免出现用于处理边界的特殊情况，必须填充此库中使用的所有图像和张量。

填充

填充可以通过不同的方式进行计算：

精确填充：

        PPMLoader ppm;
        if(argc < 2)
        {
            // Print help
            std::cout << "Usage: ./build/neon_convolution [input_image.ppm]\n\n";
            std::cout << "No input_image provided, creating a dummy 640x480 image\n";
            // Initialize just the dimensions and format of your buffers:
            src.allocator()->init(TensorInfo(640, 480, Format::U8));
        }
        else
        {
            ppm.open(argv[1]);
            // Initialize just the dimensions and format of your buffers:
            ppm.init_image(src, Format::U8);
        }
        // Initialize just the dimensions and format of the temporary and destination images:
        tmp.allocator()->init(*src.info());
        dst.allocator()->init(*src.info());
        // Apply a Gaussian 3x3 filter to the source image followed by a Gaussian 5x5:
        // The function will automatically update the padding information inside input and output to match its requirements
        conv3x3.configure(&src, &tmp, gaussian3x3, 0 /* Let arm_compute calculate the scale */, BorderMode::UNDEFINED);
        conv5x5.configure(&tmp, &dst, gaussian5x5, 0 /* Let arm_compute calculate the scale */, BorderMode::UNDEFINED);
        // Now that the padding requirements are known we can allocate the images:
        src.allocator()->allocate();
        tmp.allocator()->allocate();
        dst.allocator()->allocate();
        // Fill the input image with the content of the PPM image if a filename was provided:
        if(ppm.is_open())
        {
            ppm.fill_image(src);
            output_filename = std::string(argv[1]) + "_out.ppm";
        }

注意
在配置函数之后调用allocate很重要：如果图像/张量已经分配，那么函数将缩小其执行窗口，而不是增加填充。（请参阅下面的更多细节）。

手动填充/无填充/自动填充：您可以在前面分配图像/张量（在配置函数之前）。在这种情况下，函数将使用任何可用的填充，并且如果没有足够的填充可用（它将转换为输出的较小有效区域），将缩小其执行窗口。另见有效区域）。如果你不想手动设置填充，但仍想先分配对象，那么你可以使用auto_padding。它保证了分配将有足够的填充来运行提供的任何函数。

Image     src, dst;
// Use auto padding for the input:
src.info()->init_auto_padding(TensorShape(640u,480u), Format::U8);
// Use manual padding for the destination image
dst.info()->init(src.info()->tensor_shape(), Format::U8, strides_in_bytes, offset_first_element_in_bytes, total_size_in_bytes);
// Allocate all the images
src.allocator()->allocate();
dst.allocator()->allocate();
// Fill the input image with the content of the PPM image if a filename was provided:
fill_image(src);
NEGaussian3x3 gauss;
// Apply a Gaussian 3x3 filter to the source image (Note: if the padding provided is not enough then the execution window and valid region of the output will be shrunk)
gauss.configure(&src, &dst, BorderMode::UNDEFINED);
//Execute the functions:
gauss.run();

警告
一些内核最多需要3个相邻值来计算给定像素的值。因此，为了安全起见，我们在图像周围使用4像素填充。另外，一些内核可以同时读写多达32个像素。为了照顾到这种情况，我们在每行的末尾添加一个额外的32像素的填充。结果，自动填充的缓冲区会浪费大量内存，并且不易缓存。因此建议尽可能使用精确填充或手动填充。

有效区域

一些内核（例如边缘检测器）需要读取相邻像素的值来计算给定像素的值，因此不可能计算边缘上像素的值。

另一种情况是：如果内核每次迭代处理8个像素，并且图像的尺寸不是8的倍数，没有足够的填充可用，则内核将无法处理右边缘附近的像素。因此，这些像素将处于不确定状态。

为了知道已计算哪些像素，每个内核为每个输出图像或张量设置一个有效区域。另请参阅TensorInfo::valid_region()、 ValidRegion

张量

张量是具有最大 Coordinates::num_max_dimensions 维度的多维数组。

张量可以被解释为各种对象，这取决于尺寸的数量。标量可以表示为零维张量，数字向量可以表示为一维张量。此外，图像实际上仅仅是2D张量，3D张量可以被看作是图像阵列，4D张量为图像的二维阵列。

注意
大多数算法处理图像（即张量的2D切片），因此只需要沿着X轴和Y轴填充（2D切片可以连续存储在内存中）。

图像和张量描述约定

Image对象由Format和尺寸定义[width, height, batch]

张量由DataType和多个通道定义（现在总是预期为1），它们的尺寸表示为[width，height，feature_maps，batch]。

换句话说，张量的较低三维在[width，height，feature_maps]中指定单个输入，而指定任何其他的维度都表示适当维度空间中的批处理。例如，一个尺寸为[128,128,64,16]的张量表示一个1维批量空间，其中16个批次，每个批次元素长宽为128，特征图为64。每个内核在其文档中指定其每个张量的预期布局。

注意
除非在内核或函数的文档中另有规定，否则所有传递的张量和图像参数必须具有相同的尺寸。
除非在内核或函数的文档中另有规定，否则张量的通道数预计为1（对于图像，通道数是从Format中推断的）。

注意
无论张量所使用的DataType如何，ITensor::buffer()方法总是会返回一个uint8_t指针，并且TensorInfo中的所有元数据都将以字节表示。用户负责将指针转换为正确的类型。

例如，要读取浮点张量位于坐标（x，y）上的元素：

float value = *reinterpret_cast<float*>(input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y)));

使用迭代器处理图像和张量

该库提供了一些访问对象数据的迭代器。迭代器是通过将数据对象（例如图像或张量）与迭代窗口关联而创建的。

迭代窗口由一系列维度定义，每个维度由开始、结束和步长组成。

execute_window_loop函数接受一个执行窗口，一个lambda函数和一个或多个迭代器。它将遍历执行窗口的每个元素，为每个元素相应地更新迭代器并调用lambda函数。

这里有几个如何使用迭代器来填充/读取张量的例子：

        constexpr unsigned int width  = 4;
        constexpr unsigned int height = 3;
        constexpr unsigned int batch  = 2;
        src_data = new float[width * height * batch];
        dst_data = new float[width * height * batch];
        // Fill src_data with dummy values:
        for(unsigned int b = 0; b < batch; b++)
        {
            for(unsigned int h = 0; h < height; h++)
            {
                for(unsigned int w = 0; w < width; w++)
                {
                    src_data[b * (width * height) + h * width + w] = static_cast<float>(100 * b + 10 * h + w);
                }
            }
        }
        // Initialize the tensors dimensions and type:
        const TensorShape shape(width, height, batch);
        input.allocator()->init(TensorInfo(shape, 1, DataType::F32));
        output.allocator()->init(TensorInfo(shape, 1, DataType::F32));
        // Configure softmax:
        softmax.configure(&input, &output);
        // Allocate the input / output tensors:
        input.allocator()->allocate();
        output.allocator()->allocate();
        // Fill the input tensor:
        // Simplest way: create an iterator to iterate through each element of the input tensor:
        Window input_window;
        input_window.use_tensor_dimensions(input.info()->tensor_shape());
        std::cout << " Dimensions of the input's iterator:\n";
        std::cout << " X = [start=" << input_window.x().start() << ", end=" << input_window.x().end() << ", step=" << input_window.x().step() << "]\n";
        std::cout << " Y = [start=" << input_window.y().start() << ", end=" << input_window.y().end() << ", step=" << input_window.y().step() << "]\n";
        std::cout << " Z = [start=" << input_window.z().start() << ", end=" << input_window.z().end() << ", step=" << input_window.z().step() << "]\n";
        // Create an iterator:
        Iterator input_it(&input, input_window);
        // Iterate through the elements of src_data and copy them one by one to the input tensor:
        // This is equivalent to:
        // for( unsigned int z = 0; z < batch; ++z)
        // {
        //   for( unsigned int y = 0; y < height; ++y)
        //   {
        //     for( unsigned int x = 0; x < width; ++x)
        //     {
        //       *reinterpret_cast<float*>( input.buffer() + input.info()->offset_element_in_bytes(Coordinates(x,y,z))) = src_data[ z * (width*height) + y * width + x];
        //     }
        //   }
        // }
        // Except it works for an arbitrary number of dimensions
        execute_window_loop(input_window, [&](const Coordinates & id)
        {
            std::cout << "Setting item [" << id.x() << "," << id.y() << "," << id.z() << "]\n";
            *reinterpret_cast<float *>(input_it.ptr()) = src_data[id.z() * (width * height) + id.y() * width + id.x()];
        },
        input_it);
        // More efficient way: create an iterator to iterate through each row (instead of each element) of the output tensor:
        Window output_window;
        output_window.use_tensor_dimensions(output.info()->tensor_shape(), /* first_dimension =*/Window::DimY); // Iterate through the rows (not each element)
        std::cout << " Dimensions of the output's iterator:\n";
        std::cout << " X = [start=" << output_window.x().start() << ", end=" << output_window.x().end() << ", step=" << output_window.x().step() << "]\n";
        std::cout << " Y = [start=" << output_window.y().start() << ", end=" << output_window.y().end() << ", step=" << output_window.y().step() << "]\n";
        std::cout << " Z = [start=" << output_window.z().start() << ", end=" << output_window.z().end() << ", step=" << output_window.z().step() << "]\n";
        // Create an iterator:
        Iterator output_it(&output, output_window);
        // Iterate through the rows of the output tensor and copy them to dst_data:
        // This is equivalent to:
        // for( unsigned int z = 0; z < batch; ++z)
        // {
        //   for( unsigned int y = 0; y < height; ++y)
        //   {
        //     memcpy( dst_data + z * (width*height) + y * width, input.buffer() + input.info()->offset_element_in_bytes(Coordinates(0,y,z)), width * sizeof(float));
        //   }
        // }
        // Except it works for an arbitrary number of dimensions
        execute_window_loop(output_window, [&](const Coordinates & id)
        {
            std::cout << "Copying one row starting from [" << id.x() << "," << id.y() << "," << id.z() << "]\n";
            // Copy one whole row:
            memcpy(dst_data + id.z() * (width * height) + id.y() * width, output_it.ptr(), width * sizeof(float));
        },
        output_it);

MemoryManager

IMemoryManager是一个内存管理接口，可以通过回收临时缓冲区来减少给定管道的内存需求。

MemoryGroup、MemoryPool和MemoryManager组件

MemoryGroup

IMemoryGroup定义了内存管理粒度。

MemoryGroup将多个对象绑定到一组执行所需填充的内存，以便执行操作或操作列表。

为特定组请求后备内存可以使用IMemoryGroup::acquire并使用IMemoryGroup::release释放内存。

注意
目前实现了两种类型的内存组：
* 管理Tensor对象的MemoryGroup
* 管理CLTensor对象的CLMemoryGroup。

MemoryPool

IMemoryPool定义了一个可为内存组提供后备内存的内存池。

注意
目前实现了BlobMemoryPool，它将内存需求建模为一个不同内存blob的向量。

MemoryManager组件

IMemoryManager由两部分组成：

ILifetimeManager跟踪内存组已注册对象的生命周期，并借助IAllocator创建一个适当的内存池，以满足所有已注册内存组的内存要求。
IPoolManager可以安全地管理注册的内存池。

注意
一旦所有内存组、内核和函数的配置完成，应该调用IMemoryManager::finalize，以便内存管理器可以分配适当的后备内存。
目前实现了BlobLifetimeManager，它将内存需求建模为一个不同内存blob的向量。

使用内存管理器

使用内存管理器来减少流水线内存需求的步骤总结如下：

开始必须设置内存管理器：

Allocator  allocator{};                                                               // Create an allocator to use for the backing memory allocation
auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager

完成后，可以注册内存组以使用内存管理器：

MemoryGroup memory_group(mm); // Create a memory group and set the memory manager to use

注意
如果未指定内存管理器，则所有分配都将立即执行，而不是通过内存管理器推迟。

下一步是设置要由内存组管理的对象。记住应由MemoryGroup::manage()和TensorTensorAllocator::allocate调用来跟踪对象的生命周期，这一点很重要。MemoryGroup::manage标志着从现在开始需要该对象，并且当调用TensorAllocator::allocate时，它表示对象生存期结束。

Tensor tmp1, tmp2, tmp3;            // Create example tensors
memory_group.manage(&tmp1);         // Start managing object tmp1 and start its lifetime
memory_group.manage(&tmp2);         // Start managing object tmp2 and start its lifetime
operation1.configure(&tmp1, &tmp2); // Configure a function/kernel using tmp1 and tmp2
tmp1.allocator()->allocate();       // Flag that the lifetime of object tmp1 has ended
memory_group.manage(&tmp3);         // Start managing object tmp3 and start its lifetime
operation2.configure(&tmp2, &tmp3); // Configure a function/kernel using tmp2 and tmp3
tmp2.allocator()->allocate();       // Flag that the lifetime of object tmp2 has ended
tmp3.allocator()->allocate();       // Flag that the lifetime of object tmp3 has ended

警告
配置步骤应该由一个单独的线程顺序完成，以便能正确捕获到所有的生命周期。

当所有操作的配置完成后，内存管理器最终确认：

mm->set_allocator(&allocator); // Set allocator to use
mm->set_set_num_pools(2);      // Set number of pools to create in case parallel operations can be run
mm->finalize();                // Finalize memory manager (Object lifetime check, Memory pool creation etc)

最后，在执行流水线期间，应在运行之前请求相应内存组的内存：

memory_group.acquire(); // Request memory for the group
operation1.run();       // Run operation1
operation2.run();       // Run operation2
memory_group.release(); // Release memory so that it can be reused

注意
由于内存采集/释放是线程安全的，因此可以在多线程环境中执行流水线。

函数支持

大多数库函数已经移植到使用IMemoryManager作为其内部临时缓冲区。

如果是这样的话，可以在构造过程中将内存管理器传递给它们，以重用这些函数中的内存。

// Setup Memory Manager
CLBufferAllocator  allocator{};                                                       // Create an allocator to use for the backing memory allocation
auto lifetime_mgr  = std::make_shared<BlobLifetimeManager>();                         // Create Lifetime Manager
auto pool_mgr      = std::make_shared<PoolManager>();                                 // Create Pool Manager
auto mm            = std::make_shared<MemoryManagerOnDemand>(lifetime_mgr, pool_mgr); // Create Memory Manager
// Create two convolution layers and use the memory manager to manager their internal temporary buffers
CLConvolutionLayer conv1(mm), conv2(mm);
// Configure layers
conv1.configure(...);
conv2.configure(...);
// Finalize memory manager
mm->set_allocator(&allocator); // Set allocator to use
mm->set_set_num_pools(1);      // Set number of pools to create in case parallel operations can be run
mm->finalize();                // Finalize memory manager (Object lifetime check, Memory pool creation etc)
// Run layers (Memory will be recycled for internal buffers for conv1 and conv2
conv1.run();
conv2.run();

OpenCL Tuner

分派给GPU的OpenCL内核有两个参数：

全局工作组大小（Global Workgroup Size, GWS）：这是处理所有元素时运行OpenCL内核的次数。
本地工作组大小（ Local Workgroup Size, LWS）：这是特定时间在一个GPU内核上并行运行的元素数量。

算法可能需要LWS（例如，如果它包含内存屏障或使用本地内存），并且LWS还可以用于调整内核的性能：GWS分解策略会显著影响整个内核的执行时间。

然而，关于哪个LWS最适合给定的内核没有通用的规则，所以我们创建了CLTuner。

若启用CLTuner（对于graph示例为Target = 2），首次执行OpenCL内核时，Compute Library将尝试运行各种LWS值，并记住哪一个对后续运行表现最佳。在运行结束时，graph::Graph会尝试将这些调整参数保存到文件中。

但是这个过程需要花费很多时间，这就是为什么它不能始终处于启用状态。

但是，当CLTuner被禁用时（对于graph示例为Target = 1），graph::Graph将尝试重新载入包含调整参数的文件，然后对于每个执行的内核，如果存在调整的LWS则计算库使用该值，否则使用默认的LWS值。

补充

Graph

Graph是后期添加的模块，上面的文档没有进行介绍。其命名空间为arm_compute::Graph。其中包含了许多与arm_compute同名的类，需要注意区分。

Graph重载了operator<< ，可以用于添加Tensor和Node。

库调用的整体结构为：

Graph --> Node
Node --> Function
Function --> Kernel

在Graph::Private中，_current_hints和_next_hints用于标识INode间的目标设备。

    GraphHints                                  _current_hints{};
    GraphHints                                  _next_hints{};

而_current_output和_current_input用于标识Tensor的位置。

    ITensorObject *_current_output{ nullptr };
private:
    ITensorObject *_current_input{ nullptr };

Graph::Private::configure处理Node在不同设备的情况并实例化节点，得到

 std::vector<Stage> _pipeline{};

GraphHints可以设置目标设备和卷积实现方式。

Scheduler

Scheduler成员函数为静态，构造函数私有，是单例模式。

    /** Sets the user defined scheduler and makes it the active scheduler.
     *
     * @param[in] scheduler A shared pointer to a custom scheduler implemented by the user.
     */
    static void set(std::shared_ptr<IScheduler> scheduler);
    /** Access the scheduler singleton.
     *
     * @return A reference to the scheduler object.
     */
    static IScheduler &get();
    /** Set the active scheduler.
     *
     * Only one scheduler can be enabled at any time.
     *
     * @param[in] t the type of the scheduler to be enabled.
     */
    static void set(Type t);
    /** Returns the type of the active scheduler.
     *
     * @return The current scheduler's type.
     */
    static Type get_type();
    /** Returns true if the given scheduler type is supported. False otherwise.
     *
     * @return true if the given scheduler type is supported. False otherwise.
     */
    static bool is_available(Type t);

private:
    static Type                        _scheduler_type;
    static std::shared_ptr<IScheduler> _custom_scheduler;
    Scheduler();

scheduler::get根据Type返回不同类实例。

scheduler::get --> SingleThreadScheduler::get
scheduler::get --> CPPScheduler::get
scheduler::get --> OMPScheduler::get
SingleThreadScheduler::get --> SingleThreadScheduler
CPPScheduler::get --> CPPScheduler
OMPScheduler::get --> OMPScheduler

SingleThreadScheduler --> IScheduler
CPPScheduler --> IScheduler
OMPScheduler --> IScheduler

编译环境默认值：

vars = Variables("scons")
vars.AddVariables(
    BoolVariable("debug", "Debug", False),
    BoolVariable("asserts", "Enable asserts (this flag is forced to 1 for debug=1)", False),
    BoolVariable("logging", "Logging (this flag is forced to 1 for debug=1)", False),
    EnumVariable("arch", "Target Architecture", "armv7a", allowed_values=("armv7a", "arm64-v8a", "arm64-v8.2-a", "x86_32", "x86_64")),
    EnumVariable("os", "Target OS", "linux", allowed_values=("linux", "android", "bare_metal")),
    EnumVariable("build", "Build type", "cross_compile", allowed_values=("native", "cross_compile", "embed_only")),
    BoolVariable("examples", "Build example programs", True),
    BoolVariable("Werror", "Enable/disable the -Werror compilation flag", True),
    BoolVariable("standalone", "Builds the tests as standalone executables, links statically with libgcc, libstdc++ and libarm_compute", False),
    BoolVariable("opencl", "Enable OpenCL support", True),
    BoolVariable("neon", "Enable Neon support", False),
    BoolVariable("gles_compute", "Enable OpenGL ES Compute Shader support", False),
    BoolVariable("embed_kernels", "Embed OpenCL kernels and OpenGL ES compute shaders in library binary", True),
    BoolVariable("set_soname", "Set the library's soname and shlibversion (requires SCons 2.4 or above)", False),
    BoolVariable("openmp", "Enable OpenMP backend", False),
    BoolVariable("cppthreads", "Enable C++11 threads backend", True),
    PathVariable("build_dir", "Specify sub-folder for the build", ".", PathVariable.PathAccept),
    ("extra_cxx_flags", "Extra CXX flags to be appended to the build command", "")
)

程序编译时在SConstruct会根据选项设置宏定义。schedule默认使用的是CPPSchedule。
ndk r16b中clang应该是支持openmp的，可以尝试一下。

if env['cppthreads']:
    env.Append(CPPDEFINES = [('ARM_COMPUTE_CPP_SCHEDULER', 1)])

if env['openmp']:
    if cpp_compiler == 'clang++':
        print "Clang does not support OpenMP. Use scheduler=cpp."
        Exit(1)

    env.Append(CPPDEFINES = [('ARM_COMPUTE_OPENMP_SCHEDULER', 1)])
    env.Append(CXXFLAGS = ['-fopenmp'])
    env.Append(LINKFLAGS = ['-fopenmp'])

#if !ARM_COMPUTE_CPP_SCHEDULER && ARM_COMPUTE_OPENMP_SCHEDULER
Scheduler::Type Scheduler::_scheduler_type = Scheduler::Type::OMP;
#elif ARM_COMPUTE_CPP_SCHEDULER && !ARM_COMPUTE_OPENMP_SCHEDULER
Scheduler::Type Scheduler::_scheduler_type = Scheduler::Type::CPP;
#elif ARM_COMPUTE_CPP_SCHEDULER && ARM_COMPUTE_OPENMP_SCHEDULER
Scheduler::Type Scheduler::_scheduler_type = Scheduler::Type::CPP;
#else  /* ARM_COMPUTE_*_SCHEDULER */
Scheduler::Type Scheduler::_scheduler_type = Scheduler::Type::ST;
#endif /* ARM_COMPUTE_*_SCHEDULER */

CPPScheduler::set_num_threads()可以设置调度程序用来运行内核的线程数。如果参数num_threads设置为0，则将使用C++ 11支持的最大线程数，否则将使用指定的线程数。

OperationRegistry

get --> OperationRegistry

OperationRegistry类拥有static函数

    /** Gets operation registry instance
     *
     * @return Operation registry instance
     */
    static OperationRegistry &get();

构造函数私有

private:
    /** Default Constructor */
    OperationRegistry();

OperationRegistrar --> OperationRegistry::add_operation

#define REGISTER_SIMPLE_OPERATION(NAME, TARGET, OP)                                \
    class NAME : public IOperation                                                 \
    {                                                                              \
    public:                                                                    \
        std::unique_ptr<arm_compute::IFunction> configure(NodeContext &ctx) final; \
        TargetHint target() const final                                            \
        {                                                                          \
            return TargetHint::TARGET;                                             \
        }                                                                          \
    };                                                                             \
    static detail::OperationRegistrar<NAME> NAME##_registrar(OP);                  \
    std::unique_ptr<arm_compute::IFunction> NAME::configure(NodeContext &ctx)

graph/operations/NESimpleOperations.cpp和graph/operations/CLSimpleOperations.cpp分别对各Operation进行检查和注册。

Tensor与accesor

Graph中的Tensor拥有arm_compute::ITensor和ITensorAccessor成员指针。access_tensor是访问给定Tensor的接口。DummyAccessor使得我们测试速度时可以不加载网络模型。

    arm_compute::ITensor *set_target(TargetHint target) override;
    arm_compute::ITensor       *tensor() override;
    const arm_compute::ITensor *tensor() const override;
    TargetHint                  target() const override;
    void                        allocate() override;

private:
    TargetHint                            _target;   /**< Target that this tensor is pinned on */
    TensorInfo                            _info;     /**< Tensor metadata */
    std::unique_ptr<ITensorAccessor>      _accessor; /**< Tensor Accessor */
    std::unique_ptr<arm_compute::ITensor> _tensor;   /**< Tensor */

TopNPredictionsAccessor的调用过程如下：

Graph::run --> Tensor::call_accessor
Tensor::call_accessor --> TopNPredictionsAccessor::access_tensor
TopNPredictionsAccessor::access_tensor --> access_predictions_tensor

NEConvolutionLayer

用于模拟卷积层的基础函数。

该函数调用以下NEON函数之一：

NEGEMMConvolutionLayer（仅在需要GEMM操作时才执行）
NEWinogradLayer（仅在操作需要Winograd的情况下执行）
NEDirectConvolutionLayer（仅在执行操作需要直接卷积的情况下执行）

/runtime/NEON/functions/NEConvolution.h名字比较奇怪，其定义的卷积只支持U8类型。

get_convolution_method返回卷积方法的提示。

对于CPU，3x3卷积且stride为1时使用WINOGRAD，否则使用GEMM。

ConvolutionMethod NEConvolutionLayer::get_convolution_method(const ITensorInfo *input, const ITensorInfo *weights, const ITensorInfo *biases, const ITensorInfo *output, const PadStrideInfo &conv_info,
                                                             const WeightsInfo &weights_info)
{
    ARM_COMPUTE_UNUSED(output);
    ARM_COMPUTE_UNUSED(weights_info);
    if((input->data_type() == DataType::F32) && (weights->dimension(0) == 3) && (weights->dimension(1) == 3) && (weights->num_dimensions() <= 4) && (conv_info.stride().first == 1)
       && (conv_info.stride().second == 1) && (biases != nullptr))
    {
        return ConvolutionMethod::WINOGRAD;
    }
    return ConvolutionMethod::GEMM;
}

而对于CLConvolutionLayer::get_convolution_method直接使用GEMM。

ConvolutionMethod CLConvolutionLayer::get_convolution_method(const ITensorInfo *input, const ITensorInfo *weights, const ITensorInfo *biases, const ITensorInfo *output, const PadStrideInfo &conv_info,
                                                             const WeightsInfo &weights_info, const GPUTarget gpu_target)
{
    ARM_COMPUTE_UNUSED(input);
    ARM_COMPUTE_UNUSED(weights);
    ARM_COMPUTE_UNUSED(biases);
    ARM_COMPUTE_UNUSED(output);
    ARM_COMPUTE_UNUSED(conv_info);
    ARM_COMPUTE_UNUSED(weights_info);
    ARM_COMPUTE_UNUSED(gpu_target);

    return ConvolutionMethod::GEMM;
}

NEWinogradLayer函数调用以下NEON内核：

NEWinogradLayerTransformWeightsKernel（在第一次调用run()方法时只执行一次）
NEWinogradLayerTransformInputKernel
NEWinogradLayerTransformOutputKernel
NEWinogradLayerBatchedGEMMKernel
CPPPermute（三次：权重、输入和输出）

NEWinogradLayer拥有以下内核和函数成员：

private:
    std::unique_ptr<INEKernel> _batched_gemm_kernel;
    std::unique_ptr<INEKernel> _transform_input_kernel;
    std::unique_ptr<INEKernel> _transform_output_kernel;
    std::unique_ptr<INEKernel> _transform_weights_kernel;

    CPPPermute     _permute_input;
    CPPPermute     _permute_weights;
    CPPPermute     _permute_output;

void NEWinogradLayer::run()
{
    _memory_group.acquire();
    if(!_reshaped_kernel)
    {
        _reshaped_kernel = true;
        _permute_weights.run();
        NEScheduler::get().schedule(_transform_weights_kernel.get(), Window::DimX);
    }
    //Bring channels to the front as Winograd code expects the tensor to be in the format NHWC
    _permute_input.run();

    // Transform input tensor to the winograd domain
    NEScheduler::get().schedule(_transform_input_kernel.get(), Window::DimX);

    //Run 16 GEMMs in multiple threads, each kernel runs one or more GEMMs
    NEScheduler::get().schedule(_batched_gemm_kernel.get(), Window::DimX);

    // Transform output tensor to the spatial domain
    NEScheduler::get().schedule(_transform_output_kernel.get(), Window::DimX);

    // Reorder the convoluted output to ACL's ordering NCHW
    _permute_output.run();
    _memory_group.release();
}

主要计算函数NEWinogradLayerBatchedGEMMKernel

NEWinogradLayerBatchedGEMMKernel::run --> NEWinogradLayerTransformWeightsKernel
NEWinogradLayerTransformWeightsKernel --> WeightsTransform::run
WeightsTransform::run --> WeightsTransform::execute
NEWinogradLayerBatchedGEMMKernel::run --> NEWinogradLayerTransformInputKernel::run
NEWinogradLayerTransformInputKernel::run --> InputTransform::run
InputTransform::run --> InputTransform::execute
InputTransform::execute --> InputTransform::process_tile_row
InputTransform::process_tile_row --> Transform::tile_fns
Transform::tile_fns --> Transform::process_tile
NEWinogradLayerBatchedGEMMKernel::run --> NEWinogradLayerTransformOutputKernel::run
NEWinogradLayerTransformOutputKernel::run --> OutputTransform::run
OutputTransform::run --> OutputTransform::execute
OutputTransform::execute --> OutputTransform::process_tile_row 
OutputTransform::process_tile_row --> Transform::tile_fns
Transform::tile_fns --> Transform::process_tile
NEWinogradLayerBatchedGEMMKernel::run -->   winograd::BatchedBlockedGemm::run
winograd::BatchedBlockedGemm::run --> BlockedGemm

其中，WeightsTransform::execute定义在
ComputeLibrary/arm_compute/core/NEON/kernels/convolution/winograd/transforms/文件夹下的weights_*x*_*x*_fp32.cpp文件中。
convolution/winograd/gemm/a64_sgemm.hpp和convolution/winograd/gemm/a64_sgemm_4x16.hpp对ARM64下的BlockedGemm<8, 12, float, float>和BlockedGemm<4, 16, float, float>进行了汇编优化。

PadStrideInfo

PadStrideInfo类名称与实际输入参数相反[stride, pad]。
PadStrideInfo有两个构造函数，其中DimensionRoundingType的默认值为FLOOR，不指定时有可能与Caffe下的结果维度不一致。

PadStrideInfo   (   unsigned int    stride_x = 1,
        unsigned int    stride_y = 1,
        unsigned int    pad_x = 0,
        unsigned int    pad_y = 0,
        DimensionRoundingType   round = DimensionRoundingType::FLOOR 
    )

CPPDetectionWindowNonMaximaSuppressionKernel

CPPDetectionWindowNonMaximaSuppressionKernel意在与HOG或其他对象检测算法一起使用，以对IDetectionWindowArray执行非最大值抑制。

速度测试

ACL测试程序可以参考tvm-mali/acl_test.cc，在实际测试过程中发现的问题是加载模型比不加载模型的速度慢，在GPU上差异尤其明显。具体原因未知。

总结

ARM Compute Library整合了ARM自家的CPU和GPU资源，既能够实现常用的图像处理操作，也能继续深度学习推理。考虑到OpenCV亦加入了dnn module，二者存在很大程度上的重合。

再说ARM Compute Library的缺点：

库很大很重，源码>150MB。
用作图像处理的话不支持图像读取转换等操作。
用于深度学习应用时模型加载非常不方便且未提供完整模型转换工具。
库构建使用SCons而不是更为常见且与Android NDK兼容更好的CMake。
底层代码优化度不足，CPU端速度逊于Tencent/ncnn

Arm Compute Library Architecture