Pure dry goods! One article gets all the knowledge points for getting started with Ascend C programming

This article is shared from Huawei Cloud Community " Ascend C Programming Introduction Tutorial ", author: Ascend CANN.

On May 6, 2023, at the Ascend AI Developer Summit, Huawei officially released the Ascend C programming language for operator development scenarios. Ascend C natively supports C/C++ programming specifications. Through technologies such as multi-layer interface abstraction, parallel programming paradigm, and twin debugging, it greatly improves the development efficiency of operators and helps AI developers complete operator development and model tuning and deployment at low cost. .

Ascend AI hardware and software foundation

Just like the operators developed by CUDA run on the GPU, the operators developed based on Ascend C can run on the Ascend AI processor (NPU for short) through the heterogeneous computing architecture CANN (Compute Architecture for Neural Networks). CANN is a software stack that enables the Ascend AI processor. Through software and hardware co-optimization, it can give full play to the powerful computing power of the Ascend AI processor. As can be clearly seen from the architecture diagram below, operators developed using the Ascend C programming language are compiled by a compiler and scheduled at runtime, and finally run on the Ascend AI processor.

cke_150.png

We know that general-purpose computing is what we often write and run on the CPU. It is good at logic control and serial computing. Compared with general-purpose computing, AI computing is better at parallel computing and can support large-scale computing-intensive tasks. . As shown in the left figure below, to do a matrix multiplication, three layers of for loops are required for CPU calculations, while the right figure uses the vector calculation unit on the Ascend AI processor, only two layers of for loops are required, and the minimum calculation code can simultaneously calculate multiple The multiplication and addition of data is a step closer. If you use the Cube computing unit, you only need one statement to complete the calculation of a matrix multiplication. This is what we call SIMD (Single Instruction Multiple Data). Therefore, we usually use AI processors to perform massive parallel calculations.

cke_151.png

The NPU cannot run independently and needs to work with the CPU. It can be regarded as a coprocessor of the CPU. The CPU is responsible for running the entire operating system, managing various resources and performing complex logic control, while the NPU is mainly responsible for parallel computing tasks. In the heterogeneous computing architecture based on CPU+NPU, the NPU and the CPU are connected together through the PCIe bus to work together. The location of the CPU is called the host (host), and the location of the NPU is called the device (device). The schematic diagram is as follows :

cke_152.png

Here is a detailed introduction to the Ascend AI processor. Ascend AI processors come in different models and product forms, ranging from modules and accelerator cards to servers and clusters. The core part of the Ascend AI processor is the AI ​​Core. There are multiple AI Cores, which are the computing cores accelerated by the neural network. Each AI Core is equivalent to each core in the multi-core CPU that we usually understand. It uses the Ascend C programming language The developed operator runs on the AI ​​Core, because the acceleration of the core neural network calculation comes from the computing power of the AI ​​Core.

The abstraction of parallel computing architecture inside AI Core is shown in the figure below:

cke_153.pngThe abstract core of this parallel computing architecture includes several large components. There is a Gobal Memory outside the AI ​​Core, which is shared by multiple AI Cores. There is a local memory Local Memory inside the AI ​​Core. Because it is close to the computing unit, its bandwidth It will be very high, and the relative capacity will be very small, such as generally hundreds of K to 1M. The core components inside AI Core have three computing units, scalar computing unit, vector computing unit, and matrix computing unit. There is also a DMA handling unit, which is responsible for moving data between Global Memory and Local Memory.

The asynchronous parallel computing process inside the AI ​​Core: the Scalar computing unit reads the instruction sequence, and transmits the vector computing, matrix computing, and data handling instructions to the instruction queue of the corresponding unit, and the vector computing unit, matrix computing unit, and data handling unit are executed asynchronously and in parallel Instructions received. This process can refer to the instruction flow shown by the blue arrow in the figure above. There may be dependencies between different instructions. In order to ensure that the instructions between different instruction queues are executed according to the correct logical relationship, the Scalar computing unit will also issue synchronization instructions to the corresponding units. The synchronization process between units can refer to the synchronization signal flow shown by the orange arrow in the above figure.

The basic process of AI Core's internal data processing: the DMA import unit transfers the data to the Local Memory, the Vector/Cube calculation unit completes the data, and writes the calculation results back to the Local Memory, and the DMA transfer unit transfers the processed data back to the Global Memory. This process can refer to the data flow shown by the red arrow in the figure above.

Ascend C Programming Model Fundamentals

Ascend C programming paradigm

The Ascend C programming paradigm is a pipelined programming paradigm. The processing program in the operator core is divided into multiple pipeline tasks, and the communication and synchronization between tasks are completed through the queue (Queue), and the unified memory management module (Pipe) Manage inter-task communication memory. The pipeline programming paradigm applies the pipeline parallel computing method.

cke_154.png

If n=3, that is, the data to be processed is divided into 3 pieces, then the schematic diagram of the pipeline task running in the above figure is as follows. It can be seen from the operation figure that for the same piece of data, the data between Stage1, Stage2, and Stage3 Processing has dependencies and requires serial processing; different data slices can have multiple tasks being processed in parallel at the same time point, thereby achieving the purpose of task parallelism and improving performance.

cke_155.png

Ascend C has designed different pipeline tasks for Vector and Cube programming respectively. Developers only need to complete the code implementation of basic tasks. The underlying instruction synchronization and parallel scheduling are implemented by the Ascend C framework, and developers do not need to pay attention.

vector programming paradigm

The vector programming paradigm divides the implementation process of operators into three basic tasks: CopyIn, Compute, and CopyOut. CopyIn is responsible for the move-in operation, Compute is responsible for the vector calculation operation, and CopyOut is responsible for the move-out operation.

cke_156.png

We only need to complete the code implementation of basic tasks according to the programming paradigm, and the underlying instruction synchronization and parallel scheduling are implemented by the Ascend C framework.

How does Ascend C complete the data communication and synchronization between different tasks? Here, Ascend C provides the Queue queue management API, mainly two queue operation APIs EnQue, DeQue and logical abstraction of memory.

The logical position (QuePosition) used in vector programming is defined as follows:

  • The storage location of the imported data: VECIN;
  • Calculate the position of the intermediate variable: VECCALC;
  • The storage location of the exported data: VECOUT.

As can be seen from the front, vector programming is mainly divided into three tasks: CopyIn, Compute, and CopyOut. After the input data is moved from the Global memory to the Local memory in the CopyIn task, EnQue needs to be used to put the LocalTensor into the VECIN Queue; the Compute task waits for the LocalTensor in the VECIN Queue to be dequeued before completing the vector calculation. After the calculation is completed, use EnQue to The calculation result LocalTensor is put into the Queue of VECOUT; the CopyOut task waits for the LocalTensor in the Queue of VECOUT to dequeue, and then copies it to the Global memory. In this way, the Queue queue completes the data communication and synchronization between the three tasks. The specific process and flow chart are as follows:

  • Stage1: CopyIn task.

Use the DataCopy interface to copy GlobalTensor data to LocalTensor.

Use the EnQue interface to put the LocalTensor into the VECIN Queue.

  • Stage2: Compute task.

Get LocalTensor from VECIN using DeQue interface.

Vector calculations are done using the Ascend C interface.

Use the EnQue interface to put the calculation result LocalTensor into the Queue of VECOUT.

  • Stage3: CopyOut task.

Use the DeQue interface to remove LocalTensor from VECOUT's Queue.

Use the DataCopy interface to copy LocalTensor to GlobalTensor.

cke_157.png

In this way, our kernel implementation code is very clear. First initialize the memory and queue, and then realize the three stages of CopyIn, Compute, and CopyOut through the programming paradigm.

SPMD parallel programming - multi-core

When introducing the Ascend AI processor earlier, it was introduced that there are multiple AI Cores, so how can we make full use of multiple AI Cores? Among the commonly used parallel computing methods, there is a SPMD (Single-Program Multiple-Data) data parallel method. Simply put, it is to divide the data into pieces, and each piece of data goes through a complete data processing process. This can be matched with the multi-core of the Ascend AI processor. We divide the data into multiple parts, and each part of the data is processed on one core. In this way, each part of the data is processed in parallel, and the entire data is processed. Ascend C is SPMD (Single-Program Multiple-Data) programming. Multiple AI Cores share the same instruction code. The only difference between the running instances on each core is that the block_idx (built-in variable) is different, so we can use block_idx to To distinguish different cores, as long as the data address on the Global Memory is segmented and offset, each core can process its corresponding part of the data.

cke_158.png

When an operator is called, all computing cores execute the same implementation code, and the input parameters of the entry function are also the same. The data address processed on each core needs to be obtained by adding the offset of block_idx*BLOCK_LENGTH (data length processed by each block) to the starting address. In this way, data segmentation for multi-core parallel computing is realized.

class KernelAdd {

public:

__aicore__ inline KernelAdd() {}

__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)

{

// get start index for current core, core parallel

GM_ADDR xGmOffset = x + BLOCK_LENGTH * GetBlockIdx();

GM_ADDR yGmOffset = y + BLOCK_LENGTH * GetBlockIdx();

GM_ADDR zGmOffset = z + BLOCK_LENGTH * GetBlockIdx();

xGm.SetGlobalBuffer((__gm__ half*)xGmOffset, BLOCK_LENGTH);

yGm.SetGlobalBuffer((__gm__ half*)yGmOffset, BLOCK_LENGTH);

zGm.SetGlobalBuffer((__gm__ half*)zGmOffset, BLOCK_LENGTH);

……

}

……

}

Introduction to Ascend C API

In the entire kernel implementation, the core code is Add(zLocal, xLocal, yLocal, TILE_LENGTH); the addition calculation of all data is completed through an API interface provided by Ascend C, yes, you read it right, this interface is completed calculate.

Next, we will introduce the API provided by Ascend C. Ascend C operators use standard C++ syntax and a set of class library APIs for programming. The class library APIs mainly include the following types. You can choose the appropriate API according to your needs in the implementation of the kernel function:

cke_159.png

  • Computing APIs, including scalar computing API, vector computing API, and matrix computing API, implement the functions of calling Scalar computing units, Vector computing units, and Cube computing units to perform calculations.
  • Data transfer API, the above calculation API is calculated based on Local Memory data, so the data needs to be transferred from Global Memory to Local Memory first, then use the calculation interface to complete the calculation, and finally move from Local Memory to Global Memory. The interface that performs the transfer process is called a data transfer interface, such as the DataCopy interface.
  • Memory management API, used to allocate and manage memory, such as AllocTensor and FreeTensor interfaces.
  • Task synchronization API to complete communication and synchronization between tasks, such as EnQue and DeQue interfaces.

The calculation operands of Ascend C API are Tensor types: GlobalTensor and LocalTensor.

After introducing the types of Ascend C API, let's explain why all the numbers can be calculated with one Add interface. The original Ascend C programming model is based on the SIMD (Single Instruction Multiple Data) architecture. A single instruction can complete multiple data operations, and at the same time, some advanced functions of the instructions are encapsulated inside the API.

Operator Execution Basic Process

As mentioned earlier, in the heterogeneous computing architecture, the NPU and CPU work together. In the Ascend C programming model, we need to implement the code on the NPU side and the code on the CPU side. The code on the NPU side is usually called the Kernel implementation code, and the code on the CPU side is generally called the Host implementation code. A complete Ascend C code usually includes the Host side implementation code and the Kernel side implementation code. The basic process of Ascend C operator execution is as follows:

  1. Initialize the Device device;
  2. Create a Context binding device;
  3. Allocate Host memory and initialize data;
  4. Allocate Device memory and copy data from Host to Device;
  5. Use the kernel call symbol <<<>>> to call the kernel function to complete the specified operation;
  6. Copy the operation result on Device back to Host;
  7. Release the requested resource.

Kernel function introduction

In the above process, the most important step is to call the kernel function to perform parallel computing tasks. The kernel function (Kernel Function) is the entry point for the implementation of the Ascend C operator on the Device side. In the kernel function, the data access and calculation operations to be performed need to be specified for the code executed on the AI ​​core.

extern "C" __global__ __aicore__ void add_custom(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z);

The above is an example of a kernel function declaration, extern "C" means that the kernel function is compiled and linked according to the C-like compilation and linking protocol, the __global__ function type qualifier indicates that it is a kernel function, and the __aicore__ function type qualifier indicates that The kernel function is executed on the AI ​​Core on the device side. The variable type qualifier __gm__ in the parameter list indicates that the pointer variable points to a memory address somewhere on the Global Memory. Note that the input parameters here can only support pointers or C/C++ built-in data types. The type of pointer used in the sample is uint8_t , which needs to be converted to an actual pointer type in subsequent use.

The kernel function in the Ascend C programming model is called by the kernel call symbol <<<...>>>, as follows:

kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);

kernel_name is the name of the kernel function mentioned above, and the argument list is the function input parameter of the kernel function. In the middle of <<<>>>, there are 3 parameters:

  • blockDim, specifies that the kernel function will be executed on several cores, we can set it to 1 first;
  • l2ctrl, keep the parameters, temporarily set to a fixed value nullptr, we don't need to pay attention;
  • stream, created using aclrtCreateStream, used for multi-thread scheduling.

Sample development explanation

Sample code structure

|-- CMakeLists.txt //Compile project files 

|-- cmake //Compile project files 

|-- data_utils.h //Data read in and write out functions 

|-- input //Store the input data directory generated by the script 

|-- leakyrelu_custom.cpp //Operator kernel implementation 

|-- leakyrelu_custom.py //Input data and ground truth data generation script file 

|-- leakyrelu_custom_tiling.h //host side tiling function 

|-- main.cpp //Main function, host Side call code, including cpu domain and npu domain call 

|-- output //The directory that stores the operator operation output data and benchmark data 

|-- readme.md //Execution command description 

|-- run.sh //Run script

main file

Input data and ground truth data generation script file: KERNEL_NAME.py.

Write scripts that generate input data and ground truth data based on the input and output of the operator.

This example generates fp16 data of size 8 * 200 * 1024:

...... 

def gen_golden_data_simple(): 

total_length_imm = 8 * 200 * 1024 

tile_num_imm = 8 

//Generate tilling bin file 

total_length = np.array(total_length_imm, dtype=np.uint32) 

tile_num = np.array(tile_num_imm, dtype=np. uint32) 

scalar = np.array(0.1, dtype=np.float32) 

tiling = (total_length, tile_num, scalar) 

tiling_data = b''.join(x.tobytes() for x in tiling) 

with os.fdopen(os. open('./input/tiling.bin', WRITE_FILE_FLAGS, PEN_FILE_MODES_640), 'wb') as f: 

f.write(tiling_data) 

// Generate input data 

input_x = np.random.uniform(-100, 100, [8 , 200, 1024]).astype(np.float16) 

//Generate golden data, the function is the same as LeakyRelu 

golden = np.where(input_x > 0, input_x, input_x * scalar).astype(np.float16)

input_x.tofile("./input/input_x.bin")

golden.tofile("./output/golden.bin")

Compile project file: CMakeLists.txt

Used to compile the Ascend C operator running on the cpu side or npu side. Mainly pay attention to whether all the source files in CMakeLists.txt are listed.

The application that calls the operator: main.cpp

Mainly memory application, data copy, file read and write and other operations, and finally call the operator, the introduction of the relevant API is as follows:

1. The AscendCL initialization interface aclInit is used to initialize the runtime interface AscendCL, and is the first interface called by the program; aclrtCreateContext and aclrtCreateStream are used to create Context and Stream, mainly for thread-related resource management.

2.aclrtMallocHost interface, used to apply for memory on the Host:

aclError aclrtMallocHost(void **hostPtr, size_t size)

This function is similar to malloc in C language. It is used to apply for a certain byte of memory on the Host, where hostPtr is a pointer to the allocated memory, and size is the requested memory size. If you need to release this memory, use aclrtFreeHost Interface release, which corresponds to the free function in C language.

3.aclrtMalloc interface, used to apply for memory on Device:

aclError aclrtMalloc(void **devPtr, size_t size, aclrtMemMallocPolicy policy)

Compared with the memory application interface on the Host, there is an additional policy parameter, which is used to set the memory allocation rules, generally set to ACL_MEM_MALLOC_HUGE_FIRST. After use, you can use the corresponding aclrtFree interface to free the memory.

4.aclrtMemcpy interface, used for data copy between Host and Device:

The previously applied memory distinguishes between Host memory and Device memory, which will involve data synchronization issues. aclrtMemcpy is the interface used for data communication between Host and Device:

aclError aclrtMemcpy(void *dst, size_t destMax, const void *src, size_t count, aclrtMemcpyKind kind)

Where src points to the data source, and dst is the target memory address, destMax is the maximum memory length of the destination memory address, count is the number of bytes copied, where aclrtMemcpyKind controls the direction of copying: ACL_MEMCPY_HOST_TO_HOST, ACL_MEMCPY_HOST_TO_DEVICE, ACL_MEMCPY_DEVICE_TO_HOST and ACL_MEMCPY_DEVICE_TO_DEVICE, like ACL_ MEMCPY_HOST_TO_DEVICE is the The data on the Host is copied to the Device.

5. The core function is to call the kernel function on the CPU side

ICPU_RUN_KF(leakyrelu_custom, blockDim, x, y, usrWorkSpace, tiling);

and the NPU side calls the

leakyrelu_custom_do(blockDim, nullptr, stream, xDevice, yDevice, workspaceDevice, tilingDevice);

The complete code is as follows:

//This file constains code of cpu debug and npu code.We read data from bin file and write result to file.

#include "data_utils.h"

#include "leakyrelu_custom_tiling.h"

#ifndef __CCE_KT_TEST__

#include "acl/acl.h"

extern void leakyrelu_custom_do(uint32_t coreDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y,

uint8_t* workspace, uint8_t* tiling);

#else

#include "tikicpulib.h"

extern "C" __global__ __aicore__ void leakyrelu_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling);

#endif

int32_t main(int32_t argc, char* argv[])

{

size_t tilingSize = sizeof(LeakyReluCustomTilingData);

size_t usrWorkspaceSize = 4096;

size_t sysWorkspaceSize = 16 * 1024 * 1024; 

uint32_t blockDim = 8; 

#ifdef __CCE_KT_TEST__ //CPU side call 

//apply memory for storing workspace and tilling data 

uint8_t* usrWorkSpace = (uint8_t*)AscendC::GmAlloc(usrWorkspaceSize); 

uint8_ t * tiling = (uint8_t*)AscendC::GmAlloc(tilingSize); 

ReadFile("./input/tiling.bin", tilingSize, tiling, tilingSize); 

size_t inputByteSize = blockDim * 200 * 1024 * sizeof(uint16_t); // uint16_t represent half 

size_t outputByteSize = blockDim * 200 * 1024 * sizeof(uint16_t); // uint16_t represent half 

// apply memory for storing input and output data 

uint8_t* x = (uint8_t*)AscendC::GmAlloc(inputByteSize); 

uint8 _t * y = (uint8_t*)AscendC::GmAlloc(inputByteSize); 

//Get input data

ReadFile("./input/input_x.bin", inputByteSize, x, inputByteSize);

// PrintData(x, 16, printDataType::HALF);

//在AIV上执行

AscendC::SetKernelMode(KernelMode::AIV_MODE);

//调用kernel函数

ICPU_RUN_KF(leakyrelu_custom, blockDim, x, y, usrWorkSpace, tiling); // use this macro for cpu debug

// PrintData(y, 16, printDataType::HALF);

WriteFile("./output/output_y.bin", y, outputByteSize);

AscendC::GmFree((void *)x);

AscendC::GmFree((void *)y);

AscendC::GmFree((void *)usrWorkSpace);

AscendC::GmFree((void *)tiling);

#else //NPU侧调用

CHECK_ACL(aclInit(nullptr));

aclrtContext context;

int32_t deviceId = 0;

CHECK_ACL(aclrtSetDevice(deviceId));

CHECK_ACL(aclrtCreateContext(&context, deviceId));

aclrtStream stream = nullptr;

CHECK_ACL(aclrtCreateStream(&stream));

uint8_t *xHost, *yHost, *tilingHost, *workspaceHost;

uint8_t *xDevice, *yDevice, *tilingDevice, *workspaceDevice;

//申请host上tilling内存并读入tilling数据

CHECK_ACL(aclrtMallocHost((void**)(&tilingHost), tilingSize));

ReadFile("./input/tiling.bin", tilingSize, tilingHost, tilingSize);

//申请host上workspace内存

CHECK_ACL(aclrtMallocHost((void**)(&workspaceHost), tilingSize));

size_t inputByteSize = blockDim * 200 * 1024 * sizeof(uint16_t); // uint16_t represent half

size_t outputByteSize = blockDim * 200 * 1024 * sizeof(uint16_t); // uint16_t represent half

size_t workspaceByteSize = sysWorkspaceSize + usrWorkspaceSize;

//申请host和device上的输入输出内存和device上的workspace和tilling内存

CHECK_ACL(aclrtMallocHost((void**)(&xHost), inputByteSize));

CHECK_ACL(aclrtMallocHost((void**)(&yHost), inputByteSize));

CHECK_ACL(aclrtMallocHost((void**)(&workspaceHost), workspaceByteSize));

CHECK_ACL(aclrtMalloc((void**)&xDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));

CHECK_ACL(aclrtMalloc((void**)&yDevice, inputByteSize, ACL_MEM_MALLOC_HUGE_FIRST));

CHECK_ACL(aclrtMalloc((void**)&tilingDevice, tilingSize, ACL_MEM_MALLOC_HUGE_FIRST));

CHECK_ACL(aclrtMalloc((void**)&workspaceDevice, workspaceByteSize, ACL_MEM_MALLOC_HUGE_FIRST));

ReadFile("./input/input_x.bin", inputByteSize, xHost, inputByteSize); 

// PrintData(xHost, 16, printDataType::HALF); 

//Copy input data and tilling data from host to device 

CHECK_ACL(aclrtMemcpy( xDevice, inputByteSize, xHost, inputByteSize, ACL_MEMCPY_HOST_TO_DEVICE)); 

CHECK_ACL(aclrtMemcpy(tilingDevice, tilingSize, tilingHost, tilingSize, ACL_MEMCPY_HOST_TO_DEVICE)); 

//call kernel function 

leakyrelu_custom_do(blockD im, nullptr, stream, xDevice, yDevice, workspaceDevice, tilingDevice); 

//Wait for the kernel function to complete 

CHECK_ACL(aclrtSynchronizeStream(stream)); 

//Copy back the running result to host 

CHECK_ACL(aclrtMemcpy(yHost, outputByteSize, yDevice, outputByteSize, ACL_MEMCPY_DEVICE_TO_HOST)); 

// PrintData(yHost, 16, printDataType:: HALF);

WriteFile("./output/output_y.bin", yHost, outputByteSize);

//释放资源

CHECK_ACL(aclrtFree(xDevice));

CHECK_ACL(aclrtFree(yDevice));

CHECK_ACL(aclrtFree(workspaceDevice));

CHECK_ACL(aclrtFree(tilingDevice));

CHECK_ACL(aclrtFreeHost(xHost));

CHECK_ACL(aclrtFreeHost(yHost));

CHECK_ACL(aclrtFreeHost(workspaceHost));

CHECK_ACL(aclrtFreeHost(tilingHost));

CHECK_ACL(aclrtDestroyStream(stream));

CHECK_ACL(aclrtDestroyContext(context));

CHECK_ACL(aclrtResetDevice(deviceId));

CHECK_ACL(aclFinalize());

#endif

return 0;

}

One-click compile and run script run.sh

Compile and run the application.

Run the command on the cpu side:

bash run.sh leakyrelu_custom ascend910B1 VectorCore cpu

Run the command on the npu side:

bash run.sh leakyrelu_custom ascend910B1 VectorCore npu

The meaning of the parameters is as follows:

bash run.sh <kernel_name> <soc_version> <core_type> <run_mode>

<kernel_name> indicates the operator to be run.

<soc_version> indicates the model of the AI ​​processor running the operator.

<core_type> means running on AI Core or Vector Core, and the parameter value is AiCore/VectorCore.

<run_mode> indicates that the operator runs in cpu mode or npu mode, and the parameter value is cpu/npu.

Kernel implementation

function prototype definition

In this example, the function name is leakyrelu_custom. According to the analysis of the input and output of the operator, it is determined that there are two parameters x and y, where x is the input memory and y is the output memory. The prototype definition of the kernel function is as follows:

extern "C" __global__ __aicore__ void leakyrelu_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling){ }

Use the __global__ function type qualifier to identify that it is a kernel function that can be called by <<<...>>>; use the __aicore__ function type qualifier to identify that the kernel function is executed on the device-side AI Core; For convenience, the GM_ADDR macro is uniformly used to modify the input parameters, and the GM_ADDR macro definition:

#define GM_ADDR __gm__ uint8_t* __restrict__

Obtain the tilling data, and call the Init and Process functions of the operator class.

The Init function of the operator class completes the work related to memory initialization, and the Process function completes the core logic of the operator implementation.

extern "C" __global__ __aicore__ void leakyrelu_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)

{

GET_TILING_DATA(tilingData, tiling);

KernelLeakyRelu op;

op.Init(x, y, tilingData.totalLength, tilingData.tileNum, tilingData.scalar);

op.Process();

}

Encapsulate the call of the kernel function

After encapsulation, the leakyrelu_custom_do function is obtained, which is convenient for the main program to call. #ifndef __CCE_KT_TEST__ indicates that this wrapper function is only used when compiling and running operators on the NPU side. When compiling and running operators on the CPU side, you can directly call the add_custom function. When calling the kernel function, in addition to passing in the input and output parameters x, y, and segmentation related parameters tiling, you also need to pass in blockDim (the number of cores executed by the kernel function), l2ctrl (reserved parameters, set to nullptr), stream (application The stream that maintains the execution sequence of asynchronous operations) to specify the execution configuration of the kernel function.

#ifndef __CCE_KT_TEST__

// call of kernel function

void leakyrelu_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y,

uint8_t* workspace, uint8_t* tiling)

{

leakyrelu_custom<<<blockDim, l2ctrl, stream>>>(x, y, workspace, tiling);

}

#endif

Get tiling parameters

Mainly obtain tiling parameters totalLength (total length), tileNum (number of splits, number of times single-core cycle data processing) and scalar (LeakyRelu calculation scalar) from tilingPointer.

#define GET_TILING_DATA(tilingData, tilingPointer) \

LeakyReluCustomTilingData tilingData; \

INIT_TILING_DATA(LeakyReluCustomTilingData, tilingDataPointer, tilingPointer); \

(tilingData).totalLength = tilingDataPointer->totalLength; \

(tilingData).tileNum = tilingDataPointer->tileNum; \

(tilingData).scalar = tilingDataPointer->scalar;

#endif // LEAKYRELU_CUSTOM_TILING_H

Init function

After mainly obtaining the tiling data, set the address of gm on the single core and initialize the Buffer.

__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, uint32_t totalLength, uint32_t tileNum, float scalar)

{

ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");

this->blockLength = totalLength / GetBlockNum();

this->tileNum = tileNum;

this->scalar = static_cast<half>(scalar);

ASSERT(tileNum != 0 && "tile num can not be zero!");

this->tileLength = this->blockLength / tileNum / BUFFER_NUM;

// get start index for current core, core parallel

xGm.SetGlobalBuffer((__gm__ half*)x + this->blockLength * get_block_idx(), this->blockLength);

yGm.SetGlobalBuffer((__gm__ half*)y + this->blockLength * get_block_idx(), this->blockLength);

// pipe alloc memory to queue, the unit is Bytes

pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(half));

pipe.InitBuffer(outQueueY, BUFFER_NUM, this->tileLength * sizeof(half));

}

Process function

It mainly realizes three stages of CopyIn, Compute, and CopyOut.

__aicore__ inline void Process()

{

// loop count need to be doubled, due to double buffer

int32_t loopCount = this->tileNum * BUFFER_NUM;

// tiling strategy, pipeline parallel

for (int32_t i = 0; i < loopCount; i++) {

CopyIn(i);

Compute(i);

CopyOut(i);

}

}

CopyIn function

Responsible for copying data from Global Memory to Local Memory and adding data to Queue

__aicore__ inline void CopyIn(int32_t progress)

{

// alloc tensor from queue memory

LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();

// copy progress_th tile from global tensor to local tensor

DataCopy(xLocal, xGm[progress * tileLength], tileLength);

// enque input tensors to VECIN queue

inQueueX.EnQue(xLocal);

}

Compute function

Responsible for taking data from the Queue, performing calculations, and putting the results into the Queue

__aicore__ inline void Compute(int32_t progress)

{

// deque input tensors from VECIN queue

LocalTensor<half> xLocal = inQueueX.DeQue<half>();

LocalTensor<half> yLocal = outQueueY.AllocTensor<half>();

// call LeakyRelu instr for computation

LeakyRelu(yLocal, xLocal, scalar, tileLength);

// enque the output tensor to VECOUT queue

outQueueY.EnQue<half>(yLocal);

// free input tensors for reuse

inQueueX.FreeTensor(xLocal);

}

CopyOut function

Responsible for fetching data from Queue and copying data from Local Memory to Global Memory.

__aicore__ inline void CopyOut(int32_t progress)

{

// deque output tensor from VECOUT queue

LocalTensor<half> yLocal = outQueueY.DeQue<half>();

// copy progress_th tile from local tensor to global tensor

DataCopy(yGm[progress * tileLength], yLocal, tileLength);

// free output tensor for reuse

outQueueY.FreeTensor(yLocal);

}

compile and execute

Execute on the CPU side

The execution results are as follows:

cke_160.png

It can be seen that the final output result output_y.bin has the same MD5 value as the benchmark data golden.bin, indicating that the calculation results are the same.

After the execution is completed, the input data and tiling data are stored under input, the output data and benchmark data are stored under output, and the npu_check execution result of each core is in the npuchk directory

There is also an executable binary file leakyrelu_custom_cpu in the current directory. If an error is reported during execution, you can debug this executable file through gdb. For specific debugging, please refer to the official tutorial at the end of the article.

Execute on NPU side

There are two ways to execute on the NPU side: simulation execution and on-board operation. The commands are the same, but the compilation options are different. We can run CAModel simulation for SIMULATOR by modifying the compilation option -DASCEND_RUN_MODE. Set it to ONBOARD to run on-board.

function compile_and_execute() { 

# Use cmake to compile cpu-side or npu-side operators, SIMULATOR or ONBOARD 

mkdir -p build; cd build; \ 

cmake .. \ 

-Dsmoke_testcase=$1 \ 

-DASCEND_PRODUCT_TYPE=$2 \ 

-DASCEND_CORE_TYPE=$3 \ 

-DASC END_RUN_MODE ="SIMULATOR" \ 

-DASCEND_INSTALL_PATH=$ASCEND_HOME_DIR 

VERBOSE=1 cmake --build . --target ${1}_${4} 

... 

}

References

In short, to learn Ascend C, you only need to understand C++ programming, understand the columnar communication and memory application release mechanism, and call the corresponding computing interface and handling interface to write high-performance operators that run on the Ascend AI processor.

For more Ascend C learning resources, please visit the official tutorial: Ascend C Programming Guide (Official Tutorial)

Extra!

cke_158278.jpeg

Huawei will hold the 8th HUAWEI CONNECT 2023 at the Shanghai World Expo Exhibition Hall and Shanghai World Expo Center on September 20-22, 2023. With the theme of "accelerating industry intelligence", this conference invites thought leaders, business elites, technical experts, partners, developers and other industry colleagues to discuss how to accelerate industry intelligence from business, industry, and ecology.

We sincerely invite you to come to the site, share the opportunities and challenges of intelligentization, discuss the key measures of intelligentization, and experience the innovation and application of intelligent technology. you can:

  • In 100+ keynote speeches, summits, and forums, collide with the viewpoint of accelerating industry intelligence
  • Visit the 17,000-square-meter exhibition area to experience the innovation and application of intelligent technology in the industry at close range
  • Meet face-to-face with technical experts to learn about the latest solutions, development tools, and hands-on
  • Seek business opportunities with customers and partners

Thank you for your support and trust as always, and we look forward to meeting you in Shanghai.

Official website of the conference: https://www.huawei.com/cn/events/huaweiconnect

Welcome to follow the "Huawei Cloud Developer Alliance" official account to get the conference agenda, exciting activities and cutting-edge dry goods.

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Microsoft official announcement: Visual Studio for Mac retired The programming language created by the Chinese developer team: MoonBit (Moon Rabbit) Father of LLVM: Mojo will not threaten Python, the fear should be C++ The father of C++ Bjarne Stroustrup shared life advice Linus also Dislike the acronym, what TM is called "GenPD" Rust 1.72.0 is released, and the minimum supported version in the future is Windows 10 Wenxin said that it will open WordPress to the whole society and launch the "100-year plan" Microsoft does not talk about martial arts and uses "malicious pop-ups "Prompt users to deprecate Google's high-level, functional, interpreted, dynamic programming languages: Crumb
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10106749