This article is shared from the Huawei Cloud Community " Introduction Notes on Ascend Operator Development " by Jeff Ding.
basic concepts
What is Ascend C
Ascend C is a programming language launched by CANN for operator development scenarios. It natively supports C and C++ standard specifications to best match user development habits; it greatly improves operator performance through key technologies such as multi-layer interface abstraction, automatic parallel computing, and twin debugging. Development efficiency helps AI developers complete operator development and model tuning and deployment at low cost.
Advantages of using Ascend C to develop custom operators
- C/C++ primitive programming to maximize the matching of users’ development habits
- The programming model shields hardware differences, and the programming paradigm improves development efficiency.
- Multi-level API encapsulation, from simple to flexible, taking into account ease of use and efficiency
- Twin debugging, the CPU side simulates the behavior of the NPU side, which can optimize debugging on the CPU side
Ascend Computing Architecture CANN
CANN introduction website: https://www.hiascend.com/software/cann
AI Core is the computing core of the NPU card. There are multiple AI Cores inside the NPU. Each AI Core is equivalent to a core in a multi-core CPU
SIMD
SIMD, which is single instruction multiple data computing, one instruction can process multiple data: Ascend C programming API is mainly vector calculation API and matrix operation API, and the calculation API is SIMD style
SPMD data parallelism and streamlined parallelism in parallel computing
SPMD data parallelism principle
- Start a set of processes that run the same program
- Divide the data to be processed and distribute the divided data into different pieces for processing.
- Each process performs three tasks T1, T2, and T3 on its own data fragments.
Pipeline parallelism principle
- Start a set of processes
- Segment the data
- Each process processes all data slices and only performs one task on the input data slices.
Ascend C programming model and paradigm
Parallel Computing Architecture Abstraction
Operators developed using the Ascend C programming language run on the AI Core, which is the computing core in the Ascend AI processor.
There are multiple AI Cores inside an AI processor. The AI Core includes computing units, storage units, handling units, etc. core components
The computing unit includes three basic computing resources
- Scalar computing unit: performs scalar calculations such as address calculation and loop control, and sends vector calculations, matrix calculations, data semicircles, and synchronization instructions to the corresponding units for execution.
- Cube computing unit: responsible for performing matrix operations
- Vector computing unit: responsible for performing vector calculations
The transfer unit is responsible for transferring data between Global Memory and Local Memory, including the transfer unit MTE (Memory Transfer Engine, data transfer unit), MTE3 (data transfer unit)
The storage unit is the internal storage of AI Core, collectively called Local Memory. Correspondingly, the external storage of AI Core is called Global Memory.
Asynchronous command flow
The Scalar computing unit reads the instruction sequence and sends the vector calculation, matrix calculation, and data transfer instructions to the instruction queue of the corresponding unit. The vector calculation unit, matrix calculation unit, and data transfer unit execute the received instructions asynchronously and in parallel.
sync signal flow
There may be dependencies between instructions. In order to ensure that instructions between different instruction queues are executed according to the correct logical relationship, the Scalar computing unit will also issue synchronization instructions to the corresponding units.
Compute data flow
The DMA move-in unit moves the data to Local Memory, the Vector/Cube calculation unit completes the data calculation and writes the calculation structure back to Local Memory, and the DMA move-out unit moves the processed data back to Global Memory.
Introduction to SPMD programming model
Ascend C operator programming is SPMD programming. The data that needs to be processed is split and distributed in parallel on multiple computing cores. Multiple AI Cores share the same instruction code. The only difference between the running instances on each core is that the block_idx is different. Similar to a process, block_idx is the process ID that uniquely identifies the process. Use the function GetBlockIdx() to obtain the ID in programming.
Kernel function writing and calling
Kernel Function is the entrance to the device side of the Acend C operator. Ascend C allows users to use kernel functions, a syntax extension of C/C++ functions, to manage running code on the device side. Users can write operator logic in kernel functions, such as customizing operator classes and their member functions to implement the algorithm. All functions of sub. The kernel function is the bridge between the host side and the device side.
The kernel function is the code that is executed directly on the device side. In the kernel function, the data access and calculation operations to be performed need to be specified for the code executed on one core. The SPMD programming model allows multiple cores to execute the same computing task in parallel when the kernel function is called.
Use function type qualifiers
In addition to defining the kernel function according to the C/C++ function declaration, additional function type qualifiers must be added to the kernel function, including __global__ and __aicore__
Use the __global__ function type qualifier to identify that it is a core function that can be called by <<<...>>>; use the __aicore__ function type qualifier to identify that the function is executed on the device side AI Core
__gloabl__ __aircore__ void kernel_name(argument list);
Use variable type qualifiers
For convenience: the unified type of pointer input variables is defined as __gm__uint8_t*
Users can uniformly use uint8_t type pointers and convert them into actual pointer types when used; they can also directly pass in the actual pointer type.
rules or suggestions
- Kernel functions must have void return type
- Only supports input parameters that are pointer types or C/C++ built-in data types (Primitive Data Types), such as: half* s0, flat* s1, int32_t c
- Provides an encapsulated macro GM_ADDR to avoid long function parameter lists
#define GM_ADDR __gm__ unit8_t* __restrict__
Call kernel function
The calling statement of the kernel function is an extension of the C/C++ function calling statement.
Common C/C++ function calling methods are as follows:
function_name(argument list);
The kernel function uses the syntax form of the internal caller <<<…>>> to specify the execution configuration of the kernel function:
kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);
Note: The kernel caller can only be called when compiling in NPU mode. This symbol cannot be recognized when compiling in CPU mode.
blocakdim stipulates that the kernel function will be executed on several cores. Each core that executes the kernel function will be assigned a logical ID, which is represented by the built-in variable block_idx. The number starts from 0 and can be defined for different logical cores. Behavior can be obtained by using the GetBlockIDX() function in the operator implementation.
l2ctl, a reserved function, displays the value nullptr set to a fixed value.
Stream: The type is aclrtStream. Stream is a task queue. The application uses stream to manage the parallelism of tasks.
Use the kernel caller <<<…>>> to call the kernel function:
HelloWorld<<<8, nullptr, stream>>>(fooDevice));
blockDim is set to 8, which means that the HelloWorld kernel function is called on 8 cores. Each core will execute the core function independently and in parallel. Stream can be created through aclrtCreateStream. Its function is to explicitly create a The aclrtStream argument list is set to the cooDevice input parameter.
The call of the kernel function is asynchronous. After the call of the kernel function ends, control is immediately returned to the host side.
The API that forces the host-side program to wait for all core functions to be executed (blocks the application until all tasks in the specified Stream are completed, synchronization interface) is aclrtSynchronizeStream
aclError aclrtSynchronizeStream(aclrtStream stream);
Introduction to Programming API
Ascend C operators are programmed using standard C++ syntax and a set of class library APIs
Calculation API: scalar calculation API, vector calculation API, matrix calculation API, respectively implement calling Scalar calculation unit, Vector calculation unit, Cube calculation unit
Data transfer API: To perform calculations based on Local Memory data, the data needs to be transferred from Gloabl Memory to Local Memory first, then use the calculation interface to complete the calculation, and finally move out from Local Memory to Gloabl Memory. For example, DataCopy interface
Memory management API: used to allocate and manage memory, such as AllocTensor and FreeTensor interfaces
Task synchronization API: completes communication and synchronization between tasks, such as EnQue and DeQue interfaces. Different instructions are executed asynchronously and in parallel. In order to ensure that instructions between different instruction queues are executed according to the correct logical relationship, synchronous instructions need to be sent to different components.
The basic data types used by the Ascend C API for calculations are Tensor: GlobalTensor and LocalTensor
Level 4 API definition
Level 4 API definition: APIs are divided into four levels according to user usage scenarios.
Level 3 API, operator overloading, supports +, -,*,/,=,|,&,^,>,<,>-,<= to implement simple expression of calculation, similar to dst=src1+src2
Level 2 continuous calculation API, similar to Add(dst,src1,src2,count), calculates the continuous COUNT data of the source operand and continuously writes the destination operand to solve the calculation problem of continuous count data of the one-dimensional tensor.
Level 1 slice calculation API to solve slice calculation problems in multi-dimensional data (under development)
Level 0 rich function computing API, a computing API that can fully utilize the advantages of hardware. This function can fully utilize the powerful instructions of the CANN series chip and support repeattimes, repetstride, and MASK operations for each operand. The call is similar: Add(dst,src1,src2,repeatTimes,repeatParams);
Introduction to pipeline programming paradigm
The Ascend C programming paradigm divides the internal processing program of the operator into multiple pipeline tasks (Stage), uses tensor as the data carrier, uses queue (Queue) for communication and synchronization between tasks, and uses the memory management module ( Pipe) manages communication memory between tasks.
- Fixed steps for rapid development programming
- Development shortcuts for unified code framework
- Development experience summarized by users
- Programming ideas for specific scenarios
- Customized methodology development experience
Abstract programming model "TPIPE Parallel Computing"
In view of the complex data flow of each generation of Davinci chips, based on actual computing needs, the parallel programming paradigm is abstracted and simplified pipeline parallelism
Core elements of Ascend C’s parallel programming paradigm
- A set of parallel computing tasks
- Communication and synchronization between tasks through queues
- Programmers can autonomously express scheduling of parallel computing tasks and resources
- Basic vector programming paradigm: computing tasks are divided into CopyIn, Compute, CopyOut
- Basic matrix programming paradigm: computing tasks are divided into CopyIn, Compute, Aggregate, CopyOut
- Complex vector/matrix programming paradigm, realizing complex calculation data flow by combining Out/ln of vector/matrix
Running tasks
Pipeline tasks (Stage) refer to parallel tasks scheduled by the main program in a single-core processor.
Within the kernel function, parallel processing of data can be implemented through pipeline tasks to improve performance.
For example, the functions of a single-core processor can be split into three pipeline tasks: Stage1, Stage2, and Stage3. Each task focuses on the processing of data slices. The headers between stages express the dependence between data. For example, after Stage1 processes Progress1, Stage2 can process Progress1.
If n=3 in Progres, the data to be processed is divided into 3 slices. For the same piece of data, the processing between Stage1, Stage2, and Stage3 has dependencies and needs to be processed serially; different data slices at the same time point can be Multiple pipeline task stages are processed in parallel, thereby achieving task parallelism and improving performance.
Inter-task communication and synchronization
Data communications and synchronization manager
Queue is used in Ascend C to complete data communication and synchronization between tasks. Queue provides basic APIs such as EnQue and DeQue.
When Queue manages different levels of physical memory on the NPU, it uses an abstract logical position (QuePosition) to express each level of storage (Storage Scope), replacing the concept of on-chip physical storage, and developers do not need to be aware of the hardware architecture.
Queue types (logical locations) in vector programming include: VECIN, VECOUT
data carrier
Ascend C uses GlobalTensor and LocalTensor as the basic operating units of data. It is the object directly called by various instruction APIs and is also the carrier of data.
Vector programming inter-task communication and tasks
Logical position (QuePosition) in vector programming: the storage location of moved-in data: VECIN, and the storage location of moved-out data: VECOUT.
Vector programming is mainly divided into three tasks: CopyIn, Compute, and CopyOut.
- After transferring the input data from GlobalTensor to LocalTensor in the CopyIn task, you need to use EnQue to put the LocalTensor into the Queue of VECIN.
- The Compute task waits for the LocalTensor in the Queue of VECIN to be dequeued before it can perform vector calculations. After the calculation is completed, use EnQue to put the calculation result LocalTensor into the Queue of VECOUT.
- The CopyOut task waits for the Localtensor in the VECOUT Queue to be dequeued, and then copies it to the GlobalTensor.
Stage1: CopyIn task
Use the DataCopy interface to copy the GlobalTensor to the LocalTensor
Use EnQue to put LocalTensor into VECIN's Queue
Stage2: Compute task
Use DeQue to get LocalTensor from VECIN
Complete vector calculations using the Ascend C command API:Add
Use EnQue to put the result LocalTensor into the Queue of VECOUT
Stage3: CopyOut task
Use the DeQue interface to get the LocalTensor from the Queue of VECOUT
Use the DataCopy interface to copy LocalTensor to GlobalTensor
Memory management
The memory used for task data transfer is uniformly managed by the memory management module Pipe.
As an on-chip memory manager, Pipe provides the Queue memory initialization function through the InitBuffer interface. Developers can allocate memory for the specified Queue through this interface.
After the Queue queue memory is initialized, when memory is needed, allocate memory to LocalTensor by calling AllocTensor. When the created LocalTensor completes relevant calculations and no longer needs to be used, call FreeTensor to recycle the memory of LocalTensor.
Temporary variable memory management
The temporary variable memory used in the programming process is also managed through Pipe. Temporary variables can use the TBuf data structure to apply for storage space on the specified QuePosition, and use Get() to allocate the allocated storage space to a new LocalTensor to obtain the full length from the TBuf, or to obtain the LocalTensor of the specified length.
LocalTensor<T> Get<T>(); LocalTensor<T> Get<T>(uint32_t len);
Examples of Tbuf and Get interfaces
//Allocate memory for TBuf initialization. The allocated memory length is 1024 bytes TPipe pipe; TBuf<TPosition::VECIN> calcBuf; //The template parameter is the VECIN type in QuePosition uint32_t byteLen = 1024; pipe.InitBuffer(calcBuf,byteLen) ; //Get Tensor from calcBuf. The size of all memory allocated by Tensor for pipe is 1024 bytes. LocalTensor <int32_t> tempTensor1 = calcBuf.Get<int32_t>(); //Get Tensor from calcBuf. Tensor is 128 int32_t type elements. The memory size is 512 bytes LocalTensro<int32_t> tempTensor1 = calcBuf.Get<int32_t>(128);
The memory space applied for using TBuf can only participate in calculations and cannot perform the enqueue and dequeue operations of the Queue queue.
Ascend C vector programming
Operator analysis
Development Process
Operator analysis: Analyze the mathematical expression, input, output and calculation logic implementation of the operator, and clarify the Ascend interface that needs to be called.
Kernel function definition: define the Ascend operator entry function
Implement the operator class according to the vector programming paradigm: complete the internal implementation of the kernel function
Taking the ElemWise(ADD) operator as the mathematical formula
For the sake of simplicity, set the tensors x, y, z to a fixed shape (8, 2048), the data type dtype to half type, the data arrangement type format to ND, and the kernel function name to add_custom
Operator analysis
Clarify the mathematical expression and calculation logic of the operator
The mathematical expression of the Add operator is
Computational logic: The input data needs to be moved to on-chip storage first, and then the computing interface is used to complete two addition operations to obtain the final result, and then moved out to external storage.
Clear input and output
There are two Add operators:
The input data type is half, and the output data type is the same as the input data type. The input supports fixed shape (8, 2048), the output shape is the same as the input shape, and the input data arrangement type is ND.
Determine kernel function name and parameters
Customize the kernel function description, such as add_custom. According to the input and output, determine that the kernel function has three input parameters x, y, z. x and
y are the memory addresses input on GlobalMemory, and z is the memory address output on globalMemory.
Determine the interface required for operator implementation
Involving data transfer between internal and external storage, use the data transfer interface: DataCopy implementation
Addition operations involving vector calculations are implemented using vector binocular instructions: Add
LocalTensor is used, Queue queue management is used, and interfaces such as Enque and Deque are used.
Operator implementation
Kernel function definition
In the implementation of the add_custom kernel function, the KernelAdd operator class is instantiated, the Init() function is called to complete the memory initialization, and the Process() function is called to complete the core logic.
Note: There are no special requirements for operator class and member function names. Developers can decide the specific implementation of the kernel function based on their own C/C++ coding habits.
// implementation of kenel function extern "C" __global__ __aicore__ void add_custom(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z) { kernelAdd op; op.Init(x,y,z); op.Process(); }
For the call of the kernel function, use the built-in macro __CCE_KT_TEST__ to identify <<<…>>> which will only be compiled in NPU mode (CPU mode g++ does not have the expression of <<<…>>>), for the kernel function By calling for encapsulation, other logic can be added to the encapsulated function. Only the call to the core function is shown here.
#ifndef __CCE_KT_TEST__ // call of kernel function void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z) { add_custom<<<blockDim, l2ctrl, stream>>>(x,y,z); }
Operator class implementation
CopyIn task: Move the input Tensor xGm and yGm on Global Memory to Local Memory and store them in xlocal and ylocal respectively.
Compute task: perform addition operations on xLocal and yLocal, and the calculation results are stored in zlocal.
CopyOut task: transfer the output data from zlocal to the output tensor zGm on Global Memory.
CopyIn.Compute tasks communicate and synchronize through the VECIN queue and inQueueX, inQueueY.
Compute and CopyOut tasks communicate and synchronize through VECOUT and outQueueZ.
The pipe memory management object uniformly manages the memory used for interaction between tasks and the memory used by temporary variables.
Vector addition z=x+y code sample TPIPE pipeline programming paradigm
Operator class implementation
Operator class name: KernelAdd
Initialization function Init() and core processing function Process()
Three pipeline tasks: CopyIn(), Compute(), CopyOut()
Process meaning
The meaning of BUFFER)NUM of TQue template:
The depth of the Queue, double buffer optimization techniques
class KernelAdd{ public: __aicore__ inline KernelAdd() //Initialization function, completes memory initialization related operations__aicore__ inline voide Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z){} // Core processing function, implements calculation Sub-logic, call the private member functions CopyIn, Compute, CopyOut to complete the operator logic __aicore__ inline void Process(){} private: // Move in the function, complete the processing of the CopyIn stage, and be called by the Process function __aicore__ inline void CopyIn(int32_t process){ } // Compute function, complete the processing of the Compute phase, and be called by the Process function __aicore__ inline void Compute(int32_t process){} // Move out the function, complete the processing of the CopyOut phase, and be called by the Process function __aicore__ inline void CopyOut(int32_t process){ } private: // pipe memory management object TPipe pipe; // Input data Queue queue management object, QuePosition is VECIN TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY; // Output data Queue queue management object, QuePosition is VECOUT TQue <QuePosition::VECOUT, BUFFER_NUM> outQueueZ; // Object that manages the Global Memory memory address of input and output, where xGm, yGm is the input and zGm is the output GlobalTensor<half> xGm, yGm,zGm; };
Init() function implementation
To use multi-core parallel computing, the data needs to be sliced to obtain the memory offset address on the Global Memory that each core actually needs to process.
The overall data length TOTAL_LENGTH is 8 * 2048, which is evenly distributed to run on 8 cores. The data size BLOCK_LENGTH processed on each core is 2048. block_idx is the logical ID of the core, (gm half*)x + GetBlockIdx() * BLOCK_LENGTH , that
is The memory offset address of the input data of the core with index block_idx on Global Memory
For single-core data processing, you can perform data slicing (Tiling) to split the data into 8 blocks. After slicing, each data block is divided into BUFFER_NUM=2 blocks again. Double buffer can be turned on to achieve parallelism between pipelines.
The 2048 data that a single core needs to process is divided into 16 blocks, each block has TILE_LENGTH=128 data. Pipe allocates BUFFER_NUM memory blocks with a block size of TITLE_LENGTH * sizeof (half) bytes for inQueueX. Each memory block can accommodate TILE_LENGTH. =128 half type data
code example
constexpr int32_t TOTAL_LENGTH = 8 * 2048; //total length of data constexpr int32_t USE_CORE_NUM = 8; //num of core used constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM; //length computed of each ccore constexpr int32_t TILE_NUM = 8; //split data into 8 tiles constexpr int32_t BUFFER_NUM = 2; //tensor num for each queue constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; //seperate to 2 parts, due to double buffer __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z) { //get start index for current core,core parallel xGm,SetGlobalBuffer((__gm__ half*)x * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH); yGm,SetGlobalBuffer((__gm__ half*)y * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH); zGm,SetGlobalBuffer((__gm__ half*)z * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH); //pipe alloc memory to queue,the unit is Bytes pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half)); }
Process() function implementation
code example
__aicore__ inline void Process() { // loop count need to be doubled, due to double buffer constexpr int32_t loopCount = TILE_NUM * BUFFER_BUM; // tiling strategy, pipeline prallel for (int32_t i = 0; i < loopCount; i++) { CopyIn(i); Compute(i); CopyOut(i); } } __aicore__ inline void CopyIn(int32_t progress) { // alloc tensor from queue memory LocalTensor<half> xLocal = inQueueX.AllocTensor<half>(); LocalTensor<half> yLocal = inQueueY.AllocTensor<half>(); // copy progress_th tile from global tensor to local tensor DataCopy(xLocal,xGm[progress * TILE_LENGTH], TILE_LENGTH); DataCopy(xLocal,yGm[progress * TILE_LENGTH], TILE_LENGTH); // enque input tensors to VECIN queue inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } __aicore__ inline void Compute(int32_t progress) { //dque input tensors from VECIN queue LocalTensor<half> xLocal = inQueueX.DeQue<half>(); LocalTensor<half> yLocal = inQueueY.DeQue<half>(); LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>(); // call Add instr for computation Add(zLocal, xLocal, yLocal, TILE_LENGTH); // enque the output tensor to VECOUT queue outQueueZ.EnQue<half>(zLocal)l // free input tensors for reuse inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } __aicore__ inline void CopyOut(int32_t progress) { //deque output tensor form VECOUT queue LocalTensor<half> zLocal = outQueueZ.Deque<half>(); // copy progress_th tile form local tensor to global tensor DataCopy(zGm[progress * TILE_LENGTH), zlocal, TILE_LENGTH); // free outpupt tensor for reuse outQueueZ.freeTensor(zLocal); }
double buffer mechanism
Double buffer hides the data transfer time and reduces the waiting time of vector instructions by combining data transfer with vector calculation and execution, ultimately improving the utilization efficiency of the vector calculation unit. A Tensor can only carry out three pipeline tasks of moving in, calculating and moving out at the same time. One, the hardware involved in the other two pipeline tasks is expected to be in the Idle state.
If the data to be processed is divided into parts, such as Tensor1 and Tensor2.
- When the vector computing unit performs Compute on Tensor1, Tensor2 can perform CopyIn tasks.
- When the vector computing unit performs Compute on Tensor2, Tensor1 can perform CopyOut tasks.
- When the vector computing unit performs CopyOut on Tensor2, Tensor2 can perform the task of CopyIn.
As a result, the incoming and outgoing transfer of data and vector computation are parallelized, and the problem of idle hardware units is effectively alleviated.
Ascend C operator call
HelloWorld example
Run the header files included in CPU mode
Header files included in running NPU mode
Definition of kernel function
Built-in macro __CE_KT_TEST__: Flag to distinguish between running CPU mode or NPU mode logic
Host-side execution logic: Responsible for data application in host-side memory, copying from host to device, kernel function execution synchronization and resource recycling
Device side execution logic
Host side execution CPU mode logic: use the encapsulated execution macro ICPU_RUN_KF
mainly include:
gMAlloc(…): apply for memory space in CPU mode
ICPU_RUN_KF: Use encapsulated execution macros
GmFree: Release memory space in CPU mode
process
AscendCL initialization—>Run management resource application—>Host data transfer to Device—>Execute tasks and wait—>Device data transfer to Host—>Run resource release—>AscendCL deinitialization
Execute NPU mode logic on the host side: use the kernel caller <<<…>>>
Important interface- aclInit
- aclCreateStream
- aclMallocHost
- aclMalloc
- aclMemcpy
- <<<…>>>
- aclrtSynchronizeStream
- aclrtFree
- aclrtfreeHost
- aclrtDestoryStream
- aclFinalize
AddCustomSample
Ascend C vector operator sample code
- Kernel function source file: add_custom.app
- Truth value data generation script: add_custom.py
- CmakeLists.txt: Convenient for compiling multiple source files
- Auxiliary function for reading and writing data files: data_utils.h
- Host side source file: main.cpp
- Execute the script with one click: run.sh
- Organize cmake scripts compiled in CPU mode and NPU mode
Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~