Hand tapping, sharing introductory notes on Ascend operator development

This article is shared from the Huawei Cloud Community " Introduction Notes on Ascend Operator Development " by Jeff Ding.

basic concepts

What is Ascend C

Ascend C is a programming language launched by CANN for operator development scenarios. It natively supports C and C++ standard specifications to best match user development habits; it greatly improves operator performance through key technologies such as multi-layer interface abstraction, automatic parallel computing, and twin debugging. Development efficiency helps AI developers complete operator development and model tuning and deployment at low cost.

Advantages of using Ascend C to develop custom operators

  • C/C++ primitive programming to maximize the matching of users’ development habits
  • The programming model shields hardware differences, and the programming paradigm improves development efficiency.
  • Multi-level API encapsulation, from simple to flexible, taking into account ease of use and efficiency
  • Twin debugging, the CPU side simulates the behavior of the NPU side, which can optimize debugging on the CPU side

Ascend Computing Architecture CANN

CANN introduction website: https://www.hiascend.com/software/cann

AI Core is the computing core of the NPU card. There are multiple AI Cores inside the NPU. Each AI Core is equivalent to a core in a multi-core CPU

SIMD

SIMD, which is single instruction multiple data computing, one instruction can process multiple data: Ascend C programming API is mainly vector calculation API and matrix operation API, and the calculation API is SIMD style

SPMD data parallelism and streamlined parallelism in parallel computing

SPMD data parallelism principle

  • Start a set of processes that run the same program
  • Divide the data to be processed and distribute the divided data into different pieces for processing.
  • Each process performs three tasks T1, T2, and T3 on its own data fragments.

Pipeline parallelism principle

  • Start a set of processes
  • Segment the data
  • Each process processes all data slices and only performs one task on the input data slices.

Ascend C programming model and paradigm

Parallel Computing Architecture Abstraction

Operators developed using the Ascend C programming language run on the AI ​​Core, which is the computing core in the Ascend AI processor.
There are multiple AI Cores inside an AI processor. The AI ​​Core includes computing units, storage units, handling units, etc. core components

The computing unit includes three basic computing resources

  1. Scalar computing unit: performs scalar calculations such as address calculation and loop control, and sends vector calculations, matrix calculations, data semicircles, and synchronization instructions to the corresponding units for execution.
  2. Cube computing unit: responsible for performing matrix operations
  3. Vector computing unit: responsible for performing vector calculations

The transfer unit is responsible for transferring data between Global Memory and Local Memory, including the transfer unit MTE (Memory Transfer Engine, data transfer unit), MTE3 (data transfer unit)

The storage unit is the internal storage of AI Core, collectively called Local Memory. Correspondingly, the external storage of AI Core is called Global Memory.

Asynchronous command flow

The Scalar computing unit reads the instruction sequence and sends the vector calculation, matrix calculation, and data transfer instructions to the instruction queue of the corresponding unit. The vector calculation unit, matrix calculation unit, and data transfer unit execute the received instructions asynchronously and in parallel.

sync signal flow

There may be dependencies between instructions. In order to ensure that instructions between different instruction queues are executed according to the correct logical relationship, the Scalar computing unit will also issue synchronization instructions to the corresponding units.

Compute data flow

The DMA move-in unit moves the data to Local Memory, the Vector/Cube calculation unit completes the data calculation and writes the calculation structure back to Local Memory, and the DMA move-out unit moves the processed data back to Global Memory.

Introduction to SPMD programming model

Ascend C operator programming is SPMD programming. The data that needs to be processed is split and distributed in parallel on multiple computing cores. Multiple AI Cores share the same instruction code. The only difference between the running instances on each core is that the block_idx is different. Similar to a process, block_idx is the process ID that uniquely identifies the process. Use the function GetBlockIdx() to obtain the ID in programming.

image.png

Kernel function writing and calling

Kernel Function is the entrance to the device side of the Acend C operator. Ascend C allows users to use kernel functions, a syntax extension of C/C++ functions, to manage running code on the device side. Users can write operator logic in kernel functions, such as customizing operator classes and their member functions to implement the algorithm. All functions of sub. The kernel function is the bridge between the host side and the device side.
image.png
The kernel function is the code that is executed directly on the device side. In the kernel function, the data access and calculation operations to be performed need to be specified for the code executed on one core. The SPMD programming model allows multiple cores to execute the same computing task in parallel when the kernel function is called.

Use function type qualifiers

In addition to defining the kernel function according to the C/C++ function declaration, additional function type qualifiers must be added to the kernel function, including __global__ and __aicore__

Use the __global__ function type qualifier to identify that it is a core function that can be called by <<<...>>>; use the __aicore__ function type qualifier to identify that the function is executed on the device side AI Core

__gloabl__ __aircore__ void kernel_name(argument list);

image.png

Use variable type qualifiers

For convenience: the unified type of pointer input variables is defined as __gm__uint8_t*

Users can uniformly use uint8_t type pointers and convert them into actual pointer types when used; they can also directly pass in the actual pointer type.

image.png

rules or suggestions

  1. Kernel functions must have void return type
  2. Only supports input parameters that are pointer types or C/C++ built-in data types (Primitive Data Types), such as: half* s0, flat* s1, int32_t c
  3. Provides an encapsulated macro GM_ADDR to avoid long function parameter lists
#define GM_ADDR __gm__ unit8_t* __restrict__

Call kernel function

The calling statement of the kernel function is an extension of the C/C++ function calling statement.

Common C/C++ function calling methods are as follows:

function_name(argument list);

The kernel function uses the syntax form of the internal caller <<<…>>> to specify the execution configuration of the kernel function:

kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);

Note: The kernel caller can only be called when compiling in NPU mode. This symbol cannot be recognized when compiling in CPU mode.

blocakdim stipulates that the kernel function will be executed on several cores. Each core that executes the kernel function will be assigned a logical ID, which is represented by the built-in variable block_idx. The number starts from 0 and can be defined for different logical cores. Behavior can be obtained by using the GetBlockIDX() function in the operator implementation.

l2ctl, a reserved function, displays the value nullptr set to a fixed value.

Stream: The type is aclrtStream. Stream is a task queue. The application uses stream to manage the parallelism of tasks.

Use the kernel caller <<<…>>> to call the kernel function:

HelloWorld<<<8, nullptr, stream>>>(fooDevice));

blockDim is set to 8, which means that the HelloWorld kernel function is called on 8 cores. Each core will execute the core function independently and in parallel. Stream can be created through aclrtCreateStream. Its function is to explicitly create a The aclrtStream argument list is set to the cooDevice input parameter.

The call of the kernel function is asynchronous. After the call of the kernel function ends, control is immediately returned to the host side.

The API that forces the host-side program to wait for all core functions to be executed (blocks the application until all tasks in the specified Stream are completed, synchronization interface) is aclrtSynchronizeStream

aclError aclrtSynchronizeStream(aclrtStream stream);

Introduction to Programming API

Ascend C operators are programmed using standard C++ syntax and a set of class library APIs

Calculation API: scalar calculation API, vector calculation API, matrix calculation API, respectively implement calling Scalar calculation unit, Vector calculation unit, Cube calculation unit

Data transfer API: To perform calculations based on Local Memory data, the data needs to be transferred from Gloabl Memory to Local Memory first, then use the calculation interface to complete the calculation, and finally move out from Local Memory to Gloabl Memory. For example, DataCopy interface

Memory management API: used to allocate and manage memory, such as AllocTensor and FreeTensor interfaces

Task synchronization API: completes communication and synchronization between tasks, such as EnQue and DeQue interfaces. Different instructions are executed asynchronously and in parallel. In order to ensure that instructions between different instruction queues are executed according to the correct logical relationship, synchronous instructions need to be sent to different components.

The basic data types used by the Ascend C API for calculations are Tensor: GlobalTensor and LocalTensor

Level 4 API definition

Level 4 API definition: APIs are divided into four levels according to user usage scenarios.

Level 3 API, operator overloading, supports +, -,*,/,=,|,&,^,>,<,>-,<= to implement simple expression of calculation, similar to dst=src1+src2

Level 2 continuous calculation API, similar to Add(dst,src1,src2,count), calculates the continuous COUNT data of the source operand and continuously writes the destination operand to solve the calculation problem of continuous count data of the one-dimensional tensor.

Level 1 slice calculation API to solve slice calculation problems in multi-dimensional data (under development)

Level 0 rich function computing API, a computing API that can fully utilize the advantages of hardware. This function can fully utilize the powerful instructions of the CANN series chip and support repeattimes, repetstride, and MASK operations for each operand. The call is similar: Add(dst,src1,src2,repeatTimes,repeatParams);

image.png

Introduction to pipeline programming paradigm

The Ascend C programming paradigm divides the internal processing program of the operator into multiple pipeline tasks (Stage), uses tensor as the data carrier, uses queue (Queue) for communication and synchronization between tasks, and uses the memory management module ( Pipe) manages communication memory between tasks.

  • Fixed steps for rapid development programming
  • Development shortcuts for unified code framework
  • Development experience summarized by users
  • Programming ideas for specific scenarios
  • Customized methodology development experience

Abstract programming model "TPIPE Parallel Computing"

In view of the complex data flow of each generation of Davinci chips, based on actual computing needs, the parallel programming paradigm is abstracted and simplified pipeline parallelism

Core elements of Ascend C’s parallel programming paradigm

  • A set of parallel computing tasks
  • Communication and synchronization between tasks through queues
  • Programmers can autonomously express scheduling of parallel computing tasks and resources
Typical computing paradigm
  • Basic vector programming paradigm: computing tasks are divided into CopyIn, Compute, CopyOut
  • Basic matrix programming paradigm: computing tasks are divided into CopyIn, Compute, Aggregate, CopyOut
  • Complex vector/matrix programming paradigm, realizing complex calculation data flow by combining Out/ln of vector/matrix

image.png

Running tasks

Pipeline tasks (Stage) refer to parallel tasks scheduled by the main program in a single-core processor.

Within the kernel function, parallel processing of data can be implemented through pipeline tasks to improve performance.

For example, the functions of a single-core processor can be split into three pipeline tasks: Stage1, Stage2, and Stage3. Each task focuses on the processing of data slices. The headers between stages express the dependence between data. For example, after Stage1 processes Progress1, Stage2 can process Progress1.

image.png

If n=3 in Progres, the data to be processed is divided into 3 slices. For the same piece of data, the processing between Stage1, Stage2, and Stage3 has dependencies and needs to be processed serially; different data slices at the same time point can be Multiple pipeline task stages are processed in parallel, thereby achieving task parallelism and improving performance.

image.png

Inter-task communication and synchronization

Data communications and synchronization manager

Queue is used in Ascend C to complete data communication and synchronization between tasks. Queue provides basic APIs such as EnQue and DeQue.

When Queue manages different levels of physical memory on the NPU, it uses an abstract logical position (QuePosition) to express each level of storage (Storage Scope), replacing the concept of on-chip physical storage, and developers do not need to be aware of the hardware architecture.

Queue types (logical locations) in vector programming include: VECIN, VECOUT

data carrier

Ascend C uses GlobalTensor and LocalTensor as the basic operating units of data. It is the object directly called by various instruction APIs and is also the carrier of data.

Vector programming inter-task communication and tasks

Logical position (QuePosition) in vector programming: the storage location of moved-in data: VECIN, and the storage location of moved-out data: VECOUT.

Vector programming is mainly divided into three tasks: CopyIn, Compute, and CopyOut.

  • After transferring the input data from GlobalTensor to LocalTensor in the CopyIn task, you need to use EnQue to put the LocalTensor into the Queue of VECIN.
  • The Compute task waits for the LocalTensor in the Queue of VECIN to be dequeued before it can perform vector calculations. After the calculation is completed, use EnQue to put the calculation result LocalTensor into the Queue of VECOUT.
  • The CopyOut task waits for the Localtensor in the VECOUT Queue to be dequeued, and then copies it to the GlobalTensor.

Stage1: CopyIn task

Use the DataCopy interface to copy the GlobalTensor to the LocalTensor

Use EnQue to put LocalTensor into VECIN's Queue

Stage2: Compute task

Use DeQue to get LocalTensor from VECIN

Complete vector calculations using the Ascend C command API:Add

Use EnQue to put the result LocalTensor into the Queue of VECOUT

Stage3: CopyOut task

Use the DeQue interface to get the LocalTensor from the Queue of VECOUT

Use the DataCopy interface to copy LocalTensor to GlobalTensor

image.png

Memory management

The memory used for task data transfer is uniformly managed by the memory management module Pipe.

As an on-chip memory manager, Pipe provides the Queue memory initialization function through the InitBuffer interface. Developers can allocate memory for the specified Queue through this interface.

After the Queue queue memory is initialized, when memory is needed, allocate memory to LocalTensor by calling AllocTensor. When the created LocalTensor completes relevant calculations and no longer needs to be used, call FreeTensor to recycle the memory of LocalTensor.

image.png

Temporary variable memory management

The temporary variable memory used in the programming process is also managed through Pipe. Temporary variables can use the TBuf data structure to apply for storage space on the specified QuePosition, and use Get() to allocate the allocated storage space to a new LocalTensor to obtain the full length from the TBuf, or to obtain the LocalTensor of the specified length.

LocalTensor<T> Get<T>();
LocalTensor<T> Get<T>(uint32_t len);

Examples of Tbuf and Get interfaces

//Allocate memory for TBuf initialization. The allocated memory length is 1024 bytes 
TPipe pipe; 
TBuf<TPosition::VECIN> calcBuf; //The template parameter is the VECIN type in QuePosition 
uint32_t byteLen = 1024; 
pipe.InitBuffer(calcBuf,byteLen) ; 
//Get Tensor from calcBuf. The size of all memory allocated by Tensor for pipe is 1024 bytes. LocalTensor 
<int32_t> tempTensor1 = calcBuf.Get<int32_t>(); 
//Get Tensor from calcBuf. Tensor is 128 int32_t type elements. The memory size is 512 bytes 
LocalTensro<int32_t> tempTensor1 = calcBuf.Get<int32_t>(128);

The memory space applied for using TBuf can only participate in calculations and cannot perform the enqueue and dequeue operations of the Queue queue.

Ascend C vector programming

Operator analysis

Development Process

Operator analysis: Analyze the mathematical expression, input, output and calculation logic implementation of the operator, and clarify the Ascend interface that needs to be called.

Kernel function definition: define the Ascend operator entry function

Implement the operator class according to the vector programming paradigm: complete the internal implementation of the kernel function

image.png

Taking the ElemWise(ADD) operator as the mathematical formula

 

For the sake of simplicity, set the tensors x, y, z to a fixed shape (8, 2048), the data type dtype to half type, the data arrangement type format to ND, and the kernel function name to add_custom

 

Operator analysis

image.png

Clarify the mathematical expression and calculation logic of the operator

The mathematical expression of the Add operator is

Computational logic: The input data needs to be moved to on-chip storage first, and then the computing interface is used to complete two addition operations to obtain the final result, and then moved out to external storage.

 

Clear input and output

There are two Add operators:

The input data type is half, and the output data type is the same as the input data type. The input supports fixed shape (8, 2048), the output shape is the same as the input shape, and the input data arrangement type is ND.

 

Determine kernel function name and parameters

Customize the kernel function description, such as add_custom. According to the input and output, determine that the kernel function has three input parameters x, y, z. x and
y are the memory addresses input on GlobalMemory, and z is the memory address output on globalMemory.

Determine the interface required for operator implementation

Involving data transfer between internal and external storage, use the data transfer interface: DataCopy implementation

Addition operations involving vector calculations are implemented using vector binocular instructions: Add

LocalTensor is used, Queue queue management is used, and interfaces such as Enque and Deque are used.

Operator implementation

Kernel function definition

In the implementation of the add_custom kernel function, the KernelAdd operator class is instantiated, the Init() function is called to complete the memory initialization, and the Process() function is called to complete the core logic.

Note: There are no special requirements for operator class and member function names. Developers can decide the specific implementation of the kernel function based on their own C/C++ coding habits.

// implementation of kenel function
extern "C" __global__ __aicore__ void add_custom(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z)
{
	kernelAdd op;
	op.Init(x,y,z);
	op.Process();
}

For the call of the kernel function, use the built-in macro __CCE_KT_TEST__ to identify <<<…>>> which will only be compiled in NPU mode (CPU mode g++ does not have the expression of <<<…>>>), for the kernel function By calling for encapsulation, other logic can be added to the encapsulated function. Only the call to the core function is shown here.

#ifndef __CCE_KT_TEST__
// call of kernel function
void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z)
{
	add_custom<<<blockDim, l2ctrl, stream>>>(x,y,z);
}

Operator class implementation

CopyIn task: Move the input Tensor xGm and yGm on Global Memory to Local Memory and store them in xlocal and ylocal respectively.

Compute task: perform addition operations on xLocal and yLocal, and the calculation results are stored in zlocal.

CopyOut task: transfer the output data from zlocal to the output tensor zGm on Global Memory.

CopyIn.Compute tasks communicate and synchronize through the VECIN queue and inQueueX, inQueueY.

Compute and CopyOut tasks communicate and synchronize through VECOUT and outQueueZ.

The pipe memory management object uniformly manages the memory used for interaction between tasks and the memory used by temporary variables.

image.png

Vector addition z=x+y code sample TPIPE pipeline programming paradigm

image.png

Operator class implementation

Operator class name: KernelAdd

Initialization function Init() and core processing function Process()

Three pipeline tasks: CopyIn(), Compute(), CopyOut()

Process meaning
image.png

The meaning of BUFFER)NUM of TQue template:

The depth of the Queue, double buffer optimization techniques

class KernelAdd{ 
public: 
	__aicore__ inline KernelAdd() 
	//Initialization function, completes memory initialization related operations__aicore__ 
	inline voide Init(__gm__ uint8_t* x, __gm__ uint8_t* y, __gm__ uint8_t* z){} 
	// Core processing function, implements calculation Sub-logic, call the private member functions CopyIn, Compute, CopyOut to complete the operator logic 
	__aicore__ inline void Process(){} 

private: 
	// Move in the function, complete the processing of the CopyIn stage, and be called by the Process function 
	__aicore__ inline void CopyIn(int32_t process){ } 
	// Compute function, complete the processing of the Compute phase, and be called by the Process function 
	__aicore__ inline void Compute(int32_t process){} 
	// Move out the function, complete the processing of the CopyOut phase, and be called by the Process function 
	__aicore__ inline void CopyOut(int32_t process){ } 

private: 
	// pipe memory management object 
	TPipe pipe; 
	// Input data Queue queue management object, QuePosition is VECIN 
	TQue<QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY; 
	// Output data Queue queue management object, QuePosition is VECOUT 
	TQue <QuePosition::VECOUT, BUFFER_NUM> outQueueZ; 
	// Object that manages the Global Memory memory address of input and output, where xGm, yGm is the input and zGm is the output 
	GlobalTensor<half> xGm, yGm,zGm; 
};

Init() function implementation

To use multi-core parallel computing, the data needs to be sliced ​​to obtain the memory offset address on the Global Memory that each core actually needs to process.

The overall data length TOTAL_LENGTH is 8 * 2048, which is evenly distributed to run on 8 cores. The data size BLOCK_LENGTH processed on each core is 2048. block_idx is the logical ID of the core, (gm half*)x + GetBlockIdx() * BLOCK_LENGTH ,  that
is The memory offset address of the input data of the core with index block_idx on Global Memory

For single-core data processing, you can perform data slicing (Tiling) to split the data into 8 blocks. After slicing, each data block is divided into BUFFER_NUM=2 blocks again. Double buffer can be turned on to achieve parallelism between pipelines.

The 2048 data that a single core needs to process is divided into 16 blocks, each block has TILE_LENGTH=128 data. Pipe allocates BUFFER_NUM memory blocks with a block size of TITLE_LENGTH * sizeof (half) bytes for inQueueX. Each memory block can accommodate TILE_LENGTH. =128 half type data

image.png

code example

constexpr int32_t TOTAL_LENGTH = 8 * 2048; //total length of data
constexpr int32_t USE_CORE_NUM = 8;  //num of core used
constexpr int32_t BLOCK_LENGTH = TOTAL_LENGTH / USE_CORE_NUM;  //length computed of each ccore
constexpr int32_t TILE_NUM = 8; //split data into 8 tiles
constexpr int32_t BUFFER_NUM = 2; //tensor num for each queue
constexpr int32_t TILE_LENGTH = BLOCK_LENGTH / TILE_NUM / BUFFER_NUM; //seperate to 2 parts, due to double buffer

__aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z)
{
	//get start index for current core,core parallel
	xGm,SetGlobalBuffer((__gm__ half*)x * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
	yGm,SetGlobalBuffer((__gm__ half*)y * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
	zGm,SetGlobalBuffer((__gm__ half*)z * BLOCK_LENGTH * GetBlockIdx(), BLOCK_LENGTH);
	//pipe alloc memory to queue,the unit is Bytes
	pipe.InitBuffer(inQueueX, BUFFER_NUM, TILE_LENGTH * sizeof(half));
	pipe.InitBuffer(inQueueY, BUFFER_NUM, TILE_LENGTH * sizeof(half));
	pipe.InitBuffer(outQueueZ, BUFFER_NUM, TILE_LENGTH * sizeof(half));
}

Process() function implementation

image.png

code example

__aicore__ inline void Process()
{
	// loop count need to be doubled, due to double buffer
	constexpr int32_t loopCount = TILE_NUM * BUFFER_BUM;
	// tiling strategy, pipeline prallel
	for (int32_t i = 0; i < loopCount; i++) {
		CopyIn(i);
		Compute(i);
		CopyOut(i);
	}
}

__aicore__ inline void CopyIn(int32_t progress)
{
	// alloc tensor from queue memory
	LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
	LocalTensor<half> yLocal = inQueueY.AllocTensor<half>();
	// copy progress_th tile from global tensor to local tensor
	DataCopy(xLocal,xGm[progress * TILE_LENGTH], TILE_LENGTH);
	DataCopy(xLocal,yGm[progress * TILE_LENGTH], TILE_LENGTH);
	// enque input tensors to VECIN queue
	inQueueX.EnQue(xLocal);
	inQueueY.EnQue(yLocal);
}

__aicore__ inline void Compute(int32_t progress)
{
	//dque input tensors from VECIN queue
	LocalTensor<half> xLocal = inQueueX.DeQue<half>();
	LocalTensor<half> yLocal = inQueueY.DeQue<half>();
	LocalTensor<half> zLocal = outQueueZ.AllocTensor<half>();
	// call Add instr for computation
	Add(zLocal, xLocal, yLocal, TILE_LENGTH);
	// enque the output tensor to VECOUT queue
	outQueueZ.EnQue<half>(zLocal)l
	// free input tensors for reuse
	inQueueX.FreeTensor(xLocal);
	inQueueY.FreeTensor(yLocal);
}

__aicore__ inline void CopyOut(int32_t progress)
{
	//deque output tensor form VECOUT queue
	LocalTensor<half> zLocal = outQueueZ.Deque<half>();
	// copy progress_th tile form local tensor to global tensor
	DataCopy(zGm[progress * TILE_LENGTH), zlocal, TILE_LENGTH);
	// free outpupt tensor for reuse
	outQueueZ.freeTensor(zLocal);
}

double buffer mechanism

Double buffer hides the data transfer time and reduces the waiting time of vector instructions by combining data transfer with vector calculation and execution, ultimately improving the utilization efficiency of the vector calculation unit. A Tensor can only carry out three pipeline tasks of moving in, calculating and moving out at the same time. One, the hardware involved in the other two pipeline tasks is expected to be in the Idle state.

If the data to be processed is divided into parts, such as Tensor1 and Tensor2.

  • When the vector computing unit performs Compute on Tensor1, Tensor2 can perform CopyIn tasks.
  • When the vector computing unit performs Compute on Tensor2, Tensor1 can perform CopyOut tasks.
  • When the vector computing unit performs CopyOut on Tensor2, Tensor2 can perform the task of CopyIn.
    As a result, the incoming and outgoing transfer of data and vector computation are parallelized, and the problem of idle hardware units is effectively alleviated.

image.png

Ascend C operator call

HelloWorld example

Run the header files included in CPU mode

Header files included in running NPU mode

Definition of kernel function

Built-in macro __CE_KT_TEST__: Flag to distinguish between running CPU mode or NPU mode logic

Host-side execution logic: Responsible for data application in host-side memory, copying from host to device, kernel function execution synchronization and resource recycling

Device side execution logic

Host side execution CPU mode logic: use the encapsulated execution macro ICPU_RUN_KF

mainly include:

gMAlloc(…): apply for memory space in CPU mode

ICPU_RUN_KF: Use encapsulated execution macros

GmFree: Release memory space in CPU mode

process

AscendCL initialization—>Run management resource application—>Host data transfer to Device—>Execute tasks and wait—>Device data transfer to Host—>Run resource release—>AscendCL deinitialization

Execute NPU mode logic on the host side: use the kernel caller <<<…>>>

Important interface
  • aclInit
  • aclCreateStream
  • aclMallocHost
  • aclMalloc
  • aclMemcpy
  • <<<…>>>
  • aclrtSynchronizeStream
  • aclrtFree
  • aclrtfreeHost
  • aclrtDestoryStream
  • aclFinalize

AddCustomSample

Ascend C vector operator sample code

  1. Kernel function source file: add_custom.app
  2. Truth value data generation script: add_custom.py
  3. CmakeLists.txt: Convenient for compiling multiple source files
  4. Auxiliary function for reading and writing data files: data_utils.h
  5. Host side source file: main.cpp
  6. Execute the script with one click: run.sh
  7. Organize cmake scripts compiled in CPU mode and NPU mode

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

The author of the open source framework NanUI switched to selling steel, and the project was suspended. The first free list in the Apple App Store is the pornographic software TypeScript. It has just become popular, why do the big guys start to abandon it? TIOBE October list: Java has the biggest decline, C# is approaching Java Rust 1.73.0 Released A man was encouraged by his AI girlfriend to assassinate the Queen of England and was sentenced to nine years in prison Qt 6.6 officially released Reuters: RISC-V technology becomes the key to the Sino-US technology war New battlefield RISC-V: Not controlled by any single company or country, Lenovo plans to launch Android PC
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10116433