[Ascend C operator development (entry)] - Ascend C programming model and paradigm

Ascend C programming model and paradigm

1. Parallel computing architecture abstraction

The operators developed by Ascend C programming run on AI Core, so we need to understand the structure of AI Core. AI Core mainly includes computing units, storage units, and handling units.

  • The computing unit includes three computing resources: Scalar computing unit (performs scalar calculations); Cube computing unit (matrix calculations); Vector computing unit (vector operations)
  • The transfer unit is mainly responsible for transferring data between Global Memory and Local Memory.
  • Including internal storage (Local Memory) and external storage (Global Memory)

Data is stored and calculated between these units, involving three flows: asynchronous instruction flow, synchronous signal flow, and calculation data flow.

  • Asynchronous instruction flow: refers to the asynchronous execution of the received instruction sequence between the computing unit and the handling unit.
  • Synchronous signal flow: ensure that different instructions are executed according to the correct logical relationship
  • Computing data flow: refers to the process in which the transfer unit transfers data to Local Memory and transfers the processed data back to Global Memory.

The internal architecture diagram of AI Core is as follows:
image.png


2. Introduction to SPMD programming model

The SPMD (Single Program, Multiple Data) model is a parallel programming model for the same program that processes multiple data elements simultaneously. In Ascend C, the SPMD model is used to write parallel computing tasks in order to fully utilize the parallel computing capabilities of the Ascend AI processor.

The key points of the SPMD model are as follows:

  • Single Program: The SPMD model means that the program written is the same and does not change for different data elements. This program will be executed on different data, but the code itself is the same. This helps improve code maintainability and reusability. The only difference on each core is that the block_idx is different.

  • Multiple Data: A program processes multiple data elements simultaneously, which are usually stored in arrays or tensors. Each data element is processed one by one by the same program, thus achieving parallelism.


3. Kernel function writing and calling

The Ascend C kernel function is a specific function used to write high-performance parallel computing tasks and is the operator device-side entrance.

3.1 Kernel function definition

8C18172DD5E2AA134AC39BB340E7592E.png

It mainly includes three parameters: function type qualifier, kernel function name, and parameter list.

1. Use the __global__ function type qualifier to identify it as a kernel function that can be called by <<<...>>>; use the __aicore__ function type qualifier to identify the kernel function on the device Executed on the end AI Core.

2. Pointer input variables need to add the variable type qualifier __gm__. Indicates that the pointer variable points to a memory address somewhere on Global Memory.

3. The kernel function uses the kernel caller <<<…>>> to specify the execution configuration of the kernel function:
kernel_name<<<blockDim, l2ctrl, stream>>>(argument list);
Explain each The meaning of each parameter:

  • blockDim: specifies how many cores the kernel function will be executed on.
  • l2ctrl: Temporarily set to a fixed value nullptr, developers do not need to pay attention
  • stream: type is aclrtStream

I did an operator development experiment on the Ascend C addition algorithm yesterday. The code address is:Gitee code repository
In this Add file, The core code for operator development is in add_custom.cpp, where the code for the kernel function definition is:
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z) { KernelAdd op; op.Init(x, y, z); op.Process(); }
This code uses the __global__ __aicore__ function type qualifier to indicate that this kernel function will be on the AI ​​Core Execution
void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z): This is the declaration of the kernel function and accepts three GM_ADDR parameters, named x, y and z respectively. GM_ADDR is a pointer type pointing to general memory (GM, General Memory), indicating that this kernel function will operate on data in general memory.

KernelAdd op: Initialize the operator class. The operator class provides methods such as operator initialization and core processing.
op.Init(x, y, z): Initialization function, obtains the input and output addresses that the kernel function needs to process, and completes the necessary memory initialization work

op.Process(): core processing function, completes core logic such as data transfer and calculation of operators

After defining the kernel function, you can call it:
void add_custom_do(uint32_t blockDim, void* l2ctrl, void* stream, uint8_t* x, uint8_t* y, uint8_t* z) { add_custom<<<blockDim, l2ctrl, stream>>>(x, y, z); }

<<<blockDim, l2ctrl, stream>>>: This is the CUDA execution configuration, which specifies the execution method of the kernel function add_custom. blockDim indicates how many CUDA thread blocks are used, and l2ctrl and stream indicate information related to thread block configuration and streams. These parameters are usually used to control the execution mode and resource configuration of CUDA kernel functions.

Guess you like

Origin blog.csdn.net/qq_45257495/article/details/133972607