Get started with Ascend C programming in 3 days丨Take you to know the basic concepts and common interfaces of Ascend C

This article is shared from "  [2023 CANN Training Camp First Season] - Introduction to Ascend C Operator Development - The First Lesson (Definition and Implementation of Kernel Function) ", author: dayao.

Ascend C is a programming language launched by CANN for operator development scenarios. It natively supports C and C++ standard specifications and maximizes the matching of user development habits. Through key technologies such as multi-layer interface abstraction, automatic parallel computing, and twin debugging, the operator is greatly improved. Development efficiency helps AI developers complete operator development and model tuning and deployment at low cost.

Friends who have enough time recommend watching the official tutorial: Ascend C Official Tutorial

If you want to save time and effort for a quick start, you can read this article, which systematically sorts out the most important knowledge points of AscendC programming for you, and you can get started quickly in 3 days without getting lost!

Day 1 learning points:

image.png

1. What are the advantages of using Ascend C

  1. C/C++ primitive programming
  2. The programming model shields hardware differences, and the programming paradigm improves development efficiency
  3. Multi-level API packaging, from simple to flexible, taking into account ease of use and high efficiency
  4. Twin debugging, the CPU side simulates the behavior of the NPU side, and can be debugged on the CPU side first

Five, multi-level API interface

Ascend C provides a multi-level 0-3 level API. As the level increases, the degree of freedom of API usage decreases and the ease of use increases. Developers can choose the appropriate API according to their needs, use the most easy-to-understand high-level interface to quickly build operator logic, and use the free and flexible low-level interface for complex logic implementation and performance tuning. The main effect of this is to:

  • Reduce the difficulty of using complex instructions
  • Guaranteed Cross-Gen Compatibility
  • retain maximum flexibility possible

​​

1, ​3 level interface

Operator overloading supports +, -, *, /, |, &, <, >, <=, >=, ==, !=, and realizes the simplified expression of level 1 instructions. Users are allowed to use the form: dst = src0 * src1 to calculate for the entire Tensor. The following command API has a 3-level interface:

​2. Level 2 interface

Calculate the continuous COUNT data of the source operand srcLocal, and continuously write the destination operand dstLocal to solve the continuous calculation problem of one-dimensional tensor.

​3. Level 0 interface

Level 0 functional flexible computing interface is the lowest-level development interface, which can fully send back the computing API with hardware advantages, and can perform discontinuous computing. This function can fully send back the powerful function instructions of CANN series chips, and supports The Block stride, Repeat stride, and MASK operations allow users to use many general parameters to customize the required operations:

1.
Repeat iteration times-Repeat times 

The vector calculation unit reads 8 consecutive blocks (32 Bytes per block, 256 Bytes in total) data for calculation each time. In order to complete the processing of the input data, it must go through multiple iterations (repeat) to complete the reading of all the data with calculations. Repeat times represent the number of iterations.

As shown in the figure below, the size of the data to be processed is 16 blocks (512 Bytes), and each iteration processes 8 blocks (256 Bytes). It takes two iterations to complete the calculation, and the Repeat times should be set to 2.

​​​​2.
The address step size of the same block between adjacent iterations 

When Repeat times is greater than 1 and multiple iterations are required to complete the vector calculation, you can reasonably set the value of the Repeat stride of the address step of the same block between adjacent iterations according to different usage scenarios.

Continuous computing scenario: Assume that a Tensor is defined for both the destination operand and the source operand (that is, address overlap), and the value of Repeat stride is 8. At this time, the vector calculation unit reads 8 consecutive blocks in the first iteration, and reads the next 8 consecutive blocks in the second iteration, and the calculation of all input data can be completed through multiple iterations.

​Discontinuous
computing scenario: When the value of Repeat stride is greater than 8 (for example, 10), the data read by the vector computing unit between adjacent iterations is discontinuous in terms of address, and an interval of 2 blocks appears.

Repeated calculation scenario: When the value of Repeat stride is 0, the vector calculation unit will repeatedly read and calculate the first 8 consecutive blocks.

Partial repeated calculation: When the value of Repeat stride is greater than 0 and less than 8, some data between adjacent iterations will be repeatedly read and calculated by the vector calculation unit, which is not involved in general scenarios.

3. The address step size of different blocks in the same iteration

If it is necessary to control the step size of data processing in a single iteration, it can be realized by setting the address step size Block stride of different blocks in the same iteration.

  • Continuous calculation, Block stride is set to 1, and the 8 block data in the same iteration are processed continuously.
  • For non-continuous calculation, if the Block stride value is greater than 1 (for example, 2), there will be a block interval between different blocks in the same iteration when reading data, as shown in the figure below.

4. Mask parameter

Mask is used to control the elements participating in the calculation in each iteration. It can be set in two ways: continuous mode and bit-by-bit mode.

Continuous mode: Indicates how many consecutive elements in the front participate in the calculation. The data type is uint64_t. The value range is related to the data type of the operand. The data type is different, and the maximum number of elements that can be processed in each iteration is different (the maximum number of elements that can be processed in a single iteration of the current data type is: 256 / sizeof( type of data)). When the data type of the operand occupies 16 bits (such as half, uint16_t), mask∈[1, 128]; when the operand is 32 bits (such as float, int32_t), mask∈[1, 64].

Bit-by-bit mode: You can control which elements participate in the calculation bit by bit. The value of the bit is 1 to participate in the calculation, and 0 to not participate. The parameter type is an array of uint64_t type with length 2. The value range of the parameter is related to the data type of the operand. The data type is different, and the maximum number of elements that can be processed in each iteration is different. When the operand is 16 bits, mask[0], mask[1]∈[0, 264-1]; when dst/src is 32 bits, mask[1] is 0, mask[0]∈[0, 264-1].

The addition of 512 int16s is compared with 0, 2, and 3 interfaces, and you can choose the corresponding interface according to your actual needs.

​​

6. More learning resources

Well, this sharing is over. There are still many learning resources for Ascend C. If you want to learn more, you can refer to the official website tutorial: Ascend C Official Tutorial .

Extra!

Huawei will hold the 8th HUAWEI CONNECT 2023 at the Shanghai World Expo Exhibition Hall and Shanghai World Expo Center on September 20-22, 2023. With the theme of "accelerating industry intelligence", this conference invites thought leaders, business elites, technical experts, partners, developers and other industry colleagues to discuss how to accelerate industry intelligence from business, industry, and ecology.

We sincerely invite you to come to the site, share the opportunities and challenges of intelligentization, discuss the key measures of intelligentization, and experience the innovation and application of intelligent technology. you can:

  • In 100+ keynote speeches, summits, and forums, collide with the viewpoint of accelerating industry intelligence
  • Visit the 17,000-square-meter exhibition area to experience the innovation and application of intelligent technology in the industry at close range
  • Meet face-to-face with technical experts to learn about the latest solutions, development tools, and hands-on
  • Seek business opportunities with customers and partners

Thank you for your support and trust as always, and we look forward to meeting you in Shanghai.

Conference official website: HUAWEI CONNECT 2023 | HUAWEI CONNECT 2023

Welcome to follow the "Huawei Cloud Developer Alliance" official account to get the conference agenda, exciting activities and cutting-edge dry goods.

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/132697033