CUDA PTX ISA reading notes (1)

I don't know what this is, look here: Parallel Thread Execution ISA Version 5.0 .
In short, PTX is a thing compiled from .cu code, and then compiled by PTX to generate execution code. If you don't want to see the web version, there is a pdf version in the doc folder in the cuda installation directory, which looks very comfortable.
ps: Because the document is in English (and there are more than 200 pages = =), in view of the limited English level of the blogger and limited time (mainly lazy), so only some self-important content is paraphrased, such as wanting to study in depth , or obediently read the documentation

Chapter 1 Introduction

1.1. Using GPUs for Scalable Data Parallelism

Introduced a wave of knowledge of parallel computing.

1.2. Objectives of PTX

PTX provides a stable programming model and instruction set, this ISA can span multiple GPUs, and can optimize code compilation and more.

1.3. PTX ISA version 5.0

It is some new features of PTX ISA5.0

1.4. Document Structure

  • Programming Model: An Overview of the Programming Model
  • PTX Machine Model: An overview of the PTX virtual machine
  • Grammar: Describes the basic grammar of the PTX language
  • State Spaces, Types, and Variables: That's What Describes These Things
  • instruction operands
  • Application Binary Interface: Describes the syntax of function definitions and calls, as well as the application binary interface supported by PTX
  • Instruction Set
  • special register
  • Version update introduction

Chapter 2 Programming Model

2.1. A highly parallel coprocessor

Continue to popularize GPU.

2.2. Thread Hierarchy

2.2.1 Cooperative Thread Array

2.2.2 Threaded array grid

The above two sections are mainly about some basic GPU blocks, grids, etc. If you want to know more, you can read my other article: "GPU High Performance Programming CUDA Practice" (CUDA By Example) Reading Notes - Part 1 Chapter five . The diagrams here are from this manual.

2.3. Memory Hierarchy

This picture is really good:
Different levels correspond to different memory

Chapter 3 PTX Machine Model

3.1. A set of SIMT multiprocessors with on-chip shared memory

Mainly talk about the hardware hierarchy, and sure enough, the picture is still the best:
hardware structure

Chapter 4 Grammar

The PTX language is composed of operation instructions and operands.

4.1. Code Format

Use \n newline, space wood is meaningful, the # symbol is similar to C, it is a precompiled instruction, and it is case-sensitive, each PTX code starts with .version, indicating the version of PTX.

4.2. Notes

same as C

4.3. Statements

Start with an optional token and end with a semicolon, like this:
start: mov.b32 r1, %tid.x;

4.3.1. Indication

Provides an indication of PTX
PTX indication

4.3.2. Instructions

PTX instructions are provided:
PTX instruction
ps: The difference between the two words directive and instruction involves some assembly knowledge. The former is translated as an instruction here, and the latter is translated into an instruction here, because the general directive does not generate code but instructs compilation. Some behaviors of the device, and the instruction will generate the actual code, you can read here: What-is-the-difference-between-an-instruction-and-a-directive-in-assembly-language

4.4. Identifiers

This is probably the naming rule of variable names, which is basically the same as C, and then the predefined variables in the system are all big variables starting with %.

4.5. Constants

This, I guess, is probably the wrong label, it should be a big title containing the following constants.

4.6. Integer Constants

Each integer constant is 64, divided into signed and unsigned, defined by .s64 and .u64, where the numbers of each base are defined as follows:

X base Representation
hex 0[xX]{hexadecimal number}+U?
decimal 0{octal decimal number}+U?
binary 0 [bB] [0/1} + U?
decimal {non-zero number}{decimal number}*U?

4.6.1. Floating point constants

Floating-point numbers are all 64-bit, except that a 32-bit hexadecimal is used to precisely express a single-precision floating-point number (black question mark face???), the specific expression is as follows:

accuracy way of expression
single precision 0[fF]{hexadecimal number}{8}
double precision 0[dD]{hexadecimal number}{16}

4.6.2. Judgment value constant

0 is false, non-zero is true

4.6.3. Constant Expressions

This is probably an expression that can be used for constants, which is basically the same as C:
expression

4.6.4 Integer Constant Expression Evaluation

same as C language

4.6.5 Summary of Expression Evaluation Rules

C language +1
evaluate

Chapter 5 State Spaces, Types, and Variables

5.1. State Space

As far as I understand this state space, it is the memory on which to operate.

5.1.1. Register State Space

Use .reg to declare a register state space that can use almost any form of data, but unlike other state spaces, registers do not have addresses.

5.1.2. Special Register Status Space

Use .sreg to declare, mainly store some variables predefined by the system, such as the dimension of the grid and other data.

5.1.3. Constant State Space

The constant state space is represented by .const and is limited to 64KB. And it is organized into 10 areas, and the driver needs to apply for space in these ten areas, and then the applied space can be passed to the kernel function with a pointer.

5.1.3.1. Bank Constant Register (Deprecated)

In the past, it was necessary to specify a certain area number, like this:
.extern .const[2] .b32 const_buffer[];

5.1.4. Global State Space

Use ld.global, st.globle and atom.global to access the global state space. Moreover, there is no order to access the global variable space, and it is necessary to use bar.sync to synchronize.

5.1.5. Local State Space

.local declares the local state space and can only be used inside a thread.

5.1.6. Parameter state space

The parameter state space is used to 1. Pass the input parameters from the host to the kernel function. 2. Formalize the input and return parameters for the device function declaration called within the kernel function. 3. Declare local arrays as function call parameters, especially for passing large structures to functions.

5.1.6.1. Kernel function parameters

.entry foo ( .param .b32 N, .param .align 8 .b8 buffer[64]) 
{ 
    .reg .u32 %n; 
    .reg .f64 %d; 
    ld.param.u32 %n, [N]; 
    ld.param.f64 %d, [buffer]; 
    ...

5.1.6.2. Kernel function parameter properties

5.1.6.3. Kernel function parameter attribute: .ptr

Using this is equivalent to a pointer, and you can also specify the size of the memory alignment.

 .entry foo ( 
    .param .u32 param1, 
    .param .u32 .ptr.const.align 8 param3, 
    .param .u32 .ptr.align 16 param4 
    ) { .. }

5.1.6.4. Device function parameters

This is most commonly used to pass variables of different size and register size, such as structures.

 .func foo ( .reg .b32 N, .param .align 8 .b8 buffer[12] ) {
    .reg .s32 %y; 
    ld.param.f64 %d, [buffer]; 
    ld.param.s32 %y, [buffer+8]; 
    ... 
}

5.1.7. Shared State Space

Defined by .shared, shared memory has a feature that it can be broadcast and can be accessed sequentially (is there some kind of consistency mechanism?)

5.1.8. Texture state space (deprecated)

Texture memory is also part of global memory, shared by all threads of the context and read-only. Using .tex should be replaced by .texref in .global. like:

  .tex .u32 tex_a;
  //转换成下面这样
  .global .texref tex_a;

5.2. Types

5.2.1. Basic types

These basic types are like int, float in C language, used to define variables:
ptx basic type

5.2.2. Using size limits for subfields

Types like .u8, .s8, and .b8 are limited to ld, st, and cvt. .f16 can only be converted to and from .f32, .f64 types. The floating-point type .f16×2 is only allowed to be used in floating-point arithmetic instructions and texture fetch instructions.

5.3. Texture pickers and surface types

The following passage is excerpted from the expert handbook on the explanation of surface citations:

Instructions for reading and writing textures and surfaces involve more cryptic state than other instructions. Parameters such as base address, dimensions, format, and how the texture content is interpreted are contained in a header structure. The header is an intermediate data structure whose software abstraction is called a texture reference or surface reference.
Here is a table for the opacity types provided specifically for the texture state space:
opaque type

5.3.1. Texture and Surface Settings

Like the width, height, and depth mentioned in the table above are used to describe characteristics such as the size of texture memory.

5.3.2. Collector Settings

It has various modes, see CUDA C Programming Guide for more details.

5.3.3. Channel Data Type and Channel Command Fields

OpenCL can be used in the past, and it can be used now.
To tell the truth, because of too little understanding of texture memory, this section is very reluctant.

5.4. Variables

5.4.1. Variable declarations

Variable declarations need to declare both the state space and the data type. For example:

.global .u32 loc; 
.reg .s32 i; 
.const .f32 bias[] = {-1.0, 1.0}; 
.global .u8 bg[4] = {0, 0, 0, 0}; 
.reg .v4 .f32 accel; 
.reg .pred p, q, r;

5.4.2. Vectors

The length of the vector here is fixed by ptx, it can only be 2 or 4, and it cannot be a judgment value (true of false). The definition is the same as that of ordinary variables:global .v4 .f32 V;

5.4.3. Array declarations

The definition of an array is similar to that of C, and the length can be specified or not specified and then initialized:

      .local  .u16 kernel[19][19];
      .shared .u8  mailbox[128];
      .global .u32 index[] = { 0, 1, 2, 3, 4, 5, 6, 7 };
      .global .s32 offset[][2] = { {-1, 0}, {0, -1}, {1, 0}, {0, 1} };

5.4.4. Initializers

For initialization, it's like this:

.const .f32 vals[8] = { 0.33, 0.25, 0.125 };
.global .s32 x[3][2] = { {1,2}, {3} };
//相当于
.const .f32 vals[4] = { 0.33, 0.25, 0.125, 0.0, 0.0 }; 
.global .s32 x[3][2] = { {1,2}, {3,0}, {0,0} };

Currently, the initialization of variables is only supported for constant and global state spaces, and the default initialization value is 0. For arrays, you can also use the following magic methods to initialize:

.const .u32 foo = 42; 
.global .u32 p1 = foo; // offset of foo in .const space .global .u32 p2 = generic(foo); // generic address of foo 

// array of generic-address pointers to elements of bar .global .u32 parr[] = { generic(bar), generic(bar)+4, generic(bar)+8 };

5.4.5. Memory Alignment

That is, you can specify the size of the memory alignment when defining the array:

// allocate array at 4-byte aligned address. Elements are bytes. .const .align 4 .b8 bar[8] = {0,0,0,0,2,0,0,0};

5.4.6. Parameterized variable names

Here is a quick way to declare variables:.reg .b32 %r<100>; //就相当于声明了 %r0, %r1, ..., %r99

5.4.7. Variable Properties

see next section

5.4.8. Variable attribute indication: .attribute

Variables have a .manage attribute, which can only be used in the .global state space. After using this attribute, you canSummon the DragonVariables can be placed in a virtual space that is accessible to both the host and the device. Specifically it is used like this:.global .attribute(.managed) .s32 g;

Chapter 6 Instruction Operands

6.1. Operand Type Information

The operand in each instruction must declare its type, and the type must conform to the template of the instruction, and there is no automatic type conversion.

6.2. Source Operands

PTX describes a memory reader, so operands for ALU instructions must be in the .reg register state space. The arguments that the cvt instruction can take have multiple types and sizes, and can convert one type (or size) to another type (or size). The ld, st, mov and cvt instructions copy data from one address to another. ld, st copies the contents to or from a register, and the mov instruction changes data from one register to another. Most instructions have an optional predicate operation, and some instructions have additional predicate type source operands, which are often defined as p, q, r, s.

6.3. Destination Operands

Used to get a result, usually in a register.

6.4. Using addresses, arrays and vectors

6.4.1. Addresses as Operands

Just like the definitions of various types:

.shared .u16 x; 
.reg .u16 r0; 
.global .v4 .f32 V; 
.reg .v4 .f32 W; 
.const .s32 tbl[256];
.reg .b32 p; .reg .s32 q; 

ld.shared.u16 r0,[x]; 
ld.global.v4.f32 W, [V]; 
ld.const.s32 q, [tbl+12]; 
mov.u32 p, tbl;

6.4.2. Arrays as Operands

The use of arrays is basically the same as in the C language:

ld.global.u32 s, a[0]; 
ld.global.u32 s, a[N-1]; 
mov.u32 s, a[1]; // move address of a[1] into s

6.4.3. Vectors as Operands

A vector feels more like a structure or an array. Using a vector can quickly copy multiple numbers, which is very strong:

.reg .v4 .f32 V; 
.reg .f32 a, b, c, d; 
mov.v4.f32 {a,b,c,d}, V;
//也可以反过来
ld.global.v4.f32  {a,b,c,d}, [addr+offset];
ld.global.v2.u32  V2, [addr+offset2];

6.4.4. Tag and function names as operands

This is mainly used to get the tag or function name and use it for jumping in branch statements.

6.5. Type conversion

6.5.1. Scalar conversions

sext: Sign extension. zext: Zero extension. chop: Only keep the low bits. s is a signed integer, f is a floating point number, and u is an unsigned integer. 2 is converted to
type conversion

6.5.2. Rounding Modifiers

Here is the sign indicating rounding, what is the downward forensic rounding up and so on. (The least significant bit (English: Least Significant Bit, lsb) refers to the 0th bit (ie the lowest bit) in a binary number, with a weight of 2^0, which can be used to detect the parity of a number.)
Arrangement

6.6. Operand time consuming

The number of operands in different state empty spaces affects the speed of an operation. Registers are the fastest, globals are the slowest, and multithreading can mask this latency, or keep value-fetching instructions as simple as possible. Here's the delay in getting values ​​from these places:
value delay

Chapter 7 Abstract ABI

ABI is the abbreviation of Application Binary Interface, which translates to binary program interface. To put it bluntly, it is a series of functions provided by the system.

7.1. Declaration and Definition of Functions

Without further ado, just look at the code

//定义了一个结构体
struct {
    double dbl;
    char   c[4];
};
//有返回值和传入参数
.func (.reg .s32 out) bar (.reg .s32 x, .param .align 8 .b8 y[12]) 
{ 
    .reg .f64 f1; 
    .reg .b32 c1, c2, c3, c4; 
    ... 
    ld.param.f64 f1, [y+0]; 
    ld.param.b8 c1, [y+8];
    ld.param.b8 c2, [y+9];
    ld.param.b8 c3, [y+10]; 
    ld.param.b8 c4, [y+11]; 
    ... ... // computation using x,f1,c1,c2,c3,c4; 
} 
{
    .param .b8 .align 8 py[12]; 
    ...
    //通过位移来使用参数
    st.param.b64 [py+ 0], %rd; 
    st.param.b8 [py+ 8], %rc1; 
    st.param.b8 [py+ 9], %rc2; 
    st.param.b8 [py+10], %rc1; 
    st.param.b8 [py+11], %rc2; 
    // scalar args in .reg space, byte array in .param space 
    call (%out), bar, (%x, py); 
    ...

Note that both st.param for parameters and ld.out for return values ​​must be followed by the function call call. This will allow the compiler to optimize that .param does not take up extra space. And this .param allows simple mapping of structures with multiple addresses to variables that can be passed to functions.

7.1.1. Changes in PTX ISA Version 1.x

1.x only supports .reg, and later began to support .param.

7.2. List functions

The current ptx does not support list functions. (I don't support making a fuss, the next one!)

7.3. Alloca

Ditto

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325985093&siteId=291194637