Analysis of the instructions of the NVIDIA Volta architecture

write first

Since the experimental results are not very good, we have begun to work on the lowest level of sass modification. In view of the fact that the official nvidia is for commercial purposes, there is very little content about sass, so we can only sporadically learn from various papers or other look for things like that. Two days ago, I found a document about the Volta architecture. It talked about some sass content. It is roughly the same as the introduction of maxas , but it is better understood. The relevant part is hereby translated, that is, the content of the second chapter.

ps : Click here to download the original text

text begins

Volta uses a different way of encoding instructions than the Pascal and Maxwell architectures.

The biggest difference from previous architectures is that Volta uses 128 bits to encode each instruction and the corresponding control information for the instruction. Previous architectures used 64 bits to encode each instruction, and then forked out an additional 64 bits to represent control information, which may be associated with multiple instructions. Here is a related example:

Volta instruction encoding

The above code is disassembled using nvdisasm, which is divided into two 64-bits for easy display. The first line is only the encoded instruction information, and the second line contains instruction information and control information.

As far as we know from thorough assembly instructions, the 128 bits are divided according to the following rules:

  • At least 91 bits are used for instruction encoding
  • At least 23 bits for control information
  • According to our experiments, the remaining 14 bits are useless

control information

The Kepler architecture introduces control information into the compiler's encoding scheduling process for instructions. Control information prevents data collisions and allows simple on-chip logic, which enables GPUs to achieve higher computational density and lower power consumption

On Volta, 128 bits include instructions and instruction-related control information

The previous architecture of Volta was a control message connected to multiple instructions (3 in Pascal and Maxwell, 7 in Kepler). Each piece of control information indicates the scheduling mode of the instructions related to it. The following code is an example under the Pascal architecture, including a total of 4 64-bit words, the first 64 words only have hexadecimal representation without corresponding instructions, which is the control field; and the remaining three are instructions.

                                                /* 0x000f8800fe2007f1 */ 
/*0288*/    @P5 LDG.E.CI R66, [R86+0x100];      /* 0xeed4a00010055642 */
/*0290*/    @!P5 MOV R66, RZ;                   /* 0x5c9807800ffd0042 */
/*0298*/    @P6 LDG.E.CI R67, [R86+0x180];      /* 0xeed4a00018065643 */

Control information is encoded differently on different architectures, as follows:

  • Kepler architecture, including 6 0s in the most significant bit and 2 0s in the least significant bit, and 7 parts of 8 bits each
  • Pascal and Maxwell architectures, including a 0 in the most significant bit and three parts of 21 bits each
  • The Volta architecture consists of two 0's and a 21-bit part of the most significant bit. Each 128-bit word begins with control information, followed by the encoding of the instruction.

The organization of control information in Volta, Pascal, and Maxwell architectures is consistent, and each part includes 6 coding fields as follows:

command fields

Their respective meanings are as follows:

  1. reuse flag

    Volta, Pascal and Maxwell have 4 register reuse caches and 4 source operand slots. Each of these four bits is connected to an 8bytes slot. When the reuse flag is set, the associated register value is stored in the register reuse cache for instructions that may reuse the register. Reuse also reduces register bank conflicts. The least significant bit represents the first source operand slot and the most significant bit represents the fourth source operand slot

  2. wait barrier mask; read and write barrier flags

    While the execution time of most instructions is fixed and can be statically scheduled by the assembler, the execution times of those that involve memory access and shared computing resources are variable. Volta, Pascal, and Maxwell use "dependency gates" to measure the completion time of these variable-latency instructions and resolve data conflicts. By setting "write barrier number", when a variable-latency instruction writes to a register, the assembler connects it to a barrier. When an instruction following this instruction wants to access the written register, the assembler sets the instruction to wait for the current gate to complete by setting the bit associated with the corresponding gate in the "wait barrier mask". The hardware will block the instruction until the content it needs is ready. An instruction may need to wait for multiple gates to complete, which is why wait gates are not simply flags but masks.

  3. read dependency gate

    The read dependency gate is used to solve the problem of write after read. Instructions without buffers require the value of the register to remain unchanged when reading from the register and writing it to memory. To ensure this, the assembler binds it to a gate by setting the "read barrier number". Subsequent instructions that want to write to this register wait for this gate to complete.

  4. blocking delay

    This four-bit field indicates the amount of time the scheduler needs to wait before executing the next command, in the range 0-15. In Pascal and Maxwell architectures, if the combination of this field and the "yeild flag" indicates that a certain amount of bits is included, it will Making two dispatchers in an operation block issue one or two consecutive instructions at the same time is called a "dual issue". There is only one dispatcher in the Volta architecture, so there is no such double issue seen.

  5. domain tag

    Volta uses domain markers to balance the amount of tasks assigned to processing blocks. When this flag is set, the scheduler is more inclined to issue instructions for the current warp. If it is not set, the scheduler is inclined to replace it with another warp and disable all register reuse flags. If you switch to another warp, it will consume an extra clock cycle.

scheduler

Volta's SM is divided into four processing blocks. Instructions on the same warp are assigned to a specific block and can only use computing resources within that block. The mapping relationship between warp and scheduler (in the processing block) is scheduler_id = warp_id%4 , in order to prove it, we conducted an experiment (experiment omitted)

instruction code

Volta uses more bits to encode instructions than previous architectures.

Unlike previous architectures (Pascal, Maxwell, and Kepler) that placed operation instructions in the most significant bit, Volta places operation instructions in the least significant bit of the first 64-bit word. We show the encoding of Pascal and Volta in the appendix.

Volta's opcodes are 10-13 bits long.

As with previous architectures, Volta can operate on registers (general, special or predicate), memory addresses (constant, shared or global). An assertion is represented by 4 bits: the first bit is the valid bit, and the remaining three bits are the number of an assertion register.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325982078&siteId=291194637