Assembly language (4)-an introduction to programming syntax

table of Contents

0. The first assembler

1. Language constants

(1) Integer constant

(2) Real constant

(3) Character constant

2. Reserved words

3. Identifier

4. Pseudo instructions

5. Instructions

(1) Label

(2) Instruction mnemonic

(3) Operand

(4) Notes

(5) NOP (no operation) instruction

6. Assembler and assembly process

7. Detailed data definition

(1) Define BYTE and SBYTE data

(2) Define WORD and SWORD data

(3) Define DWORD and SDWORD data

(4) Define QWORD data

(5) Define compressed BCD (TBYTE) data

(6) Define floating point types

(7) Equal sign (=) pseudo-instruction

8. Array and string length

9. Common pseudo-instructions

(1) EQU pseudo-instruction

(2) TEXTEQU pseudo-instruction


0. The first assembler

The assembler provides almost all the information in the language. The programmer can see everything that is happening, even the registers and flags in the CPU! However, while possessing this ability, the programmer must be responsible for the details of the data representation and the format of the instructions. The programmer works at a level with a lot of detailed information. Now take a simple assembly language program as an example to understand its working process.

The following program is the code that performs the addition of two numbers and saves the result in the register: (not running)

.data                          ;此为数据区
sum DWORD 0                    ;定义名为sum的变量
.code                          ;此为代码区
main PROC
    mov eax,5                  ;将数字5送入而eax寄存器
    add eax,6                  ;eax寄存器加6
    mox sum,eax
    INVOKE ExitProcess,0       ;结束程序
main ENDP

The types that programmers may be familiar with are less specific than them, such as int, double, float, and so on. These keywords only limit the size and do not check the contents stored in the variables. Remember, the programmer has full control. Those code and data areas marked by the .code and .data directives are called segments. That is, the program has a code segment and a data segment.

1. Language constants

Microsoft syntax notation is used here. Elements in square brackets are optional; elements in braces are separated by | symbols, and one of the elements must be selected; italics identify elements that are clearly defined or explained.

(1) Integer constant

An integer literal (also called an integer constant) consists of an optional prefix, one or more digits, and an optional radix character indicating its radix:

[{+|-}] digits [radix]

For example, 26 is a valid integer constant. It has no base, so it is assumed to be in decimal form. If you want to represent the hexadecimal number 26, write it as 26h. Similarly, the number 1101 can be considered as a decimal value unless a "b" is added to the end to make it 1101b (binary). The following table lists the possible base values:
 

h Hexadecimal r Encoded Real Number
q/o Octal t Decimal (spare)
d Decimal Y Binary (spare)
b Binary     

Hexadecimal numbers beginning with a letter must be prefixed with a zero to prevent the assembler from interpreting it as an identifier. Integer constant expression (constant integer expression) is a kind of arithmetic expression, which contains integer constants and arithmetic operators. The calculation result of each expression must be an integer and can be stored in 32 bits (from 0 to FFFFFFFFh). The following table lists the arithmetic operators and gives their priority in order from high (1) to low (4). It is important for integer constant expressions to realize that they are only evaluated at assembly time. These are simply referred to as integer expressions here:
 

Operator name priority
() Parentheses 1
+,- One dollar plus and minus 2
*, / Multiply and divide 3
MOD Modulo 3
+, - Add, subtract 4

(2) Real constant

Real number literal (also known as floating-point literal) is used to represent decimal real numbers and encoded (hexadecimal) real numbers. A real decimal number contains an optional sign, followed by an integer, a decimal point, an optional integer representing the fractional part, and an optional exponent:

Real number literal (also known as floating-point literal) is used to represent decimal real numbers and encoded (hexadecimal) real numbers. A decimal real number contains an optional sign, followed by an integer, a decimal point, an optional integer representing the fractional part, and an optional exponent:

[sign]integer.[integer] [exponent]

The format of the symbol and exponent is as follows:

sign                {+,-}

exponent        E[{+,-}]integer

(3) Character constant

Character literal refers to a character enclosed in single or double quotation marks. The assembler saves the binary ASCII code value of the character in the memory. E.g:

'A'

"d"

A string literal is a sequence of characters (including spaces) enclosed in single or double quotes, for example:

'ABC'

'X'

"Good night, Gracie"

Nested quotation marks are also allowed. Just like character constants are stored in integer form, string constants are stored in memory as a sequence of integer byte values.

2. Reserved words

Reserved words have special meaning and can only be used in their correct context. By default, reserved words are not case sensitive. For example, MOV is the same as mov and Mov. There are different types of reserved words:

  • Instruction mnemonics, such as MOV, ADD, and MUL.
  • Register name.
  • A pseudo-instruction tells the assembler how to assemble the program.
  • Attributes provide information about the size and usage of variables and operands. For example BYTE and WORD.
  • Operator, used in constant expressions.
  • Predefined symbols, such as @data, which return the constant integer value during assembly.

The following table is a list of commonly used reserved words.
 

$ PARITY? DWORD STDCALL
? PASCAL FAR SWORD
@B QWORD FAR16 SYSCALL
@F REAL4 FORTRAN TBYTE
ADDR REAL8 FWORD VARARG
BASIC REAL10 NEAR WORD
BYTE SBYTE NEAR16 ZERO?
C SDORD OVERFLOW?  
CARRY? SIGN?    

3. Identifier

Identifier (identifier) ​​is a name chosen by the programmer, it is used to identify variables, constants, subroutines and code labels.
 

There are some rules for the formation of identifiers:

  • Can contain 1 to 247 characters.
  • not case sensitive.
  • The first character must be a letter (A---Z, a---z) A underscore (_), @,? or $. The following characters can also be numbers.
  • The identifier cannot be the same as the assembler reserved word.

In general, you should avoid using the symbol @ and underscore as the first character, because they are used in both assembler and high-level language compiler.

4. Pseudo instructions

Directives are commands embedded in the source code, recognized and executed by the assembler. Directives are not executed at runtime, but they can define variables, macros, and subroutines; assign names to memory segments and perform many other daily tasks related to the assembler. By default, directives are not case sensitive. For example, .data, .DATA and .Data are the same.

An important function of assembler directives is to define program sections, also called segments. The segments in the program have different functions. Sections can be used to define variables and are identified by the .DATA pseudo-instruction; the program section identified by the .CODE pseudo-instruction contains executable instructions; the program section identified by the .STACK pseudo-instruction defines the runtime stack and sets its Size, for example: .stack 100h;

5. Instructions

An instruction is a statement that becomes executable when the program is assembled and compiled. The assembler translates instructions into machine language bytes, which are loaded and executed by the CPU at runtime. An instruction has four components:

  • Label (optional)
  • Command mnemonic (required)
  • Operand (usually required)
  • Comment (optional)

The location of the different parts are arranged as follows:

[label: ] mnemonic [operands] [;comment]

(1) Label

A label is an identifier, a position mark for instructions and data. The label is located at the front end of the instruction and indicates the address of the instruction. Similarly, the label is also located at the front of the variable, indicating the address of the variable. There are two types of labels: data labels and code labels. The data label identifies the location of the variable, and it provides a convenient means to reference the variable in the code. The assembler assigns a numeric address to each label. Multiple data items can be defined after a label. The label of the program code area (the section where the instruction is located) must end with a colon (:). The code label is used as the target of jump and loop instructions. The code label can be on the same line as the instruction or on its own line.

(2) Instruction mnemonic

 

An instruction mnemonic is a short word that marks an instruction. In English, mnemonics are a way to help memorize. Similarly, assembly language instruction mnemonics, such as mov, add, and sub, give clues to the type of operation the instruction performs. Here are some examples of instruction mnemonics:
 

Mnemonic Description Mnemonic Description
MOV Transfer (distribute) the value I HAVE Multiply two numbers
ADD Add two values JMP Jump to a new location
SUB Subtract one value from another CALL Call a subroutine

(3) Operand

The operand is the value of the input and output of the instruction. The number of operands in assembly language instructions ranges from 0 to 3. Each operand can be a register, a memory operand, an integer expression, and an input port. The operands have an inherent order. When the instruction has multiple operands, usually the first operand is called the destination operand, and the second operand is called the source operand. In general, the contents of the destination operand are modified by instructions.

(4) Notes

 

Annotation is an important way for programmers and readers to communicate programming information. The beginning of the program list usually contains the following information:

  • Description of program objectives
  • List of creators or modifiers of the program
  • Date of program creation and modification
  • Description of program implementation technology

There are two ways to specify comments:

  • Single-line comments, start with a semicolon (;). The assembler will ignore all characters after the semicolon on the same line.
  • The block comment starts with the COMMENT directive and a user-defined symbol. The assembler will ignore all subsequent lines of text until the same user-defined symbol appears.

(5) NOP (no operation) instruction

The safest (and most useless) instruction is NOP (No Operation). It occupies one byte in the program space, but does nothing. It is sometimes used by compilers and assemblers to align code to valid address boundaries.

6. Assembler and assembly process

The source program written in assembly language cannot be directly executed on the target computer, and must be converted into executable code through translation or assembly. In fact, an assembler is very similar to a compiler. A compiler is a type of program used to translate  C++  or  Java  programs into executable code. The assembler generates a file containing machine language, called an object file. This file is not ready for execution. It needs to be passed to a program called a linker to generate an executable file. This file is ready to be executed at the operating system command prompt.

Assembly-link-execution cycle

  • Step 1: The programmer uses a text editor to create an ASCII text file, which is called a source file.
  • Step 2: The assembler reads the source file and generates the target file, which is the machine language translation of the program. Or, it will also generate a list file. As long as there is any error, the programmer must return to step 1 and modify the program.
  • Step 3: The linker reads and checks the object file to find out whether the program contains any calls to procedures in the link library. The linker copies any requested procedures from the link library and combines them with the target file to generate an executable file.
  • Step 4: The operating system loader reads the executable file into the memory and makes the CPU branch to the start address of the program, and then the program starts to execute.

The listing file includes a copy of the program source file, plus the line number, the numeric address of each instruction, the machine code byte (hexadecimal) of each instruction, and the symbol table. The symbol table contains the names, segments, and related information of all identifiers in the program.

7. Detailed data definition

The assembler recognizes a set of basic internal data types (intrinsic data types), according to the data size (bytes, words, double words, etc.), whether it is signed, integer or real number to describe its type. The sequencer uses SDWORD to tell the reader that this value is signed, but it is not mandatory for the assembler. The assembler only evaluates the size of the operand. So, for example, programmers can only specify 32-bit integers as DWORD, SDWORD, or REAL4 types.

 

The following table gives a list of all internal data types. The IEEE symbols in some entries refer to the standard real number format published by the IEEE Computer Society.
 

Types of usage
BYTE 8-bit unsigned integer, B represents byte
SBYTE 8-bit signed integer, S stands for signed
WORD 16-bit unsigned integer
SWORD 16-bit signed integer
DWORD 32-bit unsigned integer, D stands for double (word)
SDWORD 32-bit signed integer, SD stands for signed double (word)
FWORD 48-bit integer (far pointer in protected mode)
QWORD 64-bit integer, Q represents four (word)
TBYTE 80-bit (10-byte) integer, T represents 10 bytes
REAL4 32-bit (4 bytes) IEEE short real number
REAL8 64-bit (8-byte) IEEE long real number
REAL10 80 位(10 字节)IEEE 扩展实数

数据定义语句(data definition statement)在内存中为变量留岀存储空间,并赋予一个可选的名字。数据定义语句根据内部数据类型(上表)定义变量。数据定义语法如下所示:

[name] directive initializer [,initializer]...

  • 名字:分配给变量的可选名字必须遵守标识符规范。
  • 伪指令:数据定义语句中的伪指令可以是 BYTE、WORD、DWORD、SBTYE、SWORD 或其他在上表中列出的类型。此外,它还可以是传统数据定义伪指令,如下表所示。
伪指令 用法 伪指令 用法
DB 8位整数 DQ 64 位整数或实数
DW 16 位整数 DT 定义 80 位(10 字节)整数
DD 32 位整数或实数     

MODEL 伪指令,它告诉汇编程序用的是哪一种存储模式。32 位程序总是使用平面(flat)存储模式,它与处理器的保护模式相关联。关键字 stdcall 在调用程序时告诉汇编器,怎样管理运行时堆栈。然后是 .STACK 伪指令,它告诉汇编器应该为程序运行时堆栈保留多少内存字节。ENDP 伪指令标记一个过程的结束。

(1)定义 BYTE 和 SBYTE 数据

BYTE(定义字节)和 SBYTE(定义有符号字节)为一个或多个无符号或有符号数值分配存储空间。每个初始值在存储时,都必须是 8 位的。问号(?)初始值使得变量未初始化,这意味着在运行时分配数值到该变量。如果同一个数据定义中使用了多个初始值,那么它的标号只指出第一个初始值的偏移量。并不是所有的数据定义都要用标号。比如,在 list 后面继续添加字节数组,就可以在下一行定义它们:在单个数据定义中,其初始值可以使用不同的基数。字符和字符串常量也可以自由组合。定义一个字符串,要用单引号或双引号将其括起来。最常见的字符串类型是用一个空字节(值为0)作为结束标记,称为以空字节结束的字符串,很多编程语言中都使用这种类型的字符串:每个字符占一个字节的存储空间。对于字节数值必须用逗号分隔的规则而言,字符串是一个例外。DUP 操作符使用一个整数表达式作为计数器,为多个数据项分配存储空间。在为字符串或数组分配存储空间时,这个操作符非常有用,它可以使用初始化或非初始化数据。

(2)定义 WORD 和 SWORD 数据

WORD(定义字)和 SWORD(定义有符号字)伪指令为一个或多个 16 位整数分配存储空间。也可以使用传统的 DW 伪指令。

(3)定义 DWORD 和 SDWORD 数据

DWORD(定义双字)和 SDWORD(定义有符号双字)伪指令为一个或多个 32 位整数分配存储空间,传统的 DD 伪指令也可以用来定义双字数据。DWORD 还可以用于声明一种变量,这种变量包含的是另一个变量的 32 位偏移量。

(4)定义 QWORD 数据

QWORD(定义四字)伪指令为 64 位(8 字节)数值分配存储空间,传统的 DQ 伪指令也可以用来定义四字数据。

(5)定义压缩 BCD(TBYTE)数据

Intel 把一个压缩的二进制编码的十进制(BCD, Binary Coded Decimal)整数存放在一个 10 字节的包中。每个字节(除了最高字节之外)包含两个十进制数字。在低 9 个存储字节中,每半个字节都存放了一个十进制数字。最高字节中,最高位表示该数的符号位。如果最高字节为 80h,该数就是负数;如果最高字节为 00h,该数就是正数。整数的范围是 -999 999 999 999 999 999 到 +999 999 999 999 999 999。MASM 使用 TBYTE 伪指令来定义压缩 BCD 变量。常数初始值必须是十六进制的,因为,汇编器不会自动将十进制初始值转换为 BCD 码。如果想要把一个实数编码为压缩 BCD 码,可以先用 FLD 指令将该实数加载到浮点寄存器堆栈,再用 FBSTP 指令将其转换为压缩 BCD 码,该指令会把数值舍入到最接近的整数。

(6)定义浮点类型

REAL4 定义 4 字节单精度浮点变量。REAL8 定义 8 字节双精度数值,REAL10 定义 10 字节扩展精度数值。每个伪指令都需要一个或多个实常数初始值。下表描述了标准实类型的最少有效数字个数和近似范围:

数据类型 有效数字 近似范围
短实数 6 1.18x 10-38 to 3.40 x 1038
长实数 15 2.23 x 10-308 to 1.79 x 10308
扩展精度实数 19 3.37 x 10-4932 to 1.18 x 104932

DD、DQ 和 DT 伪指令也可以定义实数。

(7)等号(=)伪指令

等号伪指令(equal-sign directive)把一个符号名称与一个整数表达式连接起来。通常,表达式是一个 32 位的整数值。当程序进行汇编时,在汇编器预处理阶段,所有出现的 name 都会被替换为 expression。假设下面的语句出现在一个源代码文件开始的位置。

One of the most important symbols is called the current location counter, denoted as $. The symbol defined with "=" can be redefined in the same program.

8. Array and string length

Explicitly declaring the size of the array can cause programming errors, especially if the array elements are inserted or deleted later. A better way to declare the size of the array is to let the assembler calculate this value. The $ operator (current address counter) returns the offset of the current program statement. When the number of elements to be counted in the array does not contain bytes, the total size of the array (in bytes) should be divided by the size of a single element.

9. Common pseudo-instructions

(1) EQU pseudo-instruction

The EQU directive connects a symbol name with an integer expression or an arbitrary text. It has 3 formats:

name EQU expression
name EQU symbol
name EQU <text>

In the first format, expression must be a valid integer expression. In the second format, symbol is an existing symbol name, which has been defined with = or EQU. In the third format, any text can be displayed in <...>. When the assembler encounters name at the end of the program, it replaces the symbol with an integer value or text. EQU is very useful when defining non-integer values.

(2) TEXTEQU pseudo-instruction

The TEXTEQU directive, similar to EQU, creates a text macro. It has three formats: the first one assigns text to the name; the second one assigns the content of an existing text macro; the third one assigns integer constant expressions:

name TEXTEQU <text>
name TEXTEQU textmacro
name TEXTEQU %constExpr

 

Guess you like

Origin blog.csdn.net/qq_35789421/article/details/113722447