Assembly code assembly code, register register, operand operands and machine code machine code

This article writes about the ISA (instruction set architecture) instruction set architecture, machine code and assembly code in the X86 system.

A simple overview:

You don't need to understand everything at once, but first have a general concept, because some of them will be explained in detail later.

In general, when the program is executed, some of the most important parts are stored in the CPU, such as the PC, which is the program counter, which is used to store the next instruction, which is called "RIP" (instruction pointer) in x86-64; used to Condition codes for the latest logical operation and conditional judgment; Registers for storing a large number of temporary variables. These are the data that we most need to keep "at hand" when we run the program, and they are all placed in the CPU.

Of course, the size of the CPU is limited. The main data, code, and stack supporting recursive jumps of the program are all placed in the memory, which is the Memory part in the figure. Memory can be understood as a "byte addressable array", and we all access data in memory through addresses. Unlike data of type int or double, addresses are often untyped pointers. Depending on the system, the address size is also different, generally speaking, it is 64 bits.

List of registers:

These are the register names in the x86 system (%rax, %rbx, %rcx...), they have different responsibilities, such as %rax is responsible for storing the return value, %rsp stack pointer stack pointer, etc. Don't worry too much about why these registers are called these names, because there is no logic, and it is good to attribute them to historical reasons.

In the figure above, these registers are 64-bit, which is the register under the x86-64 system. But in the early IA32 systems, the registers are only 32 bits, which are actually the lower 32 bits of the current 64-bit registers, and they have their own original names (%eax,%ebx,%ecx...)

The lower 16 bits and lower 8 bits of these 32-bit registers are also subdivided, just look at the picture and compare them.

From .c file to final executable

The following is how it becomes the final executable file step by step after we have written a c program.

What is the use of so many register names mentioned above? Know immediately where they will be used. For the ordinary program we write, it can be gradually converted into assembly code assembler code, machine code machine code, and finally binary code executed by the computer through step-by-step instructions. The name of the register can be seen in the assembly code.

Value transfer mov operation between register and memory

The Assembly code indicates that the value transfer operation uses the mov command, namely: movq Source, Dest . The q here represents moving 8 bytes (64 bits), and movl is 4 bytes, the lower 32 bits. There are three operand types:

Immediate value: is a constant integer data used to represent a fixed value. Such as includes $0x400 and $-533，用a dollar sign ( $) as a prefix.
Register (Register): such as %rax and %r13，如果写寄存器，就指的是寄存器里面的值。
Memory: A memory operand represents an address in memory. In the simplest case, a register can be used as the base address, eg (%rax). This means %rax 8 consecutive bytes at the address stored in the register. Adding a parenthesis is like adding a * in the language, indicating that the value corresponding to the address is taken.

The following are some combination examples of mov operations. Comparing assembly code and C code, it is easy to understand mov operations.

It should be noted that the first parameter of the default mov here is the source, and the second is the destination, but sometimes this order is reversed depending on the machine. In addition, we can find that there is no one-step operation from memory to memory , and all operations must go through (including) registers. The more difficult thing to understand in the above figure is that the brackets are added to represent the value corresponding to the address, so the corresponding c code also uses the * sign. That is, (R) stands for Mem[Reg[R]].

There are other forms of expansion around the parentheses, such as:

D(R) stands for Mem[Reg[R]+D] , movq 8(%rbp),%rdx, which means to add 8 to the address in the register %rbp first, then take the value of the result address and assign it to the register %rdx . D here is a bias displacement.

D(Rb,Ri,S) stands for Mem[Reg[Rb]+S*Reg[Ri]+ D] . Where D: the constant "displacement", which can be 1, 2 or 4 bytes. Rb: Base address register, which can be any one of 16 integer registers. Ri: Index register, which can be %rspany register except S: scaling factor, can be 1, 2, 4 or 8. These numbers allow us to access elements in contiguous blocks of memory, such as arrays or structures.

Let’s understand with an example: For example, an array a stores 3 data of double type. The address of array a is in Rb, then (Rb, 1, 8) is to calculate the address first: Rb+8*1 = Rb+8, because the double type is 8 bytes, so the calculated result happens to be index 1 The address of the element; because of the right bracket, and then take the value corresponding to this address, you get the first element a[1] in the array. This is why the value of S is basically 1, 2, 4 or 8, because they are all the length of the basic data type.

Some other situations and examples:

leaq operation

The difference between lea and mov operation is that it does not need to access memory according to the address. for example:

leaq (%rdi,%rdi,2), %rax # t = x+2*x

In this example, the value in the register %rdi is directly multiplied by 3 and then given to %rax. There is no need to find the value in the memory based on the calculation result and then give it to %rax.

Other operations:

As long as it is an operation that can be written in a c program, such as addition, subtraction, multiplication, division, left shift and right shift, bit operations, etc., it can naturally be expressed in assembly instructions:

addq Src,Dest //Dest = Dest + Src
subq Src,Dest //Dest = Dest − Src
imulq Src,Dest //Dest = Dest * Src
salq Src,Dest //Dest = Dest << Src Also called shlq
sarq Src,Dest //Dest = Dest >> Src Arithmetic
shrq Src,Dest //Dest = Dest >> Src Logical
xorq Src,Dest //Dest = Dest ^ Src
andq Src,Dest //Dest = Dest & Src
orq Src,Dest //Dest = Dest | Src

incq Dest //Dest = Dest + 1
decq Dest //Dest = Dest − 1
negq Dest //Dest = − Dest
notq Dest //Dest = ~Dest

An example: a swap function with its assembly code:

void swap
(long *xp, long *yp)
{
long t0 = *xp;    //movq (%rdi), %rax
long t1 = *yp;    //movq (%rsi), %rdx
*xp = t1;        //movq %rdx, (%rdi)
*yp = t0;        //movq %rax, (%rsi)
}                //ret

From the perspective of assembly code, in fact, the values in the two addresses are first taken out and placed in the two registers rax and rdx, and then these values are replaced and put back into the two addresses. It can be found here that for the incoming parameters xp and yp, they are placed in the registers %rdi and %rsi by default, which are the common uses of these two registers: they are used to store function parameters.

Another example: an arithmetic function and its assembly code:

long arith
(long x, long y, long z)
{
long t1 = x+y; //leaq (%rdi,%rsi), %rax # t1
long t2 = z+t1; //addq %rdx, %rax # t2
long t3 = x+4; //leaq (%rsi,%rsi,2), %rdx # t3
long t4 = y * 48; //salq $4, %rdx # t4
long t5 = t3 + t4; //leaq 4(%rdi,%rdx), %rcx # t5
long rval = t2 * t5; //imulq %rcx, %rax # rval
return rval; //ret
}

What must be emphasized here is that the example I gave here is relatively simple, so the actual code and assembly code can correspond one by one, but in actual complex situations, the assembly code obtained by the compiler may not necessarily correspond to the original code line by line. To how the compiler compiles, optimizes and generates assembly code "in its own way".

Assembler assembler and linker linker, and disassemble

We have assembly code (.s file), which can be turned into machine code (.o file) by an assembler.

Assembler:
- Translate assembly language source code ( .s files) into object files ( .o files).
- Generate binary encoding for each instruction.
- Generates an almost complete image of the executable code, excluding code links between different files.
Linker:
- Resolve references between different files.
- Merge object files with static runtime libraries such as implementations of functions such as malloc and .printf
- The linker also deals with dynamically linked libraries, which are linked when the program starts executing

3. Disassemble Disassemble (the upward red arrow I drew in the picture)

Decompilation is to operate on the generated object file, so as to obtain the assembly code for analysis. For example, for the sum.o file, through the command:

objdump –d sum

You can get a result similar to this, the left side is the content of the original target file (machine code), and the right side is the comprehensible assembly code:

0000000000400595 <sumstore>:
400595: 53 push %rbx
400596: 48 89 d3 mov %rdx,%rbx
400599: e8 f2 ff ff ff callq 400590 <plus>
40059e: 48 89 03 mov %rax,(%rbx)
4005a1: 5b pop %rbx
4005a2: c3 retq

Let me add here, in fact, the original machine code is as follows, the continuous array on the left is actually the address of the instruction, and the right is the operation instruction expressed in hexadecimal.

400595: 53
400596: 48 89 d3
400599: e8 f2 ff ff ff
…………

It can be seen that 53 in hexadecimal occupies one byte, so the address of the next instruction is only +1, and the next 48 89 d3 occupies three bytes, so the address of the third line of instruction starts with +3. Of course, it is difficult to understand the specific meaning of these hexadecimals (there are actually specially organized tables for comparison, such as mov %rax, %rdx under the assembly code corresponds to what under the hexadecimal machine code In the attack lab, we use the comparison table to find out what machine code we need to implant to achieve related operations), which is why we have to disassemble it, and at least understand a little bit after getting the assembly code.

gdb is a commonly used disassembly and debugging tool. In the bomb lab of CSAPP, gbd is used to disassemble the existing executable bomb file, so as to analyze what the program is doing.

summary

This article talks about the C program, the difference between assembly and machine code, introduces registers, operands and moves, arithmetic instructions, etc. A simple introduction to this knowledge is still very important for a deep understanding of the underlying principles of computers and writing efficient code.