Machine-level representation of the program (a)

Program code

Suppose a C program, there are two documents p1.c and p2.c. We use the Unix command line to compile the code:

linux> gcc -Og-o p p1.c p2.c

  

Gcc command is the GCC compiler, which is the Linux default compiler. -Og compiler option tells the compiler to use will generate optimized machine code level in line with the overall structure of the original C code, optimized to produce higher-level code will be severely deformed, so that the relationship between the machine code produced and the initial source code difficult to understand.

In fact, gcc command calls a set of procedures to convert source code into executable code. First, C source code preprocessor extended, with all the specified insertion #include command file, and all extensions specified with macro #define statement. Second, the compiler generates two assembly code source files, and the names were p1.s p2.s. Next, the assembler will compile code into binary object code files p1.o and p2.o. Object code is a form of machine code, which contains the binary representation of all the instructions, but not fill in the global address value. Finally, the linker object code files and implement two functions (such as the printf) merged with the code, and generating a final executable p (specified by the -op).

Machine-level code

The machine-level programming, the two of abstract particularly important. The first is a set of instructions or instruction set architecture Architecture (Instruction Set Architecture, ISA) to define the format and behavior of machine-level program, which defines the state of the processor, the instruction formats, each instruction and the impact on state. Most ISA, including x86-64, described the behavior of the program as if each instruction is executed in sequence, after the end of an instruction, the next instruction before you start. Fine processor hardware complex than described, their concurrent execution of many instructions, but you can take steps to ensure that the same behavior performed by order of the overall behavior of the ISA specified. The second is abstract, the memory address of the machine-level program using a virtual address, provided by memory model appears to be a very large array of bytes. The actual memory system is to achieve a plurality of memory hardware and operating system software combination.

In the entire compilation process, compiler does most of the work, the program abstract execution model provided by the C language representation is converted to perform very basic instruction processor. Assembler code indicates very close to machine code. Compared with the binary format machine code, assembly code of the main features is that it represents a more readable text format.

x86-64 machine code and the original C code difference is very large, some of the C language programmers usually hidden processing status are visible:

  • Program counter (commonly referred to as a PC, with x86-64 represents% rip) the next instruction to be executed is given an address in memory.
  • The integer register file 16 contains a named location, 64-bit values ​​are stored. These registers can be memory address (corresponding to a pointer in the C language) or integer data. Some registers are used to record some important program state, and other registers used to hold temporary data, such as process parameters and local variables, and the return value of the function.
  • Condition code register contains an arithmetic logic recently executed instructions or status information. They are used to control or change of conditions to achieve a data stream, for example, used to implement the if and while statements.
  • A set of vector registers can hold one or more integer or floating point values.

While the model provides a C language, and object declarations can assign various types of data in memory, the machine code just as a large memory array byte addressable. Aggregate data type of C language, for example, arrays and structures, a set of consecutive bytes in the machine code is represented by. Even for scalar data type, assembly code does not distinguish between signed or unsigned integers, do not distinguish between the various types of pointers, not even distinguish between pointers and integers.

Program memory comprising: machine-executable code of the program, the operating system needs some information to manage procedure calls and returns a runtime stack, and the memory block allocated to the user (for example, with a library function malloc allocated). As mentioned earlier, the program memory is addressed using virtual addresses. At any given time, only a limited portion of the virtual address are considered legitimate. For example, a virtual address is x86-64 64-bit word represented. In the current implementation, the upper 16 bits of these addresses must be set to zero, so that an address can actually specified is 2 48 byte range within or 64TB. More typical program will only visit a few megabytes or several gigabytes of data. Operating system responsible for managing the virtual address space, translating virtual addresses into physical addresses actual processor memory.

A machine instruction is executed only a very basic operations. For example, in a register of the two numbers together, to transfer data between memory and registers, or conditional branch instruction to a new address. The compiler must tell the sequence of these instructions, in order to achieve expression evaluation, like, cycling or procedure call and return to the structure of such a program.

The sample code

 

long mult2(long,long);

void multstore(long x, long y, long *dest) {
  long t = mult2(x, y);
  *dest = t;
}

  

-S option on the command line, you can see the assembly code generated by the C compiler:

# gcc -Og -S mstore.c 

  

This will run the GCC compiler, assembler generates a file mstore.c, but not other further work. Assembly code file contains various statements, including the following lines:

multstore:
        pushq   %rbx
        movq    %rdx, %rbx
        call    mult2
        movq    %rax, (%rbx)
        popq    %rbx
        ret

  

Each indent code above to rows corresponding to one machine instruction. For example pushq instruction indicates the register% rbx should be pressed into the program contents of the stack. Such codes have been removed all the information about the local variable name or data type.

If we use the -c command-line option to compile and compile this code:

# gcc -Og -c mstore.c
# ll
total 637312
……
-rw-r--r--  1 root root      1368 Aug  7 14:59 mstore.o
……

    

This will produce object code file mstore.o, it is a binary format, so can not directly see. 1368 byte file mstore.o section 14 in the sequence of bytes, which is the hexadecimal representation of:

53 48 89 d3 e8 00 00 00 00 48 89 03 5b c3

  

This is representative of the target assembly instructions listed above corresponds. Which can get a message, the program execution by the machine just a sequence of bytes, which is the encoding of a series of instructions. Machine little knowledge of the source code to generate these instructions.

To view the contents of the machine code file, there is a type of program called a disassembler is very useful. The program generates a format similar to the assembly code based on machine code. In the Linux system, with '-d' command line flag program OBJDUMP (represents "object dump") can fill that role:

# objdump -d mstore.o 

mstore.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <multstore>:
   0:   53                      push   %rbx
   1:   48 89 d3                mov    %rdx,%rbx
   4:   e8 00 00 00 00          callq  9 <multstore+0x9>
   9:   48 89 03                mov    %rax,(%rbx)
   c:   5b                      pop    %rbx
   d:   c3                      retq  

  

On the left, we see the 14 byte hexadecimal byte values ​​arranged in the order given above, are divided into several groups, each group having 1 to 5 bytes. Each is an instruction, the right is equivalent assembly language.

Some of these features on the machine code disassembler and its representation is worth noting:

  • X86-64 instruction length ranging from 1 to 15 bytes. Small number of bytes required less frequently used instructions and instruction operands, the more number of bytes required more or less frequently used instructions those operands.
  • Design instruction format way, starting from a given position can be uniquely decoded byte into machine instructions. For example, only instructions pushq% rbx beginning byte value is 53.
  • Disassembler only based on a sequence of bytes machine code file assembler code is determined. It does not need access to the source code of the program or assembly code.
  • Disassembler naming rules and instructions for use GCC generated assembly code used in some subtle differences. In our example, it omits many instructions at the end of 'q'. These suffixes are size indicator may be omitted in most cases. In contrast, the disassembler to call and ret instructions plus 'q' suffix, the same, these suffixes will be omitted, no problem.

Generate executable code requires a set of object code file to run the linker, and this group of object code file must contain a main function. Imagine a following function in the file main.c:

# include <stdio.h>

void multstore(long, long, long *);

int main() {
    long d;
    multstore(2, 3, &d);
    printf("2 * 3 --> %ld/n", d);
    return 0;
}

long mult2(long a, long b) {
    long s = a * b;
    return s;
}

  

Then, we create an executable file prog using the following method:

# gcc -Og -o prog main.c mstore.c 
# ll
total 637312
……
-rwxr-xr-x  1 root root      8616 Aug  7 15:55 prog

  

Prog, becomes 8616 bytes, because it not only contains the code for two processes, but also contains code to start and terminate the program, as well as the code for interacting with the operating system. We can also disassemble prog file:

# objdump -d prog 
……
0000000000400563 <mult2>:
  400563:       48 89 f8                mov    %rdi,%rax
  400566:       48 0f af c6             imul   %rsi,%rax
  40056a:       c3                      retq   
000000000040056b <multstore>:
  40056b:       53                      push   %rbx
  40056c:       48 89 d3                mov    %rdx,%rbx
  40056f:       e8 ef ff ff ff          callq  400563 <mult2>
  400574:       48 89 03                mov    %rax,(%rbx)
  400577:       5b                      pop    %rbx
  400578:       c3                      retq   
  400579:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
……

    

<Multstore> This code disassembly and mstore.c out almost exactly the same amount of code. One major difference is the address listed on the left - the link will address this code to a different amount for a range of addresses. The second difference is that the linker callq fill the address of the function call instruction mult2 need to use (disassembly line 40056f). One of the tasks of the linker is to find the location of the executable code matching function is a function call. The last difference is more than a single line of code (the first 400,579 lines), this instruction has no effect on the program, as they appear in the back of the return instruction (the first 400,578 lines).

Notes on format

GCC assembly code generated for us a little difficult to read. On the one hand, it contains some of the information we do not care about, on the other hand, it does not provide any description of the program or describe how it works. For example, suppose we use the following command to generate a file mstore.s.

# gcc -Os -S mstore.c 
# cat mstore.s 
        .file   "mstore.c"
        .text
        .globl  multstore
        .type   multstore, @function
multstore:
.LFB0:
        .cfi_startproc
        pushq   %rbx
        .cfi_def_cfa_offset 16
        .cfi_offset 3, -16
        movq    %rdx, %rbx
        call    mult2
        movq    %rax, (%rbx)
        popq    %rbx
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
.LFE0:
        .size   multstore, .-multstore
        .ident  "GCC: (GNU) 4.8.5 20150623 (Red Hat 4.8.5-16)"
        .section        .note.GNU-stack,"",@progbits

  

All the directive to '' all lines beginning with the guidance of assembler and linker work. We can usually ignore these lines. On the other hand, does not explain the relationship between instructions and use thereof with respect to the source code.

In order to more clearly illustrate the assembly code, we expressed in such a format assembler code, it omits most of directives, including the line number and explanatory notes. For our example, assembly code with the following explanation:

Data Format

Because it is extended from the 16-bit architectures to 32-bit, Intel term "word (Word)" represents the sixteen data type. Accordingly, 32 bits is called "double word (double words)", referred to as 64-bit "words (quad words)". Figure 3-1 shows the C substantially corresponding to the type of data represented x86-64. Int value is stored as a standard double word (32-bit). Pointer (here denoted by * char) is stored as a quadword 8 bytes, 64-bit machine so already expected. In x86-64, the data type implemented as a 64-bit long, allowing a larger range of values ​​represented. The following code example uses the pointer and most long data types, so are quadword operation. x86-64 instruction set also includes instructions for the full byte, word, and double word.

FIG. 3-1 C data types in size x86-64. In the 64-bit machine, an 8-byte length pointer

There are two main forms of float: single precision (4 byte) value, corresponding to the C data types a float; double-precision (8-byte) value, corresponding to the C data type double.

As shown, the suffix most GCC generated assembly code instruction has a character indicating the size of the operands. For example, a data transfer instruction has four variants: movb (bytes transferred), MOVW (word transfer), Movl (transfer double words) and MOVQ (quadword transfer). Suffix 'l' indicates a double word is used, since 32 bits is considered a "long word (long word)". Note that, also assembly code suffix 'l' represents a 4 to 8-byte integer and double precision floating point byte. This does not cause ambiguity, because the floating point number is a completely different set of instructions and registers.

Guess you like

Origin www.cnblogs.com/beiluowuzheng/p/11313710.html