Program linking principle

This article briefly introduces the linking principle of the program. Learning the linking principle helps programmers understand the nature of the program, and can also lay a solid foundation for future large-scale software code development. Understanding the link principle helps us solve some inexplicable problems in daily development.

Simply put, linking is the process of collecting various codes and partial data in a project and combining them into a single executable file . The combined file can be loaded into memory for execution.

Linking can occur in three situations:

1. Compilation time: when the source code is translated into machine code

2. When loading: When the program is loaded into memory and executed

3. Runtime: When the application is executed

1. Static linking

1.1 Program compilation process

//示例程序1

/* /code/link/main.c */
void swap();

int buf[2] = {1, 2};

int main()
{
    swap();
    return 0;
}

/* /code/link/swap.c */
extern int buf[];

int *bufp0 = &buf[0];
int *bufp1;

void swap()
{
    int temp;
    
    bufp1 = &buf[1];
    temp = *bufp0;
    *bufp0 = *bufp1;
    *bufp1 = temp;
}

The above is a simple two-number exchange program. The process of generating an executable target file is as follows:

C language preprocessor (cpp): translates the C language source program *.c into an ASCII code intermediate file *.i

c compiler (ccl): translate *.i into an ASCII code assembly language file *.s

Assembler (as): Translate *.s into a relocatable object file*.o

Finally, the linker program ld combines all *.o files and some necessary system files to create an executable object file.

 

 

1.2 Tasks of the linker

The linker links multiple object files into a complete, loadable, and executable object file. Its input is a set of relocatable target files. The two main tasks of the link are as follows:

1. Symbol resolution : Link symbol references and symbol definitions in the target file. Each function and each variable can be regarded as a symbol, and each symbol in the object file is associated with the symbol definition.

2. Relocation : The linker associates the definition of each symbol with a specific memory (RAM) location, and then modifies all references to these symbols so that they all point to this memory location.

1.3 Target file

​Three forms of target files:

1. Relocatable target files

     This kind of file contains binary code and data that have been compiled and converted into machine instruction code and data, but cannot be executed directly. Because these instructions and data often reference symbols in other modules (object files), the symbols of these other modules are unknown to this module. The resolution of these symbols requires the linker to link all modules. This operation is called relocation, so this target file is called a "relocatable target file", and the suffix is ​​usually *.o
 

 2. Executable target file

 Such files also contain binary code and data. The difference is that this file has been linked and is linked to all modules (object files). The linker concatenates all required relocatable object files into an executable object file. At this point, symbols in each object file that reference other object files have been resolved and relocated. Therefore, every symbol is known and the file can be executed directly by the machine.
 

3.  Share target files

  This is a special locationable object file that can be dynamically loaded into memory and run when the program that needs it is run or loaded. The suffix for such files is usually *.so. Shared object files are often called "dynamic library" files or "shared library" files.

1.4 Relocatable object files

A typical relocatable target file and executable file in a Linux environment is usually in the ELF (Excutable Linkable File) format. The typical structure of an ELF file is as follows:

The target file mainly consists of two parts: the ELF file header and the target file segment. The first 16 bytes of the ELF file header constitute a byte order, describing the word length and byte order of the file system generated. The remaining part includes some other information about the ELF file, including the size of the ELF file header, the type of the target file, the type of the target machine, the file offset position of the segment header table in the target file, etc. This information is important when linking and loading ELF format programs. 

/*ELF文件头*/
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x4003e0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          6736 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 28

In addition to the ELF file header, the remaining part consists of segments of the object file. These sections are the core part of the ELF file. It consists of the following sections:

●. text : Code segment , stored binary machine instructions, which can be directly executed by the machine.

.rodata : Read-only data segment , which stores complex constants used in programs, such as strings, etc.

.data : Data segment , which stores global data that has been explicitly initialized in the program. Including global variables and static variables in C language. If these global data are initialized to 0, they are not stored in the data segment, but are stored in the block storage segment. C language local variables are stored on the stack and do not appear in the data segment.

.bss : Block storage segment , which stores global data that has not been explicitly initialized. This section does not occupy actual space in the target file, but is just a placeholder to inform that the space for global data should be reserved at the specified location. The reason why block storage segments exist is to improve the utilization of storage space on the disk.

.symtab : Symbol table , which stores defined and referenced functions and global variables. There must be one such table in each relocatable object file. In this table, all referenced global symbols (including functions and global variables) in this module and global symbols in other modules (object files) will have a registration. The relocation operation in the link is to determine the location of these referenced global symbols.

.rel.text : Information that the code segment needs to be relocated (relocate), and stores a summary of symbols that need to be modified by relocation operations. These symbols are in the code segment and are usually a function name and label.

.rel.data : Information about data segments that need to be relocated, storing a summary of symbols that need to be modified by relocation operations. These symbols are in the data segment and are global variables.

.debug : debugging information, storing a symbol table for debugging. Using the -g option of the gcc compiler when compiling a program will generate this section. This table includes the references and definitions of all symbols in the source program. With this section, you can print and observe it when using the gdb debugger to debug the program. The value of the variable.

.line : The line number mapping of the source program, which stores the line number of each statement in the source program. When compiling a program, using the -g option of the gcc compiler will generate this section. This section is very useful when debugging the program using the gdb debugger.

.strtab : String table, which stores the names of symbols in the .symtab symbol table and .debug symbol table. These names are strings and end with '\0'.

1.5 Symbols and symbol tables in object files

Symbol resolution is one of the main tasks of linking. Only after the symbol is correctly parsed can the location of the referenced symbol be changed, thereby completing the relocation and generating an executable target file that can be directly loaded and executed by the machine. Each relocatable object file has a symbol table, which stores symbols. These symbols are divided into 3 categories:

1. Global symbols defined in this module

2. Global symbols defined by other modules referenced in this module

3. Local symbols defined and referenced in this module

Note: Local variables and local symbols are not the same thing. Local variables are stored on the stack and are a concept that only appear in memory; local symbols include static variables and local labels, which may also appear in disk files.

Symbol table structure:

typedef struct{
    int name;			//目标符号的名字
    int value;			//符号的地址。对于可重定位模块:该值是距定义目标节的起始位置的偏移;
    					//			对于可执行目标文件:该值是一个绝对运行时地址。
    int size;			//目标符号的大小(字节为单位)
    char type:4;		//目标符号的类型
    char binding:4;		//目标符号是本地的还是全局的
    char reserved;		//保留
    char section;		//表示目标符号和目标文件的某个节关联(符号表中的Ndx字段)
}Elf_Symbol;

 Symbol table in sample program 1main.c

Symbol table '.symtab' contains 11 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS main.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    6 
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
     8: 0000000000000000     8 OBJECT  GLOBAL DEFAULT    3 buf
     9: 0000000000000000    21 FUNC    GLOBAL DEFAULT    1 main
    10: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND swap

Explanation of the meaning of the symbol table:

**buf:** An 8-byte target located at offset 0 (value) in the .data section, a global symbol

**main:** A 21-byte function located at offset 0 in the .text section, a global function

**swap:**Reference from external symbol swap, external symbol

Integers are used in the symbol table to identify each different section: Ndx=1 represents the .text section; Ndx=3 represents the .data section; ABS represents symbols that should not be relocated; UNDEF represents undefined symbols, that is, in this Symbols referenced in the target module and defined elsewhere; COMMON represents an uninitialized data target that has not yet been allocated a location, that is, an uninitialized global or local static variable. LOCAL represents local symbols and GLOBAL represents global symbols.

1.6 Symbol analysis

The linker resolves symbol references by associating each reference with a definite symbol definition in the symbol table of the relocatable object file it inputs.

1. Local symbol resolution

 Symbol resolution is very simple for those references to local symbols defined in the same module. The compiler allows only one definition of each local symbol in each local object file. Of course, for local static variables, they will be assigned a local linker symbol by the compiler and have a unique name.
 

2. Global symbol resolution 

 When resolving global symbols, when the compiler encounters a symbol (variable or function) that is not defined in the current module, it will assume that the symbol is defined in some other module, generate a linker symbol entry table, and Leave it to the linker. During the subsequent link relocation process, if the linker cannot find the definition of the referenced symbol in any of its input modules, the compilation will report an error.
 

 3. Compiler parsing rules for the same global symbol defined in multiple object files

 Rule 1: Multiple strong symbols are not allowed
 Rule 2: If there is one strong symbol and multiple weak symbols, choose the strong symbol
 Rule 3: If there are multiple weak symbols, then choose any strong symbol from these weak symbols
 : Initialized global symbol
 Weak symbol: Uninitialized global symbol

 

1.7 Relocation

When symbol parsing is completed, the definition position and size of each symbol are known. The relocation operation only requires linking these symbols. In this step, the linker needs to merge all the object files participating in the link and assign each symbol a runtime address to store the content. Relocation is performed in two steps:

1. Relocation section and symbol definitions

In this step, the linker merges all sections of the same type into a new section. For example, all .data sections in the input object module will be merged into .data sections in the executable object file, and then the linker assigns the runtime memory address to the new .data section. The process of other sections is the same. When this step is completed, every instruction and global variable in the program has a unique runtime memory address.

2. Symbol references in relocation sections

In this step, the linker modifies the references to each symbol in the code and data sections so that they point to the correct run-time memory address.

When the compiler generates an object file, it does not know the final storage location of the code and variables, nor does it know the external symbols defined in other files. Therefore, whenever the assembler encounters a target reference whose final location is unknown, the compiler generates a relocation entry that stores information about each symbol. This entry tells the linker how to modify the symbol references in each object file when merging the object files. This relocation entry is stored in the **.rel.text** segment and the .rel.data segment . This entry can be understood as a structure that stores the relocation information of each symbol.

typedef struct {
    int offset;/*偏移值*/
    int symbol;/*所代表的符号*/
    int type;/*符号的类型*/  
}symbol_rel;
/*
offset表示该符号在存储的段中的偏移值。symbol代表该符号的名称,字符串实际存储在.strtab段中,这里存储的是该字符串首地址的下标。type表示重定位类型,链接器只关心两种类型,一种是与PC相关的重定位引用,另一种是绝对地址引用。
*/

The PC-related relocation reference means adding the current PC value (this value is usually the storage location of the next jump instruction) plus the offset value of the symbol. Absolute address reference means that the address reference specified in the current instruction is directly used as the jump address without any modification.

With this information, the linker can add the offset value of the symbol in the storage segment to the new address of the segment after relocation, thus obtaining a new reference address, and this reference address is the final address of the symbol. address. Likewise, all parts of the program that reference this address must be modified to use this new absolute address instead of the old offset address. When the new symbol address is modified, the linker's work is over.

1.8 Executable object files

​The format of an executable object file (ELF):

 

The ELF header describes the overall format of the file, which is similar to the format of a relocatable object file, but it includes the entry point of the program.

Segment header table: Describes which contiguous segments of memory are mapped to contiguous slices of the executable file.

.init defines a function: _init, which the program initialization code will call.

.text, .rodata, and .data are similar to the previous sections in the relocatable object file, but these sections have been relocated to their final runtime memory address.

​ Example of segment header table:

 

off: file offset; vaddr: virtual address; paddr: physical address; align: segment alignment;

filesz: segment size in the target file; memsz: segment size in memory; flags: operation permissions

explain:

Lines 1 and 2 tell us that the first segment (code segment) is aligned to a 4KB boundary, has read/execute permissions, starts at memory address 0x08048000, the total memory size is 0x448 bytes, and is initialized It is the first 0x448 bytes of the executable object file, including the ELF header, segment header table, and .init, .text, and .rodata sections.

Lines 3 and 4 tell us that the second segment (data segment) is aligned to a 4KB boundary, has read/write permissions, starts at memory address 0x08049448, has a total memory size of 0x104 bytes, and uses the 0xe8 bytes are initialized starting at file offset 0x448, which in this case is the beginning of the .data section. The remaining bytes in this section correspond to .bss data that will be initialized to zero at runtime.

Guess you like

Origin blog.csdn.net/qq_40648827/article/details/128021316