Linker basics

 

Sometimes you can learn knowledge, but not time. -Zhong Yunlong


Basic: https://blog.csdn.net/qq_35865125/article/details/105214201


Overview

In the compilation system, the linker plays a role similar to "glue". It glues and splices the relocatable object file generated by the assembler processing into an executable ELF file. However, the linker does not splice the object file mechanically. It also needs to complete the segment address allocation, symbol address calculation, and data / instruction content correction that cannot be completed in the assembly stage.

These three main tasks involve the core process of linker work: address space allocation, symbol resolution, and relocation.


In each entry of the section header table of the relocatable object file, the virtual address of the section is set to 0 by default. This is because it is impossible to know the load address of the segment during the assembler processing stage. The main purpose of the linker's address space allocation operation is to specify the load address for the segment ( that is, to determine where each section in the target file is placed in the executable file ).

 

After determining the loading address of the section (referred to as the base address of the section), the address of the symbol in the executable file can be calculated according to the offset address of the symbol in the target file ( referred to as the symbol address, such as the address of the defined function ) . The symbol resolution operation of the linker does not stop at calculating the symbol address. It also needs to analyze the symbol reference between the target files and calculate the address of the external symbol referenced in the target file.

 

After the symbol resolution, the symbolic addresses (eg: addresses in the executable file) of all target files have been determined. The linker corrects the symbolic address referenced in the code segment or data segment through the relocation operation ( eg. The code segment has call printf, and the printf needs to be modified to the address of the function ) .

 

Finally, the linker exports the file information processed by the above operations as an executable ELF file to complete the linking work.

 

collect message

For the linker, the input is a series of relocatable object files. To complete the follow-up work, the linker must scan the target files one by one and extract the required information for processing.

The linker needs to analyze the references of symbols in the object file. The reason for analyzing the reference information of the symbol is that in an object file processed by the linker, there are undefined symbols, that is, references to symbols of other object files. In order to facilitate the processing of linker symbol resolution, two symbol sets are generally defined: one is an export symbol set, which represents all global symbol sets defined in all target files that can be referenced by other targets; the other is an import symbol set, which represents target files It is undefined internally and needs to refer to the symbol set of other object files.

 

Address space allocation

When the assembler generates the object file, because the load address of the segment cannot be determined, the segment base address is recorded as 0 by default. The first step of the linker is to determine the segment base address of the segment to be loaded. The process of specifying the segment base address for the segment to be loaded is called address space allocation.

The linker specifies the base address for the segment, which needs to be considered from three aspects.

1) Start address of segment loading.

      This address is the starting position of all load segments. In 32-bit Linux systems, it is generally set to 0x08048000.

2) The splicing sequence of segments.

     The linker sequentially scans the segments of the same name in each target file, and "places" the binary data of the segments in sequence.

3) Segment alignment.

      Segment alignment includes two levels: the alignment of the segment file offset and the alignment of the segment base address.

In the relocatable target file, the file offset alignment of the segment is generally set to 4 bytes, regardless of the alignment of the segment base address (the segment base address is 0, there is no meaning of alignment).

In the executable file, the file offset alignment of the code segment ".text" is set to 16 bytes, and the file offset alignment of other segments is still 4 bytes by default. The alignment of the segment base address is more complicated. It is necessary to ensure that the linear address of the segment and the corresponding file offset of the segment are modulo equal to the segment alignment value (that is, the page size, which is 4096 bytes by default in Linux).

( The segment alignment field in the Program header table p_align: p_ align indicates the segment alignment, the alignment rule is p_ vaddr% p_ align = 0, that is, the linear address of the segment must be an integer multiple of p_ align. In general, p_ align takes the value 0x1000 = 4096, which is the default page size of the Linux operating system ).

 

The following figure shows an example of address space allocation. The code segment size of the target file ao is 0x4a bytes, the data segment size is 0x08 bytes, the code segment size of bo is 0x21 bytes, and the data segment size is 0x04 bytes.

 

-No section header table is required in the executable file, this is only required in the object file.

 

Symbol resolution

The target file symbol table stores the offset of each defined symbol relative to the base address of the segment. When the address space of the segment is allocated, the base address of each segment is determined. Therefore, the symbol address can be calculated using the following formula:

Symbol address = segment base address + symbol offset from segment base address

However, before calculating the symbolic address, some preparation work is still required.

First, you need to scan the symbol table in the target file to obtain the definition and reference information of the symbols, that is, the exported symbol set and the imported symbol set described above.

Secondly, it is necessary to verify the legality of the imported symbol set and the exported symbol set . Symbol verification includes two aspects:

1) Symbol redefinition: that is, a symbol with the same name exists in the exported symbol set. When the target file is linked, the symbol is processed by name retrieval, and the redefinition of the symbol will cause the file that refers to the symbol to be unable to determine which symbol should be used specifically.

2) The symbol is not defined: the imported symbol set contains symbols that do not exist in the exported set. When the external symbol referenced by the object file cannot find the corresponding definition in other object files, the address of the symbol cannot be determined. Once the symbol is redefined or undefined, the linker's work cannot continue.

 

Note:

There is a big difference between the target file and the executable file: the program entry point e_ entry field of the file header of the target file is 0, and the program entry point of the executable file is a linear address. We need to assume that the entry address of the program is recorded in a symbol named "@start". Obviously this symbol cannot be the symbol name generated by the compiler. In order to ensure that the linker can find the entry point of the program, the symbol reference verification phase must be forced to export the "@start" symbol . As for the provider of the "@start" symbol, it can be temporarily assumed to originate from an existing object file.

 

Generally speaking, symbolic address resolution is divided into two steps:

1) Scan the local symbols of all ELF target files to calculate the address of the local symbols.

2) Scan all the symbols of the imported set (that is, a file needs to use the symbols defined by other target files), and pass the symbol address to the symbol table of the target file that references the symbol.

 

reset

 

(https://blog.csdn.net/qq_35865125/article/details/105214201

The symbols that need to be relocated are stored in the relocation table in each target file, corresponding to the section named " .rel" at the beginning. Sections where ELF files need to be relocated generally correspond to a relocation table. For example, the code section, that is, the relocation table of ". Text" sectioin is stored in the ". Rel. Text" section, and the relocation table of ". Data" is stored in ". Rel. Data")

 

The target file's relocation information contains three key elements:

#Relocation symbol -which symbol address is used for relocation;-(in the relocation table in each target file)

#Relocation location -where to relocate; ( This information can also be obtained from the relocation table of the target file, the table stores the symbol name that needs to be relocated, and also saves which section of the target file the symbol belongs to , And the offset in this section, after linking to complete address space allocation, the address of this section in the target file is also determined, so the position of the symbol in the executable file can be located according to the offset ).

#Relocation type -what method to use for relocation.

 

First, because the relocation operation relies on the address of the relocated symbol, it cannot be relocated until the symbol resolution is complete.

 

There are two types of relocation:

Absolute address relocation and relative address relocation. Correcting segment data according to different relocation types is the core of relocation.

1) Absolute address relocation operation is relatively simple, where absolute address relocation is generally derived from direct reference to the symbol address. Since the assembler cannot determine the virtual address of the symbol, the reference symbol is finally filled with 0 as a placeholder Address place. Therefore, the absolute address relocation operation only needs to directly fill in the virtual address of the relocation symbol to the relocation position.

Absolute relocation address = relocation symbol address

 

2) Relative address relocation is a bit more complicated. The place where relative address relocation is needed is generally derived from the jump address instruction referencing the symbolic address of other files .

Although the assembler cannot determine the virtual address of the referenced symbol, it does not use 0 as a placeholder to fill the reference symbol address, but uses the "offset position relative to the address of the next instruction" to fill the position. When the linker performs a relative address relocation operation, it calculates the offset of the symbol address relative to the relocation position, and then adds the offset to the content stored in the relocation position.

Relative relocation address = relocation symbol address – relocation location + relocation location data content

                           = (Relocation symbol address-relocation position) + (relocation position-next instruction address)

                           = Relocation symbol address – next instruction address

According to the above calculation, it can be clearly seen that the final calculated relative relocation address is the offset of the symbol address from the address of the next instruction, and it is also in line with the requirements of the jump instruction for the operand . As for why such a "cumbersome" calculation of relative address relocation, the author believes that in this way, for instructions of different lengths and design structures, as long as the data of the relocation position is corrected according to the relative address method, then the relative relocation address The calculation method is unchanged, the difference is only the value of the data at the relocation position. For example, for Intel 32-bit jump instructions, the position data value is –4, for Intel 64-bit jump instructions, the position data value is –8.

 

The following describes the relocation process with an example.

 

 

Program entry point and runtime library

As mentioned in the previous section, the address of the program entry point is stored in a special symbol named "@start", and the object file that defines the symbol is not generated by the compiler based on the source code. Then there are two problems that need to be clarified:

1) Why introduce new symbols instead of the main function as the entry point of the program?

2) How do I get the target file that defines the new symbol?

First explain the first question. The form of assembly code fragments generated for the main function is as follows:

 

Essentially, the main function is not much different from ordinary functions: it contains the function stack code (lines 3 ~ 5), function body code (line 6 omitted) and function stack code (lines 7 ~ 9) Row). Assuming that the main function is used as the entry point of the program, that is, the linear address of the main symbol is written to the e_entry field in the header of the ELF file , then after the program is loaded and run, the instruction will be read from the address position of the main symbol to start execution. There will be no problems during the execution of the main function until after the execution of the ret instruction. According to the semantics of the ret instruction , the program will take the 32-bit data from the top of the stack as the return address, and then jump to that address to continue execution ! However, before the program executes the main function, the data stored on the top of the stack is unknown, so the final behavior of the program cannot be predicted. The most common consequence is to trigger the process "SegmentFault".

Therefore, in order for the program to exit gracefully, a caller of the main function must be constructed to complete the "cleanup" work after the function call. This also provides a solution to the second problem.

In the system call of Linux, the system call with the call number 1 is exit. Using exit can cause the process to exit normally. The assembly code for calling exit is shown in lines 6 to 8, where register eax holds the exit system call number 1, ebx holds the exit system call parameter 0, and the int instruction triggers the exit system call to exit the process. The code at the symbol "@start" will call the main function and use the exit system call to exit the process. Before and after calling the main function, you can perform some initialization work (omitted content on line 3) and cleanup (omitted content on line 5).

If the compiler saves the above code in start.s, after processing by the assembler, the object file start.o can be obtained. Then, use the readelf tool to view the symbol table of start.o:

 

From the perspective of the work flow of the entire compilation system, the start.o file is the target file necessary for the normal operation of the compilation system. No matter how the source code processed by the compilation system is defined, start.o and other object files must be linked together in the final linking stage to generate an executable file normally. For such an object file, there is a unified name- "language runtime library" . Obviously, start.o should be the simplest runtime library, it is only responsible for guiding and calling the main function, and does nothing else.

 

  ------ Great insight -----------

According to a similar method, the functions of the runtime library of the programming language can be easily extended .

For example, you can define printf.s to implement the standard output function printf, and generate the printf.o object file after processing by the assembler. As long as the source code declaration uses the printf function, link printf.o to the executable file when linking, and then the standard output function can be realized in the high-level language. Furthermore, the math.c file can be directly defined to implement math-related functions, and the math.o object file can be generated after processing by the compiler and assembler, so that high-level languages ​​can perform complex mathematical calculations.

If the preprocessor is implemented in the compilation system and supports include instructions, function declaration statements such as the printf function or math.c can be placed in a header file like "stdio.h" or "math.h" .  

If the linker supports input files in compressed package format, then object files such as printf.o and math.o can be packaged and placed in a compressed package (library) like "libc.a" . The linker only needs to link before Unzip the compressed package. When writing high-level language programs, as long as the required header files are included and the corresponding library files are included in the linking phase , more powerful language features can be used.

 

 

In comparison, GCC's C Runtime Library (CRT) is much more complicated. Recall the example in Chapter 1:

(When statically linked, GCC will copy five important object files crt1.o, crti.o, crtbeginT.o, crtend.o, crtn.o and three static libraries libgcc.a in the C language runtime library (CRT). , Libgcc_ eh. A, libc. A link to the executable file hello. )

 

5 target files crt1.o, crti.o, crtbeginT.o, crtend.o, crtn.o, and 3 static libraries libgcc.a, libgcc_eh.a, libc.a involved in describing the GCC static linking workflow The functions of these files are:

1) crt1.o: Define program entry point "_start", call ".init" code to execute program initialization, call main function, and call ".finit" code to perform program cleanup. The earlier version was crt0.o, and the ".init" and ".finit" sections were not supported.

2) crti.o: define the function of ".init" section into the stack code, call C ++ global construction code.

3) crtn.o: define the function of ".finit" section of the stack code, call C ++ global destructor code.

4) crtbeginT.o: Define C ++ global construction code.

5) crtend.o: Define C ++ global destructor code.

6) libc.a: Define the C language standard library code. --- It should be a ready-made file of gcc. Bring it when you install gcc. To use the functions, #include the corresponding header files in the code. The purpose of the header files is only to declare the existence of functions. During the linking phase, the linker obtains these functions from libc.a. Hey! http://www.delorie.com/djgpp/doc/libc/libc_1.html

7) libgcc.a: Define the auxiliary function code due to platform differences.

8) libgcc_eh.a: defines platform-related code for C ++ exception handling.

It can be seen that, for a high-level language, in addition to the compiler, assembler and linker are essential parts, the language runtime library is also an indispensable part . The feature-rich runtime library can make the expression of high-level languages ​​more powerful.

 

ELF file generation

 

 


Ref:

Fan Zhidong; Zhang Qiongsheng. "Building a Compilation System by Oneself: Compilation, Compilation and Linking" Machinery Industry Press.

Published 374 original articles · 95 praises · 260,000+ views

Guess you like

Origin blog.csdn.net/qq_35865125/article/details/105458421