C++ compilation and linking process

We know that when writing a program in c/c++ language, we must go through the compilation and linking process to convert our c/c++ source code into an executable file (.exe program under Windows, and executable file in elf format under Linux).

So what exactly does the compilation and linking process do, and where is this executable file loaded and run?

data and instructions

Regardless of the code written in any language, in the final analysis, two things will be generated: instructions and data. Which codes are instructions and which codes are data?

All global and static variables are data, and everything else is instructions (including local variables) . Let's look at a piece of code

Obviously, in the above code, data1, 2, 3, 4, 5, 6, 7, 8, and 9 are all data, and other codes are instructions. We all know that the program is loaded into the memory to run. Since the code is divided into two different things, instructions and data, they cannot be disorderly and chaotically put together in the memory. There must be certain division rules.

virtual address space

When each program is running, our operating system will assign it a fixed-size virtual address space (x86, 32bit, the default size under the Linux kernel is 4G), so how is this memory allocated? let's take a look

This is the distribution map of each area of ​​the virtual address space. It can be seen from the figure that 1G of the entire 4G space is the kernel space used by the operating system, which cannot be accessed by users, and 3G is our user space for the running of processes on the virtual address space. The 3G user space is divided into many sections. The 128M space starting from address 0 is the reserved space of the system, and the user cannot access it. Next is the .text section, which stores code, and then the .data section and .bss section. These two sections store data, but there are differences: the data stored in the .data section is initialized and the initialization value is not 0 data, while the .bss section stores uninitialized or initialized to 0 data. data. We can look at the allocation of virtual address space under Linux

The above is the segment table information of the virtual address space of a process under Linux. We can clearly see our .text segment, .data segment, .bss segment and other information on it, indicating that these segments are real, not artificially fabricated by us. But when we check the section table information, we will find that the starting address of the .bss section and the starting address of the .comment section are the same. Why? The meaning of the three English letters of bss is: better save space (better save space), whose space is the space saved here? We know that when the program is running, it needs to be loaded from the file to the memory, and the above segment table information is called out when the intermediate file.

That is to say, in the generated intermediate file, the system does not allocate space for the .bss segment, so it is not difficult to understand how the data in it is saved. Let's save this problem for now, and we will solve it later. Now let's take a look at what the compilation and linking process mainly does.

compilation process

During the compilation process, the system mainly does three things: precompile, compile, and assemble.

Precompilation: Remove comments in the code, process preprocessing commands beginning with "#", and perform macro replacement

Compile: Generate symbols and convert source code instructions into assembly instructions

Assembly: Generate Binary Relocatable Files

What we are relatively unfamiliar with here is to generate symbols, which is also our focus. C/C++ code will generate symbols when compiling, all data will generate symbols, and instructions will only generate symbols for function names. Let’s take a look at the segment table above, we know that there will be six data stored in the .bss segment in the above code, but the size of the .bss segment is only 20 bytes, that is to say, there are only 5 data in it (14 in hexadecimal is converted to 20 in decimal), is the other data lost? ? In fact, in the process of generating symbols, the symbols generated by all static variables are local symbols (only visible to the current file), all initialized non-static global variables will generate a global (visible to all files) strong symbol, and uninitialized non-static global variables will generate a global weak symbol. A weak symbol is an indeterminate symbol. It is uncertain whether there are variables with the same name in other files that will generate a strong symbol, or other variables with the same name generate a weak symbol, but occupy more memory than the weak symbol. In the above two cases, the weak symbol will be replaced during the linking process (if the strong symbol has the same name, it will cause a compilation error, which is determined at compile time, and the weak symbol is determined at link time). It is obvious that in the above code, data3 will generate weak symbols, so weak symbols will not be stored in the .bss section at compile time, but will be stored in the comment block. Let's take a look at the symbol table

We see that the uninitialized Data3 is not stored in the .bss section, but in the comment block because it is a weak symbol. At this point we define a variable and a function in another file, and call this function and variable in the main.c file

Let's look at the symbol table again

You can clearly see the bottom two * UND * symbols, this is because during the compilation process, each file is compiled separately, and you will not see things defined in other files. In main.c, there are only variable data and function fun () declarations, so they will be considered as undefined symbols.

linking process

After the compilation is completed, the linking process will follow. Let's take a look at what the linking process does.

merge segment

In the elf file, the byte alignment is 4-byte alignment, but in the executable program, the alignment is page-aligned (the size of a page is 4k), so if we load each segment of each .o file into the executable file separately at link time, it will be a waste of space: the following table

Therefore, we need to merge segments, adjust segment offsets, and merge different segments of each file. The .text segment of each .o file is merged together and the .data segment is merged together. In this way, in the generated executable file, each segment has only one, as shown in the figure below, because only the code segment (.text segment) and data segment (.data segment and .bss segment) need to be loaded during linking. Therefore, after merging segments, when the system allocates memory for us, it only needs to allocate two page sizes, and store code and data separately as shown in the figure

Adjust Segment Offset

After merging segments, one of the operations that must be performed is to adjust the segment offset and segment length. Each process has its own virtual address space, which starts from address 0. After loading each segment of each file, the size of the segment will change, and the offset relative to address 0 will also be different, so we need to adjust the segment offset and segment offset as shown in the figure

sum all symbols

Each obj file will generate its own symbol table when compiling, so we need to combine these symbols for symbol analysis

complete symbol relocation

When merging segments and adjusting segment offsets, the virtual address of each segment of the input file after linking has been determined. After this step is completed, the linker starts to calculate the virtual address of each symbol. Because the relative position of each symbol in the segment is fixed, the address of each symbol in the segment is also determined. However, the linker needs to add an offset to each symbol so that they can be adjusted to the correct virtual address. This is the symbol relocation process.

In the elf file, there is a structure called relocation table which is specially used to save the information related to the slave location. The relocation table is often run in one or more segments in the elf file.

Guess you like

Origin blog.csdn.net/weiweiqiao/article/details/131318901