C/C++ program compilation and linking (1) The concept of compiling and linking


For languages ​​such as python and js, the compiling and linking process is completely transparent to programmers, and the code can be run directly after writing, and errors will be found during the running process.

For C++, there is still a big step between the writing of the code and the running of the program, which is compiling and linking. They are not transparent to programmers, and they need to control the process of compiling and linking, and controlling the generation of executable files according to requirements.

Most C++ programmers have encountered problems such as undefined symbols, repeated symbol definitions, dynamic library dependencies, and symbol conflicts caused by different rules generated by dependent libraries. If you don’t understand the basic knowledge and rules of compiling and linking, you can solve them It's still quite troublesome. You can only treat your head if you have a headache, and treat your feet if your feet hurt.

This will be a series of articles, introducing the basic concepts and processes of compiling and linking, the rules of the loader, and the use of related tools.

This article introduces the basic concepts and processes of compiling and linking.

First of all, the compiler is responsible for compiling and linking, and generates a series of source files into runnable programs.

Commonly used compilers include GCC and Vistual Studio, which correspond to Linux and Windows platforms respectively (of course, GCC can also run under Windows).

The compiler generates executable files in two steps: compilation and linking. Compilation compiles source files into target files; linker links target files and libraries that the program depends on into executable files. The linker in Linux is a program ld.

compile

The goal of compilation is to turn source files into object files, including preprocessing, language analysis, assembly, and generation of object files :

preprocessing

Preprocessing is to replace the macros in the original file.

In C/C++ programs, macros are widely used, which can be used to distinguish conditional compilation of different platforms, and can also be used to simplify code.

The processing of macros is in the first stage of compilation. After processing by the preprocessor, the macros in the source file are replaced with definition content. Its rules are as follows:

  • Include the file containing the definition marked with the #include keyword into the source code file.
  • Converts the value specified by the #define statement into a constant.
  • Converts the macro definition into code at the point in the code where the macro is called.
  • Include or exclude specific sections of code based on the position of #if, #elif, and #endif directives.

In gcc, we can use the following command to generate the code after the macro is replaced.
gcc -E -P <input file> -o <output preprocessed file>.i

This command just tells gcc to only do preprocessing and not compile. If we encounter a more complicated macro and are not sure about the specific code after parsing, we can extract the macro and put it in the file separately, and we don’t have to worry about whether the syntax is correct. Through this command, you can see the specific code that is parsed into.

For example the following code:

#define _SUM_(a,b) (a) + (b)
void function()
{
    
    
    _SUM_(2,3);
}

gcc -E -P sum.cpp -o sum.iAfter processing, it becomes:

void function()
{
    
    
    (2) + (3);
}

The most commonly used method in cross-platform code is to use #if #definemacros to distinguish codes, and codes that meet the macro conditions in the preprocessing stage will be retained.

language analysis stage

During this phase, the compiler replaces the C/C++ code into a more processable form (removing comments and unnecessary whitespace, extracting symbols from text, etc.). This optimized and streamlined source code form can be obtained through lexical analysis. The purpose of lexical analysis is to check whether the program satisfies the grammatical rules of the programming language. The compiler will report an error or issue a warning when it detects an error that does not meet the syntax rules. Compilation errors interrupt the compilation process.
It can be further subdivided into three stages:

  • lexical analysis
  • Gramma analysis
  • Semantic Analysis

If the compilation error is reported, it belongs to this stage.

compilation stage

After the source code has been verified to contain no syntax errors, the compiler executes the assembly phase. In this phase, the compiler converts the standard language set into a specific CPU instruction set. Different CPUs have different functional requirements and usually have different instruction sets, registers and interrupts. That's why different processors require different compiler support.

gcc can convert the source code into the corresponding ASCII-encoded text file through the following command

**gcc -S -masm=intel function.c -o function.s**

-masm=intelIndicates that the assembly code for the intel platform is generated.

optimization stage

The optimization process begins when the source code files generate the initial version of the assembly code, which minimizes the program's register usage. In addition, the analysis can predict and delete parts of the code that do not actually need to be executed.

Target file

The product of the compilation phase is the object file, which is a binary file, and they have a standard format that allows the operating system to understand it. The target file ends with .obj under Visual Studio (windows), and the target file ends with .o under GCC.

Each source file generates a corresponding object file (also known as a compilation unit, at this stage assembly instructions are converted into binary values ​​corresponding to machine instructions and written to a specific location in the object file).

Method and variable names in source files are called symbols in object files. In fact, we only need to know that the target file contains symbols of various types (mainly referring to scope), and we don't need to care about most of the details in the target file. In general object files consist of three types of objects:

  • Export symbols, which can be used by other compilation units (object files).
  • Local symbols, used only by the current compilation unit (object file).
  • Undefined symbols, in other compilation units (object files), essentially cannot determine the address of the symbol.

An object file image is as follows:

Object file from the perspective of the compiler.png

When we compile a program, we usually need to pay attention to the symbols in it, because most of the compilation errors we encounter are symbol undefined or symbol redefinition . nmYou can view the symbols in the object function_test.cppfile through the command, as follows Symbols in the object file generated by the file:

int addNum(int a,int b) {
    
    
    return a+b;
}

int subtractNum(int a,int b) {
    
    
    return a-b;
}

gcc -c function_test.cpp, generate the file, and view the included symbols function_test.othrough the commandnm

nm function_test.o

0000000000000014 T _Z11subtractNumii
0000000000000000 T _Z6addNumii

If it is .ca source file file ending in , the symbolic names are as follows:

0000000000000000 T addNum
0000000000000014 T subtractNum

This is because gcc has different naming rules for symbols for C++ and C files.

Link

After the compilation phase, the source files become object files (independent compilation units). The link phase is to generate executable files from these object files.

The methods and variables that are called each other in the source file do not know their specific addresses during the compilation phase. It can be said that the perspective of the compilation phase is only limited to a single file, and the external methods and variables do not know where they are. Then the address of these methods or variables in the target file cannot be known. However, these addresses are definitely needed for the complete program, so the work of establishing address connection is completed in the linking stage, and undefined symbols or repeated definitions of symbols that often appear are errors generated in this stage.

The input of the linker is a series of object files, and its main job is to generate the correct addresses for the symbols that call each other in these object files, and then generate the correct executable files. Its work is similar to building blocks, as follows:

Stitching.png

It includes the following steps:

  1. reset

The first stage of the linking process is simply splicing, the process of splicing sections of different types scattered across separate object files into program memory-mapped sections.

  1. resolve references
  • Check the sections spliced ​​into the program's memory map.
  • Find out which parts of the code make external calls.
  • Calculate the exact address of the reference (the address in the memory map)
  • Finally, the pseudo-address in the machine instruction is replaced with the actual address of the program memory map, thus completing the reference resolution.

A linker, unlike a compiler, doesn't care about any details of the code it writes . Instead, the linker focuses on a collection of object files and works to stitch these object files into a program memory map.

The following are the object files from the perspective of the linker, according to their exported symbols (functions or variables that have been implemented/defined in the source file) and the symbols they need (calling/using functions or variables in other source files) Splicing (confirming symbol address) into a complete executable file.

Object file from the linker's perspective.png

epilogue

The compiler essentially translates the high-level language in the source code into a low-level language, such as translation into assembly code for the Intel x86 platform. After briefly describing the main work of the compiler above, you can appreciate the complexity of the compiler. Usually, a large C/C++ project has thousands of source files, and the calling relationship between source files is complicated. The compiler To generate correct assembly code, it is necessary to understand these calling relationships to generate correct symbol addresses, etc., and then generate correct execution files, but also pay attention to speed. Such a simple thought is very complicated.

Usually we think that C/C++ is complicated and that they are relatively low-level languages, but behind them, the compiler still does a lot of work.

So for simple and efficient languages ​​like python and java. It is conceivable that the compiler systems behind them are more powerful and take on more work.

There is no such thing as easy and simple in the so-called, it is just that others are carrying the burden for you.

Guess you like

Origin blog.csdn.net/mo4776/article/details/129235751