"Programmer's self-cultivation" study notes Chapter compile and link

2.1 are hidden process

In the usual application development generally you do not need to focus on compiling and linking process, because in the IDE development environment in general will be compiled and linked together into a single step together, directly generate an executable file; usually this process is called building (Buil ) .

For the C language version of the classic "Hello World" code:

#include <stdio.h>

int main()
{
    printf("Hello World\n");
    return 0;
}

We use GCC to compile the code, just a few lines of simple commands to compile the above code and a series of process (assuming that the source file named hello.c) Under Linux, the generation of the program can be run directly:

$gcc hello.c
$./a.oout
Hello World

In fact, the above process can be decomposed into four steps, namely pretreatment (the Preprocess) , compiled (Compilation) , assembler (Assembly) and links (Linking) , wherein the first three stages of the process are in text form, as shown below shows:

The following were introduced under the action of substantially each step.

Precompiled

Pre-compilation process mainly deal with those pre-compiler directive source code file to "#" began. For example, "#include", "# define" and other common processing rules are as follows:

  • All "#define" delete, and expand all the macro definitions.
  • Precompiled instruction processing all conditions, such as "#if", "# ifdef", "# endif" and the like.
  • Processing "#include" pre-compiler directive, the file is inserted into the position which contains the pre-compiler directive. Note that this process is recursive, the file that is included may also contain additional files.
  • Delete all comments "//" and "/ * * /."

After pre-compiler-generated hello.i file does not contain any macro definition, since all macros have been expanded, and the file is also inserted into the containing file .i.

Compile

Compilation process is to pre-finished a series of file lexical analysis, syntax analysis to generate the assembler code file in response to the semantic analysis and optimization; The following section will specifically describes contents of these steps.

Been compiled in the form of text is still assembler output file hello.s.

compilation

Work assembler is to assembly code into machine instructions that can be executed. This process is relatively simple, it does not complex syntax and semantics, it does not do the optimization instructions, just eleven translation table in accordance with assembly instructions and machine instructions into binary machine code on it.

After this process is complete the generation is the target file (Object File) hello.o, which contains the machine can recognize and execute the binary machine code.

link

Finally, after the link will be able to generate an executable file .out (.exe file under windows is) a. But it says the output destination file compilation process is already included in the computer machine code that can be executed, so why object file can not run directly but go through the link only to generate an executable file it? Why bother it. In the back of Section 2.3, let us detail under the linking process includes what and why link.


2.2 compiler did

Compilers do, is roughly in translation principle have learned something, generally divided into six steps: lexical analysis, syntax analysis, semantic analysis, source code optimization, code generation and code optimization objectives.

In a very simple C code is an example of the process about the next generation from the source to the final destination code:

array[index] = (index + 4) * (2 + 6)

lexical analysis

First, the source code is inputted to the scanner (Scanner), the use of finite state machine algorithm for this character sequence is divided into a series of tokens (the Token); The above code contains 28 non-blank character, after scanning, It produced 16 mark:

mark Types of
array Identifier
[ Left square bracket
index Identifier
] Right bracket
= Assignment
( Left parenthesis
index Identifier
+ plus
4 digital
) Right parenthesis
* Multiplication sign
( Left parenthesis
2 digital
+ plus
6 digital
) Right parenthesis

Lexical analysis token generator is generally divided into the following categories: keywords, identifiers, literals (numbers, strings, etc.). While the identification symbol, the scanner also completed store other things such as the identifier into the symbol table, the number and the character string constant storage table or the like, used to prepare the following step.

Gramma analysis

Next parser (Grammar Parser) generated by the token stack scanner parsing, using context-free grammar (Context-free Grammar) to produce the syntax tree (Syntax Tree). Are the syntax tree is an expression (Expression) is a node tree.

In the C language statement is an expression, and complex sentence is a combination of a lot of expression. Examples of the above statement is a statement by the complex assignment expressions, expression addition, other components of the expression in parentheses. Generated syntax tree as follows:

The above statement can be seen as a whole assignment expression; the left side of an assignment expression is a numeric expression, the right is a multiplication expression; this recursion, symbols and numbers is the smallest expressions, they are not expressed by other type composition.

For Expressions illegal situations, such as a variety of mismatched brackets, and other expressions Missing operator, the compiler will report a syntax error in the analysis phase.

Semantic Analysis

Parsing front just completed an analysis of the level of expression of grammar, but it do not know whether this statement is truly meaningful. For example, two pointers for such multiplication operation, this statement is syntactically legal but it does not make sense. The compiler can analyze the semantics of static semantics (corresponding is dynamic semantics that can be determined only at runtime semantics, such as zero as the divisor is a run-time semantic error).

Static semantics typically include declarations and type match the type of conversion. After semantic analysis phase, whole grain expression syntax tree have been identified Type:

Outside divided by the semantic analyzer also on the symbol type symbol table to do the update.

Intermediate language generation

Modern compilers tend to have a source-level optimization process. In the above example, it is easy to find (2 + 6) This expression can be optimized away, because its value can be determined at compile time as the 8; there are many other similar complicated optimization process.

This optimization process is not carried out directly in the syntax tree optimizer source often converts whole grain syntax tree into intermediate code (Intermediate Code) , which is represented by the order of the syntax tree. Intermediate code is independent of the target machine and the operating environment, it does not contain information such as data size, address and other variables.

把上面例子的语法树翻译成中间代码(三地址码的形式)后是这样的:

t1 = 2 + 6
t2 = index + 4
t3 = t2 * t1
array[index] = t3

在这样的三地址码形式的基础上进行优化,优化程序会将2 + 6的结果计算出来得到t1 = 8,并可以省去一个临时变量t3:

t2 = index + 4
t2 = t2 * 8
array[index] = t2

目标代码生成与优化

首先是目标代码生成;这个过程是由代码生成器将中间代码转换成目标机器代码,因而十分依赖于目标机器,因为不同的机器有着不同的字长、寄存器、整数数据类型和浮点数数据类型等。假如我们用X86汇编语言来表示,代码生成器可能会生成下面的代码序列:

movl index, %ecx
addl $4, %ecx
mull $8, %ecx
movl index, %eax
movl %ecx, array(,eax,4)

最后由目标代码优化器对上述的目标代码进行优化,比如选择合适的寻址方式、使用位移运算代替乘法运算、删除多余的指令等。


链接器年龄比编译器长

在经过词法分析、语法分析、语义分析、源码优化、目标代码生成和优化,上面的源代码终于被编译成了目标代码。但是这个目标代码还有一个问题:index和array的地址还未确定。如果index和array定义在跟上面源码同一个编译单元里,那么编译器可以为index何array分配空间并确定它们的地址;而如果是定义在其他的程序模块中的话,要怎么确定它们的访问地址呢?这时候,就需要到链接器了。

在“上古时代”,是没有高级语言甚至汇编语言的;那个时候写程序是直接写机器码的,存储程序的最原始的设置之一就是纸带,即在纸带上打相应的孔格:

假设现在有一段如上图右侧所示的机器码程序,其所运行的目标机器上 每条指令都是一字节;上面有一种跳转指令,高4位是0001,表示这是一条跳转指令,低4位存放的是跳转目的地的绝对地址。从上图可以看出,第一条就是跳转指令,要跳转到第5条指令(第5条指令的绝对地址是4)。

那么问题来了,这段程序在日后是可能会修改的,如果我们在第1条指令和第5条指令之间插入了新的指令,那么第1条跳转指令的目的地址就得做修改了。如果我们有多条纸带程序,这些程序之间可能会有类似的跨纸带之间的跳转。每当有这修改时,我们都得重新计算各个目标地址(这个过程被叫做重定位) 显然是不能容忍的。

后来,先驱者发明了汇编语言,在两点上极大地解放了生产力:

  • 采用助记符来替代机器指令,例如jmp代表跳转指令
  • 可以使用符号来标记位置,例如在前面的纸带程序中,把第5条指令开始的子程序命名为“foo”, 那么第一条指令的汇编就是:jmp foo

当人们可以使用这种符号命名子程序或跳转目标以后,不管这个“foo”之前插入或减少了指令导致“foo”目标地址发生变化,汇编器在每次汇编程序的时候都会重新计算“foo”这个符号的地址,然后把所有引用了“foo”的指令修正到正确的地址。

有了汇编语言后,生产力大大提高,随之而来的是软件程序的规模也日渐庞大,人们开始将代码按照功能或性质划分。在一个程序被分割成多个模块之后,这些模块之间最后如何组合形成一个单一的程序是需要解决的问题。模块之间如何组合的问题可以归结为模块之间如何通信的问题,主要有两方面:一是模块间的函数调用,另一是模块间的变量访问。而这两种方式都可以归结为一种方式,即模块间符号的引用。 我们将各个模块“拼合”到一起形成一个可执行程序,并为各个模块中的符号引用确定最终访问地址 的这个过程就是本书的一个主题:链接(Linking)


模块拼装——静态链接

这里先举一个例子来阐述静态链接的最基本的过程和作用:
比如我们在程序模块main.c中使用了另一个模块fun.c中的函数foo()。那么在main.c模块中每一处调用foo的时候都必须确切知道foo这个函数的地址,但由于每个模块都是单独编译的,在编译器编译main.c的时候它并不知道foo函数的地址,所以它暂时把这些调用foo的指令的目标地址搁置,等待最后链接的时候由链接器去将这些指令的目标地址修正。 使用链接器,你可以直接引用其他模块的函数和全局变量而无需知道它们的地址,因为链接器在链接的时候,会根据你所引用的符号foo,自动去相应的fun.c模块查找foo的地址,然后将main.c模块中所有引用到foo的指令重新修正,让它们的目标地址为真正的foo函数的地址。

链接器所做的工作其实跟前面所说的机器码程序中因指令增减而需要“手工调整地址”本质上是一样的,只不过现代高级语言拥有诸多特性与功能,使得编译器、链接器更为复杂,功能更为强大,但从原理上讲,它的工作无非就是把一些指令对其他符号地址的引用加以修正。链接过程主要包括了地址和空间分配(Address and Storage Allocation)符号决议(Symbol Resolution)重定位(Relocation)等这些步骤。(符号决议大致就是 为每个目标文件确定符号并在其他目标文件找到引用符号的定义的过程,后面的链接章节会详细介绍)

最基本的静态链接过程如下图所示。每个模块的源代码文件(如.c文件)经过编译器编译成目标文件(Object File,一般扩展名为.o或.obj),目标文件和库(Library)一起链接形成最终可执行文件。而最常见的库就是运行时库(Runtime Library),它是支持程序运行的基本函数的集合。库其实是一组目标文件的包,就是一些常用的代码编译成目标文件后打包存放。关于库本书的后面还会详细分析。

Guess you like

Origin www.cnblogs.com/geek1116/p/11946458.html
Recommended