[Backend tutorial] Go into the principles of Golang compiler

The directory is as follows:

  • Meet go build

  • Compiler principles

  • lexical analysis

  • Gramma analysis

  • Semantic Analysis

  • Intermediate code generation

  • Code optimization

  • Machine code generation

  • to sum up

Meet go build

When we Qiaoxia go buildtime, we write source code files actually experienced what things? Eventually it became an executable file.

This command will compile the go code. Let's take a look at the compilation process of go today!

First of all, let's get to know the code source file classification of go

  • Command source file: In short, the file containing the main function, usually one file per project, and I have never thought of a project that requires two command source files

  • Test source file: that we write unit test code is to _test.goend

  • Library source code files: The library source code files without the above features, like many third-party packages we use belong to this part

go buildThis command is used to compile one of the command source files , and it depends on the library source files . The following table is a summary of some commonly used options.

Optional Explanation
-a Rebuild all command source files and library source files, even the latest
-n Print out all the commands involved in the compilation, but they will not be executed, which is very convenient for us to learn
-race Enable the detection of race conditions, the supported platforms are limited
-x Print the naming used during compilation, the difference between it and -n is that it not only prints but also executes

Next, use a hello world program to demonstrate the above command options.

image

If the code to perform the above go build-nwe look at the output:

image

To analyze the entire execution process

image

This part is the core of compilation compile, buildidand linkthe executable file will be compiled by the three commands a.out.

Then mvordered the a.out move to the current folder, and change the project file with the same name (where you can also specify the name of your own).

Later in the article, we are mainly talking about is compile, buildid、 linkthese three commands involved in the compilation process.

Compiler principles

This is the source code path of the go compiler: https://github.com/golang/go/tree/master/src/cmd/compile

image

As you can see in the picture above, the entire compiler can be divided into: compilation front end and compilation back end; now let's see what the compiler does at each stage. Let's start with the front end.

lexical analysis

Lexical analysis is simply to translate the source code we wrote into Token. What does this mean?

To understand the Golangtranslation from the source code to the Tokenprocess, we look at a piece of code translated one to one situation.

image

Figure important places I have been annotated, but there are still a few words to say what we looked at the code Imagine, if you want our own to achieve this "translation" and how the program should recognize Tokenit?

First, let's first classify Go's token types: variable names, literals, operators, separators, and keywords. We need to split a bunch of source code according to rules, which is actually word segmentation. Looking at the example code above, we can roughly formulate a rule as follows:

  1. Identify the space, if it is a space, you can divide a word;

  2. Encounter (, )'<', '>', etc. These special operators when a word count;

  3. Encounter "or numeric literal measure participle.

By simple analysis above, we can see the source code in fact turn Tokenactually not very complicated, can write their own code to achieve them. Of course, there are many more common lexical analyzer to achieve through the regular way, like the Golangearly use is lex, in a later version before switching to the use go to achieve their own.

Gramma analysis

After lexical analysis, we get that Tokensequence, it will serve as the parser input. Then after the process of generating ASTstructure as output.

The so-called syntax analysis is to Tokenbe converted to a program recognized by the grammatical structure, but ASTis this an abstract representation of grammar. There are two ways to construct this tree.

  1. This top-down approach will first construct the root node, and then start scanning Token, face STRINGor other types know this is making the type stated, funcit means the function is declared. Just keep scanning until the end of the program.

  2. The bottom-up approach is the opposite of the previous approach. It constructs the subtree first and then assembles it into a complete tree.

go language parsing using the bottom-up approach to construct AST, here we go look at the language by TokenThe tree structure looks like.

image

I have marked all the interesting parts in text. You will find that each ASTnode of the tree is associated with a Tokencorresponding physical location.

After the tree is constructed, we can see that different types are represented by corresponding structures. If there are grammatical or lexical errors here, they will not be resolved. Because so far, it is all string processing.

Semantic Analysis

Stage inside the compiler syntax regard after the analysis is called semantic analysis , and go at this stage is called type checking ; but I looked at go their own documents, in fact, do not have much difference, we follow the mainstream specification to write this process.

So what exactly does semantic analysis (type checking) do?

AST After generation, semantic analysis will use it as input, and some related operations will also be rewritten directly on this tree.

The first is Golangmentioned in the documentation Type checking, as well as type inference, to see the type of matches, whether implicit conversion (go no implicit conversion). As the following text says:

The AST is then type-checked. The first steps are name resolution and type inference, which determine which object belongs to which identifier, and what type each expression has. Type-checking includes certain extra checks, such as “declared and not used” as well as determining whether or not a function terminates.

The main idea is: after the AST is generated, type checking (that is, semantic analysis we are talking about here), the first step is to perform name checking and type inference, sign the identifier to which each object belongs, and what type each expression has. Type checking also has some other checks to be done, like "declaring unused" and determining whether the function is aborted.

Certain transformations are also done on the AST. Some nodes are refined based on type information, such as string additions being split from the arithmetic addition node type. Some other examples are dead code elimination, function call inlining, and escape analysis.

This paragraph says: AST will also convert, and some nodes will be simplified according to the type information, such as splitting the string addition from the arithmetic addition node type. Other examples are the elimination of dead code, inlining of function calls and escape analysis.

The two paragraphs above are from golang compile: https://github.com/golang/go/tree/master/src/cmd/compile

One more thing to say here, we often need to prohibit inlining when debugging code, which is actually this stage of operation.

1. `# Prohibit restraint when compiling` 

2.` go build -gcflags' -N -l'` 

4. `-N Prohibit compilation and optimization` 5.` 

-l Prohibit inlining, disabling inlining can also be to some extent Reduce executable program size`


After semantic analysis, it can be shown that our code structure and grammar are no problem. So the front end of the compiler is mainly to parse out the correct AST structure that the back end of the compiler can handle.

Next, let's look at what the compiler backend has to do.

The machine can only understand the binary and run it, so the task of the compiler backend is simply how to translate AST into machine code.

Intermediate code generation

Now that the AST has been obtained, the binary required for the operation of the machine. Why not translate directly into binary? In fact, so far technically, there is no problem at all.

However, we have a variety of operating systems with different CPU types, each of which may have a different number of bits; the instructions that registers can use are also different, such as complex instruction sets and reduced instruction sets; Before the compatibility, we also need to replace some low-level functions. For example, we use make to initialize the slice. At this time, it will be replaced with: makeslice64or according to the type passed in makeslice. Of course, the replacement of functions such as pain, channel, etc. will also be replaced during the intermediate code generation process. This part of the replacement operation can be viewed here

Another value of the intermediate code is to improve the reuse of back-end compilation. For example, we have defined what a set of intermediate code should look like, so the back-end machine code generation is relatively fixed. Each language only needs to complete its own compiler front-end work. This is why you can see the speed of developing a new language now. Compilation is mostly reusable.

And for the next optimization work, the existence of intermediate code has extraordinary significance. Because there are so many platforms, if there is an intermediate code, we can put some common optimizations here.

The intermediate code is also a variety of formats, such as Golangthe use of the intermediate code is the SSA characteristics (IR), this form of intermediate code, the most characteristic is the most important variables are always defined before using variables, and each variable only Assign once.

Code optimization

In go's compilation documentation, I did not find an independent step to optimize the code. However, according to our analysis above, we can see that the code optimization process is in every stage of the compiler. Everyone will do something within their power.

Generally, in addition to replacing inefficient ones with efficient codes, we also have the following treatments:

  • Parallelism, make full use of the characteristics of current multi-core computers

  • Pipeline, cpu can sometimes process b instruction at the same time when processing a instruction

  • Instruction selection, in order for the CPU to complete certain operations, instructions need to be used, but the efficiency of different instructions is very different, and instruction optimization will be performed here

  • Using registers and cache, we all know that the CPU takes the fastest from the register, and the second from the cache. Here will be fully utilized

Machine code generation

The optimized intermediate code will first be converted into assembly code (Plan9) at this stage, and assembly language is just a text representation of machine code, and the machine cannot actually execute it. So at this stage, the assembler will be called, and the assembler will call the corresponding code to generate the target machine code according to the architecture we set when performing the compilation.

What is more interesting here is that I Golangalways say that my assembler is cross-platform. In fact, he also wrote multi-point code to translate the final machine code. Because he will, at our set at the time of the entrance GOARCH=xxxto initialize processing parameters, and then finally call a specific method for the preparation of the corresponding architecture to generate machine code. This kind of upper layer logic is consistent and the bottom layer logic is inconsistent. It is very common and worth learning. Let's take a brief look at this process.

First look at the entry function cmd/compile/main.go:main()

1. `var archInits = map [string] func (* gc.Arch) {` 

2. `" 386 ": x86.Init,` 

3. `" amd64 ": amd64.Init,` 

4. `" amd64p32 ": amd64.Init, `5.` 

"arm": arm.Init, ` 

6.` "arm64": arm64.Init, `7.` 

"mips": mips.Init, ` 

8.` "mipsle": mips. Init, ` 

9.` "mips64": mips64.Init, ` 

10.` "mips64le": mips64.Init, `11.` 

"ppc64": ppc64.Init, ` 

12.` "ppc64le": ppc64.Init, ` 

13.` "s390x": s390x.Init, ` 

14.` "wasm": wasm.Init, `15.` 

} ` 17.` 

func main () {` 18.` 

// from the above map according to the parameters Select the processing of the corresponding architecture` 

19. `archInit, ok: = archInits [objabi.GOARCH]`

20.  `if  !ok {`

21.  `......`

22.  `}`

23. `// 

Pass the correspondence of the corresponding CPU architecture to the inside` 24.` gc.Main (archInit) `25.` 

} `

Then cmd/internal/obj/plist.goprocessing method corresponding to the architecture call

1. `func Flushplist (ctxt * Link, plist * Plist, newprog ProgAlloc, myimportpath string) (` 

2. `... ...` 

3. `for _, s: = range text {` 

4. `mkfwd ( s) ` 

5.` linkpatch (ctxt, s, newprog) `6.` 

// The corresponding architecture method performs its own machine code translation` 

7. `ctxt.Arch.Preprocess (ctxt, s, newprog)` 

8. ` ctxt.Arch.Assemble (ctxt, s, newprog) ` 

10.` linkpcln (ctxt, s) 

`11.` ctxt.populateDWARF (plist.Curfn, s, myimportpath) `12.` 

} ` 13.` 

} ``

After the whole process, you can see that there is a lot of work to be done on the back end of the compiler. You need to understand the architecture of a certain instruction set and cpu to translate the machine code correctly. At the same time, it cannot be just right. The efficiency of a language is high or low, which also depends to a large extent on the optimization of the compiler backend. In particular, as we are about to enter the AI ​​era, more and more chip manufacturers are born. I estimate that the demand for talents in this area will become more and more vigorous in the future.

to sum up

Summarize a few gains from learning this part of the ancient knowledge of the compiler:

  1. Know that the whole compilation consists of several stages, and what each stage does; but some details of the deeper implementation of each stage are not known, nor are they intended to know;

  2. Even if it is such a complex compiler, very low-level things can be decomposed to make each stage independent become simple and reusable, which has some significance for me in application development;

  3. Layering is to divide responsibilities, but some things need to be done globally, such as optimization, in fact, it will be done at each stage; it is also of certain reference significance for our design system;

  4. Learned Golangmany ways of external exposure is actually syntactic sugar (such as: make, painc etc.), the compiler will help me to translate, I thought it was the beginning of the code level, go do it at runtime, similar to the factory model, now Looking back at yourself is really naive;

  5. I made some basic preparations for the next preparation to learn Go's operating mechanism and Plan9 compilation.

Service recommendation

Published 0 original articles · liked 0 · visits 358

Guess you like

Origin blog.csdn.net/weixin_47143210/article/details/105612027