Since the C compiler is written in C language, how did the first C compiler come from?

640?wx_fmt=jpeg

Source: Bole Online, Author: Chaobs

First, pay tribute to Dennis Ritchie, the father of the C language!

Almost all practical compilers/interpreters (hereinafter collectively referred to as compilers) are written in C language. Some languages ​​such as Clojure, Jython, etc. are based on JVM or Java, and IronPython is based on .NET. It is implemented, but Java and C# also rely on C/C++ to implement it, which is equivalent to indirectly calling C. So measuring the portability of a certain high-level language is actually discussing the portability of ANSI/ISO C.

C language is a very low-level language, and in many ways it is similar to assembly language.
In the book "Intel 32-bit Assembly Language Programming", it even introduces the method of manually translating simple C language into assembly. For system software such as compilers, it is natural to write in C language. Even high-level languages ​​like Python still rely on C language at the bottom (the example of Python is because Intel hackers are trying to make

Python not needed. The operating system can run-in fact, the one-time C code on the BIOS is eliminated). Nowadays, after learning the principles of compilation, anyone with a little programming ability can implement a simple C-like compiler.

But here comes the problem. I don’t know if you have ever thought about it. Everyone writes compilers in C language or a language based on C. So how did the world’s first C language compiler write it? This is not a "chicken and egg" question...

Therefore, the prototype of the first C language compiler may be written in B language or mixed B language and PDP assembly language.

640?wx_fmt=jpeg

(Image source: C language and programming)

Therefore, the early C language compilers took a tricky approach: first use assembly language to write a subset of the C language compiler, and then use this subset to recursively complete the complete C language compiler.The detailed process is as follows:

First create a subset with only the most basic functions of the C language, denoted as the C0 language, the C0 language is simple enough, you can directly use the assembly language to write a C0 compiler. Relying on the existing functions of C0, the design is more complex than C0, but still incomplete, another subset of the C language C1 language, where C0 belongs to C1, C1 belongs to C, and C0 is used to develop a C1 language compiler. On the basis of C1, design another subset of C language C2 language. C2 language is more complicated than C1, but it is still not a complete C language. A compiler for C2 language was developed... So until CN, CN is already powerful enough. Time is enough to develop a complete C language compiler implementation. As for how much N is here, it depends on the complexity of your target language (here is C language) and the programmer's programming ability-simply put, if you reach a certain subset stage, you can easily use the existing functions When implementing the C language, then you will find N. The following diagram illustrates this abstract process:

C language
CN language
……
C0 language
Assembly language
Machine language

So how is this bold subset simplification method realized, and what theoretical basis is there?

First introduce a concept, "self-compilation" Self-Compile , that is, for some strong types with obvious bootstrapping properties (the so-called strong type means that each variable in the program must be declared before it can be used, such as C language. On the contrary, some Scripting languages ​​don’t have the term type at all.) Programming languages ​​can use a limited subset of them to express themselves through a limited number of recursions. Such languages ​​include C, Pascal, Ada, etc., as for why It can be self-compiled, you can refer to the "Compilation Principles" of Tsinghua University Press, which implements a subset of Pascal compiler.

In short, some computer scientists have proved that the C language can theoretically realize a complete compiler through the above-mentioned CVM method, so how does it actually simplify it?

Is this picture a bit familiar? By the way, I saw it when I was talking about virtual machines, but here is CVM (C Language Virtual Machine), each language can be compiled independently on each virtual layer, and except for the C language, each layer The output of will be used as the input of the next layer (the output of the last layer is the application), which is the same as snowballing. Combine a small handful of snow with your hands (assembly language) and roll it down little by little to form a big snowball. This is probably the so-called 0 begets 1, 1 C, and C begets everything, right?

The following are the keywords of C99:

auto        enum        restrict        unsigned	
break       extern      return          void	
case        float       short           volatile	
char        for         signed          while	
const       goto        sizeof          _Bool	
continue    if          static          _Complex	
default     inline      struct          _Imaginary	
do          int         switch       	
double      long        typedef	
else        register    union	
//共37个

Take a closer look, in fact, there are many keywords to help the compiler optimize, and some are used to limit the scope of variables, functions, linkability or life cycle (functions do not), these are implemented in the compiler In the early days, there is no need to add it at all, so you can remove auto, restrict, extern, volatile, const, sizeof, static, inline, register, typedef, thus forming a subset of C, C3 language, C3 language keywords are as follows:

enum       unsigned	
break       return      void	
case        float       short  	
char        for         signed     while	
goto        _Bool	
continue    if          _Complex	
default     struct      _Imaginary	
do          int         switch       	
double      long   	
else        union	
//共27个

Thinking about it again, I found that there are actually many types and type modifiers in C3 that it is not necessary to add them all at once. For example, three integer types, as long as the realization of int is enough, so further remove these keywords, they are: unsigned, float, short, char (char is int), signed, _Bool, _Complex, _Imaginary, long, thus forming our C2 language, C2 language keywords are as follows:

enum	
break      return      void	
case	
for         while	
goto       	
continue    if        	
default     struct   	
do          int         switch       	
double 	
else        union	
//共18个

Continuing to think, even the C2 language with only 18 keywords, there are still many advanced places, such as compound data structures based on basic data types. In addition, there are no operators in our keyword table. In C language Compound assignment operator ->, operator ++,-and other overly flexible expressions can also be completely deleted at this time, so the keywords that can be removed are: enum, struct, union, so that we can get the key of the C1 language word:

break      return      void	
case	
for         while	
goto       	
continue    if        	
default 	
do          int         switch       	
double 	
else	
//共15个

It's close to perfect, but the last step is naturally a little bigger. At this time, the arrays and pointers have to be removed. In addition, the C1 language still has a lot of verbosity. For example, there are multiple expression methods for controlling loops and branches. In fact, they can all be simplified into one. Specifically, loop statements have While loop, do...while loop and for loop, you only need to keep the while loop; the branch statement also has if...{}, if...{}...else, if...{}...else if..., switch, these four Form, they can all be realized by two or more if...{}, so only if,...{} is enough. But think again, the so-called branch and loop are just conditional jump statements, and the function call statement is just a stack and jump statement, so only goto (unrestricted goto) is needed. Therefore, boldly remove all structured keywords, not even functions, and the C0 language keywords obtained are as follows:

break    void	
goto       	
int    	
double 	
//共5个

This is the ultimate simplicity.

There are only 5 keywords, which can be implemented quickly in assembly language. Through reverse analysis, we restored the writing process of the first C language compiler, and also felt the wisdom and hard work of the predecessor scientists! We are nothing but dust on the shoulders of giants! 0 gives birth to 1, 1 gives birth to C, and C gives birth to all things, so clever!

640?

5.

640?wx_fmt=gif

Disclaimer: This article is reproduced online, and the copyright belongs to the original author. If you are involved in copyright issues, please contact us, we will confirm the copyright based on the copyright certification materials you provide and pay the author's remuneration or delete the content.

Guess you like

Origin blog.csdn.net/DP29syM41zyGndVF/article/details/101087712