【 声明:版权所有,欢迎转载,请勿用于商业用途。 联系信箱:feixiaoxing @163.com】
要说开源编译器,一般大家想到的都是gcc。但是现在gcc的代码量太大了,不太适合学习。代码量比较适合学习的编译器,如果google一下,基本上就剩下lcc和ucc这两个项目。其中,lcc支持多个cpu,而ucc目前只支持x86、且很长时间没有人维护了。但是从代码的可阅读性来说,我还是建议阅读ucc。ucc代码量一般,我统计了一下,大约15000行左右,结构也比较清晰。编译器当然可以用lex&bison来生成,但是看看ucc纯手工打造的软件,也是不错的学习体验。
注(另一个简洁的编译器,和lua差不多):
https://github.com/rswier/c4/blob/master/c4.c
这是一个简单的c编译器,完成的词法分析、语法分析、代码编译、代码执行全部动作,关键是整个文件也就527行
编译的时候请编译成32位代码,即gcc -m32 c4.c -o c4
1、代码可以从github进行阅读或者下载
https://github.com/sheisc/ucc162.3/tree/master/ucc
编译ucc的方法就是在顶层目录输入 make
./ucl --dump-ast --dump-IR hello.c,这个时候会生成hello.s、hello.ast、hello.uil
其中hello.s表示汇编文件,hello.ast表示生成的语法树,hello.uil表示生成的中间代码文件
a,假设有一个iterate.c的文件,那么现在尝试用./ucl --dump-ast --dump-IR iterate.c进行编译
1 int 2 iterate(int data){ 3 4 if(1 == data) 5 return 1; 6 else 7 return iterate(data-1) + data; 8 }
b,生成的语法树为iterate.ast,
1 function iterate 2 { 3 (if (== 1 4 data) 5 (then 6 (ret 1) 7 end-then) 8 (else 9 (ret (+ (call iterate 10 (- data 11 1)) 12 data)) 13 end-else) 14 end-if) 15 } 16 ~
c,中间代码文件为iterate.uil,
1 function iterate 2 if (1 != data) goto BB0; 3 return 1; 4 goto BB1; 5 goto BB1; 6 BB0: 7 t0 = data + -1; 8 t1 = iterate(t0); 9 t2 = t1 + data; 10 return t2; 11 BB1: 12 ret 13 14
d,当然,少不了最后的iterate.s汇编文件,
1 # Code auto-generated by UCC 2 3 .data 4 5 6 7 8 .text 9 10 .globl iterate 11 12 iterate: 13 pushl %ebp 14 pushl %ebx 15 pushl %esi 16 pushl %edi 17 movl %esp, %ebp 18 subl $12, %esp 19 movl $1, %eax 20 cmpl 20(%ebp), %eax 21 jne .BB0 22 movl $1, %eax 23 jmp .BB1 24 jmp .BB1 25 .BB0: 26 movl 20(%ebp), %eax 27 addl $-1, %eax 28 pushl %eax 29 call iterate 30 addl $4, %esp 31 addl 20(%ebp), %eax 32 .BB1: 33 movl %ebp, %esp 34 popl %edi 35 popl %esi 36 popl %ebx 37 popl %ebp 38 ret 39
2、结合书籍学习ucc
在网上或者电商网站上有一本书,是邹昌伟老师写地《c编译器剖析》。这本书就是讲述ucc编译器的。如果大家可以找到这本书,那么就可以结合这本书籍一起学习。
3、ucc其实是一个工具链
ucc本身其实是一个工具组合。编译器的ucl部分其实只负责将c编译成汇编文件。汇编到obj、obj链接成执行文件,这部分是由as、gcc完成的。这个和gcc是一样的。大家如果用gcc -v hello.c编译一下,就全明白了。
4、ucc的入口
int main(int argc, char *argv[]) { int i; if (argc <= 1) { ShowHelp(); exit(0); } Option.oftype = EXE_FILE; SetupToolChain(); Command = Alloc((argc + 60) * sizeof(char *)); Command[0] = NULL; i = ParseCmdLine(--argc, ++argv); for (; i < argc; ++i) { if (argv[i][0] == '-') { Option.linput = ListAppend(Option.linput, argv[i]); } else { AddFile(argv[i]); } } for (i = PP_FILE; i <= Option.oftype; ++i) { if (InvokeProgram(i) != 0) { RemoveFiles(); fprintf(stderr, "ucc invoke command error:"); PrintCommand(); return -1; } } RemoveFiles(); return 0; }
5、ucl的入口
/** * The compiler's main entry point. * The compiler handles C files one by one. */ int main(int argc, char *argv[]) { int i; CurrentHeap = &ProgramHeap; argc--; argv++; i = ParseCommandLine(argc, argv); SetupRegisters(); SetupLexer(); SetupTypeSystem(); for (; i < argc; ++i) { Compile(argv[i]); } return (ErrorCount != 0); }
6、ucl的基本逻辑代码
static void Compile(char *file) { AstTranslationUnit transUnit; Initialize(); // parse preprocessed C file, generate an abstract syntax tree transUnit = ParseTranslationUnit(file); // perform semantic check on abstract synatx tree CheckTranslationUnit(transUnit); if (ErrorCount != 0) goto exit; if (DumpAST) { DumpTranslationUnit(transUnit); } // translate the abstract synatx tree into intermediate code Translate(transUnit); if (DumpIR) { DAssemTranslationUnit(transUnit); } // emit assembly code from intermediate code EmitTranslationUnit(transUnit); exit: Finalize(); }
7、自编译
C_SRC = alloc.c ast.c decl.c declchk.c dumpast.c dom.c emit.c \ error.c expr.c exprchk.c flow.c fold.c gen.c \ input.c lex.c output.c reg.c simp.c stmt.c \ stmtchk.c str.c symbol.c tranexpr.c transtmt.c type.c \ ucl.c uildasm.c vector.c x86.c x86linux.c OBJS = $(C_SRC:.c=.o) CC = gcc CFLAGS = -g -D_UCC UCC = ../driver/ucc all: $(OBJS) assert.o $(CC) -o ucl $(CFLAGS) $(OBJS) clean: rm -f *.o ucl test: $(C_SRC) $(UCC) -o ucl1 $(C_SRC) mv $(UCCDIR)/ucl $(UCCDIR)/ucl.bak cp ucl1 $(UCCDIR)/ucl $(UCC) -o ucl2 $(C_SRC) mv $(UCCDIR)/ucl.bak $(UCCDIR)/ucl strip ucl1 ucl2 cmp -l ucl1 ucl2 rm ucl1 ucl2
8、其他
ucl是按照自顶向下的方法进行解析的。一般来说,这种方法效率比自底向上要高。可以想象一下,如果进行自底向上的源代码分析,那么就要不停地进行移进和规约的操作。当然要是进行规约地话,也必须对整个语法范式进行访问了。编译器的学习,可以集中在范式这部分。如果理解了范式,那么编译器就理解了一半。至于后面地中间代码生成、peephole优化、汇编映射,那就没有多大的难度了。当然要是你想设计一个解释器,其实到语法树这边也就结束了。
ps:
bnf of c,
The syntax of C in Backus-Naur Form <translation-unit> ::= {<external-declaration>}* <external-declaration> ::= <function-definition> | <declaration> <function-definition> ::= {<declaration-specifier>}* <declarator> {<declaration>}* <compound-statement> <declaration-specifier> ::= <storage-class-specifier> | <type-specifier> | <type-qualifier> <storage-class-specifier> ::= auto | register | static | extern | typedef <type-specifier> ::= void | char | short | int | long | float | double | signed | unsigned | <struct-or-union-specifier> | <enum-specifier> | <typedef-name> <struct-or-union-specifier> ::= <struct-or-union> <identifier> { {<struct-declaration>}+ } | <struct-or-union> { {<struct-declaration>}+ } | <struct-or-union> <identifier> <struct-or-union> ::= struct | union <struct-declaration> ::= {<specifier-qualifier>}* <struct-declarator-list> <specifier-qualifier> ::= <type-specifier> | <type-qualifier> <struct-declarator-list> ::= <struct-declarator> | <struct-declarator-list> , <struct-declarator> <struct-declarator> ::= <declarator> | <declarator> : <constant-expression> | : <constant-expression> <declarator> ::= {<pointer>}? <direct-declarator> <pointer> ::= * {<type-qualifier>}* {<pointer>}? <type-qualifier> ::= const | volatile <direct-declarator> ::= <identifier> | ( <declarator> ) | <direct-declarator> [ {<constant-expression>}? ] | <direct-declarator> ( <parameter-type-list> ) | <direct-declarator> ( {<identifier>}* ) <constant-expression> ::= <conditional-expression> <conditional-expression> ::= <logical-or-expression> | <logical-or-expression> ? <expression> : <conditional-expression> <logical-or-expression> ::= <logical-and-expression> | <logical-or-expression || <logical-and-expression> <logical-and-expression> ::= <inclusive-or-expression> | <logical-and-expression && <inclusive-or-expression> <inclusive-or-expression> ::= <exclusive-or-expression> | <inclusive-or-expression> | <exclusive-or-expression> <exclusive-or-expression> ::= <and-expression> | <exclusive-or-expression> ^ <and-expression> <and-expression> ::= <equality-expression> | <and-expression> & <equality-expression> <equality-expression> ::= <relational-expression> | <equality-expression> == <relational-expression> | <equality-expression> != <relational-expression> <relational-expression> ::= <shift-expression> | <relational-expression> < <shift-expression> | <relational-expression> > <shift-expression> | <relational-expression> <= <shift-expression> | <relational-expression> >= <shift-expression> <shift-expression> ::= <additive-expression> | <shift-expression> << <additive-expression> | <shift-expression> >> <additive-expression> <additive-expression> ::= <multiplicative-expression> | <additive-expression> + <multiplicative-expression> | <additive-expression> - <multiplicative-expression> <multiplicative-expression> ::= <cast-expression> | <multiplicative-expression> * <cast-expression> | <multiplicative-expression> / <cast-expression> | <multiplicative-expression> % <cast-expression> <cast-expression> ::= <unary-expression> | ( <type-name> ) <cast-expression> <unary-expression> ::= <postfix-expression> | ++ <unary-expression> | -- <unary-expression> | <unary-operator> <cast-expression> | sizeof <unary-expression> | sizeof <type-name> <postfix-expression> ::= <primary-expression> | <postfix-expression> [ <expression> ] | <postfix-expression> ( {<assignment-expression>}* ) | <postfix-expression> . <identifier> | <postfix-expression> -> <identifier> | <postfix-expression> ++ | <postfix-expression> -- <primary-expression> ::= <identifier> | <constant> | <string> | ( <expression> ) <constant> ::= <integer-constant> | <character-constant> | <floating-constant> | <enumeration-constant> <expression> ::= <assignment-expression> | <expression> , <assignment-expression> <assignment-expression> ::= <conditional-expression> | <unary-expression> <assignment-operator> <assignment-expression> <assignment-operator> ::= = | *= | /= | %= | += | -= | <<= | >>= | &= | ^= | |= <unary-operator> ::= & | * | + | - | ~ | ! <type-name> ::= {<specifier-qualifier>}+ {<abstract-declarator>}? <parameter-type-list> ::= <parameter-list> | <parameter-list> , ... <parameter-list> ::= <parameter-declaration> | <parameter-list> , <parameter-declaration> <parameter-declaration> ::= {<declaration-specifier>}+ <declarator> | {<declaration-specifier>}+ <abstract-declarator> | {<declaration-specifier>}+ <abstract-declarator> ::= <pointer> | <pointer> <direct-abstract-declarator> | <direct-abstract-declarator> <direct-abstract-declarator> ::= ( <abstract-declarator> ) | {<direct-abstract-declarator>}? [ {<constant-expression>}? ] | {<direct-abstract-declarator>}? ( {<parameter-type-list>|? ) <enum-specifier> ::= enum <identifier> { <enumerator-list> } | enum { <enumerator-list> } | enum <identifier> <enumerator-list> ::= <enumerator> | <enumerator-list> , <enumerator> <enumerator> ::= <identifier> | <identifier> = <constant-expression> <typedef-name> ::= <identifier> <declaration> ::= {<declaration-specifier>}+ {<init-declarator>}* ; <init-declarator> ::= <declarator> | <declarator> = <initializer> <initializer> ::= <assignment-expression> | { <initializer-list> } | { <initializer-list> , } <initializer-list> ::= <initializer> | <initializer-list> , <initializer> <compound-statement> ::= { {<declaration>}* {<statement>}* } <statement> ::= <labeled-statement> | <expression-statement> | <compound-statement> | <selection-statement> | <iteration-statement> | <jump-statement> <labeled-statement> ::= <identifier> : <statement> | case <constant-expression> : <statement> | default : <statement> <expression-statement> ::= {<expression>}? ; <selection-statement> ::= if ( <expression> ) <statement> | if ( <expression> ) <statement> else <statement> | switch ( <expression> ) <statement> <iteration-statement> ::= while ( <expression> ) <statement> | do <statement> while ( <expression> ) ; | for ( {<expression>}? ; {<expression>}? ; {<expression>}? ) <statement> <jump-statement> ::= goto <identifier> ; | continue ; | break ; | return {<expression>}? ; This grammar was adapted from Section A13 of The C programming language, 2nd edition, by Brian W. Kernighan and Dennis M. Ritchie,Prentice Hall, 1988.