随想录(开源编译器ucc)

【 声明:版权所有,欢迎转载,请勿用于商业用途。 联系信箱:feixiaoxing @163.com


    要说开源编译器,一般大家想到的都是gcc。但是现在gcc的代码量太大了,不太适合学习。代码量比较适合学习的编译器,如果google一下,基本上就剩下lcc和ucc这两个项目。其中,lcc支持多个cpu,而ucc目前只支持x86、且很长时间没有人维护了。但是从代码的可阅读性来说,我还是建议阅读ucc。ucc代码量一般,我统计了一下,大约15000行左右,结构也比较清晰。编译器当然可以用lex&bison来生成,但是看看ucc纯手工打造的软件,也是不错的学习体验。


注(另一个简洁的编译器,和lua差不多):

https://github.com/rswier/c4/blob/master/c4.c

这是一个简单的c编译器,完成的词法分析、语法分析、代码编译、代码执行全部动作,关键是整个文件也就527行

编译的时候请编译成32位代码,即gcc -m32 c4.c -o c4


1、代码可以从github进行阅读或者下载

https://github.com/sheisc/ucc162.3/tree/master/ucc

扫描二维码关注公众号,回复: 910425 查看本文章

编译ucc的方法就是在顶层目录输入 make

./ucl --dump-ast --dump-IR hello.c,这个时候会生成hello.s、hello.ast、hello.uil

其中hello.s表示汇编文件,hello.ast表示生成的语法树,hello.uil表示生成的中间代码文件


a,假设有一个iterate.c的文件,那么现在尝试用./ucl --dump-ast --dump-IR iterate.c进行编译

  1 int
  2 iterate(int data){
  3 
  4         if(1 == data)
  5                 return 1;
  6         else
  7                 return iterate(data-1) + data;
  8 }

b,生成的语法树为iterate.ast,

  1 function iterate
  2 {
  3   (if  (== 1
  4            data)
  5     (then
  6       (ret 1)
  7     end-then)
  8     (else
  9       (ret (+ (call iterate
 10                     (- data
 11                        1))
 12               data))
 13     end-else)
 14   end-if)
 15 }
 16 
~     

c,中间代码文件为iterate.uil,

  1 function iterate
  2         if (1 != data) goto BB0;
  3         return 1;
  4         goto BB1;
  5         goto BB1;
  6 BB0:
  7         t0 = data + -1;
  8         t1 = iterate(t0);
  9         t2 = t1 + data;
 10         return t2;
 11 BB1:
 12         ret
 13 
 14 

d,当然,少不了最后的iterate.s汇编文件,

  1 # Code auto-generated by UCC
  2 
  3 .data
  4 
  5 
  6 
  7 
  8 .text
  9 
 10 .globl  iterate
 11 
 12 iterate:
 13         pushl %ebp
 14         pushl %ebx
 15         pushl %esi
 16         pushl %edi
 17         movl %esp, %ebp
 18         subl $12, %esp
 19         movl $1, %eax
 20         cmpl 20(%ebp), %eax
 21         jne .BB0
 22         movl $1, %eax
 23         jmp .BB1
 24         jmp .BB1
 25 .BB0:
 26         movl 20(%ebp), %eax
 27         addl $-1, %eax
 28         pushl %eax
 29         call iterate
 30         addl $4, %esp
 31         addl 20(%ebp), %eax
 32 .BB1:
 33         movl %ebp, %esp
 34         popl %edi
 35         popl %esi
 36         popl %ebx
 37         popl %ebp
 38         ret
 39 


2、结合书籍学习ucc

    在网上或者电商网站上有一本书,是邹昌伟老师写地《c编译器剖析》。这本书就是讲述ucc编译器的。如果大家可以找到这本书,那么就可以结合这本书籍一起学习。


3、ucc其实是一个工具链

    ucc本身其实是一个工具组合。编译器的ucl部分其实只负责将c编译成汇编文件。汇编到obj、obj链接成执行文件,这部分是由as、gcc完成的。这个和gcc是一样的。大家如果用gcc -v hello.c编译一下,就全明白了。


4、ucc的入口

int main(int argc, char *argv[])
{
	int i;

	if (argc <= 1)
	{
		ShowHelp();
		exit(0);
	}

	Option.oftype = EXE_FILE;
	SetupToolChain();
	Command = Alloc((argc + 60) * sizeof(char *));
	Command[0] = NULL;

	i = ParseCmdLine(--argc, ++argv);
	for (; i < argc; ++i)
	{
		if (argv[i][0] == '-')
		{
			Option.linput = ListAppend(Option.linput, argv[i]);
		}
		else
		{
			AddFile(argv[i]);
		}
	}

	for (i = PP_FILE; i <= Option.oftype; ++i)
	{
		if (InvokeProgram(i) != 0)
		{
			RemoveFiles();
			fprintf(stderr, "ucc invoke command error:");
			PrintCommand();
			return -1;
		}
	}

	RemoveFiles();
	return 0;
}


5、ucl的入口

/**
 * The compiler's main entry point. 
 * The compiler handles C files one by one.
 */
int main(int argc, char *argv[])
{
	int i;

	CurrentHeap = &ProgramHeap;
	argc--; argv++;
	i = ParseCommandLine(argc, argv);

	SetupRegisters();
	SetupLexer();
	SetupTypeSystem();
	for (; i < argc; ++i)
	{
		Compile(argv[i]);
	}

	return (ErrorCount != 0);
}

6、ucl的基本逻辑代码

static void Compile(char *file)
{
	AstTranslationUnit transUnit;

	Initialize();

	// parse preprocessed C file, generate an abstract syntax tree
	transUnit = ParseTranslationUnit(file);

	// perform semantic check on abstract synatx tree
	CheckTranslationUnit(transUnit);

	if (ErrorCount != 0)
		goto exit;

	if (DumpAST)
	{
		DumpTranslationUnit(transUnit);
	}

	// translate the abstract synatx tree into intermediate code
	Translate(transUnit);

	if (DumpIR)
	{
		DAssemTranslationUnit(transUnit);
	}

	// emit assembly code from intermediate code
	EmitTranslationUnit(transUnit);

exit:
	Finalize();
}
    上面 这段代码我认为是ucc最重要的部分。其中ParseTranslationUnit负责生成语法树,CheckTranslationUnit负责语义分析,Translate负责中间代码生成,而EmitTranslationUnit完成中间代码到汇编代码的映射部分。如果需要查看语法树,那么可以打开DumpAST。同样如果需要查看中间代码,那么可以打开DumpIR。工程入口文件为ucl.c。词法分析的文件为lex.c、input.c,语法分析的文件包括decl.c、stmt.c、expr.c,语义分析的文件为declchk.c、stmtchk.c、exprchk.c,中间语句的文件为tranexpr.c、transtmt.c、simp.c、gen.c,最后汇编映射的文件为emit.c、x86.c、x86linux.c。其他的文件都是辅助文件,这一点还是很清晰的。


7、自编译

    ucl一个比较好玩的地方就是自编译。举例来说,ucl本身是由gcc编译生成的。等到生成ucl文件生成后,就可以用ucl来编译原来的源代码,继续生成新的ucl。这就是所谓的自举。
C_SRC       = alloc.c ast.c decl.c declchk.c dumpast.c dom.c emit.c \
              error.c expr.c exprchk.c flow.c fold.c gen.c \
              input.c lex.c output.c reg.c simp.c stmt.c \
              stmtchk.c str.c symbol.c tranexpr.c transtmt.c type.c \
              ucl.c uildasm.c vector.c x86.c x86linux.c
OBJS        = $(C_SRC:.c=.o)
CC          = gcc
CFLAGS      = -g -D_UCC
UCC         = ../driver/ucc

all: $(OBJS) assert.o
	$(CC) -o ucl $(CFLAGS) $(OBJS)

clean:
	rm -f *.o ucl

test: $(C_SRC)
	$(UCC) -o ucl1 $(C_SRC)
	mv $(UCCDIR)/ucl $(UCCDIR)/ucl.bak
	cp ucl1 $(UCCDIR)/ucl
	$(UCC) -o ucl2 $(C_SRC)
	mv $(UCCDIR)/ucl.bak $(UCCDIR)/ucl
	strip ucl1 ucl2
	cmp -l ucl1 ucl2
	rm ucl1 ucl2


8、其他

    ucl是按照自顶向下的方法进行解析的。一般来说,这种方法效率比自底向上要高。可以想象一下,如果进行自底向上的源代码分析,那么就要不停地进行移进和规约的操作。当然要是进行规约地话,也必须对整个语法范式进行访问了。编译器的学习,可以集中在范式这部分。如果理解了范式,那么编译器就理解了一半。至于后面地中间代码生成、peephole优化、汇编映射,那就没有多大的难度了。当然要是你想设计一个解释器,其实到语法树这边也就结束了。


ps:

bnf of c,

The syntax of C in Backus-Naur Form
<translation-unit> ::= {<external-declaration>}*

<external-declaration> ::= <function-definition>
                         | <declaration>

<function-definition> ::= {<declaration-specifier>}* <declarator> {<declaration>}* <compound-statement>

<declaration-specifier> ::= <storage-class-specifier>
                          | <type-specifier>
                          | <type-qualifier>

<storage-class-specifier> ::= auto
                            | register
                            | static
                            | extern
                            | typedef

<type-specifier> ::= void
                   | char
                   | short
                   | int
                   | long
                   | float
                   | double
                   | signed
                   | unsigned
                   | <struct-or-union-specifier>
                   | <enum-specifier>
                   | <typedef-name>

<struct-or-union-specifier> ::= <struct-or-union> <identifier> { {<struct-declaration>}+ }
                              | <struct-or-union> { {<struct-declaration>}+ }
                              | <struct-or-union> <identifier>

<struct-or-union> ::= struct
                    | union

<struct-declaration> ::= {<specifier-qualifier>}* <struct-declarator-list>

<specifier-qualifier> ::= <type-specifier>
                        | <type-qualifier>

<struct-declarator-list> ::= <struct-declarator>
                           | <struct-declarator-list> , <struct-declarator>

<struct-declarator> ::= <declarator>
                      | <declarator> : <constant-expression>
                      | : <constant-expression>

<declarator> ::= {<pointer>}? <direct-declarator>

<pointer> ::= * {<type-qualifier>}* {<pointer>}?

<type-qualifier> ::= const
                   | volatile

<direct-declarator> ::= <identifier>
                      | ( <declarator> )
                      | <direct-declarator> [ {<constant-expression>}? ]
                      | <direct-declarator> ( <parameter-type-list> )
                      | <direct-declarator> ( {<identifier>}* )

<constant-expression> ::= <conditional-expression>

<conditional-expression> ::= <logical-or-expression>
                           | <logical-or-expression> ? <expression> : <conditional-expression>

<logical-or-expression> ::= <logical-and-expression>
                          | <logical-or-expression || <logical-and-expression>

<logical-and-expression> ::= <inclusive-or-expression>
                           | <logical-and-expression && <inclusive-or-expression>

<inclusive-or-expression> ::= <exclusive-or-expression>
                            | <inclusive-or-expression> | <exclusive-or-expression>

<exclusive-or-expression> ::= <and-expression>
                            | <exclusive-or-expression> ^ <and-expression>

<and-expression> ::= <equality-expression>
                   | <and-expression> & <equality-expression>

<equality-expression> ::= <relational-expression>
                        | <equality-expression> == <relational-expression>
                        | <equality-expression> != <relational-expression>

<relational-expression> ::= <shift-expression>
                          | <relational-expression> < <shift-expression>
                          | <relational-expression> > <shift-expression>
                          | <relational-expression> <= <shift-expression>
                          | <relational-expression> >= <shift-expression>

<shift-expression> ::= <additive-expression>
                     | <shift-expression> << <additive-expression>
                     | <shift-expression> >> <additive-expression>

<additive-expression> ::= <multiplicative-expression>
                        | <additive-expression> + <multiplicative-expression>
                        | <additive-expression> - <multiplicative-expression>

<multiplicative-expression> ::= <cast-expression>
                              | <multiplicative-expression> * <cast-expression>
                              | <multiplicative-expression> / <cast-expression>
                              | <multiplicative-expression> % <cast-expression>

<cast-expression> ::= <unary-expression>
                    | ( <type-name> ) <cast-expression>

<unary-expression> ::= <postfix-expression>
                     | ++ <unary-expression>
                     | -- <unary-expression>
                     | <unary-operator> <cast-expression>
                     | sizeof <unary-expression>
                     | sizeof <type-name>

<postfix-expression> ::= <primary-expression>
                       | <postfix-expression> [ <expression> ]
                       | <postfix-expression> ( {<assignment-expression>}* )
                       | <postfix-expression> . <identifier>
                       | <postfix-expression> -> <identifier>
                       | <postfix-expression> ++
                       | <postfix-expression> --

<primary-expression> ::= <identifier>
                       | <constant>
                       | <string>
                       | ( <expression> )

<constant> ::= <integer-constant>
             | <character-constant>
             | <floating-constant>
             | <enumeration-constant>

<expression> ::= <assignment-expression>
               | <expression> , <assignment-expression>

<assignment-expression> ::= <conditional-expression>
                          | <unary-expression> <assignment-operator> <assignment-expression>

<assignment-operator> ::= =
                        | *=
                        | /=
                        | %=
                        | +=
                        | -=
                        | <<=
                        | >>=
                        | &=
                        | ^=
                        | |=

<unary-operator> ::= &
                   | *
                   | +
                   | -
                   | ~
                   | !

<type-name> ::= {<specifier-qualifier>}+ {<abstract-declarator>}?

<parameter-type-list> ::= <parameter-list>
                        | <parameter-list> , ...

<parameter-list> ::= <parameter-declaration>
                   | <parameter-list> , <parameter-declaration>

<parameter-declaration> ::= {<declaration-specifier>}+ <declarator>
                          | {<declaration-specifier>}+ <abstract-declarator>
                          | {<declaration-specifier>}+

<abstract-declarator> ::= <pointer>
                        | <pointer> <direct-abstract-declarator>
                        | <direct-abstract-declarator>

<direct-abstract-declarator> ::=  ( <abstract-declarator> )
                               | {<direct-abstract-declarator>}? [ {<constant-expression>}? ]
                               | {<direct-abstract-declarator>}? ( {<parameter-type-list>|? )

<enum-specifier> ::= enum <identifier> { <enumerator-list> }
                   | enum { <enumerator-list> }
                   | enum <identifier>

<enumerator-list> ::= <enumerator>
                    | <enumerator-list> , <enumerator>

<enumerator> ::= <identifier>
               | <identifier> = <constant-expression>

<typedef-name> ::= <identifier>

<declaration> ::=  {<declaration-specifier>}+ {<init-declarator>}* ;

<init-declarator> ::= <declarator>
                    | <declarator> = <initializer>

<initializer> ::= <assignment-expression>
                | { <initializer-list> }
                | { <initializer-list> , }

<initializer-list> ::= <initializer>
                     | <initializer-list> , <initializer>

<compound-statement> ::= { {<declaration>}* {<statement>}* }

<statement> ::= <labeled-statement>
              | <expression-statement>
              | <compound-statement>
              | <selection-statement>
              | <iteration-statement>
              | <jump-statement>

<labeled-statement> ::= <identifier> : <statement>
                      | case <constant-expression> : <statement>
                      | default : <statement>

<expression-statement> ::= {<expression>}? ;

<selection-statement> ::= if ( <expression> ) <statement>
                        | if ( <expression> ) <statement> else <statement>
                        | switch ( <expression> ) <statement>

<iteration-statement> ::= while ( <expression> ) <statement>
                        | do <statement> while ( <expression> ) ;
                        | for ( {<expression>}? ; {<expression>}? ; {<expression>}? ) <statement>

<jump-statement> ::= goto <identifier> ;
                   | continue ;
                   | break ;
                   | return {<expression>}? ;
This grammar was adapted from Section A13 of The C programming language, 2nd edition, by Brian W. Kernighan and Dennis M. Ritchie,Prentice Hall, 1988.



猜你喜欢

转载自blog.csdn.net/feixiaoxing/article/details/80169954