"Challenge to write a C compiler in 500 lines of Python"

Author | Theia Vogel

Translator | Ric Guan Editor | Tu Min

Listing | CSDN (ID: CSDNnews)

A few months ago, after challenging to write a signed distance function (Signed Distance Function) in 46 lines of Python, I set myself the challenge of writing a C compiler in 500 lines of Python, so how difficult can it be this time?

It turns out that even if a lot of functionality is given up, it is still quite difficult to implement! But the whole process is also very interesting, and the end result is unexpected, very practical and not difficult to understand!

My code is too large to cover in one article, so in this article I will outline the decisions I made, what I had to remove, and share the overall architecture of the compiler and the code involved in each part. Representative code. I hope that after reading this article, it will be easier for you to understand my open source code!

GitHub address: https://github.com/vgel/c500/blob/main/compiler.py

Find your position and make a decision!

The first and most critical decision was to set the goal for this time to develop a single pass compiler (going through each part of each compilation unit only once, immediately converting each code part to its final machine code).

500 lines is too much to define and transform an abstract syntax tree! what does that mean?

Most compilers use syntax trees

The internal structure of most compilers looks like this:

The tokens are lexically analyzed, and then the parser runs through them and builds a fairly small syntax tree:

The important thing here is that there are two-passes of compilation: first the parsing builds the syntax tree, and then the second pass goes through that tree and converts it into machine code.

This is really useful for most compilers! It separates parsing and code generation so each can be developed independently. This also means that you can transform a syntax tree before using it to generate code, for example, by applying optimizations to it. In fact, most compilers have multiple levels of "IR" (Intermediate Representation) between the syntax tree and code generation!

This is really cool, good engineering, best practices, expert recommendations and more. But... it requires too much code, so I can't do it here.

Therefore, I chose to challenge a single-pass compiler: code generation occurs during parsing. We parse a little, emit some code, parse a little more, emit more code. For example, here is some actual code from the c500 compiler for parsing the prefix ~ op:

Please note that there is no syntax tree, no PrefixNegateOp node. We see some tokens and immediately spit out the corresponding instructions.

You may have noticed that these instructions are based on WebAssembly, which leads us to the next part...

Using WebAssembly for some reason?

I decided to make the compiler target WebAssembly. I honestly don't know why I'm doing this, it doesn't make it any easier - I guess I'm just curious?

WebAssembly is a very strange target, especially for the C language. Some external issues confuse me, such as when I realize that WebAssembly v2 is very different from WebAssembly v1, and beyond that, the instruction set itself is weird.

First, there is no goto. Instead, it has blocks (structured assembly, imagine that!) and "break" instructions that jump to the beginning or end of blocks at a specific nesting level. This is basically trivial for if and while, but for extreme curses it's implemented very poorly, which we'll discuss later.

Additionally, WebAssembly has no registers, it has a stack, and is a stack structure machine. At first you might think this is great, right? C needs a stack! We can use the WebAssembly stack as our C stack! But no, because you can't reference the WebAssembly stack. So we need to maintain our own memory stack anyway and then move it in and out of the WASM parameter stack.

So in the end, I think I ended up with slightly more code than I would need to target a more common ISA like x86 or ARM. But it's fun! In theory you could run code compiled with c500 in a browser, although I haven't tried that (I'm just using the wasmer CLI).

handling errors

As for bugs, there are basically none. There is a function die() which is called when anything weird happens and dumps the compiler stack trace - if you're lucky, you get a line number and a somewhat vague error message.

What to give up?

In the end, I had to decide what not to support, since it wasn't feasible to fit all of the C language into 500 lines. I decided I wanted a really decent functional sample to test the capabilities of the general implementation approach. For example, if I skipped the pointers, I could go through the WASM argument stack and get away with a lot of the complexity, but it feels like cheating.

I ended up implementing the following functionality:

Arithmetic operations and binary operators, with appropriate precedence
int, short and char types
String constant (with escaping)
Pointers (no matter how many levels), including correct pointer arithmetic (increment int* plus 4)
Array (single level only, not int[][])
Function
typedefs (and lexer hacks!)

Notably, it does not support:

structure, could use more code, the basics are there, I just can't condense it
enum/union
Preprocessor directives (this itself is probably 500 lines long...)
floating point. It's also possible that the wasm_type stuff is inside, but it can't be squeezed in.
8 byte type (long/long long or double)
Other little things like in-place initialization, which are not quite appropriate
Any type of standard library or i/o that does not return an integer from main()
Casting expression

The compiler passed 34/220 test cases in c-testsuite. More importantly to me, it compiles and runs the following program successfully.

Okay, the decisions are made, let’s take a look at the code!

Helper type

The compiler uses a small set of auxiliary types and classes. None of them are particularly weird, so I'll skip them quickly.

Emitter class

This is a one-way helper for emitting well-formed WebAssembly code.

WebAssembly, at least the text format, is formatted as an s-expression, but individual directives do not require parentheses:

Emitter just helps emit code with good indentation so it's easier to read. It also has a no_emit method that will be used for ugly hacks later - stay tuned!

StringPool class

The StringPool class is used to hold all string constants so that they are arranged in a contiguous memory area and the addresses are assigned for use by the code generator. When you write char *s = "abc" in c500, what really happens is:

StringPool appends a null terminator
StringPool checks if "abc" is already stored, if so, returns the address
Otherwise, StringPool adds it to the dictionary along with the base address + the total byte length stored so far - the address of this new string in the pool
StringPool return address
When all the code has finished compiling, we create a rodata section using the huge concatenated string generated by the StringPool, stored at the StringPool base address (retroactively valid for all addresses that the StringPool distributes)

Lexer class

The Lexer class is complex because the lexical C is complex ((\\([\\abfnrtv'"?]|[0-7]{1,3}|x[A-Fa-f0-9]{1,2} )) is the real regex used for character escaping in this code), but it's conceptually simple: the lexer continues to identify what the token at the current position is. Not only can the caller look at that token, but it can also tell the lexer using next The parser advances, "consuming" the token. It can also use try_next to conditionally advance only if the next token is of a certain type - basically, try_next is if self.peek().kind == token: return self.next ( ).

There is some additional complexity due to the so-called "Lexer Hack". Essentially, when parsing C, you want to know whether something is a type name or a variable name (because context is important for compiling certain expressions), but there is no syntactic distinction between them: int int_t = 0; is perfectly valid C, the same goes for typedef int int_t; int_t x = 0;.

To know whether an arbitrary token int_t is a type name or a variable name, we need to feed type information back to the lexer from the parsing/code generation phase. This is a huge pain for a regular compiler that wants to keep its lexer, parser and code generation modules pure and separate, but it's actually not that hard for us!

I'll explain it in more detail when we get to the typedef part, but basically we just keep the type: set[str] in the lexer, and at lexer time, check if a token is in that type before giving it a token type In the collection:

CType class

This is just a data class used to represent information about a C type, as written in int **t or short t[5] or char **t[17], minus the t.

It contains:

The name of the type (any type definition that was resolved), such as int or Short
The levels of pointers are (0 = not a pointer, 1 = int *t, 2 = int **t, etc.)
What is the array size (None = not an array, 0 = int t[0], 1 = int t[1], etc.)

It is worth noting that, as mentioned before, this type only supports single-level arrays and not nested arrays like int t[5][6].

FrameVar and StackFrame classes

These classes handle our C stack frames.

As I mentioned before, since you can't reference the WASM stack, we have to deal with the C stack manually and we can't use the WASM stack.

To set up the C stack, a prelude issued in __main__ sets a global __stack_pointer variable, which is then reduced by each function call by the space required for function parameters and local variables (calculated by the function's StackFrame instance).

I'll go into more detail on how this calculation works when we start parsing the function, but essentially, each parameter and local variable gets a slot in that stack space and increases StackFrame.frame_size (thus increasing the next variable offset) depends on its size. Offsets, type information, and other data for each parameter and local variable are stored in FrameVar instances in StackFrame.variables in the order they are declared.

ExprMeta class

This final data class is used to track whether the result of an expression is a value or a position. We need to keep track of this distinction so that we can handle certain expressions differently depending on how they are used.

For example, if you have a variable x of type int, you can use it in two ways:

x + 1 want to operate on the value of x (for example, 1)
&x wants the address of x, such as 0xcafedead

When we parse the x expression, we can easily get the address from the stack frame:

But what now? If we i32.load this address to get the value, then &x will not be able to get the address. But if we don't load it, then x + 1 will try to increment the address by one, and the result will be 0xcafedeae instead of 2!

This is where ExprMeta comes in: we leave the address on the stack and return an ExprMeta indicating that this is the place:

Then, for operations like + that always want to operate on values rather than positions, there is a function load_result that converts any position to a value:

Meanwhile, an operation like & doesn't load the result, but leaves the address on the stack: in an important sense, & is a no-op in our compiler, since it emits no code!

Also note that although & is an address, the result is not a location! (The code returns ExprMeta with is_place=False.) The result of & should be treated as a value, since &x + 1 should add 1 (or rather, sizeof(x)) to the address. That's why we need to distinguish position/value, since "being an address" alone is not enough to know if the result of an expression should be loaded.

OK, enough about helper classes. Let's move on to the core parts of Codegen!

Parsing and code generation

The general control flow of the compiler is as follows:

__main__

This one is very short and boring. Here’s the full introduction:

Apparently I never finished that TODO! The only really interesting thing here is the fileinput module, which you probably haven't heard of. From the module documentation,

Typical uses are:

This iterates over the lines of all files listed in sys.argv[1:], defaulting to sys.stdin if the list is empty. If the filename is "-", it will also be replaced by sys.stdin, and the optional parameters mode and openhook will be ignored. To specify a list of alternative filenames, pass it as an argument to input(). A single filename is also allowed.

This means that, technically, the c500 supports multiple files! (If you don't mind them all being concatenated and the line numbers messed up :-) fileinput is actually quite complex and has a filelineno() method, I just didn't use it for space reasons. )

compile（）

compile() is the first interesting function here, it's short enough to contain verbatim:

This function handles emitting module-level preludes.

First, we issue a pragma for the WASM VM to reserve 3 pages of memory ((memory3)) and set the stack pointer to start at the end of this reserved area (it will grow downwards).

Then, we define two stack operation helpers, __dup_i32 and __swap_i32. If you've ever used Forth, these should be familiar: dup copies the item at the top of the WASM stack (a -- aa), and swap swaps the positions of the two items at the top of the WASM stack (ab -- ba).

Next, we initialize a stack frame to hold the global variables, initialize the lexer using the lexer hacker's built-in type names, and chew on the global declarations until we run out!

Finally, we export main and dump the string pool.

global_declaration（）

This function is too long to inline the entire function, but the signature looks like this:

It handles typedefs, global variables and functions.

Typedef is cool because that's where the lexer hacking happens!

Once again we use a common type name resolution tool, which is convenient for us since typedef inherits all of C's weird "declaration reflects usage" rules. (Not for the confused newbies out there!) We then notify the lexer that we discovered a new type name, so that in the future the token will be lexed as a type name rather than a variable name.

Finally, for typedefs, we store the type in the global typedef registry, using a trailing semicolon, and then return to compile() for the next global declaration. It's important that the type we store is a fully resolved type, because if you do typedef int* int_p; and then later write int_p *x, x should get the resulting type of int** - pointer levels are cumulative! This means that we cannot just store the base C type name, but need to store the entire CType.

If the declaration is not a typedef, we resolve the variable type and name. If we find a ; token we know it's a global variable declaration (since we don't support global initializers). In this case, we add global variables to the global stack frame and bail.

However, without the semicolon, we are definitely dealing with a function. To generate function code we need:

Create a new StackFrame for this function, named frame
Then, parse all parameters and store them in the frame: frame.add_var(varname.content, type, is_parameter=True)
After that all variable declarations are parsed using variable_declaration(lexer,frame) which will add them to the frame
Now we know how big the function's stack frame needs to be (frame.frame_size), so we can start emitting the intro!
First, for all parameters in the stack frame (added with is_parameter=True), we generate WASM parameter declarations so that the function can be called using the WASM calling convention (passing parameters on the WASM stack):

We can then emit the resulting annotation for the return type and adjust the C stack pointer to make room for the function's arguments and variables:

For each parameter (in reverse order, because of the stack), copy it from the WASM stack to our stack:

Finally, we can call statement(lexer,frame) in the loop to code generate all the statements in the function until we reach the closing bracket:

Extra step: We assume that functions always have a return value, so we emit ("unreachable") so that the WASM analyzer doesn't crash.

It's a lot, but that's what the function is all about, and therefore what global_declaration() is all about, so let's move on to statements().

statement（）

There is a lot of code in statements(). However, much of it is quite repetitive, so I'll just explain while and for, which should provide a good overview.

Remember that WASM has no jumps, but structured control flow? Now this is important.

First let's see how it works with while, it won't be too much trouble there. The while loop in WASM looks like this:

As you can see, there are two types of blocks - blocks and loops (there is also an if block type, which I did not use). Each statement contains a certain number of statements and ends with end. Within the block, you can use br for interrupts, or br_if to conditionally based on the top of the WASM stack (there's also br_table, which I didn't use).

The br series takes a labelidx parameter, here 1 or 0, indicating to which level of block the operation applies. So, in our while loop, br_if 1 applies to the outer block - index 1, and br 0 applies to the inner block - index 0. (Indices are always relative to the related instruction - 0 is the innermost block of that instruction.)

Finally, the last rule to know is that br in a block jumps forward to the end of the block, while br in a loop jumps backward to the beginning of the loop.

I hope it will be easier to understand after reading it again now:

In a more normal assembly, this corresponds to:

But with jumping, you can express things that you can't express (easily) in WASM - for example, you can jump to the middle of a block.

(This is mostly a question of compiling goto for C, I haven't even tried it - there is an algorithm that can convert any code that uses goto into an equivalent program that uses structured control flow, but it's complicated and I think it works Not using our single pass method.)

But for a while loop, this isn't too bad. All we have to do is:

But with for loops, it can get annoying. Consider this for loop:

The order in which the lexer/code generator sees the parts of the for loop is:

i = 0
i < 5
i = i + 1
j = j * 2 + i

But in order to use WASM's structured control flow, the order we need to put them into the code is:

Note that 3 and 4 are reversed in the generated code, so the order is 1, 2, 4, 3. This is a problem for single-pass compilers! Unlike a normal compiler, we cannot store high-level statements for later use. Or... can we?

The way I ended up dealing with this was to make the lexer clonable and re-parse the progress statement after parsing the body. Essentially, the code looks like this:

As you can see, the trick is to save the lexer and then use it to go back and process high-level statements later, rather than saving the syntax tree like a normal compiler. Not very elegant - the compiled for loop is probably the crudest code in the compiler - but it works well enough!

The other parts of statements() are mostly similar, so I'll skip them to get to the last major part of the compiler - expression().

expression（）

expression() is the last big method in the compiler, and as you might expect, it handles parsing expressions. It contains a number of internal methods, one for each priority level, each of which returns the ExprMeta structure described earlier (which handles the "position versus value" distinction and can be converted to a value using load_result).

The bottom of the priority stack is value() (the naming is somewhat confusing, since it can return ExprMeta(is_place=True, ...)). It handles constants, bracket expressions, function calls, and variable names.

Beyond that, the basic pattern for priorities is a function like this:

In fact, this pattern is so consistent that most operations (including muldiv) are not written out but are instead defined by the higher-order function makeop:

Only a few operations with special behavior need to be explicitly defined, such as plusminus that need to handle the nuances of C pointer mathematics.

That's it! This is the last major part of the compiler.

Summarize

This is my journey to challenge the C compiler in 500 lines of Python! Compilers have a reputation for being complex - GCC and Clang are huge, and even TCC (Tiny C compiler) has tens of thousands of lines of code - but it can also be very large if you're willing to sacrifice code quality and do it all at once. few!

I'd be interested to know if you've ever written your own one-pass compiler - perhaps for a custom language? I think this compiler could become a good stage for self-hosted languages because it's so simple.

"Challenge to write a C compiler in 500 lines of Python"

Guess you like