Python implements the Python interpreter

1. Introduction

This article will start with implementing a toy interpreter to learn the basics needed to implement an interpreter. Afterwards, view the real Python bytecode through Python's dis library to further understand the internal mechanism of the Python interpreter. Finally, refer to Byterun (an existing Python interpreter) to implement a Python interpreter within 500 lines.

2. Experimental description

1. Python interpreter

What exactly does the Python interpreter mean here? Sometimes we think of Python's REPL (the interactive environment of Python under the command line) as an interpreter, and sometimes the term Python interpreter can refer to the entire Python, which compiles source code into bytecode and executes it. The interpreter implemented in this course only completes the last part of the work of executing bytecode, which is equivalent to a Python virtual machine running Python bytecode.

You may wonder if Python is not an interpreted language, isn't the virtual machine running bytecode like a compiled language like java? In fact, this classification is not very accurate. Most interpreted languages, including Python, will have a compilation process. Interpreted languages are called interpreted languages because their compilation effort is relatively small.

2. Python interpreter implemented by Python

The prototype of this course - Byterun is a Python interpreter implemented in Python, you may find it strange, as strange as the statement that you gave birth to yourself. In fact, it is not so strange. You can see that gcc is written in C. You can also use other languages to implement the python interpreter. In fact, except for the functions implemented, the interpreter is no different from ordinary programs.

Using Python to implement a Python interpreter has advantages and disadvantages. The biggest disadvantage is speed. Byterun runs a python program much slower than CPython. The advantage is that we can directly use some of Python's native implementations, such as Python's object system. When Byterun needs to create a class, it can directly use the original Python to create it. Of course, the biggest advantage is that the Python code is short and powerful, and only 500 lines can implement a fully functional interpreter. Therefore, life is short, and Python is the shore.

3. Structure of the Python Interpreter

Our Python interpreter is a virtual machine that emulates a stack machine , using just multiple stacks to do things. The bytecode processed by the interpreter comes from the instruction set in the code object generated after lexical analysis, syntax analysis and compilation of the source code. It is equivalent to an intermediate representation of Python code, just as assembly code is to C code.

3. Hello, the interpreter

Let's start with the hello world of interpreters, the simplest introductory interpreter that only implements addition. It also only recognizes three instructions, so the program it can run is only the permutation and combination of these three instructions. It sounds shivering now, but not after finishing this course.

The three most basic instructions to get started with the interpreter:

LOAD_VALUE
ADD_TWO_VALUES
PRINT_ANSWER

Since we only care about the part that runs the bytecode, let's not worry about how the source code is compiled to some combination of the above three instructions. We just need to run the instructions one by one according to the compiled content. On the other hand, if you invent a new language and write a corresponding compiler that generates bytecode, you can run it on our python interpreter.

Taking 7 + 5 as an example of source code, the following instruction set is generated after compilation:

what_to_execute = {
    "instructions": [("LOAD_VALUE", 0),  # 第一个数
                     ("LOAD_VALUE", 1),  # 第二个数
                     ("ADD_TWO_VALUES", None),
                     ("PRINT_ANSWER", None)],
    "numbers": [7, 5] }

Here what_to_execute is equivalent to code object, and instructions is equivalent to bytecode.

Our interpreter is a stack machine, so the addition is done using the stack. First execute the first instruction LOAD_VALUE, push the first number onto the stack, and the second instruction pushes the second number onto the stack as well. The third instruction ADD_TWO_VALUES pops two numbers from the stack, adds them and pushes the result onto the stack, and the last instruction pops the answer from the stack and prints it. The content of the stack changes as shown in the following figure:

The LOAD_VALUE instruction needs to find the data specified by the parameter and push it on the stack, so where does the data come from? It can be found that our instruction set consists of two parts: the instruction itself and a list of constants. The data comes from a list of constants.

Knowing these later wrote our interpreter program. We use the list to represent the stack, and write the corresponding method of the instruction to simulate the operation effect of the instruction.

class Interpreter:
    def __init__(self):
        self.stack = []

    def LOAD_VALUE(self, number):
        self.stack.append(number)

    def PRINT_ANSWER(self):
        answer = self.stack.pop()
        print(answer)

    def ADD_TWO_VALUES(self):
        first_num = self.stack.pop()
        second_num = self.stack.pop()
        total = first_num + second_num
        self.stack.append(total)

A method to write a set of input instructions and execute them one by one:

def run_code(self, what_to_execute):
    #指令列表
    instructions = what_to_execute["instructions"]
    #常数列表
    numbers = what_to_execute["numbers"]
    #遍历指令列表，一个一个执行
    for each_step in instructions:
        #得到指令和对应参数
        instruction, argument = each_step
        if instruction == "LOAD_VALUE":
            number = numbers[argument]
            self.LOAD_VALUE(number)
        elif instruction == "ADD_TWO_VALUES":
            self.ADD_TWO_VALUES()
        elif instruction == "PRINT_ANSWER":
            self.PRINT_ANSWER()

have a test

interpreter = Interpreter()
interpreter.run_code(what_to_execute)

operation result:

Although our interpreter is still very weak, the process of executing instructions is actually similar to that of real Python. There are a few things to pay attention to in the code:

The parameter of the LOAD_VALUE method in the code is the constant that has been read and not the parameter of the instruction.
ADD_TWO_VALUES does not require any parameters, and the number used in the calculation is directly popped from the stack, which is also a feature of stack-based interpreters.

We can run the addition of 3 numbers or even multiple numbers using existing instructions:

what_to_execute = {
    "instructions": [("LOAD_VALUE", 0),
                     ("LOAD_VALUE", 1),
                     ("ADD_TWO_VALUES", None),
                     ("LOAD_VALUE", 2),
                     ("ADD_TWO_VALUES", None),
                     ("PRINT_ANSWER", None)],
    "numbers": [7, 5, 8] }

operation result:

variable

Next we want to add the concept of variables to our interpreter, so we need to add two new directives:

STORE_NAME: Store the value of the variable, and store the contents of the top of the stack into the variable.
LOAD_NAME: Read the variable value and push the contents of the variable onto the stack.

and adding a list of variable names.

Here is the set of instructions we need to run:

#源代码
def s():
    a = 1
    b = 2
    print(a + b)

#编译后的字节码
what_to_execute = {
    "instructions": [("LOAD_VALUE", 0),
                     ("STORE_NAME", 0),
                     ("LOAD_VALUE", 1),
                     ("STORE_NAME", 1),
                     ("LOAD_NAME", 0),
                     ("LOAD_NAME", 1),
                     ("ADD_TWO_VALUES", None),
                     ("PRINT_ANSWER", None)],
    "numbers": [1, 2],
    "names":   ["a", "b"] }

Because the issues of namespace and scope are not considered here, when implementing the interpreter, the mapping relationship between variables and constants can be directly stored in the member variables of the interpreter object in the form of a dictionary. For an instruction that manipulates a list of variable names, when obtaining method parameters through instruction parameters, it is necessary to determine which list (constant list or variable name list) is taken according to the instruction, so it is necessary to implement a method for parsing instruction parameters.

The code implementation of the interpreter with variables is as follows:

class Interpreter:
    def __init__(self):
        self.stack = []
        #存储变量映射关系的字典变量
        self.environment = {}

    def STORE_NAME(self, name):
        val = self.stack.pop()
        self.environment[name] = val

    def LOAD_NAME(self, name):
        val = self.environment[name]
        self.stack.append(val)

    def LOAD_VALUE(self, number):
        self.stack.append(number)

    def PRINT_ANSWER(self):
        answer = self.stack.pop()
        print(answer)

    def ADD_TWO_VALUES(self):
        first_num = self.stack.pop()
        second_num = self.stack.pop()
        total = first_num + second_num
        self.stack.append(total)

    def parse_argument(self, instruction, argument, what_to_execute):
        #解析命令参数
        #使用常量列表的方法
        numbers = ["LOAD_VALUE"]
        #使用变量名列表的方法
        names = ["LOAD_NAME", "STORE_NAME"]

        if instruction in numbers:
            argument = what_to_execute["numbers"][argument]
        elif instruction in names:
            argument = what_to_execute["names"][argument]

        return argument

    def run_code(self, what_to_execute):
        instructions = what_to_execute["instructions"]
        for each_step in instructions:
            instruction, argument = each_step
            argument = self.parse_argument(instruction, argument, what_to_execute)

            if instruction == "LOAD_VALUE":
                self.LOAD_VALUE(argument)
            elif instruction == "ADD_TWO_VALUES":
                self.ADD_TWO_VALUES()
            elif instruction == "PRINT_ANSWER":
                self.PRINT_ANSWER()
            elif instruction == "STORE_NAME":
                self.STORE_NAME(argument)
            elif instruction == "LOAD_NAME":
                self.LOAD_NAME(argument)

operation result:

I believe you have found that we have only implemented five instructions now, but run_code already looks a bit "bloated", and then adding new instructions will make it even more "bloated". Don't be afraid, you can use python's dynamic method to find features. Because the instruction name and the corresponding implementation method name are the same, the getattr method can be used, and getattr will return the corresponding method according to the input method name, so as to get rid of the bloated branch structure, and add new instructions without modifying the original one. run_code code.

The following is an evolved version of run_code execute:

def execute(self, what_to_execute):
    instructions = what_to_execute["instructions"]
    for each_step in instructions:
        instruction, argument = each_step
        argument = self.parse_argument(instruction, argument, what_to_execute)
        bytecode_method = getattr(self, instruction)
        if argument is None:
            bytecode_method()
        else:
            bytecode_method(argument)

That's it. In this article, we implemented a toy interpreter, which runs the code object structure we defined. In the next lesson, we will come into contact with the code object and bytecode of real Python. see you in class~