"Python interpreter source analysis" Chapter 8 --python with pyc byte code files

8.0 Sequence

We will write a variety of daily python script at run time need only enter python xxx.pythe program executed. So the question becomes, how was a python py file into a series of machine instructions and perform it?

8.1 python program execution

The implementation of the principle of python may include two words: virtual machine bytecode

First, there is the python in a very critical thing, this thing is called interpreter (interpreter), when we enter python on the command line, it is to activate the interpreter. Of course, if the latter also keep up with the py file, the interpreter will immediately be activated, and then execute the code inside py file. However, before you actually begin to explain actually have to complete a very complex task - to compile py file

Yes, python though is an interpreted language, but also a process of compilation. Whichever py file execution, first of all the source code is compiled, is compiled into a byte code set python (byte code), then the compiled bytecode to the python VM (virtual machine), and then by the virtual a machine performing a byte-code order, thus completing the operation performed on the python. About the difference between a virtual machine and interpreter, personally I feel can be considered in python interpreter = + compiler virtual machine. Do not misunderstand here compilers, this is compiled into byte code in python, C language is not directly compiled into machine code interpreter.

Then the python compiler and virtual machine hiding in the area? We open the installation directory of python, you will see a python.exe, when clicked can really start a terminal, but the file size is less than 100K, can not accommodate an interpreter plus a virtual machine. But in fact you will find that there is a python37.dll below, yes, compilers, virtual machines are hiding in python37.dll them.

Compile the results --PyCodeObject objects 8.2 python compiler

8.2.1 PyCodeObject objects and pyc file

Let's look at a simple example, take a look at a py file is compiled after what should produce some results

class A:
    pass


def foo():
    pass


a = A()
foo()

First, we know that action python implementation of this document is the first to be compiled, the results are compiled byte code. In addition to bytecode however, should also contain some other information, these results, it also runs the python necessary.

During compilation, as a constant value, the static source code string information will be collected from among the python compiler together, these results will be reflected in the static information after the compilation inside. During the python run, these documents provide a source of static information will be stored in the object among a runtime, when the end of the python run, the run-time object information contained will be stored by one file. The objects and documents that we are going to discuss priorities: PyCodeObject objects and pyc file

We know the result of the compilation is a pyc file, but the contents inside are PyCodeObject objects for python compilers, PyCodeObject its real target is to compile the results, and the performance of the object pyc files are on the hard disk form. Thus they are in fact present in two different forms of the python after compiling the source code.

During the program runs, the results compiled in memory of PyCodeObject among objects, and after the end of the run python compiler the result has been saved to a file which pyc. The next time, when run, python will be built directly PyCodeObject objects in memory based on the compilation results pyc file records without having to re-rebuilt.

8.2.2 python source code of the object PyCodeObject

About how python py file is compiled, we do not introduce this, because it also relates to the word, build syntax tree and so on, our focus is the result after compilation. To completely understand the runtime behavior of python virtual machine, it must thoroughly understand PyCodeObject object.

/* Bytecode object */
typedef struct {
    PyObject_HEAD
    int co_argcount;            /* #arguments, except *args */
    int co_kwonlyargcount;      /* #keyword only arguments */
    int co_nlocals;             /* #local variables */
    int co_stacksize;           /* #entries needed for evaluation stack */
    int co_flags;               /* CO_..., see below */
    int co_firstlineno;         /* first source line number */
    PyObject *co_code;          /* instruction opcodes */
    PyObject *co_consts;        /* list (constants used) */
    PyObject *co_names;         /* list of strings (names used) */
    PyObject *co_varnames;      /* tuple of strings (local variable names) */
    PyObject *co_freevars;      /* tuple of strings (free variable names) */
    PyObject *co_cellvars;      /* tuple of strings (cell variable names) */
    /* The rest aren't used in either hash or comparisons, except for co_name,
       used in both. This is done to preserve the name and line number
       for tracebacks and debuggers; otherwise, constant de-duplication
       would collapse identical functions/lambdas defined on different lines.
    */
    Py_ssize_t *co_cell2arg;    /* Maps cell vars which are arguments. */
    PyObject *co_filename;      /* unicode (where it was loaded from) */
    PyObject *co_name;          /* unicode (name, for reference) */
    PyObject *co_lnotab;        /* string (encoding addr<->lineno mapping) See
                                   Objects/lnotab_notes.txt for details. */
    void *co_zombieframe;       /* for optimization only (see frameobject.c) */
    PyObject *co_weakreflist;   /* to support weakrefs to code objects */
    /* Scratch space for extra data relating to the code object.
       Type is a void* to keep the format private in codeobject.c to force
       people to go through the proper APIs. */
    void *co_extra;
} PyCodeObject;

What is there on behalf of each domain, you can temporarily disregard, as we will introduce one by one, but there are a co_code can advance Spoilers look, this field is stored in compiling the generated bytecode instruction sequence.

python compiler at the time of the python source code is compiled, the code for each block, will create a corresponding PyCodeObject, but how to determine how much code is considered a block of it? In fact, python has a simple and clear rule: when entering a new name space, or scope, even if we are entering a new block up.

py file created earlier review, there will be compiled after three PyCodeObject objects, a corresponding entire py file, a corresponding class A, one of the corresponding def foo.

Here, we mentioned at the beginning python in a crucial concept - namespace (name space). Namespace is a symbol of context, the meaning of symbols depends on the namespace. More specifically, what variables corresponding to the value of a variable name, is uncertain in python, you need to decide namespace.

For a symbol, such as a, in a namespace, the objects it may be a PyLongObject; and in another namespace, it may be a PyListObject object. But in a namespace, a symbol can have only one meaning. Further namespaces one layer of a sleeve form 命名空间链, Python virtual machine during execution, there will be a large part of the time consumed in the 命名空间链determining that the object corresponding to the one symbol is. This side also explains why python do not need to specify the type when creating variables, and why python slower.

If the current namespace is not very understanding, it does not matter, along with in-depth analysis, you will surely be named in the action space and python namespace chains have more profound understanding. Now a word to remember is: a code block corresponding to a namespace, but also corresponds to a PyCodeObject object. In python, class, function, module corresponds to a namespace alone, so there will be a corresponding PyCodeObject.

8.2.3 pyc file

Each object contains a PyCodeObject byte code sequence of each of the code block after the code is compiled, all obtained. We said front, python and PyCodeObject sequence of bytecodes will be stored together in the object file pyc. Unfortunately, the fact is not always the case. Sometimes, when we run a simple program did not produce pyc files, so we guess: some python program is only temporary do some grunt work, such a program will run only once, and then will no longer be used, and therefore there is no need to save the pyc files.

If we add an import abc such statements in the code, then execute you will find python generated for the pyc file, which indicates that import will trigger the generation of pyc. In fact, during operation, if you encounter import abc such a statement, the python will look abc.pyc or abc.dll files to set a good path, if no such documents, but only found abc.py , then the python will first abc.py compiled into PyCodeObject, and then create a pyc file and PyCodeObject wrote pyc files inside. Next, and then to carry out import abc.pyc action, right, is not used directly after PyCodeObject compiled into an object, but the first written pyc to go inside, and then PyCodeObject objects pyc files recopy it in memory.

About python import mechanism, we will analyze later chapters, here it is used to complete the trigger pyc files. Of course, there are many ways to get pyc files, such as the use py_compile module.

a.py

class A:
    a = 1

b.py

import a 

When the execution b.py, may find creating a a.pyc. Also on the pyc files created position, will be in the same directory of the current file __pycache__creation directory name is called, py file name .cpython- version number .pyc

We use notepad ++ open, binary fashion look and found a bunch of numbers seem meaningless.

So python how to interpret the byte stream becomes critical, pyc file format that is what we care about.

要了解pyc文件的格式,就必须清楚PyCodeObject对象中的每一个域都代表什么含义,这一点是无论如何也绕不过去的。

回顾一下,PyCodeObject结构体的定义:

  • PyObject_HEAD:真的是一切皆对象,字节码也是一个对象,那么自然要求PyObject这些头部信息

  • co_argcount:位置参数个数

    import inspect
    
    frame = None
    
    
    def foo(a, b, c, d):
        global frame
        frame = inspect.currentframe()
    
    
    foo(1, 2, 3, 4)
    # frame是这个函数的栈帧,这个栈帧先不用管
    # 总之我们再frame.f_code就可以拿到字节码,这个字节码就是底层的PyCodeObject对象
    # 我们再调用一个co_argcount就可以拿到位置参数的个数
    print(frame.f_code.co_argcount)  # 4
  • co_kwonlyargcount:只能用关键字参数的个数

    def foo(a, b, *, c, d):
        global frame
        frame = inspect.currentframe()
    
    
    foo(1, 2, c=3, d=4)
    print(frame.f_code.co_argcount)  # 2
    print(frame.f_code.co_kwonlyargcount)  # 2
    # 我们注意到c和d只能使用关键字参数传递,所以是2个
    # 如果定义的时候不加*,即使你在在调用的时候都是使用关键字参数传递,也没有用
    # 都是属于位置参数。
    # co_kwonlyargcount是只能用关键字参数传递的个数
  • co_nlocals:代码块中局部变量的个数,也包括参数

    import inspect
    
    frame = None
    
    
    def foo(a, b, *, c):
        name = "xxx"
        age = 16
        gender = "f"
        c = 33
        global frame
        frame = inspect.currentframe()
    
    
    foo(1, 2, c=3)
    print(frame.f_code.co_nlocals)  # 6
    """
    参数有三个,加上局部变量3个,一共6个
    下面的c=33中的c和参数的c整体是1个变量
    """
  • co_stacksize:执行该段代码块需要的栈空间

    import inspect
    
    frame = None
    
    
    def foo1(a, b):
        global frame
        frame = inspect.currentframe()
    
    
    foo1(1, 222)
    print(frame.f_code.co_stacksize)  # 2
  • co_flags:用于mask,没什么用,这个可以不用管

  • co_firstlineno:代码块在对应文件的起始行

    如果foo被函数调用呢?

    每个函数都有自己独自的命名空间,以及PyCodeObject对象,所以即便是通过bar调用的,co_firstlineno还是自身的代码块的起始行

  • co_code:代码块编译成字节码的指令序列,以PyBytesObject的形式存在

    import inspect
    
    frame = None
    
    
    def bar():
        foo(1, 2)
    
    
    def foo(a, b):
        global frame
        frame = inspect.currentframe()
    
    
    bar()
    print(frame.f_code.co_code)  # b't\x00\xa0\x01\xa1\x00a\x02d\x00S\x00'
  • co_consts:常量池,PyTupleObject对象,保存代码块中的所有常量。

    import inspect
    
    frame = None
    
    
    def foo(a, b):
        name = "satori"
        global frame
        age = 16
        frame = inspect.currentframe()
    
    
    foo(1, 2)
    print(frame.f_code.co_consts)  # (None, 'satori', 16)
  • co_names:PyTupleObject对象,保存代码块中的所有符号

    import inspect
    
    frame = None
    
    
    def foo(a, b):
        name = "satori"
        global frame
        age = 16
        frame = inspect.currentframe()
    
    
    foo(1, 2)
    print(frame.f_code.co_names)  # ('inspect', 'currentframe', 'frame')
  • co_varnames:代码块中出现的局部变量名

    import inspect
    
    frame = None
    
    
    def foo(a, b):
        name = "satori"
        global frame
        age = 16
        frame = inspect.currentframe()
    
    
    foo(1, 2)
    print(frame.f_code.co_varnames)  # ('a', 'b', 'name', 'age')
  • co_freevars:python实现闭包所需要的东西。

    import inspect
    
    frame = None
    
    
    def foo(a, b):
        name = "satori"
        global frame
        age = 16
        frame = inspect.currentframe()
    
    
    foo(1, 2)
    print(frame.f_code.co_freevars)  # ()
    import inspect
    
    frame = None
    
    
    def foo():
    
        name = "satori"
        age = 16
    
        def inner():
            global frame
            name
            age
            frame = inspect.currentframe()
    
        return inner
    
    foo()()
    print(frame.f_code.co_freevars)  # ('age', 'name')
  • co_cellvars:内部嵌套函数所引用的外部函数的变量

    import inspect
    
    frame = None
    
    
    def foo():
    
        global frame
        name = "satori"
        age = 16
        frame = inspect.currentframe()
    
        def inner():
            name
            age
    
        return inner
    
    foo()
    # 注意到:这里是foo(),不是foo()(),co_freevars需要的是内部函数的栈帧,所以我们要调用两次
    # 但这里co_cellvars需要的外部函数foo的栈帧,因此我们只需要调用一次即可,因为global frame是在外部函数当中的
    print(frame.f_code.co_cellvars)  # ('age', 'name')
    • co_filename:代码块所在的文件名

      import inspect
      
      frame = None
      
      
      def foo():
      
          global frame
          name = "satori"
          age = 16
          frame = inspect.currentframe()
      
          def inner():
              name
              age
      
          return inner
      
      foo()
      print(frame.f_code.co_filename)  # C:/Users/satori/Desktop/love_minami/a.py
    • co_name:代码块的名字,通常是函数名或者类名

      import inspect
      
      frame = None
      
      
      def foo():
      
          global frame
          name = "satori"
          age = 16
          frame = inspect.currentframe()
      
          def inner():
              name
              age
      
          return inner
      
      foo()
      print(frame.f_code.co_name)  # foo
    • co_lnotab:字节码指令与python源代码的行号之间的对应关系,以PyByteObject的形式存在

      import inspect
      
      frame = None
      
      
      def foo():
      
          global frame
          name = "satori"
          age = 16
          frame = inspect.currentframe()
      
          def inner():
              name
              age
      
          return inner
      
      
      foo()
      print(frame.f_code.co_lnotab)  # b'\x00\x03\x04\x01\x04\x01\x08\x02\x0e\x04'

    事实上,python不会直接记录这些信息,而是会记录增量值。比如说:

目前PyCodeObject中的所有属性我们就介绍完了,事实上,还有那么两三个属性我们没有介绍到,因为基本不用,并且通过frame.f_code去获取也根本获取不到。

8.2.4 在python中访问PyCodeObject对象

事实上我们已经介绍了一种方法去获取相应的PyCodeObject对象,但是还有没有其他的方法呢?

__code__

def foo():

    name = "satori"
    age = 16


code = foo.__code__
# 可以看到,函数本身就提供了获取PyCodeObject对象的接口,直接调用__code__即可
# 此时拿到的就是frame.f_code
print(code.co_varnames)  # ('name', 'age')

compile

在介绍compile之前,先介绍一下eval和exec。

  • eval:传入一个字符串,然后把字符串里面的内容拿出来。

    a = 1
    # 所以eval("a")就等价于a
    print(eval("a"))  # 1
    
    print(eval("1 + 1 + 1"))  # 3
    
    # 注意:eval是有返回值的,返回值就是字符串里面内容。
    # 或者说eval是可以作为右值的,比如a = eval("xxx")
    # 所以eval里面绝不可以出现诸如赋值之类的,比如 print(eval("a = 3")),那么这个语句等价于print(a = 3),这样显然会出现语法错误的
    # 因此eval里面把字符串剥掉之后就是一个普通的值,不可以出现诸如if、def等语句
    
    
    try:
        eval("xxx")
    except NameError as e:
        print(e)  # name 'xxx' is not defined
  • exec:传入一个字符串,把字符串里面的内容当成语句来执行,这个是没有返回值,或者说返回值是None

    exec("a = 1")  # 等价于把a = 1这个字符串里面的内容当成语句来执行
    print(a)  # 1
    
    statement = """a = 123
    if a == 123:
        print("a等于123")
    else:
        print("a不等于123")
    """
    exec(statement)  # a等于123
    # 注意:'a等于123'并不是exec返回的,而是把上面那坨字符串当成普通代码执行的时候print出来的
    # 这便是exec的作用。
    
    
    # 那么它和eval的区别就显而易见的,eval是要求字符串里面的内容能够当成一个值来打印,返回值就是里面的值
    # 而exec则是直接执行里面的内容
    # 举个例子
    print(eval("1 + 1"))  # 2
    print(exec("1 + 1"))  # None
    
    exec("a = 1 + 1")
    print(a)  # 2
    
    try:
        eval("a = 1 + 1")
    except SyntaxError as e:
        print(e)  # invalid syntax (<string>, line 1)`compile:相当于将两者组合起来`

    compile则是拿到一个PyCodeObject对象

    statement = "a, b = 1, 2"
    co = compile(statement, "hanser", "exec")
    print(co.co_name)  # <module>
    print(co.co_filename)  # hanser

8.3 pyc文件的生成

8.3.1 创建pyc文件的具体过程

前面我们提到,python通过import module进行加载时,如果没有找到相应的pyc或者dll文件,就会在py文件的基础上自动创建pyc文件。所以想要了解pyc文件是怎么创建的,只需要了解PyCodeObject是如何写入的即可。关于写入pyc文件,主要写入三个内容:

  • magic number

    这是python定义的一个整数值,不同版本的python会定义不同的magic number,这个值是为了保证python能够加载正确的pyc。比如python3.7不会加载3.6版本的pyc,因为python在加载这个pyc文件的时候会首先检测该pyc的magic number,如何和自身的magic number不一致,则拒绝加载。

  • pyc的创建时间

    这个很好理解,因为编译完之后要是把源代码修改了怎么办呢?因此会判断源代码的最后修改时间和pyc文件的创建时间,如果pyc文件的创建时间比源代码修改时间要早,说明在生成pyc之后,源代码被修改了,那么会重新编译新的pyc,而反之则会直接加载pyc。

  • PyCodeObject对象

    这个不用说了,肯定是要存储的。

文件对象:

//位置:Python/marshal.c

//FILE是一个文件句柄,可以把WFILE看成是FILE的包装
typedef struct {
    FILE *fp;  //文件句柄
    //下面的字段在写入信息的时候会看到
    int error;  
    int depth;
    PyObject *str;
    char *ptr;
    char *end;
    char *buf;
    _Py_hashtable_t *hashtable;
    int version;
} WFILE;

写入magic number和时间:

写入magic number和时间都是调用了PyMarshal_WriteLongToFile,我们来看看长什么样子

void
PyMarshal_WriteLongToFile(long x, FILE *fp, int version)
{   
    //声明char型的数组,元素个数为4个
    char buf[4];
    //声明一个WFILE类型变量wf
    WFILE wf;
    //内存初始化
    memset(&wf, 0, sizeof(wf));
    //设置fp,文件句柄
    wf.fp = fp;
    //将buf数组的指针赋值给wf.ptr和wf.buf
    wf.ptr = wf.buf = buf;
    //相当于buf的最后一个元素的指针
    wf.end = wf.ptr + sizeof(buf);
    //写错误
    wf.error = WFERR_OK;
    //写入版本信息
    wf.version = version;
    //调用w_long将x也就是版本信息或者时间写到wf里面去
    w_long(x, &wf);
    //刷到磁盘上
    w_flush(&wf);
}


//所以我们看到这一步只是初始化一个WFILE对象,真正写入则是调用w_long
static void
w_long(long x, WFILE *p)
{
    w_byte((char)( x      & 0xff), p);
    w_byte((char)((x>> 8) & 0xff), p);
    w_byte((char)((x>>16) & 0xff), p);
    w_byte((char)((x>>24) & 0xff), p);
}
//w_long则是将要写入的x一个字节一个字节写到文件里面去。

写入PyCodeObject对象:

写入PyCodeObject对象则是调用了PyMarshal_WriteObjectToFile,我们也来看看长什么样子

void
PyMarshal_WriteObjectToFile(PyObject *x, FILE *fp, int version)
{
    char buf[BUFSIZ];
    WFILE wf;
    memset(&wf, 0, sizeof(wf));
    wf.fp = fp;
    wf.ptr = wf.buf = buf;
    wf.end = wf.ptr + sizeof(buf);
    wf.error = WFERR_OK;
    wf.version = version;
    if (w_init_refs(&wf, version))
        return; /* caller mush check PyErr_Occurred() */
    w_object(x, &wf);
    w_clear_refs(&wf);
    w_flush(&wf);
}
//可以看到,和PyMarshal_WriteLongToFile基本是类似的
//只不过PyMarshal_WriteLongToFile调用的是w_long,而PyMarshal_WriteObjectToFile调用的是w_object


static void
w_object(PyObject *v, WFILE *p)
{
    char flag = '\0';

    p->depth++;

    if (p->depth > MAX_MARSHAL_STACK_DEPTH) {
        p->error = WFERR_NESTEDTOODEEP;
    }
    else if (v == NULL) {
        w_byte(TYPE_NULL, p);
    }
    else if (v == Py_None) {
        w_byte(TYPE_NONE, p);
    }
    else if (v == PyExc_StopIteration) {
        w_byte(TYPE_STOPITER, p);
    }
    else if (v == Py_Ellipsis) {
        w_byte(TYPE_ELLIPSIS, p);
    }
    else if (v == Py_False) {
        w_byte(TYPE_FALSE, p);
    }
    else if (v == Py_True) {
        w_byte(TYPE_TRUE, p);
    }
    else if (!w_ref(v, &flag, p))
        w_complex_object(v, flag, p);

    p->depth--;
}

可以看到本质上还是调用了w_byte,但是在这里面我们并没有看到诸如:list、tuple之类的数据的存储过程,注意最后的w_complex_object,关键来了

//源代码很长

static void
w_complex_object(PyObject *v, char flag, WFILE *p)
{
    Py_ssize_t i, n;

    if (PyLong_CheckExact(v)) {
        long x = PyLong_AsLong(v);
        if ((x == -1)  && PyErr_Occurred()) {
            PyLongObject *ob = (PyLongObject *)v;
            PyErr_Clear();
            w_PyLong(ob, flag, p);
        }
        else {
#if SIZEOF_LONG > 4
            long y = Py_ARITHMETIC_RIGHT_SHIFT(long, x, 31);
            if (y && y != -1) {
                /* Too large for TYPE_INT */
                w_PyLong((PyLongObject*)v, flag, p);
            }
            else
#endif
            {
                W_TYPE(TYPE_INT, p);
                w_long(x, p);
            }
        }
    }
    else if (PyFloat_CheckExact(v)) {
        if (p->version > 1) {
            unsigned char buf[8];
            if (_PyFloat_Pack8(PyFloat_AsDouble(v),
                               buf, 1) < 0) {
                p->error = WFERR_UNMARSHALLABLE;
                return;
            }
            W_TYPE(TYPE_BINARY_FLOAT, p);
            w_string((char*)buf, 8, p);
        }
        else {
            char *buf = PyOS_double_to_string(PyFloat_AS_DOUBLE(v),
                                              'g', 17, 0, NULL);
            if (!buf) {
                p->error = WFERR_NOMEMORY;
                return;
            }
            n = strlen(buf);
            W_TYPE(TYPE_FLOAT, p);
            w_byte((int)n, p);
            w_string(buf, n, p);
            PyMem_Free(buf);
        }
    }
    else if (PyComplex_CheckExact(v)) {
        if (p->version > 1) {
            unsigned char buf[8];
            if (_PyFloat_Pack8(PyComplex_RealAsDouble(v),
                               buf, 1) < 0) {
                p->error = WFERR_UNMARSHALLABLE;
                return;
            }
            W_TYPE(TYPE_BINARY_COMPLEX, p);
            w_string((char*)buf, 8, p);
            if (_PyFloat_Pack8(PyComplex_ImagAsDouble(v),
                               buf, 1) < 0) {
                p->error = WFERR_UNMARSHALLABLE;
                return;
            }
            w_string((char*)buf, 8, p);
        }
        else {
            char *buf;
            W_TYPE(TYPE_COMPLEX, p);
            buf = PyOS_double_to_string(PyComplex_RealAsDouble(v),
                                        'g', 17, 0, NULL);
            if (!buf) {
                p->error = WFERR_NOMEMORY;
                return;
            }
            n = strlen(buf);
            w_byte((int)n, p);
            w_string(buf, n, p);
            PyMem_Free(buf);
            buf = PyOS_double_to_string(PyComplex_ImagAsDouble(v),
                                        'g', 17, 0, NULL);
            if (!buf) {
                p->error = WFERR_NOMEMORY;
                return;
            }
            n = strlen(buf);
            w_byte((int)n, p);
            w_string(buf, n, p);
            PyMem_Free(buf);
        }
    }
    else if (PyBytes_CheckExact(v)) {
        W_TYPE(TYPE_STRING, p);
        w_pstring(PyBytes_AS_STRING(v), PyBytes_GET_SIZE(v), p);
    }
    else if (PyUnicode_CheckExact(v)) {
        if (p->version >= 4 && PyUnicode_IS_ASCII(v)) {
            int is_short = PyUnicode_GET_LENGTH(v) < 256;
            if (is_short) {
                if (PyUnicode_CHECK_INTERNED(v))
                    W_TYPE(TYPE_SHORT_ASCII_INTERNED, p);
                else
                    W_TYPE(TYPE_SHORT_ASCII, p);
                w_short_pstring((char *) PyUnicode_1BYTE_DATA(v),
                                PyUnicode_GET_LENGTH(v), p);
            }
            else {
                if (PyUnicode_CHECK_INTERNED(v))
                    W_TYPE(TYPE_ASCII_INTERNED, p);
                else
                    W_TYPE(TYPE_ASCII, p);
                w_pstring((char *) PyUnicode_1BYTE_DATA(v),
                          PyUnicode_GET_LENGTH(v), p);
            }
        }
        else {
            PyObject *utf8;
            utf8 = PyUnicode_AsEncodedString(v, "utf8", "surrogatepass");
            if (utf8 == NULL) {
                p->depth--;
                p->error = WFERR_UNMARSHALLABLE;
                return;
            }
            if (p->version >= 3 &&  PyUnicode_CHECK_INTERNED(v))
                W_TYPE(TYPE_INTERNED, p);
            else
                W_TYPE(TYPE_UNICODE, p);
            w_pstring(PyBytes_AS_STRING(utf8), PyBytes_GET_SIZE(utf8), p);
            Py_DECREF(utf8);
        }
    }
    else if (PyTuple_CheckExact(v)) {
        n = PyTuple_Size(v);
        if (p->version >= 4 && n < 256) {
            W_TYPE(TYPE_SMALL_TUPLE, p);
            w_byte((unsigned char)n, p);
        }
        else {
            W_TYPE(TYPE_TUPLE, p);
            W_SIZE(n, p);
        }
        for (i = 0; i < n; i++) {
            w_object(PyTuple_GET_ITEM(v, i), p);
        }
    }
    else if (PyList_CheckExact(v)) {
        W_TYPE(TYPE_LIST, p);
        n = PyList_GET_SIZE(v);
        W_SIZE(n, p);
        for (i = 0; i < n; i++) {
            w_object(PyList_GET_ITEM(v, i), p);
        }
    }
    else if (PyDict_CheckExact(v)) {
        Py_ssize_t pos;
        PyObject *key, *value;
        W_TYPE(TYPE_DICT, p);
        /* This one is NULL object terminated! */
        pos = 0;
        while (PyDict_Next(v, &pos, &key, &value)) {
            w_object(key, p);
            w_object(value, p);
        }
        w_object((PyObject *)NULL, p);
    }
    else if (PyAnySet_CheckExact(v)) {
        PyObject *value, *it;

        if (PyObject_TypeCheck(v, &PySet_Type))
            W_TYPE(TYPE_SET, p);
        else
            W_TYPE(TYPE_FROZENSET, p);
        n = PyObject_Size(v);
        if (n == -1) {
            p->depth--;
            p->error = WFERR_UNMARSHALLABLE;
            return;
        }
        W_SIZE(n, p);
        it = PyObject_GetIter(v);
        if (it == NULL) {
            p->depth--;
            p->error = WFERR_UNMARSHALLABLE;
            return;
        }
        while ((value = PyIter_Next(it)) != NULL) {
            w_object(value, p);
            Py_DECREF(value);
        }
        Py_DECREF(it);
        if (PyErr_Occurred()) {
            p->depth--;
            p->error = WFERR_UNMARSHALLABLE;
            return;
        }
    }
    else if (PyCode_Check(v)) {
        PyCodeObject *co = (PyCodeObject *)v;
        W_TYPE(TYPE_CODE, p);
        w_long(co->co_argcount, p);
        w_long(co->co_kwonlyargcount, p);
        w_long(co->co_nlocals, p);
        w_long(co->co_stacksize, p);
        w_long(co->co_flags, p);
        w_object(co->co_code, p);
        w_object(co->co_consts, p);
        w_object(co->co_names, p);
        w_object(co->co_varnames, p);
        w_object(co->co_freevars, p);
        w_object(co->co_cellvars, p);
        w_object(co->co_filename, p);
        w_object(co->co_name, p);
        w_long(co->co_firstlineno, p);
        w_object(co->co_lnotab, p);
    }
    else if (PyObject_CheckBuffer(v)) {
        /* Write unknown bytes-like objects as a bytes object */
        Py_buffer view;
        if (PyObject_GetBuffer(v, &view, PyBUF_SIMPLE) != 0) {
            w_byte(TYPE_UNKNOWN, p);
            p->depth--;
            p->error = WFERR_UNMARSHALLABLE;
            return;
        }
        W_TYPE(TYPE_STRING, p);
        w_pstring(view.buf, view.len, p);
        PyBuffer_Release(&view);
    }
    else {
        W_TYPE(TYPE_UNKNOWN, p);
        p->error = WFERR_UNMARSHALLABLE;
    }
}

源代码很长,这里就不一一分析了。虽然长,但是逻辑很简单,就是对应不同的对象、执行不同的写的动作。然而其最终目的都是通过w_byte写到pyc文件中。换句话说,python在往pyc写入list对象时,只是将list中包含的数值或者字符串等对象写到了pyc文件中。同时这也意味着,python在加载pyc文件时,必须基于这些数值或字符串重新构造出list对象。

对于PyCodeObject对象,很显然,w_object会遍历PyCodeObject中的所有域,将这些域依次写入

PyCodeObject *co = (PyCodeObject *)v;
        W_TYPE(TYPE_CODE, p);
        w_long(co->co_argcount, p);
        w_long(co->co_kwonlyargcount, p);
        w_long(co->co_nlocals, p);
        w_long(co->co_stacksize, p);
        w_long(co->co_flags, p);
        w_object(co->co_code, p);
        w_object(co->co_consts, p);
        w_object(co->co_names, p);
        w_object(co->co_varnames, p);
        w_object(co->co_freevars, p);
        w_object(co->co_cellvars, p);
        w_object(co->co_filename, p);
        w_object(co->co_name, p);
        w_long(co->co_firstlineno, p);
        w_object(co->co_lnotab, p);

但是当面对一个PyListObject对象时,会有什么变化呢?没错,会和PyCodeObject一样,w_object还是会遍历,然后将PyListObject对象中的每一个元素依次写入到pyc文件中。

//可以看到PyTupleObject、PyListObject、PyDictObject都是采用了相同的姿势
//注意里面的W_TYPE
else if (PyTuple_CheckExact(v)) {
        n = PyTuple_Size(v);
        if (p->version >= 4 && n < 256) {
            W_TYPE(TYPE_SMALL_TUPLE, p);
            w_byte((unsigned char)n, p);
        }
        else {
            W_TYPE(TYPE_TUPLE, p);
            W_SIZE(n, p);
        }
        for (i = 0; i < n; i++) {
            w_object(PyTuple_GET_ITEM(v, i), p);
        }
    }
    else if (PyList_CheckExact(v)) {
        W_TYPE(TYPE_LIST, p);
        n = PyList_GET_SIZE(v);
        W_SIZE(n, p);
        for (i = 0; i < n; i++) {
            w_object(PyList_GET_ITEM(v, i), p);
        }
    }
    else if (PyDict_CheckExact(v)) {
        Py_ssize_t pos;
        PyObject *key, *value;
        W_TYPE(TYPE_DICT, p);
        /* This one is NULL object terminated! */
        pos = 0;
        while (PyDict_Next(v, &pos, &key, &value)) {
            w_object(key, p);
            w_object(value, p);
        }
        w_object((PyObject *)NULL, p);
    }

我们看到无论对于哪一个对象,在写入之前,都会先调用W_TYPE写一个类似于类型的东西,是的,诸如TYPE_LIST、TYPE_TUPLE、TYPE_DICT这样的标识,对于pyc文件的加载起着至关重要的作用。

之前说过,python仅仅将数值和字符串写入到pyc文件。当PyCodeObject写入到pyc之后,所有的数据就变成了了字节流,类型信息就丢失了。然鹅如果没有类型信息,那么当python再次加载pyc文件的时候,就没办法知道字节流中隐藏的结构和蕴含的信息,所以python必须往pyc文件写入一个标识,这些标识正是python定义的类型信息,如果python在pyc中发现了这样的标识,则预示着上一个对象结束,新的对象开始,并且也知道新对象是什么样的对象,从而也知道该执行什么样的加载动作。这些标识也是可以看到的

//marshal.c
#define TYPE_NULL               '0'
#define TYPE_NONE               'N'
#define TYPE_FALSE              'F'
#define TYPE_TRUE               'T'
#define TYPE_STOPITER           'S'
#define TYPE_ELLIPSIS           '.'
#define TYPE_INT                'i'
/* TYPE_INT64 is not generated anymore.
   Supported for backward compatibility only. */
#define TYPE_INT64              'I'
#define TYPE_FLOAT              'f'
#define TYPE_BINARY_FLOAT       'g'
#define TYPE_COMPLEX            'x'
#define TYPE_BINARY_COMPLEX     'y'
#define TYPE_LONG               'l'
#define TYPE_STRING             's'
#define TYPE_INTERNED           't'
#define TYPE_REF                'r'
#define TYPE_TUPLE              '('
#define TYPE_LIST               '['
#define TYPE_DICT               '{'
#define TYPE_CODE               'c'
#define TYPE_UNICODE            'u'
#define TYPE_UNKNOWN            '?'
#define TYPE_SET                '<'
#define TYPE_FROZENSET          '>'

到了这里可以看到,其实python对于PyCodeObject对象的导出实际上是不复杂的,实际上不管什么对象,最后都为归结为两种简单的形式,一种是数值写入,一种是字符串写入。上面都是对数值的写入,比较简单,仅仅需要按照字节一次写入pyc即可。然而在写入字符串的时候,python设计了一种比较复杂的机制,有兴趣可以自己阅读源码,这里不再介绍。

8.3.2 PyCodeObject的写入

# a.py
class A:
    pass

def foo():
    pass


a = A()
foo()

我们之前说对于这样的一个py文件,会创建三个PyCodeObject对象,但是写到pyc文件里面的只有一个PyCodeObject,这难道不就意味着有两个PyCodeObject丢失了吗?其实很明显,有两个PyCodeObject对象是位于另一个PyCodeObject对象当中的。因此其实foo和A对应的PyCodeObject对象是位于a.py这个PyCodeObject对象当中的,准确的说是位于co_consts当中

在将一个PyCodeObject对象写入到pyc文件当中时,如果碰到了包含的另一个PyCodeObject对象,那么就会递归地执行写入PyCodeObject对象的操作。如此下去,最终所有的PyCodeObject对象都会写入到pyc文件当中,因此pyc文件当中的PyCodeObject对象也是以一种嵌套的关系联系在一起的。

8.4 python的字节码

关于python的字节码,是后面章节剖析虚拟机的重点,现在先来看一下。我们知道python执行源代码之前会对其进行编译得到字节码序列,python虚拟机会根据这些字节码序列来进行一系列的操作,从而完成对程序的执行。

在python中一共定义了130条指令

#define POP_TOP                   1
#define ROT_TWO                   2
#define ROT_THREE                 3
#define DUP_TOP                   4
#define DUP_TOP_TWO               5
#define NOP                       9
#define UNARY_POSITIVE           10
#define UNARY_NEGATIVE           11
#define UNARY_NOT                12
#define UNARY_INVERT             15
#define BINARY_MATRIX_MULTIPLY   16
#define INPLACE_MATRIX_MULTIPLY  17
#define BINARY_POWER             19
#define BINARY_MULTIPLY          20
#define BINARY_MODULO            22
#define BINARY_ADD               23
#define BINARY_SUBTRACT          24
#define BINARY_SUBSCR            25
#define BINARY_FLOOR_DIVIDE      26
#define BINARY_TRUE_DIVIDE       27
#define INPLACE_FLOOR_DIVIDE     28
...
...

如果使用过dis模块的小伙伴肯定很熟悉,当然这里我们只是先看一下, 后面章节会介绍。

Guess you like

Origin www.cnblogs.com/traditional/p/11806454.html