"Python source code analysis - bytecode and virtual machine"

python source analysis - Virtual Machine bytecodes and

Python the code to be compiled into byte code, and then obtain a dynamic sequentially interpreted bytecode in the virtual machine. Compiled bytecode stored in the hard disk .pyc, .pydand other extension. In the operating mode, such as a bytecode Python is an object PyCodeObjectexists. PyCodeObjectIt will be appreciated as the C language text segments, for storing the compiled byte code, debugging information, constant values, variable names and the like.

This article describes how the code will not be compiled into a step by step PyCodeObject, it will brief PyCodeObjectthe meaning of each field, and to focus on the introduction of Python virtual machine and execute the stream.

The pseudocode PyCodeObject Python


PyCodeObjectStoring static information generation compiled at runtime context recombined to form a complete operating mode environment. Let's take a look at the information which are statically compiled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
typedef struct {
PyObject_HEAD
co_argcount int; // co_argcount parameters, not including variable parameter
co_nlocals int; // number co_nlocals variables, co_argcount +
// the number of variable parameters + co_kwonlyargcount (py3.0) the number of local variables +
co_stacksize int; (maximum stack depth you need to compile) // stack size
co_flags int; // flag bits PyCodeObject, used to optimize run-time performance
* Co_code the PyObject; // compiled byte code character string
PyObject * co_consts; a list of constants //
* Co_names the PyObject; // String object in constant
* Co_varnames PyObject; // variable name tuple
* Co_freevars PyObject; // tuple of free variables
* Co_cellvars PyObject; // the Cell variable tuple
/* The rest doesn't count for hash/cmp */
* Co_filename PyObject; // filename
PyObject * co_name; name // object, for example, the function name, class name, and the like
co_firstlineno int; // the code in the source file corresponding to the starting line number
* Co_lnotab the PyObject; // pseudocode line numbers and mapping
* co_zombieframe void; // for some special cases of optimization
* Co_weakreflist PyObject; // support weak references
} PyCodeObject;

Some fields require special explanation.

  • co_flags to save some compilers information, mainly for optimization. E.g. co_VARARGS (0x0004) represents a variable parameter and the like, see code.h specific file.
  • co_freevars are some free variables in scope, but no variables in the scope of this definition.
  • co_cellvars defined in the current scope, and internal variables and the like used in closures.
  • The relative row number value co_lnotab offset value corresponding to the bytecode source.
1
2
3
4
Co_code bytecode offset value in the offset value of the true implementation of the number of lines of
0 1 0
6 2 1
50 7 5

So actually co_lnotabrecorded is (0, 0), (6, 1), (44, 5), of course, no parentheses are the actual recording. Specific 偏移值and implementation of real numbers corresponding relationship can be calculated by the following algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// codeobject.c
int
PyCode_Addr2Line(PyCodeObject *co, int addrq)
{
int size = PyString_Size(co->co_lnotab) / 2;
unsigned char *p = (unsigned char*)PyString_AsString(co->co_lnotab);
int line = co->co_firstlineno;
int addr = 0;
while (--size >= 0) {
addr += *p++;
if (addr > addrq)
break;
line += *p++;
}
return line;
}
  • co_code 记录编译后的字节码,以字符串的形式保存,而实际上就是数字。后面我们通过一个例子详细描述。

PyCodeObject的示例


先给定一个Python代码示例,然后打印出其中的各个域。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from __future__ import print_function
import dis
 
def out(a, b=1, *args, **kwargs):
c = 2
 
def inner(d, e=3, *iargs, **ikwargs):
f = 4
g = c
 
print( 'inner-->co_argcount :', inner.__code__.co_argcount)
# print('inner-->co_kwonlyargcount :', inner.__code__.co_kwonlyargcount)
print( 'inner-->co_nlocals :', inner.__code__.co_nlocals)
print( 'inner-->co_stacksize :', inner.__code__.co_stacksize)
print( 'inner-->co_flags :', inner.__code__.co_flags)
print( 'inner-->co_code :', inner.__code__.co_code)
print( 'inner-->co_consts :', inner.__code__.co_consts)
print( 'inner-->co_names :', inner.__code__.co_names)
print( 'inner-->co_varnames :', inner.__code__.co_varnames)
print( 'inner-->co_freevars :', inner.__code__.co_freevars)
print( 'inner-->co_cellvars :', inner.__code__.co_cellvars)
print( 'inner-->co_filename :', inner.__code__.co_filename)
print( 'inner-->co_name :', inner.__code__.co_name)
print( 'inner-->co_firstlineno :', inner.__code__.co_firstlineno)
print( 'inner-->co_lnotab :', inner.__code__.co_lnotab)
 
print( 'out-->co_argcount :', out.__code__.co_argcount)
#print('out-->co_kwonlyargcount :', out.__code__.co_kwonlyargcount)
print( 'out-->co_nlocals :', out.__code__.co_nlocals)
print( 'out-->co_stacksize :', out.__code__.co_stacksize)
print( 'out-->co_flags :', out.__code__.co_flags)
print( 'out-->co_code :', out.__code__.co_code)
print( 'out-->co_consts :', out.__code__.co_consts)
print( 'out-->co_names :', out.__code__.co_names)
print( 'out-->co_varnames :', out.__code__.co_varnames)
print( 'out-->co_freevars :', out.__code__.co_freevars)
print( 'out-->co_cellvars :', out.__code__.co_cellvars)
print( 'out-->co_filename :', out.__code__.co_filename)
print( 'out-->co_name :', out.__code__.co_name)
print( 'out-->co_firstlineno :', out.__code__.co_firstlineno)
print( 'out-->co_lnotab :', out.__code__.co_lnotab)
print( '=========================================================')
out( 1, 2, 3, 4, 5, 6, 7, e = 8, f = 9)
 
print()
print( 'disamble:')
print(dis.dis(out))

需要先解释一下co_kwonlyargcount,这个域在PY3才有,用于支持在不定参数后定义的位置参数,例如def func(*args, kwonly=None)

这个实例的输出可以看到对应的各个域的详细内容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
out-->co_argcount : 2 # a, b
out-->co_nlocals : 5 # a, b, c, d, e
out-->co_stacksize : 3
out-->co_flags : 65551 # b'0b10000000000001111' CO_FUTURE_PRINT_FUNCTION|CO_VARKEYWORDS|CO_VARARGS|CO_NEWLOCALS|CO_OPTIMIZED
out-->co_code : ddfd}t... # 部分省略,后续分析
out-->co_consts : (None, 2, 3, <code object inner>, 'inner-->co_argcount :', # 省略其他'inner-->') # 常量值,这里添加了默认返回值None
out-->co_names : ( 'print', '__code__', 'co_argcount', 'co_nlocals', 'co_stacksize', 'co_flags', 'co_code', 'co_consts', 'co_names', 'co_varnames', 'co_freevars','co_cellvars', 'co_filename', 'co_name', 'co_firstlineno', 'co_lnotab') # 常量名
out-->co_varnames : ( 'a', 'b', 'args', 'kwargs', 'inner') # 变量名字,包括参数变量和内部变量
out-->co_freevars : () # 无
out-->co_cellvars : ( 'c',) # 用于给子作用域使用的变量
out-->co_filename : pycode.py
out-->co_name : out
out-->co_firstlineno : 3 # 起始行号
out-->co_lnotab : # 省略
=========================================================
inner-->co_argcount : 2 # d, e
inner-->co_nlocals : 6 # d, e, iargs, ikwargs, f, g
inner-->co_stacksize : 1 #
inner-->co_flags : 65567 # '0b10000000000011111' CO_FUTURE_PRINT_FUNCTION|CO_NESTED |CO_VARKEYWORDS|CO_VARARGS|CO_NEWLOCALS|CO_OPTIMIZED
inner-->co_code : d}}dS # 省略
inner-->co_consts : (None, 4) # 常量
inner-->co_names : () #
inner-->co_varnames : ( 'd', 'e', 'iargs', 'ikwargs', 'f', 'g') # 变量名字
inner-->co_freevars : ( 'c',) # 自由变量,引用的父作用域的变量
inner-->co_cellvars : () # 无
inner-->co_filename : pycode.py
inner-->co_name : inner
inner-->co_firstlineno : 6 # 起始行号
inner-->co_lnotab : # 省略

从这个例子中可以清楚了解常量、变量、自由变量以及cell变量的含义。接下来我们看下co_code的含义,使用linux的xdd工具将其转换成十六进制,并且使用dis模块反编译其字节码。

1
2
3
4
5
6
7
8
9
10
11
import dis
 
def out(a, b=1, *args, **kwargs):
c = 2
 
def inner(d, e=3, *iargs, **ikwargs):
f = 4
g = c
 
print out.__code__.co_code
dis.dis(out)
1
2
3
# co_code的十六进制内容
0000000: 6401 0089 0000 6402 0087 0000 6601 0064 d.....d.....f..d
0000010: 0300 8601 007d 0400 6400 0053 0a .....}..d..S.
1
2
3
4
5
6
7
8
9
10
11
12
# 字节码的反编译
4 0 LOAD_CONST 1 (2)
3 STORE_DEREF 0 (c)
 
6 6 LOAD_CONST 2 (3)
9 LOAD_CLOSURE 0 (c)
12 BUILD_TUPLE 1
15 LOAD_CONST 3 (<code object inner at 00000000039E69B0, file "<ipython-input-2-656e8bface8a>", line 6>)
18 MAKE_CLOSURE 1
21 STORE_FAST 4 (inner)
24 LOAD_CONST 0 (None)
27 RETURN_VALUE
  • 十六进制的第一个为64100,查阅opcode.h可以看到起对应的字节码#define LOAD_CONST 100,与反编译中的命令LOAD_CONST相符。
  • 十六进制的第二个为0101,对应的是字节码LOAD_CONST的参数1
  • 十六进制的第三个为0000,此值表示STOP_CDOE,一个完整字节码的结束标志。

同理可以解析接下来的字节码和对应的操作的含义。至此,我们明白字节码的格式为

1
字节码指令编号 (64) 多个参数值(1) 结束标志(00)

到现在为止我们明白了字节码的数据结构、各域值的含义,co_code字节码的格式以及如何与操作命令对应。下面我们看看这些字节码如何运行。

PyFrameObject


Python模拟了C语言中的运行栈作为运行时的环境,每个栈用PyFrameObject结构表示。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
typedef struct _frame {
PyObject_VAR_HEAD
struct _frame *f_back; // 前一个运行栈,调用方
PyCodeObject *f_code; // 执行的PyCodeObject对象
PyObject *f_builtins; // builtins环境变量集合
PyObject *f_globals; // globals全局变量集合
PyObject *f_locals; // locals本地变量集合
PyObject **f_valuestack; // 栈起始地址,最后一个本地变量之后
PyObject **f_stacktop; // 栈针位置,指向栈中下一个空闲位置
PyObject *f_trace; // trace函数
PyObject *f_exc_type, *f_exc_value, *f_exc_traceback; // 记录异常处理
 
PyThreadState *f_tstate; // 当前的线程
int f_lasti; // 当前执行的字节码的地址
int f_lineno; // 当前的行号
int f_iblock; // 一些局部block块
PyTryBlock f_blockstack[CO_MAXBLOCKS]; /* for try and loop blocks */
PyObject *f_localsplus[ 1]; // 栈地址,大小为 本地变量+co_stacksize
} PyFrameObject;

对应的结构图

image

当执行函数调用时会进入新的栈帧,那么当前栈帧就作为下一个栈帧的f_back字段。

image

多个栈帧链属于一个线程,而同时可能存在多个线程,每个线程拥有一个栈帧链。这样形成了Python的虚拟机运行环境。

image

Python执行字节码


字节码的执行就像上图所示,由一个大的循环和选择语句构成,逻辑骨干比较简单。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for(;;;) {
 
switch(opcode) {
 
case 100: # LOAD_CONST
{
x = POP()
... // 执行的具体操作
break;
};
 
case 101: # LOAD_NAME
{
...
break;
}
...
};

接下来,我们通过反编译代码追踪其如何一步步执行。

1
2
3
4
5
6
7
8
9
10
11
12
# 字节码的反编译
4 0 LOAD_CONST 1 (2)
3 STORE_DEREF 0 (c)
 
6 6 LOAD_CONST 2 (3)
9 LOAD_CLOSURE 0 (c)
12 BUILD_TUPLE 1
15 LOAD_CONST 3 (<code object inner at 00000000039E69B0, ...>)
18 MAKE_CLOSURE 1
21 STORE_FAST 4 (inner)
24 LOAD_CONST 0 (None)
27 RETURN_VALUE

通过追踪每个指令码的执行过程以及对应的PyFrameObject的栈帧变化,可以一步步看到虚拟机的执行过程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
PyObject *
PyEval_EvalFrame(PyFrameObject *f) {
 
co = f->f_code;
names = co->co_names;
consts = co->co_consts;
fastlocals = f->f_localsplus;
// freevars在内存中对应的不是f->f_freevars,而是f->f_cellvars
freevars = f->f_localsplus + co->co_nlocals;
first_instr = ( unsigned char*) PyString_AS_STRING(co->co_code);
// f->f_lasti默认值为-1
next_instr = first_instr + f->f_lasti + 1;
// 执行栈顶
stack_pointer = f->f_stacktop;
 
for (;;) {
 
fast_next_opcode:
f->f_lasti = INSTR_OFFSET();
 
opcode = NEXTOP(); // 获取字节码
oparg = 0;
if (HAS_ARG(opcode)) // 如果字节码有参数,获取参数
oparg = NEXTARG();
 
TARGET(LOAD_CONST) // 0, 6, 24 行反编译指令LOAD_CONST
{
x = GETITEM(consts, oparg); // 从const中获取值压栈
Py_INCREF(x);
PUSH(x);
FAST_DISPATCH(); // goto fast_next_opcode
}
 
...
 
 
TARGET(STORE_DEREF) // 3
{
w = POP(); // 从栈中取值,设置为CellObejct的值
x = freevars[oparg];
PyCell_Set(x, w);
Py_DECREF(w);
DISPATCH();
}

初始化以及分别执行03字节码的PyFrameObject结构变化。

  • LOAD_CONST 将co_consts中对应的值压栈
  • STORE_DEREF 解引用,设置栈中的变量值

image

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
TARGET(LOAD_CLOSURE) // 9
{
x = freevars[oparg];
Py_INCREF(x);
PUSH(x);
if (x != NULL) DISPATCH();
break;
}
 
 
TARGET(BUILD_TUPLE) // 12
{
x = PyTuple_New(oparg); // 创建一个元组,并且将栈中的元素设置为元组的元素
if (x != NULL) {
for (; --oparg >= 0;) {
w = POP();
PyTuple_SET_ITEM(x, oparg, w);
}
PUSH(x);
DISPATCH();
}
break;
}
 
TARGET(MAKE_CLOSURE) // 18
{
v = POP(); /* code object */
x = PyFunction_New(v, f->f_globals); // 创建函数
Py_DECREF(v);
if (x != NULL) {
v = POP();
if (PyFunction_SetClosure(x, v) != 0) {
/* Can't happen unless bytecode is corrupt. */
why = WHY_EXCEPTION;
}
Py_DECREF(v);
}
if (x != NULL && oparg > 0) {
v = PyTuple_New(oparg);
if (v == NULL) {
Py_DECREF(x);
x = NULL;
break;
}
while (--oparg >= 0) {
w = POP();
PyTuple_SET_ITEM(v, oparg, w);
}
if (PyFunction_SetDefaults(x, v) != 0) {
/* Can't happen unless
PyFunction_SetDefaults changes. */
why = WHY_EXCEPTION;
}
Py_DECREF(v);
}
PUSH(x);
break;
}
  • LOAD_CLOSURE 将freevars中的对象压栈
  • BUILD_TUPLE 用栈帧中的元素创建元组,并压栈
  • BUILD_CLOSURE 创建PyFunction对象,并设置其中的f_closure

image

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 
TARGET(STORE_FAST) // 21
{
v = POP(); // 设置locals值
SETLOCAL(oparg, v);
FAST_DISPATCH();
}
 
TARGET_NOARG(RETURN_VALUE) // 27
{
retval = POP();
why = WHY_RETURN;
goto fast_block_end;
}
}

image

  • STORE_FAST 将栈中的一个元素设置到对应的本地变量域中
  • RETURN_VALUE return,并且设置退出原因WHY_RETURN

从上面的代码和过程图,整个代码的执行过程清楚的显现出来:)

(完)

Guess you like

Origin www.cnblogs.com/cx2016/p/12082984.html