为什么Python代码在函数中运行得更快？

本文翻译自：Why does Python code run faster in a function?

def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.) Python中的这段代码在其中运行（注意：计时是通过Linux中的BASH中的time函数完成的。）

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn't placed within a function, 但是，如果for循环未放在函数内，

for i in xrange(10**8):
    pass

then it runs for a much longer time: 那么它会运行更长的时间：

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this? 为什么是这样？

#1楼

参考：https://stackoom.com/question/lAQt/为什么Python代码在函数中运行得更快

#2楼

Inside a function, the bytecode is 在函数内部，字节码为

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE

At top level, the bytecode is 在顶层，字节码是

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE

The difference is that STORE_FAST is faster (!) than STORE_NAME . 区别在于STORE_FAST （！）比STORE_NAME更快（！）。 This is because in a function, i is a local but at toplevel it is a global. 这是因为在函数中， i是局部变量，但在顶层是全局变量。

To examine bytecode, use the dis module . 要检查字节码，请使用dis模块。 I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin . 我可以直接反汇编该函数，但是要反汇编顶层代码，我必须使用compile Builtin 。

#3楼

You might ask why it is faster to store local variables than globals. 您可能会问为什么存储局部变量比全局变量更快。 This is a CPython implementation detail. 这是CPython实现的细节。

Remember that CPython is compiled to bytecode, which the interpreter runs. 请记住，CPython被编译为解释器运行的字节码。 When a function is compiled, the local variables are stored in a fixed-size array ( not a dict ) and variable names are assigned to indexes. 编译函数时，局部变量存储在固定大小的数组（不是 dict ）中，并且变量名称分配给索引。 This is possible because you can't dynamically add local variables to a function. 这是可能的，因为您无法将局部变量动态添加到函数中。 Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial. 然后检索一个局部变量实际上是对列表的指针查找和对PyObject的引用计数的增加，这是微不足道的。

Contrast this to a global lookup ( LOAD_GLOBAL ), which is a true dict search involving a hash and so on. 将此与全局查找（ LOAD_GLOBAL ）进行对比，这是一个真正的dict搜索，涉及哈希等等。 Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FAST s for its access unless you tell it not to. 顺便说一句，这就是为什么如果要使其成为全局变量，则需要指定global i变量global i原因：如果您在作用域内分配了变量，则编译器将发出STORE_FAST进行访问，除非您告知不STORE_FAST 。

By the way, global lookups are still pretty optimised. 顺便说一句，全局查找仍然非常优化。 Attribute lookups foo.bar are the really slow ones! 属性查询foo.bar 真的很慢！

Here is small illustration on local variable efficiency. 这是关于局部变量效率的小插图。

#4楼

Aside from local/global variable store times, opcode prediction makes the function faster. 除了局部/全局变量存储时间外， 操作码预测还使函数运行更快。

As the other answers explain, the function uses the STORE_FAST opcode in the loop. 正如其他答案所解释的，该函数在循环中使用STORE_FAST操作码。 Here's the bytecode for the function's loop: 这是函数循环的字节码：

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. 通常，在运行程序时，Python会依次执行每个操作码，跟踪堆栈并在执行每个操作码后对堆栈帧执行其他检查。 Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead. 操作码预测意味着在某些情况下，Python能够直接跳转到下一个操作码，从而避免了其中的一些开销。

In this case, every time Python sees FOR_ITER (the top of the loop), it will "predict" that STORE_FAST is the next opcode it has to execute. 在这种情况下，每次Python看到FOR_ITER （循环的顶部）时，它将“预测” STORE_FAST是它必须执行的下一个操作码。 Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST . 然后，Python窥视下一个操作码，如果预测正确，它将直接跳转到STORE_FAST 。 This has the effect of squeezing the two opcodes into a single opcode. 这具有将两个操作码压缩为单个操作码的效果。

On the other hand, the STORE_NAME opcode is used in the loop at the global level. 另一方面，在全局级别的循环中使用了STORE_NAME操作码。 Python does *not* make similar predictions when it sees this opcode. 看到此操作码时，Python *不会*做出类似的预测。 Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed. 相反，它必须返回到评估循环的顶部，该循环对循环的执行速度有明显的影响。

To give some more technical detail about this optimization, here's a quote from the ceval.c file (the "engine" of Python's virtual machine): 为了提供有关此优化的更多技术细节，以下是ceval.c文件（Python虚拟机的“引擎”）的ceval.c ：

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. 一些操作码往往成对出现，因此可以在运行第一个代码时预测第二个代码。 For example, GET_ITER is often followed by FOR_ITER . 例如， GET_ITER之后通常是FOR_ITER 。 And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE . 并且FOR_ITER后面通常是STORE_FAST或UNPACK_SEQUENCE 。

Verifying the prediction costs a single high-speed test of a register variable against a constant. 验证预测需要对寄存器变量进行一个针对常数的高速测试。 If the pairing was good, then the processor's own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. 如果配对良好，则处理器自己的内部分支谓词成功的可能性很高，从而导致到下一个操作码的开销几乎为零。 A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. 成功的预测可以节省通过评估循环的旅程，该评估循环包括其两个不可预测的分支， HAS_ARG测试和开关情况。 Combined with the processor's internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined. 结合处理器的内部分支预测，成功的PREDICT可以使两个操作码像合并了主体的单个新操作码一样运行。

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made: 我们可以在FOR_ITER操作码的源代码中看到准确的FOR_ITER预测STORE_FAST ：

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally

The PREDICT function expands to if (*next_instr == op) goto PRED_##op ie we just jump to the start of the predicted opcode. PREDICT函数扩展为if (*next_instr == op) goto PRED_##op即我们只是跳转到预测操作码的开头。 In this case, we jump here: 在这种情况下，我们跳到这里：

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. 现在设置了局部变量，下一个操作码可以执行了。 Python continues through the iterable until it reaches the end, making the successful prediction each time. Python继续执行迭代直到到达终点，每次都成功进行预测。

The Python wiki page has more information about how CPython's virtual machine works. Python Wiki页面包含有关CPython虚拟机如何工作的更多信息。

xfxf996

发布了0 篇原创文章 · 获赞 137 · 访问量 84万+

私信关注

为什么Python代码在函数中运行得更快？

#1楼

#2楼

#3楼

#4楼

猜你喜欢