I never expected that in addition to the Shannon project, Python 3.11 has so many performance improvements!

As we all know, the Python 3.11 version has brought great performance improvements, but in which aspects has it been optimized? In addition to the famous "Shannon plan", what performance-related optimizations does it contain? This article will take you to find out!

By Beshr Kayali

Translator: pea flower under the cat @Python cat

English: https://log.beshr.com/python-311-speedup-part-1

Please keep the author and translator information for reprinting !

Python 3.11 was released a few days ago , and it brings many new features as usual, such as exception groups, fine-grained error location and stack traceback, standard library support for TOML parsing, and of course, the much-anticipated speed improvement brought by the faster CPython project.

According to the pyperformance benchmark, CPython 3.11 is on average 25% faster than CPython 3.10. One of the reasons for this improvement is what Guido named " Project Shannon " (ie faster CPython). For version 3.11, the plan is to do a lot of optimization in two main directions: startup and runtime.

In addition, Python 3.11 contains other optimizations that are not part of Shannon's plan.

In this article, I'll go into the details of general optimizations in the 3.11.0 stable release (ie improvements for non-faster CPython projects).

(Annotation: The author said that he will write another article to introduce the improvement details of faster CPython. At that time, I will continue to translate, so stay tuned!)

Table of contents

  • Optimized some format codes of printf style%
  • Optimized division of large integers in Python
  • Optimized summation of numeric PyLongs
  • The expansion operation of the simplified list improves the performance of list.append
  • Reduced memory footprint for dictionaries with full unicode keys
  • Improved the speed of transferring large files using asyncio.DatagramProtocol
  • For the math library: optimized comb(n, k) and perm(n, k=None)
  • For the statistics library: optimized mean(data), variance(data, xbar=None) and stdev(data, xbar=None)
  • unicodedata.normalize() for plain ASCII strings, promoted to constant time

Optimized some format codes of printf style%

Using formatted string literals ( formatted string literals ) is the fastest way to format strings.

A simple benchmark in Python 3.10:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 187 ns +- 8 ns

But using f-string seems to be 42% faster:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- 'f"{k!s} = {v!r}"'
.....................
Mean +- std dev: 131 ns +- 9 ns

The way to optimize performance is to convert simple C-style formatting methods to f-string methods. In 3.11.0, only %s, %r and %a are converted, but there is currently a pull request to be incorporated, which will support: %d, %i, %u, %o, %x, %X, %f, %e, %g, %F, %E, %G.

For example, here are the results of the same benchmark in Python 3.11:

$ python -m pyperf timeit -s \
  'k = "foo"; v = "bar"' -- '"%s = %r" % (k, v)'
.....................
Mean +- std dev: 100 ns +- 5 ns

About 87% faster! Of course, other optimizations in 3.11 also contribute to this, such as faster interpreter startup times.

Optimized division of large integers in Python

In Python 3.10:

python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 1.18 us +- 0.02 us

In Python 3.11:

python -m pyperf timeit -s 'x=10**1000' -- 'x//10'
.....................
Mean +- std dev: 995 ns +- 15 ns

About 18% faster.

This optimization stems from Mark Dickinson's discovery that compilers would always generate 128:64 divide instructions, despite dealing with 30-bit values.

Even on x64, Python's division is somewhat crippled. Assuming 30-bit numbers, the basic structure required for multi-precision division is 64-bit divided by 32-bit unsigned integer division, yielding a 32-bit quotient (and ideally a 32-bit remainder). There is an x86/x64 instruction to do this, namely DIVL. But current versions of GCC and Clang apparently can't emit that instruction from longobject.c without using inline assembly -- they'll just use DIVQ (128-bit by 64-bit division, although the first 64 bits of the dividend are set to zero) on x64, and the intrinsic __udivti3 or __udivti4 on x86.

—Mark Dickinson (full text)

Optimized summation of numeric PyLongs

There is an issue here that finds that sum is much faster in Python 2.7 than in Python 3. Unfortunately, this still seems to be the case with 3.11.0 under certain conditions.

Python 2.7:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 37.4 us +- 1.1 us

Python 3.10:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 52.7 us +- 1.3 us

Python 3.11:

$ python -m pyperf timeit -s 'd = [0] * 10000' -- 'sum(d)'
.....................
Mean +- std dev: 39.0 us +- 1.0 us

The difference between Python 3.10 and 3.11 is that the performance of calling sum on single numeric PyLongs is improved by inlining the unpacking of single numeric PyLongs in the fast addition branch of the sum function. Doing this avoids calling PyLong_AsLongAndOverflow when unpacking .

It's worth noting that Python 3.11 is still significantly slower than Python 2.7 when summing integers in some cases . We hope to see more improvements in Python by implementing more efficient integers .

The expansion operation of the simplified list improves the performance of list.append

In Python 3.11, list.append has a significant performance boost (about 54% faster).

List append for Python 3.10:

$ python -m pyperf timeit -s \
  'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 605 us +- 20 us

List append for Python 3.11:

$ python -m pyperf timeit -s \
  'x = list(map(float, range(10_000)))' -- '[x.append(i) for i in range(10_000)]'
.....................
Mean +- std dev: 392 us +- 14 us

There are also some minor improvements for simple list comprehensions:

Python 3.10:

$ python -m pyperf timeit -s \
  '' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 553 us +- 19 us

Python 3.11:

$ python -m pyperf timeit -s \
  '' -- '[x for x in list(map(float, range(10_000)))]'
.....................
Mean +- std dev: 516 us +- 16 us

Annotation: I remember that in version 3.9, Python optimized the speed of calling list(), dict() and range() and other built-in types , and it can continue to optimize in an inconspicuous place!

Reduced memory footprint for dictionaries with full unicode keys

This optimization makes Python cache more efficient when using dictionaries with all Unicode keys. This is because less memory is used and hashes for those Unicode keys are discarded because those Unicode objects already have hashes.

For example, on a 64-bit platform, Python 3.10 results in:

>>> sys.getsizeof(dict(foo="bar", bar="foo"))
232

In Python 3.11:

>>> sys.getsizeof(dict(foo="bar", bar="foo"))
184

(Annotation: Insert a digression, Python's getsizeof is a kind of "shallow calculation", this article " Python should pay attention to when calculating memory? " distinguishes "deep and shallow calculations", which can give you a deeper understanding of Python calculation memory.)

Improved the speed of transferring large files using asyncio.DatagramProtocol

asyncio.DatagramProtocolProvides a base class for implementing the datagram (UDP) protocol. With this optimization, transferring large files (say 60 MiB) using asyncio UDP will be more than 100 times faster than with Python 3.10.

This is achieved by calculating the size of the buffer once and storing it in a property. asyncio.DatagramProtocolThis enables an order of magnitude speedup when transferring large files over UDP .

The author of the PR msoxzw provided the following test script .

For the math library: optimized comb(n, k) and perm(n, k=None)

Python 3.8 mathadded the comb(n, k) and perm(n, k=None) functions to the standard library. Both are used to count the number of ways to select k elements from n unique elements, combreturning the result of an unordered computation, and permreturning the result of an ordered computation. (Annotation: That is, one seeks the number of combinations, and the other seeks the number of permutations)

The optimizations for 3.11 consist of several smaller improvements, such as using a divide-and-conquer algorithm to implement Karatsuba's multiplication of large numbers, and doing computations with C unsigned long longtypes instead of Python integers when possible comb( * ).

Another improvement is for small k values ​​(0 <= k <= n <= 67):

(Annotation: The following two paragraphs are incomprehensible, skip for now)

对于 0 <= k <= n <= 67, comb(n, k) always fits into a uint64_t. We compute it as comb_odd_part << shift where 2 ** shift is the largest power of two dividing comb(n, k) and comb_odd_part is comb(n, k) >> shift. comb_odd_part can be calculated efficiently via arithmetic modulo 2 ** 64, using three lookups and two uint64_t multiplications, while the necessary shift can be computed via Kummer’s theorem: it’s the number of carries when adding k to n - k in binary, which in turn is the number of set bits of n ^ k ^ (n - k). *

One more improvement is that the previous popcount-based code for computing the largest power of two dividing math.comb(n, k) (for small n) got replaced with a more direct method based on counting trailing zeros of the factorials involved. (*).

Python 3.10:

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 3.72 us +- 0.07 us

# ---

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 11.9 ms +- 0.1 ms

Python 3.11:

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(100, 55)'
.....................
Mean +- std dev: 476 ns +- 20 ns

# ---

$ python -m pyperf timeit -s \
  'import math' -- 'math.comb(10000, 5500)'
.....................
Mean +- std dev: 2.28 ms +- 0.10 ms

For the statistics library: optimized mean(data), variance(data, xbar=None) and stdev(data, xbar=None)

3.11 Optimized the , and functions statisticsin the module . If the input parameter is an iterator, it will be used directly in the calculation instead of converting it to a list first. This calculation method is twice as fast as the previous one. *meanvariancestdev

Python 3.10:

# Mean
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 255 us +- 11 us

# Variance
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 77.0 us +- 2.9 us

# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 78.0 us +- 2.2 us

Python 3.11:

# Mean
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.mean(range(1_000))'
.....................
Mean +- std dev: 193 us +- 7 us

# Variance
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.variance((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 56.1 us +- 2.3 us

# Sample standard deviation (stdev)
$ python -m pyperf timeit -s \
  'import statistics' -- 'statistics.stdev((x * 0.1 for x in range(0, 10)))'
.....................
Mean +- std dev: 59.4 us +- 2.6 us

unicodedata.normalize() for plain ASCII strings, promoted to constant time

For the unicodedata.normalize() method, if the provided input parameter is a pure ASCII string, the result will be returned quickly through the unicode fast checking algorithm . This check is implemented using the PyUnicode_IS_ASCII.

Python 3.10:

$ python -m pyperf timeit -s \
  'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 83.3 ns +- 4.3 ns

Python 3.11:

$ python -m pyperf timeit -s \
  'import unicodedata' -- 'unicodedata.normalize("NFC", "python")'
.....................
Mean +- std dev: 34.2 ns +- 1.2 ns

Final words:

  • I wrote this article to deepen my understanding of the latest achievements in Python 3.11. If something is wrong, please let me know via email or Twitter . (Annotation: This translation is for the purpose of promoting your own learning and strengthening your understanding. If there are any mistakes or omissions, please correct me!)
  • With comment on HackerNews
  • In the next article, I will analyze the optimization points brought by the faster CPython project. Stay tuned!

Guess you like

Origin blog.csdn.net/chinesehuazhou2/article/details/127825016