High-performance read python programming

EDITORIAL

Recently I read the book, "python-performance programming." In fact, when the books still have great hopes for this book, but then read it, I feel, translation, right, translate, really sucks. Chinglish does not say a lot, even in some places is not very fluent. But high for the average person like me English speaking certainly still see some efficiency than the English books. Secondly, the book tells the python optimize efficiency, enhance computing performance and space-saving method, part of the contents still very instructive, here's a simple chat. (Following the order of presentation and content of the book is not necessarily the same, but most of the content can be found in the corresponding book)

python performance limitations

I had to learn when atmospheric models also lifted some programming language, C, c ++, fortran. But this is the first time to achieve high performance using dynamic languages. For various reasons mainly python is a high-level language, python interpreter to pull away elements of the underlying calculation used to do a lot of work, we actually face is a python black box, just know that our operations and calculations can be done but the implementation process is not transparent to developers; another reason is the development of the python interpreter when introduced to thread-safe global interpreter lock, which is "notorious GIL".
For the first question, the impact caused by the more intuitive, such as python due to the impact of virtual machines that the vector operation is not directly available, but in fact, we can use an external library numpy achieve this, this is not a fundamental problem; in addition the problem is localized data, python abstract affect any arrangements required for the next cache calculation prepared, which is due to garbage collection python is a language, although no reference to an object when garbage collection will enter the process, but the memory is automatically allocated and released when needed, which can lead to memory fragmentation and affect the transmission to the CPU cache; the last one is because python is a high-level language, the so-called mad. In high-level language allows developers to facilitate the realization of a prototype program, they also raises the question, is not properly optimized from the compiler. When we compile static code, the compiler can do a lot to change the memory layout of objects and optimized CPU instruction operation to optimize. Furthermore, since the python supports dynamic typing, Joe Smith may be a beginning or a paramecium, who is the blink of an eye, which makes the optimization process even more difficult.
For the latter question, first of all talk about GIL itself. As already said before, GIL Chinese name is global interpreter lock. But need to be clear that, although there is mention of this shortcoming python affect performance, but this is not characteristic of the python language, it was just introduced in the realization of the python interpreter (CPython) of properties. Some other python interpreter is not such a problem, such as JPython. Who CPython but now is dominated by it?
Back GIL, the official explanation is this:

In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython’s memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)

We can see, in order to solve data integrity and status between the multi-thread synchronization, before the people with the most natural solution is to lock. However, due to start after so many libraries did accept such a setting, it is not the cause of this feature is very python like "characteristics." But with the optimal design python3 the beginning, at least when high-density I / O problem, multithreading is still a viable accelerated program.

Some performance testing tool

In the second chapter of the book, the author introduces several useful comparison of performance testing tool (for python2, but some of them are available in python3).

  • import time using python time module, this is the most intuitive and convenient way, such a function is inserted in the code segment less elegant
  • Define a decorator, actually call the above time module, but only @ decorators can be completed on the need to do performance analysis functions, relatively elegant; and because the method invocation function brings additional overhead, under normal circumstances, this method is less than the measured time on a method
  • Using the timeit module, such as python -m timeit -n 5 -r 5 "import test code", "function code particularly, to fill specific parameter", when using the interpreter ipython% can be directly used to reduce the workload timeit
  • Using the System / usr / bin / time --verbose (Note that this is different from the built-in time shell), returns three times, real, user, and sys, in terms of the machine for a single-core real ~ = user + sys, but for multi-core machines, because the task will be assigned to multiple cores is completed, so this is generally not true
  • Using the standard built-in library cProfile tools, python -m cProfile -s cumulative ex.py
  • For cpu-intensive programs can use line_profiler
  • heapy survey heap
  • dowser Draw instance variables
  • Check dis bytecode

Described above basically it covers the different size of performance analysis tools, and above all my tools available to verify python3, after all, a few months after that thorough and python2 say goodbye.
The first four tools can only provide a rough run-time situation, only let us know which part of the code that runs a long time, but let us not really understand why. The latter is required to understand the situation calling code cProfile precise position location, and then use the analysis line_profiler progressive tool, when coupled with dis necessary code check bytes, essentially an optimization process from the coarse to fine. But the middle is worth noting that many of the tools we use are needed decorator, but the decorator during the test when the code will affect the correctness of the code, is actually a very annoying presence, so we need to use No -op problems decorator to prevent the test of time.
Finally, if able to do so, it is best to do the time-keeping system performance analysis and operational processes relatively "clean", so as not to affect the results of the analysis.

Calculating a basic data structures used

Lists and tuples

First sentence explanation before contact lists and tuples, they can all be stored container array, the list is dynamic arrays, tuples are static arrays. The length of the list can be reset, but the group can not prime, then its internal elements can not be changed to create. And the cache with the python runtime environment tuple, that is to say we do not need to visit to access the kernel every time tuple to allocate memory, no doubt efficiency is relatively high.
Memory allocation elaborate on the list, the list can be changed due to the size of the increase in the length of the list when, in fact, python creates a new list when every time you create new data, but the size of this new list +1 is greater than the length of the list before, this is because every time we add a new start may be added back many times similar, can be logically understood as a local operation. It is noteworthy that this and the process as it relates to memory copy, so the cost of operation is great.
So tuple and how is it different from the lists, tuples is no "overbooking phenomenon of" so-tuples each additional element will have a distribution and copying work, rather than as a list can only be used in the current list when it is not long enough to do. In addition, because tuples with static properties, so when dealing with python tuple is resource caching, which means that even if the tuple no longer in use, their space is not returned to the system immediately, but left to future use . That is the same size in the future if the need tuples when the operating system is not required to apply for memory, but direct use of this reserved space.
For a list of search problems in the data, it is recommended to use the built-in sorting python (Tim algorithm) + binary search.

Dictionaries and collections

In fact, we have talked about dictionaries and collections can be considered a manifestation of generalized list above. You see, when we use the list, in fact, index, or key offser is the dictionary memory location ah! In fact, the dictionary is to learn to realize this idea. In general, if we have some order data, but may be the only index to reference objects (of any type can be hashed index can become the object), then we can use a dictionary and a collection. But it is still a collection of some of the more special, we can consider the set of keys but does not include the value thereof.
Query role of the dictionary is relatively simple, but when you insert the data will need to help hash function. In essence, the new data is inserted in two positions depending on the attribute data, how the key hash value and the comparison value with other objects. This is because when we insert the data, a hash value needs to be calculated first masking key and to obtain a valid array index. In order to ensure that the mask is a hash value may be any number can be transformed into the final index interval. During insertion, if the corresponding index to find the location, but the index value corresponding to the position is empty, we can attach a value, but if the position of the index is already in use, then divided into two cases, if the value of the index position we want to insert equal value, the direct return, or not equal relationship, we need sniffing a new index position, in order to perform sniffing to find the new location we need to use a function to calculate the new position. This function is actually we want him to have two properties, one is for certain key output of which is determined, or at specified search operation when we have big trouble, the other is the result of input keys for different functions dispersed , which is the entropy function should be large enough. The impact on the performance of the dictionary, the main consideration is the size of a dictionary when progressively larger time when more and more content inserted into the hash table when the table itself must be resized to fit. We have a more general rule is that no more than two-thirds of a full table still has a good hash collision avoidance rate while having optimal space savings. But when a hash table is full, you need to allocate a bigger table, and the mask is adjusted to fit the new table, all the elements of the old table and then be re-inserted into the new table. This center will need to recalculate the index, so when the need to optimize the performance vigilant. Since the allocation to minimize duplication insufficient length hash result.

Dictionary and namespace

python namespace dictionary and also have a great relationship, in fact python can be said that excessive use of the dictionary to query. When accessing a python variables, functions, time module, there is a mechanism to decide how to find these objects. On order, python first looks for locals () array, which is stored within the entries for all local variables. Actually python in here for a larger optimization, search chain also mentioned above, the only part of the dictionary lookup is not required here. And if the locals are not found python (), it will search for global () dictionary, and finally will search __buildin__ objects. So when trying to optimize native code is an optional program to use a local variable to hold the outer function. Of course, the best display is given to increase the readability of the code.

Iterators and generators

At first, let's pour a pot of cold water, the generator does not in fact Zuosha contributions to computational efficiency, but can indeed reduce memory consumption. In addition the book describes the use of generator when the generator can be explained with the normal function up generator used to create the data, while the ordinary function is responsible for the operation of data generated. This division of the functions and increased clarity and logic function code, and is reflected decoupled. In addition, the book also describes the problems caused by the generator, or "single-pass" issue, we can only access the current value, but can not access other elements in the series. But nothing to complain about, after all, the generator can save memory is saved in this above, but this is not the inevitable trade-off. You can call the python standard library itertools libraries, some of which function can help us solve this problem (such as islice, etc.).

Matrix and vector calculation

The content of this chapter is to introduce the python vector calculation bottlenecks that may exist, and describes the causes and solutions. Process is used as an example of a start speaking diffusion equation, wherein there are a few points worth noting. The first is in the high-density computing environment CPU, memory allocation does not cheap, every time when we need memory to store a variable or list, Python must take the time to apply more memory to the operating system, and then have to traverse the new space will be allocated to him initialized, so allocate enough case to a value in the variable under, should maximize the use (or reuse) the allocated memory space, this will give us some speed improvements. Another point is that memory fragmentation, also mentioned earlier, Python vector calculating the core issues, Python does not support vector operations. This is mainly caused by two reasons, Python stored list pointer is pointing to the actual data, the actual data is not stored sequentially in memory; and Python byte code itself is not optimized for vector operations. In the above stated reasons really affect the use of the vector operation is quite large, a simple first read operation will be broken down into the elements to find the corresponding position according to the first index in the list, but the position corresponding to the stored address value, so it needs to obtain a value corresponding to dereference the address, further in a larger particle size, when we try to speak into blocks of data, we were only able to separate small pieces of transmission, the transmission can not be disposable the entire block, but we also can not predict how the situation will appear in the cache, bad! A "von Neumann bottleneck" appeared. Before trying to find a solution, you can use the linux perf tools to understand how the CPU processing program running, the pro-test is a good tool, but it is best to confirm the installation of the same linux kernel version. The book by the results of the analysis is not difficult to understand, vector calculation we will fill in the relevant data to the CPU cache when it will achieve. However, due to the continuous movement of the bus only memory data, data is possible only when the RAM is stored contiguously. array of objects in Python continuous data stored in memory, but the Python bytecode issue remains unresolved, not to mention the fact array type to create a list is actually slower than list, so we need a good tool, It's time to introduce numpy.
numpy, data can be continuously stored in memory, and data to support vector operations. Instead, use the corresponding function numpy (specialized function is possible, after all, generally dedicated code common code performance better than) the actual code, can bring performance improvement. And the actual code, we can also use in-place operation to avoid the impact caused by memory allocation, memory allocation, after all, it is more costly than a cache failure. Not only because it can not find the data in the cache to RAM and need to look for, and memory allocation request data must also be available as an operating system to keep it. The overhead required to request the operating system is much larger than simply fill the cache. After filling a cache miss is an optimized hardware on the motherboard behavior (tuple mentioned earlier also similar mechanism makes refill faster), but you need to allocate memory with another process, the kernel deal. But more painful is that, although we can make local operation (especially continuous local operation) is more intuitive and more convenient by using numexpr tools, but the code readability local operation is still relatively poor.
Finally, it is worth mentioning that the study during the vector operation, the best advance understanding of the hardware cache-related knowledge, will understand the data locality, perf research results and program optimization is very helpful.

Solve problems of dynamic languages! Compiled C

It introduced the outset, not because of the dynamic language compiler optimization, so in the actual execution of the code is difficult to improve efficiency. The book gives several ways in terms of our part of the code is compiled to machine code.

  • Cython compiled the most common tools of C. Numpy python supports and standard default gcc
  • shed skin for non numpy code automatically converts to a converter C Python uses g ++
  • Numba code specific to the new compiler numpy
  • Pythran numpy for new and non-code compiler numpy
  • PyPy can replace conventional Python (mainly refers to Cpython) in-time compiler, it should be noted that the relative JIT and AOT problem with cold start, not to deal with short and frequent run scripts with JIT

After my visit, available at python3 in better user experience should be the Cython and PyPy, but PyPy numpy support for the poor, before making a numpypy project, based Cpython used to transform the numpy library, made 80 after work% numpy project discontinued, replaced by use of a spinning class numpy. But efficiency Well, you have to Han bike ah! But unfortunately, but the efficiency point of view, PyPy for Cython still have an advantage, as the JIT efficiency is three times Cython general, and the other is a permanent support PyPy python2, so for less use numpy such external libraries, or need to stay in python2 of friends can be considered under PyPy, after all pure Python environment, it got the most praise. But secretly still hope we can safely use numpy in pypy in, after all, still have confidence in the JIT, JIT case brings speed improvements can really offset the presence cpyext it?
In fact, this chapter begins to think that is really the content of private scientific and technological content, but do not happy too well, let's count was a little b, we used to bring up the compiler how to enhance it?
First, the calling code external libraries, ha ha, after compiling is no faster, and secondly we can not hope to enhance my on I / O-intensive code can gain speed. In other words, the code can not be compiled after the sober than normal C code written more quickly (after all, we chose python ah). So when we optimized code is bound to get a working curve, we first analyze the code to understand the behavior of a program, then began to evidence-based algorithm modifications and the use of compiler performance gains, and finally we will eventually realize that the workload increases command to bring small returns, it is time to close off it!

Cython

In fact, their work is very easy to understand, is guided by running a setup.py will .pyc file into C code by compiling Cython, we will get c intermediate code and static libraries.
When used by Cython own annotations option in your browser to view the GUI can be annotated code block, and then by adding annotations (bound to lose some versatility) and removing boundscheck The program is optimized for computing the list. Of course openmp ready is a good thing, designated with nogil, then you can start up.

Shed Skin

Feeling is tepid project, but also for python3 limited support, so there did not do in-depth understanding of private thought for the average python compiler is very fool solution, only need to give a seed to make an example, shed skin on compiler can automatically take advantage of the compiler for developers to allow users to modify the code, and then "comfortable" but will not be accelerated to software users to master this knowledge. But it is worth noting that Shed Skin is in a separate memory space to perform, so there will be additional overhead memory copy.

PyPy

pypy is my favorite get compiled in an accelerated manner, but because it is not the native support for numpy, but by cpyext to do to connect (glue?), so it is not very friendly for numpy so the bottom using c library, but PyPy is a very active projects, with the arrival of PyPy6.0, in fact, you can see already been greatly improved in performance numpy, not to mention JIT really strong, I am very optimistic about the development of PyPy.

Complicated by

Mainly to address the impact of the flow of I / O implementation of the program, can let us wait for the concurrent O operation is complete when / I-perform other operations, allowing us to use this time together. Our goal is to solve the I / O wait.
In-depth view, when a program into the I / O wait, it will be suspended, so that the kernel can perform I / O requests related to low-level operations (that is, to complete a context switch), until the I / O operation to complete before carry on. But heavyweight context switch, you need to consume a lot of resources. Specifically, the context switch status we need to first save the program (in other words, will make us lose any type of cache on the CPU level), and exit CPU usage. In this way, then we are allowed to run again when it is necessary to take the time to re-initialize the program and are ready to continue to run.
That we generally use complicated by the idea of what is it? Typically, we need an "event loop" to manage the program, what should be running, when to run. In essence, a list of functions that an event loop just need to run (or queue?). Function implemented in the top of a list, and then turn to the next one, and so on. But such operations are actually comes at a cost, after all, is also a function of switching between overhead, the kernel must take the time to set up in memory function is called, and the cache state is also unpredictable. But the program has a lot of I / O wait time, the function switch will greatly I / O wait time to use them, compared to its cost, overall performance will be greatly improved.
And the time mentioned loop program is mainly in two ways: callback or future.
The main library can be used to asynchronous have the following:

  • gevent follow the main logic function returns the future to make asynchronous mode, most of the code logic remains consistent
  • tornado use callbacks to implement asynchronous behavior
  • AsynclO

multi-Progress

Introduced before GIL and harm it brings to us optimize the code. But the role of a GIL is a space corresponding python process, when we run several python process will not be affected by the GIL (of course, we need to turn to consider the problem of message delivery).
python's multiprocessing module can Oh let us use process-based and thread-based parallel processing, sharing tasks on the queue, and share data on the process. Its main target is the single multi-core issues, broader application space to solve the problem of CPU-intensive. In fact, earlier also introduced the use of cython compile the C code to use openmp framework, multiprocessing presented here is working on a higher level, select the library such as python numpy like we need to do parallel computing in wide use.

multiproceing module can do some typical tasks:

  • With the process and the pool of objects to a CPU-intensive tasks in parallel
  • A dummy module in the thread pool of a parallel I / O intensive tasks
  • Incidentally shared by the job queue
  • Data type shared state between parallel worker, can be shared by the byte, native data types, and a list of dictionary

The main components of multiproceing module are:

  • Derived copy process, a current process, create a new process identifier, and the task to run a separate sub-processes in the operating system. Developers can start and query the status of the process and give it a run method. But also due to the characteristics of the above, we should be careful when using such operations such as generating a random number (due to the problems derived copies, each process may generate a random number is exactly the same :-)
  • Pool, pool, packaged processes and threads. In a worker thread pool to share a piece of work and returns an aggregate result.
  • queue
  • Managers, a unidirectional or bidirectional communication channel between two processes
  • ctypes, after allowing the process forked, native data types shared between parent and child
  • Synchronization primitives

The book describes the two were more representative examples of parallelism. Monte Carlo method pi seeking and finding prime numbers. The former task load is balanced, the latter in different calculation interval is uneven. In the first example, in the exercise of analysis can be found, Hyper-Threading resources is a very small value; of course, the biggest problem is the use of hyper-threading CPython uses a lot of RAM, after all, Hyper-Threading is not cache friendly, so every utilization of remaining resources on a chip is very low, so under normal circumstances, we can be seen as a value-added hyperthreading, rather than an optimal target, more cases, considering the appropriate communication pressure, increase the CPU is king.
In the second to find prime examples, the author summarizes some strategy to solve the thorny problem of parallel:

  • Dividing the work into separate work units
  • For the worker to find a prime number such uneven problem space problem, the operating sequence randomization methods can be employed
  • Work queue are arranged overhead, the average time is minimized as far as possible, even to do pre-screening to avoid using a serial parallel portion
  • If there is no good reason, it is best to use the default honestly chunksize

Clusters and work queues

In fact, for handling common problems, one needs to spend on the computer's mind is far less than one cluster, so we must first determine the time and effort spent worth it. The python has a more mature of the three clusters of solutions, which are parallel python, IPython parallel and NSQ. Of course, the ease of entry of these three basically in that order.

  • parallel python, the interface is easy to use, the interface multiprocessing minor modifications can, but the function is not very strong
  • Ipython cluster is due to support MPI, and debugging more convenient (after all, is the way to run interactively)
  • NSQ a high-performance distributed messaging platform robust and strong but also difficult to achieve the appropriate higher

end

In fact, this book's contents in general is very rich, basically we wanted to use the faster way python program running are presented (perhaps lacking hardware configuration, overclocking? Linux optimization?). And a great reminder to the reader, is the high-performance computing in solving problems, practical experience is very important, just look at the contents of the book is to eat impervious. In addition, since that program in order to achieve the purpose of tuning the operating system and the computer must be composed of two sub-critical knowledge is not open around the system, so there is time still need to look at the relevant material.
Finally, thanks nathanmarz 's blog: by You Should blog the even IF you have have NO Readers let me have the power to re-write the blog.

Guess you like

Origin www.cnblogs.com/gabriel-sun/p/12128378.html