In-depth study of concurrency in Python (2)-----advanced

Consider the following requirements:

We have a log directory, which is full of gzip compressed log files.

The format of each log file is fixed, we want to extract all the hosts that have accessed the robots.txt file

1.1.1.1 ------------ [10/june/2012:00:18:50 - 0500] "GET /robots.txt ..." 200 71

2.1.1.3 ------------ [12/june/2013:00:18:50 - 0500] "GET /a.txt ..." 202 73

122.1.1.3 ------------ [12/june/2013:00:18:50 - 0500] "GET /robots.txt ..." 202 73

 

Without using concurrency, we will write the following program code:

import gzip
import glob
import io

def find_robots(filename):
    robots = set()
    with gzip.open(filename) as f:
        for line in io.TextIOWrapper(f, encoding='ascii'):
            fields = line.split()
            if fields[6] == '/robots.txt':
                robots.add(fields[0])
    return robots

def find_all_robots(logdir):
    files = glob.glob(logdir+'/*.log.gz')
    all_robots = set()
    for robots in map(find_robots, files):
        all_robots.update(robots)
    return all_robots

The above program is written in the style of map-reduce.

If you want to rewrite the above program to use multiple CPU cores. Just replace the map with a similar operation and let it execute in the process pool in the concurrent.futures library.

Here is the slightly modified code:

def find_all_robots(logdir):
    files = glob.glob(logdir+'/*.log.gz')
    all_robots = set()
    with ProcessPoolExecutor as pool:
        for robots in pool.map(find_robots, files):
            all_robots.update(robots)
    return all_robots

The typical usage of ProcessPoolExecutor is as follows:

from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as pool:
    do work in parallel using pool

At the bottom, ProcessPoolExcutor uses N independent processes to start the Python interpreter, N is the number of CPUs, or you can pass ProcessPoolExecutor (N) through the parameter N. Until the last statement in the with block is executed, ProcessPoolExecutor will exit and wait before exiting All tasks are completed.

The task submitted to the process pool must be in the form of a function. There are 2 ways to submit a task. If you want to process a list comprehension or map operation in parallel, you can use pool.map, or you can manually submit a task with submit.

def work(x):
    result = '''
    '''
    return result

from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as pool:
    future_result = pool.submit(work)

    r = future_result.result()

Manually submitting the task will return a future object, you can get the result through the result method, but the result method will block until the result returns.

To prevent it from blocking, you can also install a processing completion function.

def when_done(r):
    print('Got:', r.result())

with ProcessPoolExecutor() as pool:
    future_result = pool.submit(work)
    future_result.add_done_callback(when_done)

Although the process pool is very simple to use, it is necessary to pay attention to the following points:

1. This parallel processing technology is only suitable for situations where the problem can be decomposed into separate parts

2. Tasks can only be defined as ordinary functions to submit, instance methods, closures or other types of callable objects do not support parallel processing

3. The parameters and return value of the function must be compatible with pickle encoding. Task execution is performed in a separate interpreter process, which requires inter-process communication. Therefore, the exchange of data between different interpreters must be serialized

4. The submitted work function should not maintain a persistent state or have side effects.

5. In the UNIX environment, the process pool is implemented through the fork system call

6. Be extra careful when combining the process pool and the thread pool, usually you should start the process pool before creating the thread pool

 

How to circumvent the restrictions of GIL.

In the C language implementation of the Python interpreter, part of the code is not thread-safe, so it cannot be executed completely concurrently. In fact, the interpreter is protected by something called a global interpreter lock (GIL), and only one Python thread is allowed to execute at any time. The most obvious impact of GIL is that multi-threaded Python programs cannot take full advantage of multi-core CPUs (that is, a computationally intensive application that uses multi-threading technology can only run on one CPU)

To understand the GIL, you need to know when Python releases the GIL.

The interpreter will release the GIL whenever it is blocked waiting for an I / O operation. For CPU-intensive threads that never perform any blocking operations, the Python interpreter will release the GIL after executing a certain number of bytecodes, so that other threads get the opportunity to execute. However, the C language extension module is different. When the C function is called, the GIL will be locked until it returns.

Since the C code is not controlled by the interpreter, no Python bytecode will be executed during this period, so the interpreter cannot release the GIL.

TALK SO MUCH. To circumvent GIL restrictions, there are usually two strategies:

1. If you are programming completely in Python, use the multiprocessing module to create a process pool and use it as a coprocessor.

2. Focus on the extension of the C language. The main idea is to transfer the computation-intensive tasks to the C code, make it independent of Python, and release the GIL in the C code. This is achieved by inserting special macros in the C code:

#include "Python.h"

PyObject *pyfunc(PyObject *self, PyObject *args)
{
    ...
    Py_BEGIN_ALLOW_THREADS
    // Threaded C code
    ...
    Py_END_ALLOW_THREADS
    ...
}

If you use the cyptes library or Cython to access the C code, then ctypes will automatically release the GIL without our intervention.

 

 

Published 230 original articles · 160 praises · 820,000 views

Guess you like

Origin blog.csdn.net/happyAnger6/article/details/104483242