Parallelism in one line of Python code!

Python is somewhat notorious for program parallelization. Leaving aside the technical issues such as threading implementation and GIL, I think the wrong instruction is the main problem. The common classic Python multi-threading and multi-processing tutorials are mostly "heavy". And it's often just scratching the surface, without delving into what's most useful in your day-to-day work.

traditional example

Simply search for "Python Multithreading Tutorial", and it is not difficult to find that almost all tutorials give examples involving classes and queues:

import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    pool = Pool()
    pool.map(creat_thumbnail, images)
    pool.close()
    pool.join()

Ha, looks a bit like Java doesn't it?

I'm not saying that using the producer/consumer model for multi-threaded/multi-process tasks is wrong (in fact, this model has its place). It's just that we can use a more efficient model when dealing with daily scripting tasks.

The problem is that…

First, you need a boilerplate class;
second, you need a queue to pass objects;
moreover, you also need to build corresponding methods at both ends of the channel to assist its work (if you want two-way communication or save results, you need Introduce another queue).

The more workers, the more problems

Following this line of thought, you now need a thread pool of worker threads. Here's an example from a classic IBM tutorial - Accelerating Web Retrieval through Multithreading.

#Example2.py
'''
A more realistic thread pool example 
'''

import time 
import threading 
import Queue 
import urllib2 

class Consumer(threading.Thread): 
    def __init__(self, queue): 
        threading.Thread.__init__(self)
        self._queue = queue 

    def run(self):
        while True: 
            content = self._queue.get() 
            if isinstance(content, str) and content == 'quit':
                break
            response = urllib2.urlopen(content)
        print 'Bye byes!'

def Producer():
    urls = [
        'http://www.python.org', 'http://www.yahoo.com'
        'http://www.scala.org', 'http://www.google.com'
        # etc.. 
    ]
    queue = Queue.Queue()
    worker_threads = build_worker_pool(queue, 4)
    start_time = time.time()

    # Add the urls to process
    for url in urls: 
        queue.put(url)  
    # Add the poison pillv
    for worker in worker_threads:
        queue.put('quit')
    for worker in worker_threads:
        worker.join()

    print 'Done! Time taken: {}'.format(time.time() - start_time)

def build_worker_pool(queue, size):
    workers = []
    for _ in range(size):
        worker = Consumer(queue)
        worker.start() 
        workers.append(worker)
    return workers

if __name__ == '__main__':
    Producer()

This code works correctly, but take a closer look at what we need to do: construct different methods, track a series of threads, and in order to solve the annoying deadlock problem, we need to perform a series of join operations. And this is just the beginning...

So far we have reviewed the classic multithreading tutorial, which is somewhat empty, isn't it? It is boilerplate and error-prone, so the style of getting twice the result with half the effort is obviously not so suitable for daily use. Fortunately, we have a better way.

why not try map

map This small and elegant function is the key to the simple parallelization of Python programs. map is derived from functional programming languages ​​such as Lisp. It can realize the mapping between two functions through a sequence.

    urls = ['http://www.yahoo.com', 'http://www.reddit.com']
    results = map(urllib2.urlopen, urls)

The above two lines of code pass each element in the urls sequence as a parameter to the urlopen method and save all the results in the results list. The result is roughly equivalent to:

results = []
for url in urls: 
    results.append(urllib2.urlopen(url))

The map function single-handedly handles a series of operations such as sequence operations, parameter passing, and result saving.

Why is this important? This is because maps can be easily parallelized with the right library.

picture

There are two libraries in Python that contain the map function: multiprocessing and its lesser-known sub-library multiprocessing.dummy.

Here are two more sentences: multiprocessing.dummy? A threaded clone of the mltiprocessing library? Is this shrimp? Even in the official documentation of the multiprocessing library, there is only one sentence about this sub-library. And the translation of this description into adult language basically means: "Well, there is such a thing, you just know it." Believe me, this library is seriously underestimated!

dummy is a complete clone of the multiprocessing module, the only difference being that multiprocessing works on processes, whereas dummy works on threads (thus including all of Python's usual multithreading limitations).
So it is very easy to replace and use these two libraries. You can choose different libraries for IO-bound tasks and CPU-bound tasks.

try it yourself

Use the following two lines of code to reference the library containing the parallelized map function:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

Instantiate a Pool object:

pool = ThreadPool()

This simple statement replaces the work of the 7 lines of code in the build worker pool function in example2.py. It spawns and initializes a series of worker threads, storing them in variables for easy access.

The Pool object has some parameters, and what I need to focus on here is only its first parameter: processes. This parameter is used to set the number of threads in the thread pool. Its default value is the number of cores of the current machine CPU.

In general, when performing CPU-intensive tasks, the more cores you invoke, the faster you will be. But when dealing with network-intensive tasks, things can get a little unpredictable, and it is wise to experiment with the size of the thread pool.

pool = ThreadPool(4) # Sets the pool size to 4

When there are too many threads, the time spent switching threads may even exceed the actual working time. It's a good idea to experiment to find the optimal thread pool size for different jobs.

After the Pool object is created, the parallelized program is ready to go. Let's take a look at the rewritten example2.py

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
    'http://www.python.org', 
    'http://www.python.org/about/',
    'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
    'http://www.python.org/doc/',
    'http://www.python.org/download/',
    'http://www.python.org/getit/',
    'http://www.python.org/community/',
    'https://wiki.python.org/moin/',
    'http://planet.python.org/',
    'https://wiki.python.org/moin/LocalUserGroups',
    'http://www.python.org/psf/',
    'http://docs.python.org/devguide/',
    'http://www.python.org/community/awards/'
    # etc.. 
    ]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close() 
pool.join()

There are only 4 lines of code that actually work, only one of which is critical. The map function handily replaces the 40+ line example above. In order to be more interesting, I counted the time consumption of different methods and different thread pool sizes.

# results = [] 
# for url in urls:
#   result = urllib2.urlopen(url)
#   results.append(result)

# # ------- VERSUS ------- # 

# # ------- 4 Pool ------- # 
# pool = ThreadPool(4) 
# results = pool.map(urllib2.urlopen, urls)

# # ------- 8 Pool ------- # 

# pool = ThreadPool(8) 
# results = pool.map(urllib2.urlopen, urls)

# # ------- 13 Pool ------- # 

# pool = ThreadPool(13) 
# results = pool.map(urllib2.urlopen, urls)

result:

#        Single thread:  14.4 Seconds 
#               4 Pool:   3.1 Seconds
#               8 Pool:   1.4 Seconds
#              13 Pool:   1.3 Seconds

Great result isn't it? This result also explains why the size of the thread pool should be determined experimentally. On my machine, when the thread pool size is greater than 9, the benefits are very limited.

another real example

Generating thumbnails of thousands of images
is a CPU-intensive task, and well suited for parallelization.

Basic single-process version
import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    for image in images:
        create_thumbnail(Image)

The main job of the code above is to traverse the image files in the incoming folder, generate thumbnails one by one, and save these thumbnails to a specific folder.

On my machine, it takes 27.9 seconds to process 6000 images with this program.

If we use the map function instead of the for loop:

import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    pool = Pool()
    pool.map(creat_thumbnail, images)
    pool.close()
    pool.join()

5.6 seconds!

Although we only changed a few lines of code, we have significantly improved the execution speed of the program. In a production environment, we can further improve execution speed by selecting multi-process and multi-thread libraries for CPU-intensive tasks and IO-intensive tasks respectively-this is also a good way to solve deadlock problems. In addition, since the map function does not support manual thread management, it makes the related debugging work extremely simple.

At this point, we've achieved (essentially) parallelization with one line of Python.

Guess you like

Origin blog.csdn.net/s_frozen/article/details/129252853