How to implement parallelism with one line of Python code

foreword

Python is somewhat notorious for program parallelization. Leaving aside technical issues such as threading implementation and the GIL, I think wrong teaching guidance is the main problem

question. The common classic Python multi-threading and multi-process tutorials are mostly "heavy". And it’s often scratching the surface without delving into what’s most useful in your day-to-day work.

insert image description here

traditional example

A simple search for "Python multithreading tutorial" reveals that almost all tutorials give examples involving classes and queues:

Python学习交流Q群：906715085###
import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    pool = Pool()
    pool.map(creat_thumbnail, images)
    pool.close()
    pool.join()

insert image description here

Ha, looks a bit like Java doesn't it?

I'm not saying it's wrong to use the producer/consumer model for multi-threaded/multi-process tasks (in fact, this model has its place). Just, deal with daily scripts

task we can use a more efficient model.

the question is that…

First, you need a boilerplate class;

Second, you need a queue to pass objects around;

Moreover, you also need to build corresponding methods on both ends of the channel to assist its work (if you want to communicate in both directions or save the results, you need to introduce another queue).

The more workers, the more problems

Following this line of thinking, you now need a thread pool of worker threads. The following is an example from an IBM classic tutorial - Web page retrieval through multi-threading

line acceleration.

Python学习交流Q群：906715085###
#Example2.py
'''
A more realistic thread pool example 
'''

import time 
import threading 
import Queue 
import urllib2 

class Consumer(threading.Thread): 
    def __init__(self, queue): 
        threading.Thread.__init__(self)
        self._queue = queue 

    def run(self):
        while True: 
            content = self._queue.get() 
            if isinstance(content, str) and content == 'quit':
                break
            response = urllib2.urlopen(content)
        print 'Bye byes!'

def Producer():
    urls = [
        'http://www.python.org', 'http://www.yahoo.com'
        'http://www.scala.org', 'http://www.google.com'
        # etc.. 
    ]
    queue = Queue.Queue()
    worker_threads = build_worker_pool(queue, 4)
    start_time = time.time()

    # Add the urls to process
    for url in urls: 
        queue.put(url)  
    # Add the poison pillv
    for worker in worker_threads:
        queue.put('quit')
    for worker in worker_threads:
        worker.join()

    print 'Done! Time taken: {}'.format(time.time() - start_time)

def build_worker_pool(queue, size):
    workers = []
    for _ in range(size):
        worker = Consumer(queue)
        worker.start() 
        workers.append(worker)
    return workers

if __name__ == '__main__':
    Producer()

This code works correctly, but take a closer look at what we need to do: construct different methods, trace a series of threads, and solve the annoying deadlock problem,

We need to perform a series of join operations. This is just the beginning...

So far we've reviewed the classic multithreading tutorial, which is somewhat hollow isn't it? Boilerplate and error-prone, this style of doing more with less is obviously not so suitable for daily use,

Fortunately, we have a better way.

why not try map

The small and delicate function map is the key to easily parallelizing Python programs. map is derived from functional programming languages like Lisp. It can be achieved by a sequence

A mapping between two functions.

urls = [‘http://www.yahoo.com’, ‘http://www.reddit.com’]
results = map(urllib2.urlopen, urls)

The above two lines of code pass each element of the sequence urls as a parameter to the urlopen method and save all the results to the results list. its knot

The result is roughly equivalent to:

results = []
for url in urls: 
    results.append(urllib2.urlopen(url))

The map function handles a series of operations such as sequence operations, parameter passing, and result storage.

Why is this important? This is because with the right library, map can easily parallelize operations.

insert image description here

There are two libraries in Python that contain the map function: multiprocessing and its lesser-known sub-library multiprocessing.dummy.

Two more sentences here: multiprocessing.dummy? Threaded clone of mltiprocessing library? Is this shrimp? Even in the official documentation of the multiprocessing library

There is only one relevant description for this sub-library. And this description translated into human language basically means: "Well, there is such a thing, you know it." Believe me, this library is seriously underrated

dummy is a complete clone of the multiprocessing module, the only difference is that multiprocessing works on processes, while the dummy module works on threads (and therefore also

All common multithreading limitations of Python are included).

So it's surprisingly easy to use these two libraries interchangeably. You can choose different libraries for IO-intensive tasks and CPU-intensive tasks.

try it out

Use the following two lines of code to reference the library containing the parallelized map function:

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

insert image description here

Instantiate a Pool object:

pool = ThreadPool()

This simple statement replaces the work of the 7 lines of code in the buildworkerpool function in example2.py. It spawns a series of worker threads and completes the initialization

, store them in variables for easy access.

The Pool object has some parameters, all I need to pay attention to here is its first parameter: processes. This parameter is used to set the number of threads in the thread pool. Its default value is

The number of cores of the current machine CPU.

In general, when performing CPU-intensive tasks, the more cores are called, the faster they are. But when dealing with network-intensive tasks, things can get a little unpredictable, with real

It is advisable to experiment to determine the size of the thread pool.

pool = ThreadPool(4) # Sets the pool size to 4

When the number of threads is too large, the time consumed by switching threads may even exceed the actual working time. For different jobs, trying to find the optimal value for the thread pool size is not easy.

wrong idea.

After the Pool object is created, the parallelized program is ready to go. Let's take a look at the rewritten example2.py

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
    'http://www.python.org', 
    'http://www.python.org/about/',
    'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
    'http://www.python.org/doc/',
    'http://www.python.org/download/',
    'http://www.python.org/getit/',
    'http://www.python.org/community/',
    'https://wiki.python.org/moin/',
    'http://planet.python.org/',
    'https://wiki.python.org/moin/LocalUserGroups',
    'http://www.python.org/psf/',
    'http://docs.python.org/devguide/',
    'http://www.python.org/community/awards/'
    # etc.. 
    ]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close() 
pool.join()

There are only 4 lines of code that actually work, only one of which is critical. The map function easily replaces the 40+ line example above. For more fun,

I counted the time consumption of different methods and different thread pool sizes.

insert image description here

# results = [] 
# for url in urls:
#   result = urllib2.urlopen(url)
#   results.append(result)

# # ------- VERSUS ------- # 

# # ------- 4 Pool ------- # 
# pool = ThreadPool(4) 
# results = pool.map(urllib2.urlopen, urls)

# # ------- 8 Pool ------- # 

# pool = ThreadPool(8) 
# results = pool.map(urllib2.urlopen, urls)

# # ------- 13 Pool ------- # 

# pool = ThreadPool(13) 
# results = pool.map(urllib2.urlopen, urls)

结果：
#        Single thread:  14.4 Seconds 
#               4 Pool:   3.1 Seconds
#               8 Pool:   1.4 Seconds
#              13 Pool:   1.3 Seconds

Great result isn't it? This result also explains why the size of the thread pool is determined experimentally. On my machine, when the thread pool size is greater than 9, the benefit is

Very limited.

Another real example

Generate thumbnails of thousands of images

This is a CPU-intensive task and well suited for parallelization.

Basic single process version

Python学习交流Q群：906715085####
import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    for image in images:
        create_thumbnail(Image)

insert image description here

The main job of the above code is to traverse the image files in the incoming folder, generate thumbnails one by one, and save these thumbnails to a specific folder.

On my machine, it took 27.9 seconds to process 6000 images with this program.

If we use the map function instead of the for loop:

import os 
import PIL 

from multiprocessing import Pool 
from PIL import Image

SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'

def get_image_paths(folder):
    return (os.path.join(folder, f) 
            for f in os.listdir(folder) 
            if 'jpeg' in f)

def create_thumbnail(filename): 
    im = Image.open(filename)
    im.thumbnail(SIZE, Image.ANTIALIAS)
    base, fname = os.path.split(filename) 
    save_path = os.path.join(base, SAVE_DIRECTORY, fname)
    im.save(save_path)

if __name__ == '__main__':
    folder = os.path.abspath(
        '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
    os.mkdir(os.path.join(folder, SAVE_DIRECTORY))

    images = get_image_paths(folder)

    pool = Pool()
    pool.map(creat_thumbnail, images)
    pool.close()
    pool.join()

5.6 seconds!

Even though only a few lines of code have been changed, we have significantly improved the execution speed of the program. In a production environment, we can separate CPU-intensive tasks and IO-intensive tasks

Choose multiprocessing and multithreading libraries to further improve execution speed - this is also a good way to solve deadlock problems. In addition, since the map function does not support manual thread management, the reverse

It also makes the related debug work extremely simple.

At this point, we have achieved (basically) parallelization in one line of Python.

insert image description here