foreword
Python is somewhat notorious for program parallelization. Leaving aside technical issues such as threading implementation and the GIL, I think wrong teaching guidance is the main problem
question. The common classic Python multi-threading and multi-process tutorials are mostly "heavy". And it’s often scratching the surface without delving into what’s most useful in your day-to-day work.
traditional example
A simple search for "Python multithreading tutorial" reveals that almost all tutorials give examples involving classes and queues:
Python学习交流Q群:906715085###
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath(
'11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
pool = Pool()
pool.map(creat_thumbnail, images)
pool.close()
pool.join()
Ha, looks a bit like Java doesn't it?
I'm not saying it's wrong to use the producer/consumer model for multi-threaded/multi-process tasks (in fact, this model has its place). Just, deal with daily scripts
task we can use a more efficient model.
the question is that…
First, you need a boilerplate class;
Second, you need a queue to pass objects around;
Moreover, you also need to build corresponding methods on both ends of the channel to assist its work (if you want to communicate in both directions or save the results, you need to introduce another queue).
The more workers, the more problems
Following this line of thinking, you now need a thread pool of worker threads. The following is an example from an IBM classic tutorial - Web page retrieval through multi-threading
line acceleration.
Python学习交流Q群:906715085###
#Example2.py
'''
A more realistic thread pool example
'''
import time
import threading
import Queue
import urllib2
class Consumer(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
self._queue = queue
def run(self):
while True:
content = self._queue.get()
if isinstance(content, str) and content == 'quit':
break
response = urllib2.urlopen(content)
print 'Bye byes!'
def Producer():
urls = [
'http://www.python.org', 'http://www.yahoo.com'
'http://www.scala.org', 'http://www.google.com'
# etc..
]
queue = Queue.Queue()
worker_threads = build_worker_pool(queue, 4)
start_time = time.time()
# Add the urls to process
for url in urls:
queue.put(url)
# Add the poison pillv
for worker in worker_threads:
queue.put('quit')
for worker in worker_threads:
worker.join()
print 'Done! Time taken: {}'.format(time.time() - start_time)
def build_worker_pool(queue, size):
workers = []
for _ in range(size):
worker = Consumer(queue)
worker.start()
workers.append(worker)
return workers
if __name__ == '__main__':
Producer()
This code works correctly, but take a closer look at what we need to do: construct different methods, trace a series of threads, and solve the annoying deadlock problem,
We need to perform a series of join operations. This is just the beginning...
So far we've reviewed the classic multithreading tutorial, which is somewhat hollow isn't it? Boilerplate and error-prone, this style of doing more with less is obviously not so suitable for daily use,
Fortunately, we have a better way.
why not try map
The small and delicate function map is the key to easily parallelizing Python programs. map is derived from functional programming languages like Lisp. It can be achieved by a sequence
A mapping between two functions.
urls = [‘http://www.yahoo.com’, ‘http://www.reddit.com’]
results = map(urllib2.urlopen, urls)
The above two lines of code pass each element of the sequence urls as a parameter to the urlopen method and save all the results to the results list. its knot
The result is roughly equivalent to:
results = []
for url in urls:
results.append(urllib2.urlopen(url))
The map function handles a series of operations such as sequence operations, parameter passing, and result storage.
Why is this important? This is because with the right library, map can easily parallelize operations.
There are two libraries in Python that contain the map function: multiprocessing and its lesser-known sub-library multiprocessing.dummy.
Two more sentences here: multiprocessing.dummy? Threaded clone of mltiprocessing library? Is this shrimp? Even in the official documentation of the multiprocessing library
There is only one relevant description for this sub-library. And this description translated into human language basically means: "Well, there is such a thing, you know it." Believe me, this library is seriously underrated
!
dummy is a complete clone of the multiprocessing module, the only difference is that multiprocessing works on processes, while the dummy module works on threads (and therefore also
All common multithreading limitations of Python are included).
So it's surprisingly easy to use these two libraries interchangeably. You can choose different libraries for IO-intensive tasks and CPU-intensive tasks.
try it out
Use the following two lines of code to reference the library containing the parallelized map function:
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
Instantiate a Pool object:
pool = ThreadPool()
This simple statement replaces the work of the 7 lines of code in the buildworkerpool function in example2.py. It spawns a series of worker threads and completes the initialization
, store them in variables for easy access.
The Pool object has some parameters, all I need to pay attention to here is its first parameter: processes. This parameter is used to set the number of threads in the thread pool. Its default value is
The number of cores of the current machine CPU.
In general, when performing CPU-intensive tasks, the more cores are called, the faster they are. But when dealing with network-intensive tasks, things can get a little unpredictable, with real
It is advisable to experiment to determine the size of the thread pool.
pool = ThreadPool(4) # Sets the pool size to 4
When the number of threads is too large, the time consumed by switching threads may even exceed the actual working time. For different jobs, trying to find the optimal value for the thread pool size is not easy.
wrong idea.
After the Pool object is created, the parallelized program is ready to go. Let's take a look at the rewritten example2.py
import urllib2
from multiprocessing.dummy import Pool as ThreadPool
urls = [
'http://www.python.org',
'http://www.python.org/about/',
'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
'http://www.python.org/doc/',
'http://www.python.org/download/',
'http://www.python.org/getit/',
'http://www.python.org/community/',
'https://wiki.python.org/moin/',
'http://planet.python.org/',
'https://wiki.python.org/moin/LocalUserGroups',
'http://www.python.org/psf/',
'http://docs.python.org/devguide/',
'http://www.python.org/community/awards/'
# etc..
]
# Make the Pool of workers
pool = ThreadPool(4)
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish
pool.close()
pool.join()
There are only 4 lines of code that actually work, only one of which is critical. The map function easily replaces the 40+ line example above. For more fun,
I counted the time consumption of different methods and different thread pool sizes.
# results = []
# for url in urls:
# result = urllib2.urlopen(url)
# results.append(result)
# # ------- VERSUS ------- #
# # ------- 4 Pool ------- #
# pool = ThreadPool(4)
# results = pool.map(urllib2.urlopen, urls)
# # ------- 8 Pool ------- #
# pool = ThreadPool(8)
# results = pool.map(urllib2.urlopen, urls)
# # ------- 13 Pool ------- #
# pool = ThreadPool(13)
# results = pool.map(urllib2.urlopen, urls)
结果:
# Single thread: 14.4 Seconds
# 4 Pool: 3.1 Seconds
# 8 Pool: 1.4 Seconds
# 13 Pool: 1.3 Seconds
Great result isn't it? This result also explains why the size of the thread pool is determined experimentally. On my machine, when the thread pool size is greater than 9, the benefit is
Very limited.
Another real example
Generate thumbnails of thousands of images
This is a CPU-intensive task and well suited for parallelization.
Basic single process version
Python学习交流Q群:906715085####
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath(
'11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
for image in images:
create_thumbnail(Image)
The main job of the above code is to traverse the image files in the incoming folder, generate thumbnails one by one, and save these thumbnails to a specific folder.
On my machine, it took 27.9 seconds to process 6000 images with this program.
If we use the map function instead of the for loop:
import os
import PIL
from multiprocessing import Pool
from PIL import Image
SIZE = (75,75)
SAVE_DIRECTORY = 'thumbs'
def get_image_paths(folder):
return (os.path.join(folder, f)
for f in os.listdir(folder)
if 'jpeg' in f)
def create_thumbnail(filename):
im = Image.open(filename)
im.thumbnail(SIZE, Image.ANTIALIAS)
base, fname = os.path.split(filename)
save_path = os.path.join(base, SAVE_DIRECTORY, fname)
im.save(save_path)
if __name__ == '__main__':
folder = os.path.abspath(
'11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')
os.mkdir(os.path.join(folder, SAVE_DIRECTORY))
images = get_image_paths(folder)
pool = Pool()
pool.map(creat_thumbnail, images)
pool.close()
pool.join()
5.6 seconds!
Even though only a few lines of code have been changed, we have significantly improved the execution speed of the program. In a production environment, we can separate CPU-intensive tasks and IO-intensive tasks
Choose multiprocessing and multithreading libraries to further improve execution speed - this is also a good way to solve deadlock problems. In addition, since the map function does not support manual thread management, the reverse
It also makes the related debug work extremely simple.
At this point, we have achieved (basically) parallelization in one line of Python.