3 lines of code! Let Python script to get four times the data processing speed!

Python is a very suitable for data processing and automation of repetitive tasks to complete programming language, we have the data before training machine learning models used, usually require preprocessing of the data, while Python is very suitable for the job done, such as the need to re adjust the size of the tens of thousands of images, no problem with Python! You can almost always find a Python library can easily complete the data processing work.

However, while Python is easy to learn, easy to use, but it is not the fastest language. By default, Python uses a CPU running at a single process. But if you are in the computer configuration in recent years, usually quad-core processor, that is, there are four CPU. This means that when you are waiting Python scripts for data processing, your computer is actually 75% or more of the computing resources in that idling about!

Today, we teach how Python functions by running in parallel, make full use of all the processing power of your computer. Thanks to the Python concurrent.futures module, we only need three lines of code, will be able to become a normal data processing scenario script data can be processed in parallel, four times speed.

Normal data processing method Python

Let's say we have a full file folder of image data, to create a thumbnail for each image with Python.

Here is a short script for a list of all JPEG image files in the folder with the built-in Python glob function, and then use Pillow image processing library to save each image as a thumbnail size of 128 pixels:

import glob
import os
from PIL import Image

def make_image_thumbnail(filename):
    # 缩略图会被命名为"<original_filename>_thumbnail.jpg"
    base_filename, file_extension = os.path.splitext(filename)
    thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"

    # 创建和保存缩略图
    image = Image.open(filename)
    image.thumbnail(size=(128, 128))
    image.save(thumbnail_filename, "JPEG")

    return thumbnail_filename

# 循环文件夹中所有JPEG图像,为每张图像创建缩略图
for image_file in glob.glob("*.jpg"):
    thumbnail_file = make_image_thumbnail(image_file)

print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

The script follows a simple pattern, you will often see this method in a data processing scripts:

  • First, get a list of files you want to process (or other data)
  • A helper function capable of handling a single data file above
  • Auxiliary function calls for loop, each individual data processing, one at a time.

We use a file that contains JPEG 1000 image clip to test this script, runs out and see how long it takes:

$ time python3 thumbnails_1.py
A thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg
[... about 1000 more lines of output ...]
real 0m8.956s
user 0m7.086s
sys 0m0.743s

It took 8.9 seconds to run the program, but how do real work intensity computer?

We then run the program again to see if the program is run Activity Monitor case of:

![image](http://upload-images.jianshu.io/upload_images/13090773-b965ee6f42944c06.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

75% of computer processing resources are idle! What's happening here?

The reason for this is that my computer has 4 CPU, but only uses a Python. So the program simply mustering the strength with which a CPU, but nothing further 3. I therefore need for a method capable of work is divided into four separate parts I parallel processing. Luckily, Python has a very easy method allows us to do!

Try to create a multi-process

Here is a method that allows us to process data in parallel:

1. JPEG file is divided into 4 pieces.

Example 2. 4 separate run Python interpreter.

Python 3. Examples of the treatment so that each one of these four data.

4. The result of these processing portion 4 were combined to obtain the final result list.

Four copies of Python runs on four separate CPU, the workload should be able to handle four times higher than a CPU is about, right?

Best of all, Python has done for us that the most troublesome part of the work. We just need to tell it like to run and how many instances of the use of which function on the line, it will complete the remaining work. We just need to change the entire course of three lines of code.

First of all, we need to import concurrent.futures library that you built in Python:

import concurrent.futures

Next, we need to tell Python launch four additional Python examples. We do this by having to create a Python Process Pool:

with concurrent.futures.ProcessPoolExecutor() as executor:

By default, it creates a Python process for each CPU on your computer, so if you have 4 CPU, Python will start four processes.

The final step is to let Process Pool created with these four processes executing our helper function on the data list. Do this, we have to have a for loop:

for image_file in glob.glob("*.jpg"):
thumbnail_file = make_image_thumbnail(image_file)

Replaced with a new call executor.map ():

image_files = glob.glob("*.jpg")for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):

The executor.map () function call to enter auxiliary data to be processed and the function list. This function can help me to complete the work of all troubles, including the list is divided into multiple sub-lists, sub-lists will be sent to each child process, the child process runs and merge results, etc. Well done!

It can also return results for each function call for us. Executor.map () function returns the input data in accordance with results of the same order. So I used Python's zip () function as a shortcut to get one step matches the original file name and every step in.

Here are three steps after this change of program code:

import glob
import os
from PIL import Image
import concurrent.futures

def make_image_thumbnail(filename):
    # 缩略图会被命名为 "<original_filename>_thumbnail.jpg"
    base_filename, file_extension = os.path.splitext(filename)
    thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"

    # 创建和保存缩略图
    image = Image.open(filename)
    image.thumbnail(size=(128, 128))
    image.save(thumbnail_filename, "JPEG")

    return thumbnail_filename

# 创建Process Pool,默认为电脑的每个CPU创建一个
with concurrent.futures.ProcessPoolExecutor() as executor:
    # 获取需要处理的文件列表
    image_files = glob.glob("*.jpg")

    # 处理文件列表,但通过Process Pool划分工作,使用全部CPU!
    for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):
        print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

We look to run this script and see if it is faster to complete the data processing:

$ time python3 thumbnails_2.py
A thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg
[... about 1000 more lines of output ...]
real 0m2.274s
user 0m8.959s
sys 0m0.951s

Script finished in 2.2 seconds on the handling of data! Speed ​​4 times faster than the original version! It has been able to process data faster, because we used four CPU instead of one.

But if you look carefully, you will find the "user" time almost nine seconds. So why program processing time of 2.2 seconds, but I do not know how or do a run time of 9 seconds? It seems unlikely ah?

This is because the "User" CPU time is the sum of all time, the total CPU time as we finalize the work, are nine seconds, but we use four complete CPU, the actual data processing time of only 2.2 seconds!

Note : Enabling more Python process and distribute data takes up time to the child, and therefore rely on this method does not always guarantee a substantial increase in speed. If you want to deal with very large data sets, there are articles setting ** how many pieces of data into a set of cut paper **, you can read, you will help considerably.

This method always help me speed up data processing scripts do?

If you have a column of data, and each data can be handled separately, using a good approach we are talking about here is a Process Pools of speed. Here are some examples for the use of parallel processing:

  • Crawl statistics from a series of separate web server log.
  • Parse the data from a bunch of XML, CSV and JSON files.
  • A large number of picture data preprocessing, the establishment of machine learning datasets.

However, we should remember, Process Pools is not a panacea. Use Process Pool need to pass data between separate Python treatment process back and forth. If you want to process the data can not be effectively transferred in the process, this approach will not work. In short, the data must be processed Python you know how to deal with the type.

Meanwhile, the data can not be processed according to an expected sequence. If you need further processing before the result of the next step, this approach does not work.

That GIL problem?

You may know a man named Python global interpreter lock (Global Interpreter Lock) things, namely GIL. This means that even if your application is multi-threaded, each thread can only execute a Python command. GIL ensure that at all times only a Python thread. In other words, multi-threaded Python code does not really run in parallel, which can not take full advantage of multi-core CPU.

However Process Pool can solve this problem! Because we are running a separate Python instances, each instance has its own GIL. Python code so that we get is truly parallel processing!

Do not be afraid of parallel processing!

With concurrent.futures library, Python will allow you simply to modify the script, immediately let you on the computer all the CPU into work. Do not be afraid to try this method, once you have mastered it just as easy for a loop, but it can make your data processing scripts coming his dancing.

As a someone who has, I have to tell you to ask the older cattle is really important, so that you can take a lot less detours, do not be afraid shame, no face, the face value of some money? Learn a real skill is most important. No skill called really no face. python technology sharing , so that your future is no longer confused.

Recommended reading:

How zero-based learning Python as a programming language?

After learning python12 hours and tell you that you really did not want to learn python so difficult!

Programmed learning, knowledge payment is a regular phenomenon, realized knowledge is more important!

Programmer's skill tree, determines the height of a lifetime career

More Internet industry consulting, programming skills, learning to share public concern number id:! Mtbcxx

Guess you like

Origin blog.csdn.net/weichen090909/article/details/90678642