Only three lines of code, so Python speed 4 times! Most aid

Python is a very suitable for data processing and automation of repetitive tasks to complete programming language. We use data before training machine learning models usually require preprocessing of the data, while Python is very suitable for the job done, such as the need to re-adjust the size of the tens of thousands of images, no problem with Python! You can almost always find a Python library can easily complete the data processing work.

Although Python is easy to learn, easy to use, but it is not the fastest language. By default, Python uses a CPU running at a single process. If you are in the computer configuration in recent years, usually quad-core processor, that is, there are four CPU. This means that when you are waiting Python scripts for data processing, your computer is actually 75% or more of the computing resources in that idling about!

Today, we teach how Python functions by running in parallel, make full use of all the processing power of your computer. Thanks to the concurrent.futures Python module, we need only three lines of code, you will be able to become a common data processing scripts script can process data in parallel, the speed 4 times!

Normal data processing method Python

▲▲▲

Let's say we have a full file folder of image data, to create a thumbnail for each image with Python.

Here is a short script for a list of all JPEG image files in the folder with the built-in Python glob function, and then use Pillow image processing library to save each image as a thumbnail size of 128 pixels:

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
import globimport osfrom PIL import Imagedef make_image_thumbnail(filename):
    # 缩略图会被命名为"<original_filename>_thumbnail.jpg"
    base_filename, file_extension = os.path.splitext(filename)
    thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"

    # 创建和保存缩略图
    image = Image.open(filename)
    image.thumbnail(size=(128, 128))
    image.save(thumbnail_filename, "JPEG")    return thumbnail_filename# 循环文件夹中所有JPEG图像,为每张图像创建缩略图for image_file in glob.glob("*.jpg"):
    thumbnail_file = make_image_thumbnail(image_file)

print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

Only three lines of code, so Python speed 4 times!  Most aid

The script follows a simple pattern, you will often see this method in a data processing scripts:

  • First, get a list of files you want to process (or other data)

  • A helper function capable of handling a single data file above

  • Auxiliary function calls for loop, each individual data processing, one at a time.

1000 file that contains a JPEG image clip to test this script, runs out and see how long it takes:

$ time python3 thumbnails_1.py
A thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg
[... about 1000 more lines of output ...]real 0m8.956s
user 0m7.086s
sys 0m0.743s

Only three lines of code, so Python speed 4 times!  Most aid

It took 8.9 seconds to run the program, but how do real work intensity computer?

We then run the program again to see if the program is run Activity Monitor case of:

Only three lines of code, so Python speed 4 times!  Most aid

Only three lines of code, so Python speed 4 times!  Most aid

75% of computer processing resources are idle! what's going on?

The reason for this is that my computer has 4 CPU, but only uses a Python. So the program simply mustering the strength with which a CPU, but nothing further 3. I therefore need for a method capable of work is divided into four separate parts I parallel processing. Luckily, Python has a very easy method allows us to do!

Try to create a multi-process

▲▲▲

Here is a method that allows us to process data in parallel:

1, the JPEG file is divided into 4 pieces. 2, four separate instances running Python interpreter. 3, an example of the process so that each Python four block data. 4, the results of these processing section 4 of the merger, the final list of the results obtained.

Four copies of Python runs on four separate CPU, the workload should be able to handle four times higher than a CPU is about, right?

Best of all, Python has done for us that the most troublesome part of the work. We just need to tell it like to run and how many instances of the use of which function on the line, it will complete the remaining work. We just need to change the entire course of three lines of code.

First of all, we need to import concurrent.futures library that you built in Python:

import concurrent.futures

Only three lines of code, so Python speed 4 times!  Most aid

Next, we need to tell Python launch four additional Python examples. We do this by having to create a Python Process Pool:

with concurrent.futures.ProcessPoolExecutor() as executor:

Only three lines of code, so Python speed 4 times!  Most aid

By default, it creates a Python process for each CPU on your computer, so if you have 4 CPU, Python will start four processes.

The final step is to let Process Pool created with these four processes executing our helper function on the data list. Do this, we have to have a for loop:

for image_file in glob.glob("*.jpg"):
thumbnail_file = make_image_thumbnail(image_file)

Only three lines of code, so Python speed 4 times!  Most aid

Replaced with a new call executor.map ():

image_files = glob.glob("*.jpg")for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):

Only three lines of code, so Python speed 4 times!  Most aid

The executor.map () function call to enter auxiliary data to be processed and the function list. This function can help me to complete the work of all troubles, including the list is divided into multiple sub-lists, sub-lists will be sent to each child process, the child process runs and merge results, etc. Well done!

It can also return results for each function call for us. Executor.map () function returns the input data in accordance with results of the same order. So I used Python's zip () function as a shortcut to get one step matches the original file name and every step in.

Here are three steps after this change of program code:

import globimport osfrom PIL import Imageimport concurrent.futuresdef make_image_thumbnail(filename):
    # 缩略图会被命名为 "<original_filename>_thumbnail.jpg"
    base_filename, file_extension = os.path.splitext(filename)
    thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"

    # 创建和保存缩略图
    image = Image.open(filename)
    image.thumbnail(size=(128, 128))
    image.save(thumbnail_filename, "JPEG")    return thumbnail_filename# 创建Process Pool,默认为电脑的每个CPU创建一个with concurrent.futures.ProcessPoolExecutor() as executor:    # 获取需要处理的文件列表
    image_files = glob.glob("*.jpg")    # 处理文件列表,但通过Process Pool划分工作,使用全部CPU!
    for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):
        print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

Only three lines of code, so Python speed 4 times!  Most aid

We look to run this script and see if it is faster to complete the data processing:

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
$ time python3 thumbnails_2.py
A thumbnail for 1430028941_4db9dedd10.jpg was saved as 1430028941_4db9dedd10_thumbnail.jpg
[... about 1000 more lines of output ...]real 0m2.274s
user 0m8.959s
sys 0m0.951s

Only three lines of code, so Python speed 4 times!  Most aid

Script finished in 2.2 seconds on the handling of data! Speed ​​4 times faster than the original version! It has been able to process data faster, because we used four CPU instead of one.

But if you look carefully, you will find the "user" time almost nine seconds. So why program processing time of 2.2 seconds, but I do not know how to engage in, or 9 seconds running time? It seems unlikely ah?

This is because the "User" CPU time is the sum of all time, the total CPU time as we finalize the work, are nine seconds, but we use four complete CPU, the actual data processing time of only 2.2 seconds!

Note : Enabling more Python process and distribute data takes up time to the child, and therefore rely on this method does not always guarantee a substantial increase in speed.

Always help me speed up data processing scripts do?

If you are still confused in the programming world, you can join us to learn Python buckle qun: 784758214, look at how seniors are learning. Exchange of experience. From basic web development python script to, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Share some methods of learning and the need to pay attention to small details, click on Join us python learner gathering
▲▲

If you have a column of data, and each data can be handled separately, using a good approach we are talking about here is a Process Pools of speed. Here are some examples for the use of parallel processing:

  • Crawl statistics from a series of separate web server log.

  • Parse the data from a bunch of XML, CSV and JSON files.

  • A large number of picture data preprocessing, the establishment of machine learning datasets.

However, we should remember, Process Pools is not a panacea. Use Process Pool need to pass data between separate Python treatment process back and forth. If you want to process the data can not be effectively transferred in the process, this approach will not work. In short, the data must be processed Python you know how to deal with the type.

Meanwhile, the data can not be processed according to an expected sequence. If you need further processing before the result of the next step, this approach does not work.

That GIL problem?

▲▲▲

You may know a man named Python global interpreter lock (Global Interpreter Lock) things, namely GIL. This means that even if your application is multi-threaded, each thread can only execute a Python command. GIL ensure that at all times only a Python thread. In other words, multi-threaded Python code does not really run in parallel, which can not take full advantage of multi-core CPU.

However Process Pool can solve this problem! Because we are running a separate Python instances, each instance has its own GIL. Python code so that we get is truly parallel processing!

Do not be afraid of parallel processing!

▲▲▲

With concurrent.futures library, Python will allow you simply to modify the script, immediately let you on the computer all the CPU into work. Do not be afraid to try this method, once you have mastered it just as easy for a loop, but it can make your data processing scripts coming his dancing.

Guess you like

Origin blog.51cto.com/14510224/2438069