Reading text files with multiprocessing slower than without

Aenaon :

I have some text files that I need to read with Python. The text files keep an array of floats only (ie no strings) and the size of the array is 2000-by-2000. I tried to use the multiprocessing package but for some reason it now runs slower. The times I have on my pc for the code attached below are

  • Multi thread: 73.89 secs
  • Single thread: 60.47 secs

What am I doing wrong here, is there a way to speed up this task? My pc is powered by an Intel Core i7 processor and in real life I have several hundreds of these text files, 600 or even more.

import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
import os
import time
from datetime import datetime


def read_from_disk(full_path):
    print('%s reading %s' % (datetime.now().strftime('%H:%M:%S'), full_path))
    out = np.genfromtxt(full_path, delimiter=',')
    return out

def make_single_path(n):
    return r"./dump/%d.csv" % n

def save_flatfiles(n):
    for i in range(n):
        temp = np.random.random((2000, 2000))
        _path = os.path.join('.', 'dump', str(i)+'.csv')
        np.savetxt(_path, temp, delimiter=',')


if __name__ == "__main__":
    # make some text files
    n = 10
    save_flatfiles(n)

    # list with the paths to the text files
    file_list = [make_single_path(d) for d in range(n)]

    pool = ThreadPool(8)
    start = time.time()
    results = pool.map(read_from_disk, file_list)
    pool.close()
    pool.join()
    print('finished multi thread in %s' % (time.time()-start))

    start = time.time()
    for d in file_list:
        out = read_from_disk(d)
    print('finished single thread in %s' % (time.time() - start))
    print('Done')
Shubham Sharma :

You are using multiprocessing.dummy which replicates the API of multiprocessing but actually it is a wrapper around the threading module.

So, basically you are using Threads instead of Process. And threads in python are not useful( Due to GIL) when you want to perform computational tasks.

So Replace:

from multiprocessing.dummy import Pool as ThreadPool

With:

from multiprocessing import Pool

I've tried running your code on my machine having a i5 processor, it finished execution in 45 seconds. so i would say that's a big improvement.

Hope this clears your understanding.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=350554&siteId=1