I have some text files that I need to read with Python. The text files keep an array of floats only (ie no strings) and the size of the array is 2000-by-2000. I tried to use the multiprocessing
package but for some reason it now runs slower. The times I have on my pc for the code attached below are
- Multi thread: 73.89 secs
- Single thread: 60.47 secs
What am I doing wrong here, is there a way to speed up this task? My pc is powered by an Intel Core i7 processor and in real life I have several hundreds of these text files, 600 or even more.
import numpy as np
from multiprocessing.dummy import Pool as ThreadPool
import os
import time
from datetime import datetime
def read_from_disk(full_path):
print('%s reading %s' % (datetime.now().strftime('%H:%M:%S'), full_path))
out = np.genfromtxt(full_path, delimiter=',')
return out
def make_single_path(n):
return r"./dump/%d.csv" % n
def save_flatfiles(n):
for i in range(n):
temp = np.random.random((2000, 2000))
_path = os.path.join('.', 'dump', str(i)+'.csv')
np.savetxt(_path, temp, delimiter=',')
if __name__ == "__main__":
# make some text files
n = 10
save_flatfiles(n)
# list with the paths to the text files
file_list = [make_single_path(d) for d in range(n)]
pool = ThreadPool(8)
start = time.time()
results = pool.map(read_from_disk, file_list)
pool.close()
pool.join()
print('finished multi thread in %s' % (time.time()-start))
start = time.time()
for d in file_list:
out = read_from_disk(d)
print('finished single thread in %s' % (time.time() - start))
print('Done')
You are using multiprocessing.dummy
which replicates the API of multiprocessing but actually it is a wrapper around the threading module.
So, basically you are using Threads
instead of Process
. And threads
in python are not useful( Due to GIL) when you want to perform computational tasks.
So Replace:
from multiprocessing.dummy import Pool as ThreadPool
With:
from multiprocessing import Pool
I've tried running your code on my machine having a i5 processor
, it finished execution in 45 seconds. so i would say that's a big improvement.
Hope this clears your understanding.