Fast data processing with multiprocessing in Python

Reprinted from: https://blog.csdn.net/bryan__/article/details/78786648

 

Data sharding: The task of slicing data is suitable for multi-process code processing. The core idea is to shard data, return results (possibly out of order) for each piece of data, and then merge them. Application scenario: multi-process crawler, mapreduce-like task. The disadvantage is that the child process will copy all the state of the parent process, and the memory waste is serious.

import math
from multiprocessing import Pool

def run(data, index, size): # data incoming data, index data shard index, size number of processes
    size = math.ceil(len(data) / size)
    start = size * index
    end = (index + 1) * size if (index + 1) * size < len(data) else len(data)
    temp_data = data[start:end]
    # do something
     return data # You can return data and collect it later

processor = 40
res = []
p = Pool(processor)
for i in range(processor):
    res.append(p.apply_async(run, args=(data, i, processor,)))
    print(str(i) + ' processor started !')
p.close()
p.join()
for i in res:
    print(i. get ()) # use get to get the result of multiprocessing

 

 

Split file processing: When memory is limited, data sharding can no longer be used, because the child process will copy all the state of the parent process, resulting in a waste of memory. At this time, you can consider saving the large file fragments to disk first, then del to release the data, and then read them separately in the multi-process processing function, so that the sub-processes will read the data to be processed separately without taking up a lot of RAM.

 

from multiprocessing import Pool
import pandas as pd
import math
data=pd.DataFrame({'user_id':[1,2,3,4],'item_id':[6,7,8,9]})
users=pd.DataFrame(data['user_id'].unique(),columns=['user_id'])
processor=4
p=Pool(processor)
l_data = len(users)
size = math.ceil(l_data / processor)
res = []
def run(i):
    data=pd.read_csv('../data/user_'+str(i)+'.csv')
    #everything
return data

for i in range(processor):
    start = size * i
    end = (i + 1) * size if (i + 1) * size < l_data else l_data
    user = users[start:end]
    t_data = pd.merge(data, user, on='user_id').reset_index(drop=True)
    t_data.to_csv('../data/user_'+str(i)+'.csv',index=False)
    print(len(t_data))

del data,l_data,users
for i in range(processor):
    res.append(p.apply_async(run, args=(i,)))
    print(str(i) + ' processor started !')
p.close()
p.join()
data = pd.concat([i.get() for i in res])

  

 

Multi-process data sharing: When you need to modify shared data, you can use data sharing at this time:

from multiprocessing import Process, Manager
# Function executed by each child process
# In the parameter, a special dictionary for data sharing between multiple processes is passed
def func(i, d):
    d[i] = i + 100
    print(d.values())
# Create special dictionary in main process
m = Manager()
d = m.dict()
for i in range(5):
    # Let the child process modify the special dictionary of the main process
    p = Process(target=func, args=(i, d))
    p.start()
p.join()
------------
[100]
[100, 101]
[100, 101, 102, 103]
[100, 101, 102, 103]
[100, 101, 102, 103, 104]

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325143880&siteId=291194637