Reprinted from: https://blog.csdn.net/bryan__/article/details/78786648
Data sharding: The task of slicing data is suitable for multi-process code processing. The core idea is to shard data, return results (possibly out of order) for each piece of data, and then merge them. Application scenario: multi-process crawler, mapreduce-like task. The disadvantage is that the child process will copy all the state of the parent process, and the memory waste is serious.
import math from multiprocessing import Pool def run(data, index, size): # data incoming data, index data shard index, size number of processes size = math.ceil(len(data) / size) start = size * index end = (index + 1) * size if (index + 1) * size < len(data) else len(data) temp_data = data[start:end] # do something return data # You can return data and collect it later processor = 40 res = [] p = Pool(processor) for i in range(processor): res.append(p.apply_async(run, args=(data, i, processor,))) print(str(i) + ' processor started !') p.close() p.join() for i in res: print(i. get ()) # use get to get the result of multiprocessing
Split file processing: When memory is limited, data sharding can no longer be used, because the child process will copy all the state of the parent process, resulting in a waste of memory. At this time, you can consider saving the large file fragments to disk first, then del to release the data, and then read them separately in the multi-process processing function, so that the sub-processes will read the data to be processed separately without taking up a lot of RAM.
from multiprocessing import Pool import pandas as pd import math data=pd.DataFrame({'user_id':[1,2,3,4],'item_id':[6,7,8,9]}) users=pd.DataFrame(data['user_id'].unique(),columns=['user_id']) processor=4 p=Pool(processor) l_data = len(users) size = math.ceil(l_data / processor) res = [] def run(i): data=pd.read_csv('../data/user_'+str(i)+'.csv') #everything return data for i in range(processor): start = size * i end = (i + 1) * size if (i + 1) * size < l_data else l_data user = users[start:end] t_data = pd.merge(data, user, on='user_id').reset_index(drop=True) t_data.to_csv('../data/user_'+str(i)+'.csv',index=False) print(len(t_data)) del data,l_data,users for i in range(processor): res.append(p.apply_async(run, args=(i,))) print(str(i) + ' processor started !') p.close() p.join() data = pd.concat([i.get() for i in res])
Multi-process data sharing: When you need to modify shared data, you can use data sharing at this time:
from multiprocessing import Process, Manager # Function executed by each child process # In the parameter, a special dictionary for data sharing between multiple processes is passed def func(i, d): d[i] = i + 100 print(d.values()) # Create special dictionary in main process m = Manager() d = m.dict() for i in range(5): # Let the child process modify the special dictionary of the main process p = Process(target=func, args=(i, d)) p.start() p.join() ------------ [100] [100, 101] [100, 101, 102, 103] [100, 101, 102, 103] [100, 101, 102, 103, 104]