33 pipeline, data sharing, process pool, a callback function

pipeline

Interprocess communication (IPC) Second way: pipe (not recommended, you can understand), data port easily lead to an unsafe situation.

from multiprocessing import Pipe,Process

def func(conn1,conn2):
    msg = conn1.recv()  # 接收了conn2传递的
    # msg1 = conn2.recv()  # 接收了conn1传递的
    print('>>>',msg)
    # print('>>>',msg1)

if __name__ == '__main__':
    # 拿到管道的两端，双工通信方式，两端都可以收发消息
    conn1,conn2 = Pipe()  # 必须在Process之前产生管道
    p = Process(target=func,args=(conn1,conn2,))  # 管道给子进程
    p.start()
    conn1.send('hello')
    conn1.close()
    conn2.send('小子')
    conn2.close()

    print('进程结束')

# 注意管道不用了就关闭，防止异常

data sharing

Data sharing between processes -manager module, use less data between processes is independent and can be implemented by means of a queue or pipe communication, both of which are based on message passing, we can achieve data sharing through Manager module.

from multiprocessing import Manager,Process,Lock

def func1(dic,loc):
    # loc.acquire()  # 不加锁易出错
    dic['num'] -= 1
    # loc.release()

if __name__ == '__main__':
    m = Manager()
    loc = Lock()
    dic = m.dict({'num':100})
    p_list = []
    for i in range(100):
        p = Process(target=func1, args=(dic,loc))
        p_list.append(p)
        p.start()

    [pp.join() for pp in p_list]

    print('>>>>>',dic['num'])
# 共享时不加锁，很可能导致同一个数据被多个子进程取用，数据是不安全的,且超多进程消耗大量资源易导致卡死.

When multiple processes together to deal with shared data, and we will operate multiple processes at the same time to a data file is the same, unlocked erroneous results will appear unsafe process, so it needs to be locked.
Summary: The inter-process communication should be avoided, even if the need to communicate, it should choose a safe process tool to avoid problems caused by locking.

Process pool Pool

Save memory space, and time-saving creation process.
Create a process requires time-consuming, the destruction process (content space, variables, file information, etc.) also need to consume time. Thousands of open process, the operating system can not perform them simultaneously, while maintaining a large list of processes, scheduling the time, also need to be switched to perform and record each process node, which is the record context (various variables, and so a mess of things, though you can not see, but the operating system have to do), but this will affect the efficiency of the program. Therefore, we can not be unlimited or open end of the process according to the task, which requires use of process pool.

Pool([numprocess [,initializer [, initargs]]]):创建进程池

Parameter Description:

numprocess: The number of processes to be created if omitted, the default cpu_count () value
initializer: is callable object for each worker process to be executed on startup, defaults to None
initargs: a set of parameters to be passed initializer

Common methods:

p.apply(func [, args [, kwargs]])
'''在一个池工作进程中执行func(*args,**kwargs),然后返回结果。需要强调的是：此操作并不会在所有池工作进程中并执行func函数。如果要通过不同参数并发地执行func函数，必须从不同线程调用p.apply()函数或者使用p.apply_async()'''

p.apply_async(func [, args [, kwargs]])
'''在一个池工作进程中执行func(*args,**kwargs),然后返回结果。此方法的结果是AsyncResult类的实例，callback是可调用对象，接收输入参数。当func的结果变为可用时，将理解传递给callback。callback禁止执行任何阻塞操作，否则将接收其他异步操作中的结果。'''
    
p.close()	#关闭进程池，防止进一步操作。如果所有操作持续挂起，它们将在工作进程终止前完成

P.jion()	#等待所有工作进程退出。此方法只能在close（）或teminate()之后调用

Features:

The number is generally the process of creating pools plus a number of cpu.
map (func, iterable), asynchronous, and comes close join, the return value is a list
apply synchronization, apply (func, args = ()) only when func just finished, just before return home under other code, the return value is the return func

from multiprocessing import Process,Pool
import time

def func1(i):
    num = 0
    for j in range(3):
        num += i
    time.sleep(1)
    print(num)
    return num

if __name__ == '__main__':
    pool = Pool(6)
    for i in range(10):
        res = pool.apply(func1,args=(i,))  # apply 进程同步/串行方法 效率低，不常用
        # print(res)

apply_async (func, args = ()) asynchronous, just submit the task, after being registered as a func process, the program continues downward. The return value is an object apply_async returned. The end of the need to close and then join to maintain synchronization of multiple processes and main process code.

from multiprocessing import Process,Pool
import time

def func1(i):
    num = 0
    for j in range(5):
        num += i
    time.sleep(1)
    # print('>>>>>',num)
    return num

if __name__ == '__main__':
    pool = Pool(6)
    red_list = []
    for i in range(10):
        res = pool.apply_async(func1,args=(i,))
        red_list.append(res)
    pool.close()  # 不是关闭，只是锁定进程池，告诉主进程不会再添加数据进去
    pool.join()  # 等待子程序执行完
    for ress in red_list:
        print(ress.get())  # get方法取出返回值num 按添加顺序取出已保存在缓存区的结果 所以是顺序打印出的

Callback

The callback function is executed in the main process, usually used in crawlers.

from multiprocessing import Pool
def func1(n):
    return n+1

def func2(m):
    print(m)

if __name__ == '__main__':
    p = Pool(5)
    for i in range(10,20):
        p.apply_async(func1,args=(i,),callback=func2)
    p.close()
    p.join()

reptile:

import requests
from multiprocessing import Pool
def get(url):
    response = requests.get(url)
    if response.status_code == 200:
        return url,response.content.decode('utf-8')
def call_back(args):
    url,content = args
    print(url,len(content))

if __name__ == '__main__':
    url_list = [
        'http://www.cnblogs.com/',
        'http://www.baidu.com',
        'http://www.sogou.com',
        'http://www.sohu.com',
    ]
    p = Pool(5)
    for url in url_list:
        p.apply_async(get,args=(url,),callback=call_back)	#运用时注意一点，回调函数的形参执行有一个，如果你的执行函数有多个返回值，那么也可以被回调函数的这一个形参接收，接收的是一个元祖，包含着你执行函数的所有返回值。
    p.close()
    p.join()