python as much recording process

# -*- coding:utf-8 -*-
#多进程
import time
import requests
import multiprocessing
from multiprocessing import Pool
from bs4 import BeautifulSoup


def job(url,header):
    r = requests.get(url,headers = header)
    content = r.text
    soup = BeautifulSoup(content,'html.parser')
    item = soup.select(".item")
    for i in item:
        print(i.select('.title')[0].text)
if __name__ == "__main__":
    MAX_WOKER_NUM = multiprocessing.cpu_count()
    t1 = time.time()
    urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0, 226, 25)]
    header = {
        "user-agent": "Mozilla / 5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 ( " 
                      " KHTML, like Gecko) Chrome / 79.0.3945.88 Safari / 537.36 " 
    } 
    the p- = Pool (MAX_WOKER_NUM)
     for url in urls:
         # apply_async asynchronous non-blocking style, without waiting for the current process is finished, ready to follow the operating system scheduler to process switching, 
        # ie multiple processes in parallel, improve the efficiency of the program 
        p.apply_async (the Job, args = (url, header)) 
    p.close () 
    p.join () 
    Print ( " time-consuming: " , time.time () - T1)

The example above is the total time-consuming crawling watercress Top250 name of the movie, which involves a multi-process knowledge, understand the code in the process, I looked more information, see now some of the information recorded.

 

The difference between (1) p.apply_async () and p.apply () of.

apply () works: block the main thread, and a child perform a sequential process, after all the child process until all is finished, continue to apply () code for the primary process behind;

apply_asaync () works: asynchronous, non-blocking, it does not wait for the child process is finished, the main process will continue, it will be carried out according to the process of switching the system operator;

Can be understood from the results of execution of the code, look under the apply ().

# - * - Coding: UTF-. 8 - * - 
Import Time
 Import multiprocessing 

DEF the doIt (NUM):
     Print ( " Process NUM IS:% S " % NUM) 
    the time.sleep ( . 1 )
     Print ( ' Process End S% ' % NUM) 

IF  __name__ == ' __main__ ' :
     Print ( ' mainProcess start ' )
     # recording time at the beginning of the implementation of 
    start_time = time.time ()
     # create three sub-processes
    pool = multiprocessing.Pool(3)
    print('Child start')
    for i in range(3):
        pool.apply(doIt,[i])
        #pool.apply_async(doIt,[i])print('mainProcess done time:%s s' % (time.time() - start_time))

Results of the:

 

 

 As seen from the results of the main process started, the process of creating sub 3 also started, when the three sub-process is finished in the order,, the main process is then performed on the subsequent codes.

 

When the apply () replaced apply_async (), execute the following results:

 

 As can be seen from the results, the main program is not blocked, but it looks as if the subroutine has not been executed. This is because the process of switching is controlled by the operating system, the main program is first run, we all know, CPU running quickly, and before coming to the sub-system scheduling threads, the main thread has finished up, and launched program, so the routine is not running. In this way, of course, it is not what we want, we add in the program join () can solve the problem, as follows:

# - * - Coding: UTF-. 8 - * - 
Import Time
 Import multiprocessing 

DEF the doIt (NUM):
     Print ( " Process NUM IS:% S " % NUM) 
    the time.sleep ( . 1 )
     Print ( ' Process End S% ' % NUM) 

IF  __name__ == ' __main__ ' :
     Print ( ' mainProcess start ' )
     # recording time at the beginning of the implementation of 
    start_time = time.time ()
     # create three sub-processes
    pool = multiprocessing.Pool(3)
    print('Child start')
    for i in range(3):
        # pool.apply(doIt,[i])
        pool.apply_async(doIt,[i])
    pool.close()
    pool.join()
    print('mainProcess done time:%s s' % (time.time() - start_time))

operation result:

 

 The results can be seen, even if the main process is non-blocking apply_async (), also run the child process, the child process may be observed simultaneously are sequentially alternately performed. CPU when running the first sub-process, and before the first sub-process is finished, the system operator in order to schedule a second sub-process, and so on, has been scheduled to run sub-process, one after the end of the child running processes, and finally run the main process.

 

(2) above join () What is it used?

join () function: the process can block the execution of the main process, after waiting until the child process completed, it continues to run behind the main process code.

Guess you like

Origin www.cnblogs.com/python-kp/p/12125455.html