Python Reptile Case presentation: Python multi-threaded, multi-process, coroutine

After many times we wrote a reptile, to achieve the requirements will find a lot to improve the place, which is very important point is crawling speed. This article explains how to use the code through the multi-process, multi-thread, coroutine  to improve the crawling speed. Note: We do not insight into the theory and principle, everything in your code.

Second, synchronous

First, we write a simplified reptiles, each functional segment, conscious conduct functional programming. The following code is an access object 300 Baidu page and return a status code, wherein  parse_1 the function can set the number of cycles, each cycle of the current cycle number (starting from 0) and the incoming url  parse_2 function.

import requests

def parse_1():
    url = 'https://www.baidu.com'
    for i in range(300):
        parse_2(url)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

Performance is mainly consumed in the IO request, when the request URL using single-threaded mode will inevitably lead to wait

The sample code is a typical serial logic,  parse_1 the number of cycles to transfer and url  parse_2 ,  parse_2 request and returns the Status Code  parse_1 continue iterating step was repeated once prior to

Third, multi-threaded

Because there is only one thread CPU time on each scale in the implementation of the program, so multi-threaded process actually increases the usage resulting in improved CPU utilization

Multi-threaded library there are many, here  concurrent.futures is  ThreadPoolExecutor to demonstrate. Introduction  ThreadPoolExecutor library because it is more concise than any other library code

For convenience of illustration, the following code if it is newly added parts, the first line of code plus> Symbol facilitate observation of illustration, the actual operation need to be removed

import requests
> from concurrent.futures import ThreadPoolExecutor

def parse_1():
    url = 'https://www.baidu.com'
    # Create thread pool
    > pool = ThreadPoolExecutor(6)
    for i in range(300):
        > pool.submit(parse_2, url)
    > pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

It is synchronized with relatively asynchronous  . Asynchronous is independent of each other, continue to do their own thing while waiting for an event, you do not need to wait for this event after the completion of the work. Asynchronous thread is to achieve a way, that is asynchronous processing asynchronous multi-threading means that do not know the results, sometimes we need to know the results, you can use a callback

import requests
from concurrent.futures import ThreadPoolExecutor

# Add a callback function
> def callback(future):
    > print(future.result())

def parse_1():
    url = 'https://www.baidu.com'
    pool = ThreadPoolExecutor(6)
    for i in range(300):
        > results = pool.submit(parse_2, url)
        # Callback key steps
        > results.add_done_callback(callback)
    pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

P ython a multi-threaded countless people have criticized GIL (Global Interpreter Lock)  , but belong to IO-intensive multi-threaded task is still very suitable for crawling the pages of this majority.

Fourth, multi-process

Multi-process implemented in two ways:  ProcessPoolExecutor and multiprocessing

1. ProcessPoolExecutor

And multi-threaded  ThreadPoolExecutor similar

import requests
> from concurrent.futures import ProcessPoolExecutor

def parse_1():
    url = 'https://www.baidu.com'
    # Create thread pool
    > pool = ProcessPoolExecutor(6)
    for i in range(300):
        > pool.submit(parse_2, url)
    > pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

Look for changes in the two class name, the code is still very simple, empathy can also add a callback  function

import requests
from concurrent.futures import ProcessPoolExecutor

> def callback(future):
    > print(future.result())

def parse_1():
    url = 'https://www.baidu.com'
    pool = ProcessPoolExecutor(6)
    for i in range(300):
        > results = pool.submit(parse_2, url)
        > results.add_done_callback(callback)
    pool.shutdown(wait=True)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

2. multiprocessing

Direct look at the code, everything in the comments.

import requests
> from multiprocessing import Pool

def parse_1():
    url = 'https://www.baidu.com'
    # Built pool
    > pool = Pool(processes=5)
    # Store results
    > res_lst = []
    for i in range(300):
        # Join the task pool
        > res = pool.apply_async(func=parse_2, args=(url,))
        # Acquisition result completed (need to remove)
        > res_lst.append(res)
    # Store the final result (can also be stored directly or print)
    > good_res_lst = []
    > for res in res_lst:
        # Get results after using the acquisition process
        > good_res = res.get()
        # Judgment result is good or bad
        > if good_res:
            > good_res_lst.append(good_res)
    # Turn off and wait for completion
    > pool.close()
    > pool.join()

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

You can see  multiprocessing the code base a little tedious, but support more expansion. Multi-process and multi-threading can really achieve the purpose of accelerating, but if they are blocking IO thread or process will be a waste, so there is a better way ......

Fifth, non-blocking asynchronous

Coroutine + callback with dynamic non-blocking asynchronous collaboration we can achieve the purpose, nature took only one thread, so to a large extent use of resources

Classic is the use of asynchronous non-blocking  asyncio library +  yield , in order to facilitate the use of the gradual emergence of the upper layer of the package  aiohttp , in order to better understand the asynchronous non-blocking better understanding of  asyncio the library. And  gevent is a very easy to implement coroutine library

import requests
> from gevent import monkey
# Monkey soul patch is run in collaboration
> monkey.patch_all()
> Import peddled

def parse_1():
    url = 'https://www.baidu.com'
    # Create a task list
    > tasks_list = []
    for i in range(300):
        > task = gevent.spawn(parse_2, url)
        > tasks_list.append(task)
    > gevent.joinall(tasks_list)

def parse_2(url):
    response = requests.get(url)
    print(response.status_code)

if __name__ == '__main__':
    parse_1()

gevent can greatly speed, also introduces a new problem: if we do not want too fast to cause too much burden on the server how to do?  If it is a multi-process multi-threaded build the pool, you can control the number of pool. If you want to control the speed with gevent also has a good approach:  build queue.  gevent also provided  Quene class  , the following code larger change

import requests
from gevent import monkey
monkey.patch_all()
import peddled
> from gevent.queue import Queue

def parse_1():
    url = 'https://www.baidu.com'
    tasks_list = []
    # Instantiation queue
    > quene = Queue()
    for i in range(300):
        All queues pressed url #
        > quene.put_nowait(url)
    Two queue #
    > for _ in range(2):
        > task = gevent.spawn(parse_2)
        > tasks_list.append(task)
    gevent.joinall(tasks_list)

# Do not need to pass parameters, we are in the queue
> def parse_2():
    # Cycle whether the queue is empty
    > while not quene.empty():
        # Pop queue
        > url = quene.get_nowait()
        response = requests.get(url)
        Analyzing queue status #
        > print(quene.qsize(), response.status_code)

if __name__ == '__main__':
    parse_1()

Conclusion

These are several commonly used acceleration method. If you are interested may utilize time code testing module determines the running time. Accelerate the reptile is an important skill, but speed is properly controlled reptile workers good habits, do not give too much pressure on the server, bye ~

Published 151 original articles · won praise 4 · views 10000 +

Guess you like

Origin blog.csdn.net/weixin_46089319/article/details/105365180