After many times we wrote a reptile, to achieve the requirements will find a lot to improve the place, which is very important point is crawling speed. This article explains how to use the code through the multi-process, multi-thread, coroutine to improve the crawling speed. Note: We do not insight into the theory and principle, everything in your code.
Second, synchronous
First, we write a simplified reptiles, each functional segment, conscious conduct functional programming. The following code is an access object 300 Baidu page and return a status code, wherein parse_1
the function can set the number of cycles, each cycle of the current cycle number (starting from 0) and the incoming url parse_2
function.
import requests def parse_1(): url = 'https://www.baidu.com' for i in range(300): parse_2(url) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
Performance is mainly consumed in the IO request, when the request URL using single-threaded mode will inevitably lead to wait
The sample code is a typical serial logic, parse_1
the number of cycles to transfer and url parse_2
, parse_2
request and returns the Status Code parse_1
continue iterating step was repeated once prior to
Third, multi-threaded
Because there is only one thread CPU time on each scale in the implementation of the program, so multi-threaded process actually increases the usage resulting in improved CPU utilization
Multi-threaded library there are many, here concurrent.futures
is ThreadPoolExecutor
to demonstrate. Introduction ThreadPoolExecutor
library because it is more concise than any other library code
For convenience of illustration, the following code if it is newly added parts, the first line of code plus> Symbol facilitate observation of illustration, the actual operation need to be removed
import requests > from concurrent.futures import ThreadPoolExecutor def parse_1(): url = 'https://www.baidu.com' # Create thread pool > pool = ThreadPoolExecutor(6) for i in range(300): > pool.submit(parse_2, url) > pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
It is synchronized with relatively asynchronous . Asynchronous is independent of each other, continue to do their own thing while waiting for an event, you do not need to wait for this event after the completion of the work. Asynchronous thread is to achieve a way, that is asynchronous processing asynchronous multi-threading means that do not know the results, sometimes we need to know the results, you can use a callback
import requests from concurrent.futures import ThreadPoolExecutor # Add a callback function > def callback(future): > print(future.result()) def parse_1(): url = 'https://www.baidu.com' pool = ThreadPoolExecutor(6) for i in range(300): > results = pool.submit(parse_2, url) # Callback key steps > results.add_done_callback(callback) pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
P ython a multi-threaded countless people have criticized GIL (Global Interpreter Lock) , but belong to IO-intensive multi-threaded task is still very suitable for crawling the pages of this majority.
Fourth, multi-process
Multi-process implemented in two ways: ProcessPoolExecutor
and multiprocessing
1. ProcessPoolExecutor
And multi-threaded ThreadPoolExecutor
similar
import requests > from concurrent.futures import ProcessPoolExecutor def parse_1(): url = 'https://www.baidu.com' # Create thread pool > pool = ProcessPoolExecutor(6) for i in range(300): > pool.submit(parse_2, url) > pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
Look for changes in the two class name, the code is still very simple, empathy can also add a callback function
import requests from concurrent.futures import ProcessPoolExecutor > def callback(future): > print(future.result()) def parse_1(): url = 'https://www.baidu.com' pool = ProcessPoolExecutor(6) for i in range(300): > results = pool.submit(parse_2, url) > results.add_done_callback(callback) pool.shutdown(wait=True) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
2. multiprocessing
Direct look at the code, everything in the comments.
import requests > from multiprocessing import Pool def parse_1(): url = 'https://www.baidu.com' # Built pool > pool = Pool(processes=5) # Store results > res_lst = [] for i in range(300): # Join the task pool > res = pool.apply_async(func=parse_2, args=(url,)) # Acquisition result completed (need to remove) > res_lst.append(res) # Store the final result (can also be stored directly or print) > good_res_lst = [] > for res in res_lst: # Get results after using the acquisition process > good_res = res.get() # Judgment result is good or bad > if good_res: > good_res_lst.append(good_res) # Turn off and wait for completion > pool.close() > pool.join() def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
You can see multiprocessing
the code base a little tedious, but support more expansion. Multi-process and multi-threading can really achieve the purpose of accelerating, but if they are blocking IO thread or process will be a waste, so there is a better way ......
Fifth, non-blocking asynchronous
Coroutine + callback with dynamic non-blocking asynchronous collaboration we can achieve the purpose, nature took only one thread, so to a large extent use of resources
Classic is the use of asynchronous non-blocking asyncio
library + yield
, in order to facilitate the use of the gradual emergence of the upper layer of the package aiohttp
, in order to better understand the asynchronous non-blocking better understanding of asyncio
the library. And gevent
is a very easy to implement coroutine library
import requests > from gevent import monkey # Monkey soul patch is run in collaboration > monkey.patch_all() > Import peddled def parse_1(): url = 'https://www.baidu.com' # Create a task list > tasks_list = [] for i in range(300): > task = gevent.spawn(parse_2, url) > tasks_list.append(task) > gevent.joinall(tasks_list) def parse_2(url): response = requests.get(url) print(response.status_code) if __name__ == '__main__': parse_1()
gevent can greatly speed, also introduces a new problem: if we do not want too fast to cause too much burden on the server how to do? If it is a multi-process multi-threaded build the pool, you can control the number of pool. If you want to control the speed with gevent also has a good approach: build queue. gevent also provided Quene class , the following code larger change
import requests from gevent import monkey monkey.patch_all() import peddled > from gevent.queue import Queue def parse_1(): url = 'https://www.baidu.com' tasks_list = [] # Instantiation queue > quene = Queue() for i in range(300): All queues pressed url # > quene.put_nowait(url) Two queue # > for _ in range(2): > task = gevent.spawn(parse_2) > tasks_list.append(task) gevent.joinall(tasks_list) # Do not need to pass parameters, we are in the queue > def parse_2(): # Cycle whether the queue is empty > while not quene.empty(): # Pop queue > url = quene.get_nowait() response = requests.get(url) Analyzing queue status # > print(quene.qsize(), response.status_code) if __name__ == '__main__': parse_1()
Conclusion
These are several commonly used acceleration method. If you are interested may utilize time code testing module determines the running time. Accelerate the reptile is an important skill, but speed is properly controlled reptile workers good habits, do not give too much pressure on the server, bye ~