Comparison of crawlers under synchronization, multi-process, and coroutine

Use python to compile a crawler to get the top ten titles of Weibo search, which are implemented in synchronization, multi-process, and coroutine respectively. Compare the differences in their respective implementation methods. Of course, the execution efficiency is also very different.

1. Use synchronization to crawl and output hot search titles one by one:

This method is the simplest, crawling and processing one by one, and the efficiency is the lowest .

import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
    	#由于新浪微博具有反爬措施，需要一个header字典储存cookie信息，可在网站的页面按f12查看并复制。
        header={
    
    "user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)#爬取页面信息
        r.encoding=r.apparent_encoding#获得编码方式
        soup=BeautifulSoup(r.text,'html.parser')#使用Beautiful处理
        print(soup.find(attrs={
    
    'class':'title'}).string)#输出含有属性class:'title'的标签的非属性字符串，即页面的标题
    except:
        print('error')
#页面链接列表
urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]
#逐个执行
def main(urls):
    for url in urls:
        get_title(url)

start=time.time()
main(urls)
end=time.time()
print('run time is %.5f'%(end-start))#输出运行所花费的时间

Run to get the top ten titles of Weibo hot search, you can see that the running time is 7.73s Insert picture description here

Second, use the multi-process method to implement the crawler

Now use mutiprocessing.Pool to implement multi-process crawlers. Since my computer has 4 cores, I set the process pool to 4 (p=Pool(4))

import multiprocessing
from multiprocessing import Pool
import time
import requests
from bs4 import BeautifulSoup

def get_title(url):
    try:
        header={
    
    "user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
        r=requests.get(url,timeout=30,headers=header)
        r.encoding=r.apparent_encoding
        soup=BeautifulSoup(r.text,'html.parser')
        print(soup.find(attrs={
    
    'class':'title'}).string)
    except:
        print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls):
    p=Pool(4)#设置进程池为4
    for url in urls:
        p.apply_async(get_title,args=[url])#创建多个进程，并发执行
    p.close()
    p.join()# 运行完所有子进程才能顺序运行后续程序
#把实际执行功能的代码封装成一个函数,然后加入到if __name__ == '__main__':中执行，否则将报错。
if __name__=='__main__':
    start=time.time()
    main(urls)
    end=time.time()
    print('run time is %.5f'%(end-start))

It should be noted here that in the Windows environment, the code that actually executes the function should be encapsulated into a function and then added to it if __name__=='__main__'for execution, otherwise it will report an error RuntimeError: reeze_support() , as shown below:

    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

This error does not appear under Linux, because there is no fork call under Windows.
The running result of the crawler using multi-process is as follows: It
can be seen that the running time is 4.56s, and the efficiency has been improved . Insert picture description here
The following figure shows the execution flow chart of the multi-process crawler (the picture comes from the Internet)

Three, the way to use the coroutine

Since request is not awaitable and cannot be placed behind await, the official aiohttp library is specifically provided to implement asynchronous web page requests and other functions. It can be regarded as an asynchronous version of the requests library and needs to be manually installed. Installation command:pip install aiohttp

import aiohttp
import asyncio
import time
from bs4 import BeautifulSoup
header={
    
    "user-agent":"Chrome/78.0.3904.108",'cookie':'SUB=_2AkMqIoz1f8PxqwJRmPAVxWPnaIpyyQnEieKcfn0uJRMxHRl-yT9kqkAFtRB6AaKiGn7WbCVQVtWDseW_5JZsJ0NoICGr; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWJrDJ8H0hrSpRG9GpmEzGF; SINAGLOBAL=5229331688388.405.1568539601424; login_sid_t=1a1165955454339327bdef141125e35c; cross_origin_proto=SSL; Ugrow-G0=5c7144e56a57a456abed1d1511ad79e8; YF-V5-G0=8c1ea32ec7cf68ca3a1513783c276b8c; _s_tentry=-; wb_view_log=1280*7201.5; Apache=2516595840555.2446.1585651161757; ULV=1585651161764:5:1:1:2516595840555.2446.1585651161757:1582956310594; WBStorage=42212210b087ca50|undefined; YF-Page-G0=c704b1074605efc315869695a91e5996|1585653703|1585653701'}
sem=asyncio.Semaphore(10)# 信号量，控制协程数，防止爬的过快
async def get_title(url,header):
    async with sem:
        async with aiohttp.ClientSession() as session:
            async with session.request('GET',url,headers=header) as result:
                try:
                    text=await result.text()
                    soup=BeautifulSoup(text,'html.parser')
                    print(soup.find(attrs={
    
    'class':'title'}).string)
                except:
                    print('error...')

urls=[
    'https://weibo.com/ttarticle/p/show?id=2309404488307177292254',
    'https://weibo.com/ttarticle/p/show?id=2309404488460353274040',
    'https://weibo.com/ttarticle/p/show?id=2309354488485582012495',
    'https://weibo.com/ttarticle/p/show?id=2309354488485540069723',
    'https://weibo.com/ttarticle/p/show?id=2309354488485808505228',
    'https://weibo.com/ttarticle/p/show?id=2309404488184535843057',
    'https://weibo.com/ttarticle/p/show?id=2309354488519753007179',
    'https://weibo.com/ttarticle/p/show?id=2309354488514216526214',
    'https://weibo.com/ttarticle/p/show?id=2309354488464673406980',
    'https://weibo.com/ttarticle/p/show?id=2309354488533355135102'
    ]

def main(urls,header):
    loop=asyncio.get_event_loop()#获取事件循环
    tasks=[get_title(url,header) for url in urls]#生成任务列表
    loop.run_until_complete(asyncio.wait(tasks))#激活协程

if __name__=='__main__':
    start=time.time()
    main(urls,header)
    end=time.time()
    print('run time is %.5f'%(end-start))

The results of the implementation are as follows:
only 0.82 seconds are used, which is extremely efficient . Insert picture description here
Explanation:
1, semaphoreis to limit the number of coroutine synchronization tools working simultaneously
2, in coroutine aiohttpof ClientSession()the request requesting a web page

Asynchronous crawlers are different from multi-process crawlers in that it uses a single thread (that is, only creates an event loop, and then adds all tasks to the event loop) to handle multiple tasks concurrently. After polling for a task, when a time-consuming operation (such as requesting a URL) is encountered, the task is suspended and the next task is performed. When the previously suspended task has updated its status (such as obtaining a web page response), It is awakened and the program continues to run from where it was suspended last time. Greatly reduce unnecessary waiting time in the middle
——————————————
Copyright statement: This article is the original article of the CSDN blogger "SL_World", and it follows the CC 4.0 BY-SA copyright agreement , Please attach a link to the original source and this statement for reprinting.
Original link: https://blog.csdn.net/SL_World/article/details/86633611

The following figure shows the execution flow chart of the coroutine crawler (the picture comes from the Internet) Insert picture description here