Asynchronous crawlers: the basic principles of coroutines

basic concept

  • Blocking : (blocking state refers to the state of being suspended when the program does not get the required computing resources) the program cannot continue to do other things while waiting for an operation to complete
  • Non-blocking : When the program is waiting for an operation, it is not blocked by itself and can continue to do other things (only when the level of program encapsulation can include independent subroutine units, the program may have a non-blocking state)
  • Synchronization : In order to complete a certain task together, different program units need to rely on a certain communication method to maintain coordination during execution . At this time, these program units are executed synchronously (ordered)
  • Asynchronous : In order to complete a certain task, there is no need for communication and coordination between different program units . At this time, unrelated program units can be asynchronous (disordered)
  • Multi-process : take advantage of multi-core CPU to execute multiple tasks at the same time
  • Coroutine : It is essentially a single process, but has its own register context and stack, which can be used to implement asynchronous operations.

Coroutine usage

Environment: Python 3.5+, asyncio library

basic concept

  • event_loop : event loop, you can register a function to this event loop, and when the occurrence condition is met, the corresponding processing method will be called;
  • coroutine (coroutine): refers to the coroutine object type. Define a method with the async keyword, this method will not be executed immediately when called, but will return a coroutine object. Then the coroutine object can be registered in the event loop, and it will be called by the event loop.
  • task : The coroutine object can be further encapsulated, and the returned task object will contain running status information.
  • future : Another way to define task objects, which is not fundamentally different from task.

Define a coroutine (register the coroutine object directly into the event loop)

import asyncio
# 导入asyncio包,这样才能使用async和await关键字

async def execute(x):
    print('Number:', x)
# 用async定义一个方法,该方法接收一个参数,执行之后就会打印这个参数

coroutine = execute(1)
print('Coroutine:', coroutine)
print('After calling execute')
# 直接调用前面定义的方法,但这个方法不会被执行,而是返回一个协程对象

loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(coroutine)
# 调用loop对象的run_until_complete方法将协程对象注册到事件循环中
# 注册完成之后,前面定义并调用的方法才会被执行
print('After calling loop')

Coroutine: <coroutine object execute at 0x000001E7D19743C0>
After calling execute
Number: 1
After calling loop

async keyword: The method defined by async will become a coroutine object that cannot be executed directly. This object must be registered in the event loop before it can be executed!

Define the coroutine (first use the create_task method to encapsulate the coroutine into a task object)

import asyncio

async def execute(x):
    print('Number:', x)
    return x

coroutine = execute(1)
print('Coroutine:', coroutine)
print('After calling execute')

loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
task = loop.create_task(coroutine)
# 调用loop对象的create_task方法将协程对象转化成task对象
print('Task:', task)
# 此时打印输出的task对象处于pending(待定)状态
loop.run_until_complete(task)
# 调用loop对象的run_until_complete方法将协程对象注册到事件循环中
print('Task', task)
# 此时打印输出的task对象处于finished状态
print('After calling loop')

Coroutine: <coroutine object execute at 0x000001EB521D5440>
After calling execute
Task: <Task pending name='Task-1' coro=<execute() running at D:\Python\demo\临时测试2.py:125>>
Number: 1
Task <Task finished name='Task-1' coro=<execute() done, defined at D:\Python\demo\临时测试2.py:125> result=1>
After calling loop 

Define the coroutine (first use the ensure_future method to encapsulate the coroutine into a task object)

import asyncio

async def execute(x):
    print('Number:', x)
    return x

coroutine = execute(1)
print('Coroutine:', coroutine)
print('After calling execute')

task = asyncio.ensure_future(coroutine)
# 直接调用asyncio包的ensure_future方法将协程对象转化成task对象(不需要先声明loop对象了)
print('Task:', task)
# 此时打印输出的task对象处于pending(待定)状态
loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(task)
# 调用loop对象的run_until_complete方法将协程对象注册到事件循环中
print('Task', task)
# 此时打印输出的task对象处于finished状态
print('After calling loop')

Coroutine: <coroutine object execute at 0x0000023C80885440>
After calling execute
Task: <Task pending name='Task-1' coro=<execute() running at D:\Python\demo\临时测试2.py:125>>
Number: 1
Task <Task finished name='Task-1' coro=<execute() done, defined at D:\Python\demo\临时测试2.py:125> result=1>
After calling loop

Bind a callback method (add_done_callback) to the task object

import asyncio
import requests

async def request():
    url = 'https://www.baidu.com'
    status = requests.get(url)
    return status
# 用async定义一个方法,该方法会请求网站获取并返回状态码

def callback(task):
    print('Status:', task.result())
# 正常定义一个callback方法,该方法会调用result方法打印出task对象的结果

coroutine = request()
# 接收返回的协程对象
task = asyncio.ensure_future(coroutine)
# 将协程对象封装成一个task对象
task.add_done_callback(callback)
# 给task对象指定一个回调方法
print('Task:', task)
# 打印这个task对象(其中包含运行状态信息,此时为:pending)

loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(task)
# 调用loop对象的方法,将task对象注册到事件循环中(注册完,前面定义的方法立刻被执行)
print('Task:', task)
# 再次打印这个task对象(此时的运行状态:finished)

Task: <Task pending name='Task-1' coro=<request() running at D:\Python\demo\temporary test 2.py:148> cb=[callback() at D:\Python\demo\temporary Test 2.py:154]>
Status: <Response [200]>
Task: <Task finished name='Task-1' coro=<request() done, defined at D:\Python\demo\temporary test 2.py :148> result=<Response [200]>>

In fact, in this example, even without using the callback method, you can directly call the result method to get the result after the task finishes running:

import asyncio
import requests

async def request():
    url = 'https://www.baidu.com'
    status = requests.get(url)
    return status
# 用async定义一个方法,该方法会请求网站获取并返回状态码

coroutine = request()
# 接收返回的协程对象
task = asyncio.ensure_future(coroutine)
# 将协程对象封装成一个task对象
print('Task:', task)
# 打印这个task对象(其中包含运行状态信息,此时为:pending)

loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(task)
# 调用loop对象的方法,将task对象注册到事件循环中(注册完,前面定义的方法立刻被执行)
print('Task:', task)
# 再次打印这个task对象(此时的运行状态:finished)
print('Task:', task.result())
# 调用result方法,获取并打印这个task对象的结果

Task: <Task pending name='Task-1' coro=<request() running at D:\Python\demo\临时测试2.py:148>>
Task: <Task finished name='Task-1' coro=<request() done, defined at D:\Python\demo\临时测试2.py:148> result=<Response [200]>>
Task: <Response [200]>

Multitasking coroutines (execute multiple requests)

import asyncio
import requests

async def request():
    url = 'https://www.baidu.com'
    status = requests.get(url)
    return status
# 用async定义一个方法,该方法会请求网站获取并返回状态码

tasks = [asyncio.ensure_future(request()) for _ in range(5)]
# 将协程对象封装成一个task对象(重复五次),然后放在tasks列表里面
print('Tasks:', tasks)
# 打印这个tasks对象(其中包含五个task对象及其运行状态信息,此时为:pending)

loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(asyncio.wait(tasks))
# 调用loop对象的方法,将tasks列表(经过wait方法封装)注册到事件循环中(注册完,前面定义的方法立刻被执行)
for task in tasks:
    print('Task Result', task.result())
# 调用result方法,遍历并打印这五个task对象的结果

Tasks: [<Task pending name='Task-1' coro=<request() running at D:\Python\demo\临时测试2.py:148>>, <Task pending name='Task-2' coro=<request() running at D:\Python\demo\临时测试2.py:148>>, <Task pending name='Task-3' coro=<request() running at D:\Python\demo\临时测试2.py:148>>, <Task pending name='Task-4' coro=<request() running at D:\Python\demo\临时测试2.py:148>>, <Task pending name='Task-5' coro=<request() running at D:\Python\demo\临时测试2.py:148>>]
Task Result <Response [200]>
Task Result <Response [200]>
Task Result <Response [200]>
Task Result <Response [200]>
Task Result <Response [200]>

Coroutine implementation

  • async keyword : The method defined by async will become impossible to execute directly (direct call will return a coroutine object), and the coroutine object must be registered in the event loop before it can be executed;
  • await keyword : You can suspend the time-consuming waiting operation and give up control (if the coroutine encounters await during execution, the event loop will suspend the coroutine and execute other coroutines until Other coroutines are suspended or executed);
  • aiohttp library : supports asynchronous requests and needs to be used in conjunction with asyncio (installation: pip install aiohttp)

The object (method) behind await must be in one of the following formats:

  • A native coroutine object (modified by async and supports asynchronous operations)
  • A generator (decorated by type.coroutine and can return a coroutine object)
  • an iterator (returned by the object containing the __await__ method)

Official documentation of aiohttp: https://docs.aiohttp.org/en/stable/

import asyncio
import aiohttp
import time

start = time.time()

async def get(url):
    session = aiohttp.ClientSession()
    # 第一步是非阻塞的,所以会被立马唤醒执行
    response = await session.get(url)
    # 利用aiohttp库里的ClientSession类的get方法进行请求(加上await关键词声明可挂起)
    # 全部task都会来到这里挂起,然后几乎同一时间获得响应
    await response.text()
    await session.close()
    return response

async def request():
    url = 'https://www.httpbin.org/delay/5'
    print('Waiting for', url)
    response = await get(url)
    # 整一个get方法都设置为await(可挂起)
    print('Get response from', url, 'response', response)

tasks = [asyncio.ensure_future(request()) for _ in range(10)]
# 将协程对象封装成一个task对象(重复十次),然后放在tasks列表里面
loop = asyncio.get_event_loop()
# 创建一个事件循环loop对象
loop.run_until_complete(asyncio.wait(tasks))
# 调用loop对象的方法,将tasks列表(经过wait方法封装)注册到事件循环中(注册完,前面定义的方法立刻被执行)

end = time.time()
print('Cost time:', end - start)

Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Waiting for https://www.httpbin.org/delay/5
Get response from https://www.httpbin.org/delay/5 response <ClientResponse(https://www.httpbin.org/delay/5) [200 OK]>
<CIMultiDictProxy('Date': 'Thu, 24 Feb 2022 13:05:43 GMT', 'Content-Type': 'application/json', 'Content-Length': '367', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
.....

Higher Concurrency Tests

import asyncio
import aiohttp
import time


def akb(number):
    start = time.time()

    async def get(url):
        session = aiohttp.ClientSession()
        response = await session.get(url)
        # 利用aiohttp库里的ClientSession类的get方法进行请求(加上await关键词声明可挂起)
        await response.text()
        await session.close()
        return response

    async def request():
        url = 'https://www.csdn.net/'
        response = await get(url)

    tasks = [asyncio.ensure_future(request()) for _ in range(number)]
    # 将协程对象封装成一个task对象(重复十次),然后放在tasks列表里面
    loop = asyncio.get_event_loop()
    # 创建一个事件循环loop对象
    loop.run_until_complete(asyncio.wait(tasks))
    # 调用loop对象的方法,将tasks列表(经过wait方法封装)注册到事件循环中(注册完,前面定义的方法立刻被执行)

    end = time.time()
    print('Number:', number, 'Cost time:', end - start)


for number in [1, 10, 50, 100, 500]:
    akb(number)

Number: 1 Cost time: 2.363093376159668
Number: 10 Cost time: 2.376859188079834
Number: 50 Cost time: 3.716461420059204
Number: 100 Cost time: 7.078423500061035
Number: 500 Cost time: 15.283033847808838

Using asynchronous crawlers can achieve hundreds of network requests in a short period of time!

Basic usage of aiohttp (client part)

Key modules:

  • asyncio module : implement asynchronous operations on TCP, UDP, and SSL protocols (a library that must be imported, because asynchronous crawling needs to start a coroutine, and the coroutine needs to be started with the help of an event loop);
  • aiohttp module : It is an asynchronous HTTP network request module based on asyncio.

The server and client provided by the aiohttp module:

  • Server : Use the server to build a server that supports asynchronous processing to process requests and return responses (similar to some web servers such as Django, Flask, Tornado, etc.);
  • Client : used to initiate a request, similar to requests (initiate an HTTP request and then get a response), the difference is that requests initiate a synchronous network request, while aiohttp is asynchronous.

Basic example (GET request):

import aiohttp
import asyncio

async def fetch(session, url):
# 每个异步方法前都要统一加async来修饰
    async with session.get(url) as response:
    # with as语句用于声明一个上下文管理器,帮助自动分配和释放资源
    # with as语句同样需要加async来修饰(代表声明一个支持异步的上下文管理器)
        return await response.text(), response.status
        # 对于返回协程对象的操作,需要加上await来修饰
        # response.text()返回的是协程对象
        # response.status返回的直接是一个数值

async def main():
    async with aiohttp.ClientSession() as session:
        html, status = await fetch(session, 'https://www.csdn.net')
        print(f'html: {html[:100]}...')
        print(f'status: {status}')

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # 创建事件循环对象
    loop.run_until_complete(main())
    # 将协程对象注册到事件循环中

Basic example 2 (GET request + URL parameter setting)

import aiohttp
import asyncio

async def main():
    params = {'name': 'germey', 'age': 25}
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.httpbin.org/get', params=params) as response:
            print(await response.text())

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 声明事件循环对象,将方法注册其中并运行

{
  "args": {
    "age": "25", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python/3.9 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6217aad9-6711f6a8646bbf54671998f1"
  }, 
  "origin": "116.66.127.55", 
  "url": "https://www.httpbin.org/get?name=germey&age=25"
}

Basic example 3 (POST request)

For form submission, the Content-Type in the corresponding request header is application/x-www-form-urlencoded

import aiohttp
import asyncio

async def main():
    data = {'name': 'germey', 'age': 25}
    async with aiohttp.ClientSession() as session:
        async with session.post('https://www.httpbin.org/post', data=data) as response:
            print(await response.text())

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 声明事件循环对象,将方法注册其中并运行

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "25", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python/3.9 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6217ab09-717e0f595b6720491faa7623"
  }, 
  "json": null, 
  "origin": "116.66.127.55", 
  "url": "https://www.httpbin.org/post"
}

For POST JSON data submission, the Content-Type in the corresponding request is application/json, and the data parameter in the post method needs to be changed to json

import aiohttp
import asyncio

async def main():
    data = {'name': 'germey', 'age': 25}
    async with aiohttp.ClientSession() as session:
        async with session.post('https://www.httpbin.org/post', json=data) as response:
            print(await response.text())

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 声明事件循环对象,将方法注册其中并运行

 {
  "args": {}, 
  "data": "{\"name\": \"germey\", \"age\": 25}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "29", 
    "Content-Type": "application/json", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python/3.9 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6217ac9b-05b45563594b0c367b7f302d"
  }, 
  "json": {
    "age": 25, 
    "name": "germey"
  }, 
  "origin": "116.66.127.55", 
  "url": "https://www.httpbin.org/post"
}

other request types

session.post('https://www.httpbin.org/post', data=data)
session.put('https://www.httpbin.org/put', data=data)
session.delete('https://www.httpbin.org/delete')
session.head('https://www.httpbin.org/get')
session.options('https://www.httpbin.org/get')
session.patch('https://www.httpbin.org/patch', data=data)
# 只需要把对应的方法和参数替换一下就行

Get the information in the response

import aiohttp
import asyncio

async def main():
    data = {'name': 'germey', 'age': 25}
    async with aiohttp.ClientSession() as session:
        async with session.post('https://www.httpbin.org/post', data=data) as response:
            print('status:', response.status)
            # 获取响应中的状态码
            print('headers', response.headers)
            # 获取响应中的响应头
            print('body', await response.text())
            # 获取响应中的响应体
            print('bytes', await response.read())
            # 获取响应中的二进制格式响应体
            print('json', await response.json())
            # 获取响应中的JSON格式响应体

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 声明事件循环对象,将方法注册其中并运行

Some fields need to be preceded by await, and some do not. The principle is: if a coroutine object is returned, then it must be added.

For details, you can see the API of aiohttp, the link is:  https://docs.aiohttp.org/en/stable/client_reference.html

status: 200
headers <CIMultiDictProxy('Date': 'Thu, 24 Feb 2022 16:10:30 GMT', 'Content-Type': 'application/json', 'Content-Length': '510', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true')>
body {
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "age": "25", 
    "name": "germey"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "18", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python/3.9 aiohttp/3.8.1", 
    "X-Amzn-Trace-Id": "Root=1-6217adf6-6003f76d24e8c82b28fec349"
  }, 
  "json": null, 
  "origin": "116.66.127.55", 
  "url": "https://www.httpbin.org/post"
}

bytes b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "age": "25", \n    "name": "germey"\n  }, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Content-Length": "18", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "www.httpbin.org", \n    "User-Agent": "Python/3.9 aiohttp/3.8.1", \n    "X-Amzn-Trace-Id": "Root=1-6217adf6-6003f76d24e8c82b28fec349"\n  }, \n  "json": null, \n  "origin": "116.66.127.55", \n  "url": "https://www.httpbin.org/post"\n}\n'
json {'args': {}, 'data': '', 'files': {}, 'form': {'age': '25', 'name': 'germey'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '18', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'www.httpbin.org', 'User-Agent': 'Python/3.9 aiohttp/3.8.1', 'X-Amzn-Trace-Id': 'Root=1-6217adf6-6003f76d24e8c82b28fec349'}, 'json': None, 'origin': '116.66.127.55', 'url': 'https://www.httpbin.org/post'}

timeout setting

import aiohttp
import asyncio

async def main():
    timeout = aiohttp.ClientTimeout(total=1)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        async with session.get('https://www.httpbin.org/get') as response:
            print('status:', response.status)
            # 获取响应中的状态码

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 声明事件循环对象,将方法注册其中并运行

If it can respond within 1 second, it will return: 200, otherwise it will throw an asyncio.exceptions.TimeoutError type error

(There are other parameters that can be used when declaring ClientTimeout: connect, socket_connect)

concurrency limit

(Reference link: https://www.cnblogs.com/lymmurrain/p/13805690.html )

import asyncio
import aiohttp

url = 'https://www.baidu.com'
num = 5
# 设置控制并发数
semaphore = asyncio.Semaphore(num)
# 生成控制并发对象

async def scrape_api():
    async with semaphore:
        print('scraping', url)
        async with session.get(url) as response:
            await asyncio.sleep(2)
            return len(await response.text())


async def main():
    print(await asyncio.gather(*[scrape_api() for _ in range(20)]))

async def create_session():
    return aiohttp.ClientSession()

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # 创建事件循环
    session = loop.run_until_complete(create_session())
    # 先用async定义一个方法,将其返回的协程对象注入事件循环
    loop.run_until_complete(main())
    # 先用async定义好主函数,将其返回的协程对象注入事件循环
    loop.run_until_complete(session.close())
    # 要手动关闭自己创建的session,并且client.close()是个协程,得用事件循环关闭
    loop.run_until_complete(asyncio.sleep(3))
    # 在关闭loop之前要给aiohttp一点时间关闭session,调用asyncio的sleep方法
    loop.close()

aiohttp asynchronous crawling practice

Target site:

  • http://spa5.scrape.center/

Site Features:

  • It contains thousands of book information, and the website data is rendered by JavaScript, and the data can be obtained through the Ajax interface.

Crawl target:

  1. Use aiohttp to crawl the book data of the whole site
  2. Save data to MongoDB asynchronously

environment:

  • Python3.7 or above + MongoDB database + asyncio, aiohttp, motor and other module libraries
  • Note: The connection statement of motor is similar to that of pymongo, and the calling method of saving data is basically the same, the difference is that motor supports asynchronous operation.

Page analysis:

  • The Ajax request interface format of the list page is: https://spa5.scrape.center/api/book/?limit={limit}&offset={offset}
  • The value of limit represents how many books each page contains, and the value of offset is the offset of each page. The calculation formula is offset=limit*(page-1). For example, the offset value of the first page is 0, and the value of the second page is offset. The offset value of the page is 18

  • In the data returned by the Ajax interface on the list page, the results field contains the information of all the pictures on the current page, and the id in it can be used to further request the details page

  • The Ajax request interface format of the details page is: https://spa5.scrape.center/api/book/{id};
  • The id is the id corresponding to the book in the list page, and the details of the book can be obtained from this interface.

Implementation ideas

  1. Crawl all list pages asynchronously, gather the crawling tasks of all list pages together, and declare them as a list of tasks for asynchronous crawling
  2. Get all the content of the list page in the previous step and analyze it, combine the id information of all books into a collection of crawling tasks for all detail pages, and declare it as a list of tasks for asynchronous crawling, and the crawling structure has also been completed Asynchronously stored in MongoDB.
  3. (The two stages need to be executed serially, which is not the best way to perform)

code:

import asyncio
import aiohttp
import logging
import json
from motor.motor_asyncio import AsyncIOMotorClient

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
# 定义报告信息的打印格式

INDEX_URL = 'https://spa5.scrape.center/api/book/?limit=18&offset={offset}'
# 索引页链接格式
DETAIL_URL = 'https://spa5.scrape.center/api/book/{id}'
# 详情页链接格式
PAGE_SIZE = 18
# 索引页链接的翻页偏移量
PAGE_NUMBER = 5
# 需要爬取页码的数量
CONCURRENCY = 5
# 并发量


semaphore = asyncio.Semaphore(CONCURRENCY)
# 基于并发量声明一个信号标,用来控制最大并发数量

MONGO_CONNECTION_STRING = 'mongodb://localhost:27017'
MONGO_DB_NAME = 'books'
MONGO_COLLECTION_NAME = 'books'
# 设置MongoDB数据库的连接信息(链接、数据库名、集合名)
client = AsyncIOMotorClient(MONGO_CONNECTION_STRING)
db = client[MONGO_DB_NAME]
collection = db[MONGO_COLLECTION_NAME]
# 基于连接信息,声明连接MongoDB数据库需要用到的对象


async def scrape_api(url):
    # 定义一个异步的、通用的scrape方法(索引页和详情页的爬取都可以用)
    # 请求并返回url的JSON格式的响应结果
    async with semaphore:
        # 开启一个异步上下文管理器,引入信号标
        try:
            logging.info('scraping %s', url)
            # 调用info方法,报告当前的运行状态(在爬哪一个链接?)
            async with session.get(url) as response:
                # 用get方法请求这个url
                return await response.json()
                # 返回响应的JSON格式的结果
        except aiohttp.ClientError:
            # 引入异常处理,捕获ClientError
            logging.error('error occurred while scraping %s', url, exc_info=True)
            # 调用error方法,报告当前爬取出错的链接信息


async def main():
    global session
    session = aiohttp.ClientSession()
    # 声明一个session对象,并声明为全局变量(这样就不用在各个方法里面传递session了)

    scrape_index_tasks = [asyncio.ensure_future(scrape_index(page)) for page in range(1, PAGE_NUMBER + 1)]
    # 由所有爬取索引页的task组成的列表(要爬取N页,就调用scrape_index方法生成N个task,然后这些task会组成一个列表)
    results = await asyncio.gather(*scrape_index_tasks)
    # 调用gather方法,将task列表传入其中,收集scrape_index返回的所有结果并赋值为results
    logging.info('results %s', json.dumps(results, ensure_ascii=False, indent=2))
    # 调用info方法打印爬取信息(将JSON格式的results转化为字符串)
    # 使用json.dumps可以实现漂亮打印,其中indent控制缩进的空格数(默认输出ascii格式字符,想要正确输出中文就要改为False)

    ids = []
    for index_data in results:
        # 遍历从results中提取到的字段
        if not index_data:
            continue
            # 如果提取的是空值,直接跳过这一轮
        for item in index_data.get('results'):
            ids.append(item.get('id'))
            # 如果提取到的index_data为非空,就遍历这个索引页中的id信息,汇总到ids列表中

    scrape_detail_tasks = [asyncio.ensure_future(scrape_detail(id)) for id in ids]
    # 由所有爬取详情页的task组成的列表(遍历ids中的id,传入scrape_detail方法,生成task对象)
    await asyncio.wait(scrape_detail_tasks)
    # 调用asyncio的wait方法,并将声明的列表传入其中,就可开启爬取详情页(效果和gather方法一样)
    await session.close()


async def scrape_index(page):
    url = INDEX_URL.format(offset=PAGE_SIZE * (page - 1))
    # 格式化索引页url
    return await scrape_api(url)
    # 调用通用的scrape_api方法,请求并返回当前索引页的页面(JSON格式的resp)


async def scrape_detail(id):
    url = DETAIL_URL.format(id=id)
    # 格式化详情页url
    data = await scrape_api(url)
    # 调用通用的scrape_api方法,请求并返回当前详情页的页面(JSON格式的resp)
    await save_data(data)
    # 调用save_data方法,异步保存url中提取到的信息


async def save_data(data):
    # 定义数据保存方法
    logging.info('saving data %s', data)
    # 打印数据保存信息
    if data:
        return await collection.update_one(
            {
                'id': data.get('id')
            },
            {
                "$set": data
            }, upsert=True)
    # 这里此采用update_one(更新)的方式进行数据插入,依据的唯一标识就是从data中提取到的图书id
    # $set参数表示只操作更新data字典里面有的数据,原本就存在的字段不会更新也不会删除(如果不用$set就会被全部被替换)
    # upsert=True表示如果根据条件(这里是id)查询不到对应数据的话,就会执行插入操作


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
    # 将main方法注册到事件循环中


<end>

Guess you like

Origin blog.csdn.net/weixin_58695100/article/details/123107073