Asynchronous crawling data in python crawler combat

Asynchronous crawling data in python crawler combat



foreword

There are three main methods of asynchronous programming in python: callback function, generator function, and thread method.
A programming method that uses processes, threads, coroutines, and functions/methods as the basic units for executing task programs, and combines mechanisms such as callbacks, event loops, and semaphores to improve the overall execution efficiency and concurrency capabilities of programs.
If a program can accurately determine which specific operation it will perform next based on the executed instructions when it is running, then it is a synchronous program, otherwise it is an asynchronous program. (The difference between disorder and order)
Synchronous/asynchronous, blocking/non-blocking are not incompatible, it depends on the encapsulation level of the program in question. For example, a shopping program can be asynchronous when processing browsing requests from multiple users, but must be synchronous when updating inventory.
Advantages: Asynchronous operations do not require additional thread overhead, and are processed using callbacks. In the case of a good design, the processing function does not need to use shared variables (even if it cannot completely eliminate the use of shared variables, at least the number of shared variables can be reduced), reducing the possibility of deadlock.
Disadvantages: Asynchronous programming is more complex and difficult to debug. The biggest problem is the callback, which increases the difficulty of software design


1. Demand

  • Use asyncio and aiohttp modules to crawl data asynchronously
  • Control the amount of asynchronous concurrency
  • crawl page data

2. Use steps

1. Idea

  • Visit the website to analyze the data to be crawled
    insert image description hereinsert image description here
  • Access all the pages to be crawled at once to obtain data through asynchronous
  • Set a data that controls asynchronous one-time access

The code is as follows (example):

2. Import library

The code is as follows (example):

import asyncio
import aiohttp
import json
import time
import requests
import re
from lxml import etree
import datetime

3. The code is as follows

The code is as follows (example):

import asyncio
import aiohttp
import json
import time
import requests
import re
from lxml import etree
import datetime


CONCURRENCY = 5
headers = {
    
    
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
}
# URL = 'http://127.0.0.1:30328'
#  asyncio 的 Semaphore 来控制并发量
semaphore = asyncio.Semaphore(CONCURRENCY)


url_3011 = 'xxxx'
response_3011 = requests.get(
    url='xxxx', headers=headers)
HTML_1 = etree.HTML(response_3011.text)
# 数据列表
json_data_list = []
tasks = []


async def scrape_api(session, URL):
    # 控制并发量
    async with semaphore:
        # print('scraping', URL)
        # 请求网站,获取html代码和状态码
        async with session.get(URL, headers=headers) as response:
            await asyncio.sleep(1)
            # 关闭会话
            # await session.close()
            return await response.text()


async def session_url(url):

    # 设置超时
    timeout = aiohttp.ClientTimeout(total=7)
    # 用 with as 可以自动关闭会话
    # 请求库由 requests 改成了 aiohttp,通过 aiohttp 的 ClientSession 类的 get 方法进行请求
    async with aiohttp.ClientSession(timeout=timeout) as session:

        html = await scrape_api(session, url)
        print('scraping', url)
        pages_1 = etree.HTML(html)
        for b in pages_1.xpath('/html/body/div[2]/div[3]/ul/li'):
            game_name = b.xpath('div[2]/div[1]/a/text()')[0]
            service = b.xpath('div[3]/text()')[0].strip()
            print({
    
    "game": game_name, "server": service,
                  "mobile": "安卓", "time": timestamp})
            json_data_list.append(
                {
    
    "game": game_name, "server": service, "mobile": "安卓", "time": timestamp})


def url_list():
    for number1, day in enumerate(HTML_1.xpath('/html/body/div[2]/div[2]/div'), 1):
        day1 = day.xpath('a/div[1]/text()')[0]
        # 转换成时间数组
        timeArray = time.strptime(str(datetime.datetime.now().year) + '-' + str(
            datetime.datetime.now().month) + '-' + str(day1) + ' ' + '00:00:00', "%Y-%m-%d %H:%M:%S")
        # 转换成时间戳
        global timestamp
        timestamp = int(time.mktime(timeArray))*1000
        urls = ('https://www.3011.cn/server/%s/1.html' % (number1))
        response_3011_page = requests.get(url=urls, headers=headers)
        pattern_page = r'<li>共(\d+)页</li>'
        pages = re.findall(pattern_page, response_3011_page.text, re.S)[0]
        for a in range(1, int(pages)+1):
            # 生成访问链接
            urls_1 = ('https://www.3011.cn/server/%s/%s.html' % (number1, a))
            yield urls_1


async def main():
    scrape_index_tasks = []
    for url1 in url_list():
        # ensure_future 方法,返回结果也是 task 对象,这样的话我们就可以不借助 loop 来定义
        scrape_index_tasks.append(asyncio.ensure_future(session_url(url1)))
    # 声明了 10000 个 task,将其传递给 gather 方法运行,已经生成的任务
    await asyncio.gather(*scrape_index_tasks)

    # scrape_index_tasks = [asyncio.ensure_future(scrape_api()) for _ in range(10000)]
    # 声明了 10000 个 task,将其传递给 gather 方法运行
    # await asyncio.gather(*scrape_index_tasks)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())


Summarize

The above is what I want to talk about today. This article only briefly introduces the use of asyncio and aiohttp. By using these two modules, the speed of crawlers can be greatly improved.

Guess you like

Origin blog.csdn.net/weixin_45688123/article/details/127413089