Asynchronous reptiles

asyncio threaded concurrent IO operations can be implemented, Python is commonly used in asynchronous processing module. About this asyncio module, the author will be introduced in a future article, we will tell a HTTP-based implementation framework asyncio --aiohttp, it can help us to achieve an asynchronous HTTP request, so that our program efficiency is greatly improved.

This article will introduce a simple application aiohttp in the reptile.

Our project comes from: Scrapy reptiles (5) crawling Dangdang books bestseller list in the original project, we are using the framework scrapy Python reptile to crawl Dangdang books bestseller list of the information. In this article, the author will be produced in two ways reptile, reptiles and relatively synchronous asynchronous reptiles (use aiohttp achieve) efficiency, demonstrated aiohttp advantages in terms of reptiles.

First, let's take a look at reptiles implemented with a general method, namely synchronization method, a complete Python code is as follows:

'''
同步方式爬取当当畅销书的图书信息
'''

import time
import requests
import pandas as pd
from bs4 import BeautifulSoup

# table表格用于储存书本信息
table = []

# 处理网页
def download(url):
    html = requests.get(url).text

    # 利用BeautifulSoup将获取到的文本解析成HTML
    soup = BeautifulSoup(html, "lxml")
    # 获取网页中的畅销书信息
    book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode")('li')

    for book in book_list:
        info = book.find_all('div')

        # 获取每本畅销书的排名,名称,评论数,作者,出版社
        rank = info[0].text[0:-1]
        name = info[2].text
        comments = info[3].text.split('条')[0]
        author = info[4].text
        date_and_publisher = info[5].text.split()
        publisher = date_and_publisher[1] if len(date_and_publisher) >= 2 else ''

        # 将每本畅销书的上述信息加入到table中
        table.append([rank, name, comments, author, publisher])


# 全部网页
urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d' % i for i in range(1, 26)]

# 统计该爬虫的消耗时间
print('#' * 50)
t1 = time.time()  # 开始时间

for url in urls:
    download(url)

# 将table转化为pandas中的DataFrame并保存为CSV格式的文件
df = pd.DataFrame(table, columns=['rank', 'name', 'comments', 'author', 'publisher'])
df.to_csv('E://douban/dangdang.csv', index=False)

t2 = time.time()  # 结束时间
print('使用一般方法,总共耗时:%s' % (t2 - t1))
print('#' * 50)

Output:

##################################################
使用一般方法,总共耗时:23.522345542907715
##################################################

Next we look at the efficiency with aiohttp made asynchronous reptiles, complete source code is as follows:

'''
异步方式爬取当当畅销书的图书信息
'''

import time
import aiohttp
import asyncio
import pandas as pd
from bs4 import BeautifulSoup

# table表格用于储存书本信息
table = []

# 获取网页(文本信息)
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text(encoding='gb18030')

# 解析网页
async def parser(html):
    
    # 利用BeautifulSoup将获取到的文本解析成HTML
    soup = BeautifulSoup(html, "lxml")
    # 获取网页中的畅销书信息
    book_list = soup.find('ul', class_="bang_list clearfix bang_list_mode")('li')

    for book in book_list:
        
        info = book.find_all('div')

        # 获取每本畅销书的排名,名称,评论数,作者,出版社
        rank = info[0].text[0:-1]
        name = info[2].text
        comments = info[3].text.split('条')[0]
        author = info[4].text
        date_and_publisher = info[5].text.split()
        publisher = date_and_publisher[1] if len(date_and_publisher) >=2 else ''

        # 将每本畅销书的上述信息加入到table中
        table.append([rank,name,comments,author,publisher])
        
# 处理网页    
async def download(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)
        await parser(html)

# 全部网页
urls = ['http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-recent7-0-0-1-%d'%i for i in range(1,26)]

# 统计该爬虫的消耗时间
print('#' * 50)
t1 = time.time() # 开始时间

# 利用asyncio模块进行异步IO处理
loop = asyncio.get_event_loop()
tasks = [asyncio.ensure_future(download(url)) for url in urls]
tasks = asyncio.gather(*tasks)
loop.run_until_complete(tasks)

# 将table转化为pandas中的DataFrame并保存为CSV格式的文件
df = pd.DataFrame(table, columns=['rank','name','comments','author','publisher'])
df.to_csv('E://douban/dangdang.csv',index=False)
    
t2 = time.time() # 结束时间
print('使用aiohttp,总共耗时:%s' % (t2 - t1))
print('#' * 50)

We can see that this approach and thinking crawler and crawler previous general method basically the same, only the processing module aiohttp used when parsing the HTTP request and the page into a coroutine function (the coroutine), reuse aysncio concurrent processing, this would no doubt be able to improve the efficiency of the reptiles. Its operating results are as follows:

##################################################
使用aiohttp,总共耗时:2.405137538909912
##################################################

In summary it can be seen, the method making use of synchronous and asynchronous methods efficiency crawler vary widely, so we in the actual production of the crawler, may also wish to consider an asynchronous reptiles, greater use of asynchronous modules, such as aysncio, aiohttp. In addition, aiohttp only supports version 3.5.3 Python later.

Of course, this only as an example of an asynchronous reptiles, did not specifically tell the story behind asynchronous, and asynchronous idea has been widely used in our real life and production, and other sites, I will introduce to their own understanding of asynchronous programming, Welcome to the concern.

This article ends, welcome attention to micro-channel public number: Easy to learn Python Reptile (micro signal: easy_web_scrape). Welcome to AC ~

From https://www.cnblogs.com/jclian91/p/9641856.html

Guess you like

Origin www.cnblogs.com/hankleo/p/11139876.html