python 利用asynico+aiohttp模块实现简单的异步爬虫

看了很多大佬的博客关于这点自己懂得太少,aiohttp这个库的应用不是很熟练,比照别人的代码自己也先实践以后,后续需要看官方文档来补充这点知识。

中文文档
https://segmentfault.com/p/1210000013564725

自己比照别人代码写一个关于用aiohttp来实现的爬虫代码。

目标网站:
	http://www.ivsky.com/tupian/ziranfengguang/
	简单爬取天堂图片网的照片
逻辑就不讲了,直接上个代码
import time
import aiohttp
import asyncio
from scrapy import Selector

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36'
}


# 获取网页(文本信息)
async def fetch(session, url):
    async with session.get(url, headers=headers) as response:
        return await response.text(encoding='utf-8')

# 获取每一页的所有图片路径
async def url_parse(html):
    selector = Selector(text=html)
    url_list = selector.xpath('//ul[@class="ali"]//li//img/@src').extract()
    return url_list

# 进行图片的下载
async def down_img(session, url_list):
    for each_url in img_list:
        print('程序正在采集%s' % each_url)
        async with session.get(each_url, headers=headers) as response:
            img_response = await response.read()
            with open('./image/%s.jpg' % time.time(), 'wb') as file:
                file.write(img_response)


# 开始执行抓取
async def start(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)  # 得到每一页的html
        url_list = await url_parse(html)  # 解析得到每一页的图片url
        await down_img(session, url_list) # 进行图片的下载


if __name__ == '__main__':
    each_url = "http://www.ivsky.com/tupian/ziranfengguang/index_{page}.html"
    full_urllist = [each_url.format(page=i) for i in range(1, 20)]
    event_loop = asyncio.get_event_loop()
    tasks = [start(url) for url in full_urllist]
    tasks = asyncio.wait(tasks)
    event_loop.run_until_complete(tasks)  # 等待任务结束

后续需要掌握这块aiohttp库的知识,今天只是分享了一下代码。

猜你喜欢

转载自blog.csdn.net/weixin_42812527/article/details/83794787
今日推荐