利用协程asyncio爬取搜狗美女图片（二）——实战

上节我们详细的介绍了asyncio库的应用（链接https://blog.csdn.net/MG1723054/article/details/81778460），本节我们将其应用到实战之中。主要还是以分析ajax爬取搜狗美女图片（链接https://blog.csdn.net/MG1723054/article/details/81735834）

直接贴出代码，我们在代码里面详细说明每一行的代码含义。

我们以爬取前25个网页，首先我们再次将之前的没有添加协程的代码放上

# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a NJUer.
"""
import requests
import time 
from urllib.parse import urlencode  #网址编码
import json  #导入json库
urls=[]
def image_json (url) :###请求库，利用requests请求构造的链接，然后转化为json格式，然后得到图######片的标题和图片链接。
      response=requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
      data=json.loads(response.text)['all_items']
      for m in range(len(data)) :
           items={
                       'image_url':data[m]['thumbUrl'],'title':data[m]['title']
                       }
      
           
           yield items
      
def  image_download(item):###下载图片
      resource=requests.get(item['image_url'])
      item['title']=item['title'].replace('|','_')
      item['title']=item['title'].replace('/','_')###改名，因为有些图片中有些字符不符合jpg
###图片命名规范
      file='C:\\Users\\FangWei\\Desktop\\网络爬虫\\爬取酷狗美女图片\\'+item['title'][0:20]+'.jpg'
      with open (file,'wb') as f:
            f.write(resource.content)###将图片下载，放到指定文件夹
def get_image (offest) :   ###get_image函数主要是构造需要的ajax链接
    base_url='http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?'
    data={'category':'美女',
                  'tag':'全部',
                  'start':str(offest*15),
                  'len':'15',}
    url=base_url+urlencode(data)  ###利用urlencode将字典拼接为一个网址链接
    return url  
def main(offest):
      infor=get_image(offest)  #mian函数内部调用get_image函数
    #time.sleep(1)
      for item in  image_json (infor):
           image_download(item)          
if __name__=='__main__' :
      start=time.time()
      for x in range(1,26):   #设置爬取变量，设置30，根据上面分析表示可以爬取30*15张图片         
          offest=x
          main(offest)           #调用主函数main()
      end=time.time()
      times=end-start
      print(times)

运行时间为：

可以看到，爬取完这些网页所消耗的时间还是比较多的

下面，我们将该程序修改，使其变为单线程协程并发，以此来提高效率。

# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a NJUer.
"""
import requests
import time ,json
from urllib.parse import urlencode 
import asyncio,aiohttp
urls=[]
def image_json (url) :
      response=requests.get(url,headers={'User-Agent':'Mozilla/5.0'})
      data=json.loads(response.text)['all_items']
      for m in range(len(data)) :
           items={
                       'image_url':data[m]['thumbUrl'],'title':data[m]['title']
                       }
      
           
           yield items ##利用生成器，与return类似，但是yield可节省内存，实际上yield也可作协程
      
async def  image_download(item):
      item['title']=item['title'].replace('|','_')
      item['title']=item['title'].replace('/','_')###数据命名处理，在爬取中发现有的字符不符合jpg命名规范
      file='C:\\Users\\FangWei\\Desktop\\网络爬虫\\爬取搜狗美女图\\'+item['title'][0:20]+'.jpg'####命名文件名
      async with aiohttp.ClientSession() as session:
          async with session.get(item['image_url']) as resp:###aiohttp模块中ClientSession方法，这两句方法最稳妥，也有session=aiohttp.ClientSession(),resp=session.get(item['image_url']),但是可能会报错，如果不报错，可以使用这种方法，报错就使用上面的代码
          
              #print(resp.status)
              imgcode=await resp.read()####读取二进制文件，这与requests库不同，requests读######取二进制的方法是content
      with open(file,'wb')as f:
          f.write(imgcode)   ####将二进制文件写入文件
def get_image (offest) :
    base_url='http://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?'
    data={'category':'美女',
                  'tag':'全部',
                  'start':str(offest*15),
                  'len':'15',}
    url=base_url+urlencode(data)
    return url  
def main(offest):
      infor=get_image(offest)
    #time.sleep(1)
      #for item in  image_json (infor):
           #image_download(item)    
      tasks=[asyncio.ensure_future(image_download(item)) for item in image_json (infor) ]     ###开启协程多任务队列，该语句是列表推导式，列表的简写，与上面两句等效，但是该句是利用协程，多个队列一起进行
      loop=asyncio.get_event_loop()  
      loop.run_until_complete(asyncio.wait(tasks))###将任务注册到事件循环，并启动任务
if __name__=='__main__' :
      start=time.time()
      for x in range(1,26):
          
          offest=x
          main(offest)
      end=time.time()
      times=end-start
      print(times)

上面的代码运行结束，运行时间为：

我们可以明显的看到，通过协程并发我们运行时间缩短了一半多，所以我们在实际爬取过程中可以适当的使用协程。

原创不易，如需转载，请注明出处和作者，谢谢。

利用协程asyncio爬取搜狗美女图片（二）——实战

猜你喜欢