python 爬虫保存图片/多进程 - 代码天地

python 爬虫保存图片/多进程

其他 2018-05-28 17:39:40 阅读次数: 0

踩过的坑:
1. OSError: [Errno 22] Invalid argument 创建jpg文件时,直接用的图片链接作为图片名的,而链接中有'/',所以报错了,解决方法是链接切片
2. TypeError: a bytes-like object is required, not 'str' 把URL返回的response写入图片时报错,resp.text返回的是Unicode型的数据,
所以用resp.content,它返回的是bytes型也就是二进制的数据

#coding=utf-8
import time
import requests
from lxml import etree
import time
from multiprocessing.dummy import Pool


headers = {
    'userAgent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N)\
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Mobile Safari/537.36'}


def get_info(url):
    '''
    get源码,encode,解析,xpath,保存
    '''

    response = requests.get(url, headers=headers)
    response = response.text.encode('utf-8')
    selector = etree.HTML(response)
    soup = selector.xpath('//*[@class="photo-item photo-item--overlay"]/a[1]/img')

    list_url = []
    for img in soup:
        photo = img.get('src')
        list_url.append(photo)

    for item in list_url:
        with open(item[33:39]+ '.jpg', 'wb') as fp:         #创建jpg
            data = requests.get(item, headers = headers)    #get url
            fp.write(data.content)        #写入.text返回的是Unicode型的数据,所以用.content返回的是bytes型也就是二进制的数据


if __name__ == '__main__':
    urls = ['https://www.pexels.com/?page={}'.format(str(i)) for i in range(1, 2)]
    start_time = time.time()
    for url in urls:
        print(url)
        get_info(url)
    end_time = time.time()
    print('time1 : ', end_time - start_time)

    #多进程
    # start_time2 = time.time()
    # pool = Pool(processes=6)
    # pool.map(get_info, urls)
    # end_time2 = time.time()
    # print('time2 : ', end_time2 - start_time2)

猜你喜欢

转载自blog.csdn.net/qq_18525247/article/details/80323963

python 爬虫保存图片/多进程

python的多进程应用--读取处理保存图片

多进程爬取某图片网站（python爬虫）

[Python爬虫]爬虫实例:爬取PEXELS图片---修改为多进程爬虫

python爬虫之多线程threading、多进程multiprocessing、协程aiohttp 批量下载图片

python 爬虫多线程/多进程

python爬虫多进程的使用：multiprocessing

Python爬虫之多线程，多进程

Python爬虫之多进程浅谈

Python爬虫，多进程 + 日志记录

python给爬虫加速：多线程，多进程

06讲：python爬虫之多进程

Python实现多线程、多进程爬虫

python爬虫效率提升——多进程

python使用多进程爬取图片

python多进程爬虫解决进程挂掉问题

Python基础—python爬虫多线程、多进程

Python爬虫(7):多进程抓取拉钩网十万数据

Python爬虫多进程包multiprocessing中，pool组件的使用

python之多进程和多协成爬虫

[python爬虫] 使用多进程爬取妹子图

python--多进程在网络爬虫中的具体应用

Python使用多进程提高网络爬虫的爬取速度

多进程爬虫python——实例爬取酷狗歌单

爬虫代码详解Python多线程、多进程、协程

python+selenium多线程与多进程爬虫

Python爬虫：单线程、多线程、多进程

python爬虫解析图片保存到本地

Python保存爬虫爬取的图片

Python爬虫自动爬取图片并保存

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

static方法和非static方法的区别（java）

如何查找计算机专业paper

java.lang.ClassFormatError: Incompatible magic value 0 in class file com/sitecha

跳跃游戏II

stm32_之【建立工程】

TeaWeb v0.0.9 发布，统计底层优化、主机监控功能改进

事件分发 -----控制字体大小

JavaScript DOM练习（动态表格添加） December 25，2019

JSF Scope & CDI

实现从零搭建一个登录注册页面（附源代码）

每日归档

更多

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)