Use Python crawler to quickly download a large number of pictures

Table of contents

1. Import library

2. Write the download address of the picture into the file

3. Officially download pictures

Fourth, the main function

Summarize


Recommended learning videos:

https://www.bilibili.com/video/BV1v24y127E3?p=27&vd_source=ed36b2700bbc2bac7746c270bc391540https://www.bilibili.com/video/BV1v24y127E3?p=27&vd_source=ed36b2700bbc2bac7746c270bc391540

1. Import library

from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import asyncio
import aiohttp
import aiofiles

2. Write the download address of the picture into the file

def download_one_page(k):
    # 查看不同的图片页面得出url的规律
    if k == 1:
        url = "http://www.gaoimg.com/photo/nature/"
    else:
        url = f"http://www.gaoimg.com/photo/nature/list_39_{k}.html"
    resp = requests.get(url)
    resp.encoding = 'utf-8'  # 处理乱码
    # print(resp.text)
    # Bs4解析
    main_page = BeautifulSoup(resp.text, "html.parser")
    alist = main_page.find_all("div", class_="item")  # 标签  属性
    # print(alist)
    for a in alist:
        # 直接通过get就可以拿到属性的值,同时拼接图片下载的url
        href = "http://www.gaoimg.com" + a.find("img").get('data-original')
        # print(href)
        # 将图片下载链接写入文件中
        with open("photo_http.txt", mode="a+", encoding="utf-8") as f:
            f.write(href)
            f.write('\n')
    print(f"第{k}页下载完毕!!!")

3. Officially download pictures

Downloading a picture is to write the content of the picture into a file

Asynchronous download to increase download speed

# 图片的单个下载
async def download_ts(href, img_name, session):
    async with session.get(href) as img_resp:
        async with aiofiles.open(f"img2/{img_name}", mode="wb") as f:
            await f.write(await img_resp.content.read())  # 图片内容写入文件
    print(f"{img_name}下载完毕")

# 异步下载
async def aio_download():
    tasks = []
    async with aiohttp.ClientSession() as session:  # 提前准备好session
        async with aiofiles.open("photo_http.txt", mode="r", encoding='utf-8') as f:
            async for href in f:  # 从文件中读取图片下载的url
                if href.startswith(" "):
                    continue
                href = href.strip() # 去掉没用的空格和换行
                # print(href)
                img_name = href.split("/")[-1]  # 图片的名字
                # print(img_name)
                # 创建任务
                task = asyncio.create_task(download_ts(href, img_name, session))
                tasks.append(task)
            await asyncio.wait(tasks)

Fourth, the main function

# 主函数
if __name__ == '__main__':
    pageages=int(input('请输入页数:'))
    # 线程池下载图片链接
    with ThreadPoolExecutor(5) as t:
        for i in range(1,pageages+1):
            # 把下载任务提交给线程池
            # 下载每一页的信息
            t.submit(download_one_page, i)
    # 异步下载图片
    loop = asyncio.get_event_loop()
    loop.run_until_complete(aio_download())

Summarize

The above code uses the thread pool to get the download address of the picture, in order to improve the download speed.

The picture download uses asynchronous download, which is also to improve the download speed.

Guess you like

Origin blog.csdn.net/HX091624/article/details/127839728