Python crawls Douban pantyhose horizon HD big picture

Preface

Today, a friend asked me to write a script. The demand is to crawl the stills of an anime he just watched, so I started to meet his needs hahaha~ because this script involves some batches and will also be used The module is classified into Python penetration programming. I will briefly talk about the idea of ​​writing this script in 3 steps.

Step 1: Integrate URL

I first took a look at the URL structure of the website. This kind of image crawling is a bit like a nesting doll. It needs to clean the data layer by layer until the real address is obtained. jpg

In fact, the script is easy to understand, and you need to master the Python regularity. I have here a detailed Python regularity waiting for your review!
Python regular martial arts cheats.
Insert picture description here
Here you need to turn pages, so you need to make two regulars and pass them in two for loops.
Insert picture description here

Right-click to look at the source code, and think about the regular
Insert picture description here
rule. Then the code for the first step comes out:

#get_num()的作用是整合裤袜视界全部图片的url
def get_num(url1):
    url = 'https://movie.douban.com/subject/30419644/photos?type=S'
    req = requests.get(url=url,headers=header)
    html = req.text
    page = re.findall(r'<a href=\"https://movie\.douban\.com/photos/photo/(.*?)\">',html)
    pages = re.findall(r'<a href=\"(.*?)\" >\d</a>',html)
    # 这里过滤一下需要跳转的页面的url,传给turn_page()进行处理
    for j in pages:
        turn_page(j)
    #在这里把过滤出来的url列表赋值给i,拼接后传递给函数get_img_url()
    for i in page:
        url2 = url1+i
        get_img_url(url2)

This is the code for automatic page turning:

# 这里获取翻页的url并进行二次处理
def turn_page(page):
    host = 'https://movie.douban.com/photos/photo/'
    url = page
    req = requests.get(url=url,headers=header)
    html = req.text
    pages = re.findall(r'<a href=\"https://movie\.douban\.com/photos/photo/(.*?)\">',html)
    for j in pages:
        url2 = host+j
        get_img_url(url2)

Step 2: Filter URL

After integrating the url, click in and find that it needs to be further integrated again,
Insert picture description here
so once again to construct the regular
Insert picture description here
third code:

#这里做了一次数据清洗,从传递过来的url中爬取图片的真实url
def get_img_url(url2):
    req = requests.get(url=url2,headers=header)
    html = req.text
    url = re.findall(r'<img src=\"(.*?)\" width=\"686\" />',html)
    down_img(url)

Step 3: Download pictures in batches

After the final filtering here, the real picture .jpg is obtained, download it, and call it a day!
Insert picture description here

# 最后把处理好的url,再进行一次清洗,得到真正的图片地址~
def down_img(url3):
    # 用for循环接收传递过来的url列表
    for img_url in url3:
        req = requests.get(url=img_url,headers=header)
        img = req.content
        img_url = img_url
        # 这里为了避免图片因重名而覆盖,采用split()方式分割取值来命名
        img_name = img_url.split('/')[-1]
        local_name = '/'+'Pantyhose'+'/'+img_name
        # 这里就开始下载了~最后为了避免被封IP,sleep了一下
        with open(local_name,'wb') as f:
            print '正在下载{}......'.format(local_name)
            f.write(img)
            print '下载完毕,请查看!'
            time.sleep(3)

Complete code

Note: The interpreter here is Python2.7, not 3.8, students who use Python3 have to change the code!

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2020/10/13 16:50
# @Author  : Shadow
# @Site    : 
# @File    : Panty_hose.py
# @Software: PyCharm

# 建议从底部开始食用
import requests
import re
import os
import time


# 最后把处理好的url,再进行一次清洗,得到真正的图片地址~
def down_img(url3):
    # 用for循环接收传递过来的url列表
    for img_url in url3:
        req = requests.get(url=img_url,headers=header)
        img = req.content
        img_url = img_url
        # 这里为了避免图片因重名而覆盖,采用split()方式分割取值来命名
        img_name = img_url.split('/')[-1]
        local_name = '/'+'Pantyhose'+'/'+img_name
        # 这里就开始下载了~最后为了避免被封IP,sleep了一下
        with open(local_name,'wb') as f:
            print '正在下载{}......'.format(local_name)
            f.write(img)
            print '下载完毕,请查看!'
            time.sleep(3)

# 这里做了一次数据清洗,从传递过来的url中爬取图片的真实url
def get_img_url(url2):
    req = requests.get(url=url2,headers=header)
    html = req.text
    url = re.findall(r'<img src=\"(.*?)\" width=\"686\" />',html)
    down_img(url)

# 这里获取翻页的url并进行二次处理
def turn_page(page):
    host = 'https://movie.douban.com/photos/photo/'
    url = page
    req = requests.get(url=url,headers=header)
    html = req.text
    pages = re.findall(r'<a href=\"https://movie\.douban\.com/photos/photo/(.*?)\">',html)
    for j in pages:
        url2 = host+j
        get_img_url(url2)

# get_num()的作用是整合裤袜视界全部图片的url
def get_num(url1):
    url = 'https://movie.douban.com/subject/30419644/photos?type=S'
    req = requests.get(url=url,headers=header)
    html = req.text
    page = re.findall(r'<a href=\"https://movie\.douban\.com/photos/photo/(.*?)\">',html)
    pages = re.findall(r'<a href=\"(.*?)\" >\d</a>',html)
    # 这里过滤一下需要跳转的页面的url,传给turn_page()进行处理
    for j in pages:
        turn_page(j)
    # 在这里把过滤出来的url列表赋值给i,拼接后传递给函数get_img_url()
    for i in page:
        url2 = url1+i
        get_img_url(url2)

if __name__=='__main__':

    # 定义请求头
    header = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
    }

    # 定义图片存放目录
    dir = 'D:\Pantyhose'

    # 测试D盘是否已存在目录Pantyhose,若不存在则创建
    try:
        print '正在测试文件目录状态......'
        mkdir = os.mkdir(dir)
        if mkdir is False:
            print '已为您创建图片目录:' + dir
    # 若存在则输出路径
    except WindowsError:
        print '目录已存在:' + dir

    # 给老子爬!
    get_num('https://movie.douban.com/photos/photo/')

Guess you like

Origin blog.csdn.net/qq_43573676/article/details/109062860