Python crawler practice (6) - Use proxy IP to batch download high-definition young lady pictures (complete source code attached)

1. Crawling target

The target of this crawling is 4K high-definition pictures of young ladies on a certain website:

2. Realize the effect

Achieve batch downloading of pictures with specified keywords and store them in specified folders:

3. Preparation work

Python:3.10

Editor:PyCharm

Third-party modules, install by yourself:

pip install requests # 网页数据爬取
pip install lxml # 提取网页数据

4. Proxy IP

4.1 What are the benefits of using a proxy?

The benefits of crawlers using proxy IPs are as follows:

  • Rotate IP address: Use proxy IP to rotate IP addresses, reduce the risk of being banned, and maintain the continuity and stability of crawling.
  • Improve collection speed: Proxy IP can provide multiple IP addresses, allowing the crawler to use multiple threads at the same time, thereby speeding up data collection.
  • Bypass anti-crawler mechanism: Many websites adopt various anti-crawler mechanisms, such as IP bans, verification codes, request frequency limits, etc. Using proxy IP can help crawlers bypass these mechanisms and maintain normal data collection.
  • Protect personal privacy: Using proxy IP can help hide the real IP address and protect personal identity and privacy information.

Bloggers often write crawler code using high-anonymity proxy IPs from huge IP houses. There are 1,000 free IPs every day:Click for a free trial

4.2 Get a free agent

1. Open the official website of Judao IP:Official website of Judao IP

2. Enter account information to register:

3. Real-name authentication is required here. If you don’t know how, you can read:Personal registration real-name tutorial:

4. Enter the member center and click to claim today’s free IP:

5. For detailed steps, please refer to the official tutorial document:Tutorial on obtaining the huge HTTP-free proxy IP package, as shown below after receiving it:< /span>

6. Click Product Management》Dynamic Agent (Time Package), you can see the free IP information we just received:

7. Add your computer’s IP to the whitelist to obtain the proxy IP. Click on authorization information:

8. Click Modify Authorization》Quick Add》Confirm

9. After the addition is completed, click to generate the extraction link:

10. Set the quantity for each extraction, click Generate Link, and copy the link:

11. Copy the link to the address bar and you can see the proxy IP we obtained:

4.3 Obtain proxy

After obtaining the image link, we need to send a request again to download the image. Since the request volume is generally large, we need to use a proxy IP. We have manually obtained the proxy IP above. Let’s see how Python hangs up the proxy IP to send a request:

1. Use a crawler to obtain the proxy IP in the API interface (Note: For the proxy URL below, see tutorial 4.2 and replace it with your own API link):

import requests
import time
import random


def get_ip():
    url = "这里放你自己的API链接"
    while 1:
        try:
            r = requests.get(url, timeout=10)
        except:
            continue

        ip = r.text.strip()
        if '请求过于频繁' in ip:
            print('IP请求频繁')
            time.sleep(1)
            continue
        break
    proxies = {
    
    
        'https': '%s' % ip
    }

    return proxies



if __name__ == '__main__':
    proxies = get_ip()
    print(proxies)

From the running results, you can see that the proxy IP in the interface is returned:

2. Next, when we write the crawler proxy, we can hang up the proxy IP to send the request. We only need to pass proxies as a parameter to requests.getFunction to request other URLs:

requests.get(url, headers=headers, proxies=proxies) 

5. Actual agency practice

5.1 Import module

import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒
import os # 创建文件

5.2 Set page turning

First, let’s analyze the page turning of the website. There are 62 pages in total:

First page link:

https://pic.netbian.com/4kmeinv/index.html

Second page link:

https://pic.netbian.com/4kmeinv/index_2.html

Third page link:

https://pic.netbian.com/4kmeinv/index_3.html

It can be seen that each page only has index followed by _页码 starting from the second page, so a loop is used to construct all web page links: a>

if __name__ == '__main__':
    # 页码
    page_number = 1
    # 循环构建每页的链接
    for i in range(1,page_number+1):
        # 第一页固定,后面页数拼接
        if i ==1:
            url = 'https://pic.netbian.com/4kmeinv/index.html'
        else:
            url = f'https://pic.netbian.com/4kmeinv/index_{
      
      i}.html'

5.3 Get image link

You can see that all image URLs are under the ul tag > a tag > img tag:

We create aget_imgurl_list(url) function that passes in the web page link to obtain the web page source code, and uses xpath to locate the link to each image:

def get_imgurl_list(url,imgurl_list):
    """获取图片链接"""
    # 请求头
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
    # 发送请求
    response = requests.get(url=url, headers=headers)
    # 获取网页源码
    html_str = response.text
    # 将html字符串转换为etree对象方便后面使用xpath进行解析
    html_data = etree.HTML(html_str)
    # 利用xpath取到所有的li标签
    li_list = html_data.xpath("//ul[@class='clearfix']/li")
    # 打印一下li标签个数看是否和一页的电影个数对得上
    print(len(li_list))  # 输出20,没有问题
    for li in li_list:
        imgurl = li.xpath(".//a/img/@src")[0]
        # 拼接url
        imgurl = 'https://pic.netbian.com' +imgurl
        print(imgurl)
        # 写入列表
        imgurl_list.append(imgurl)

operation result:

Click on a picture link to see:

OK no problem!!!

5.4 Download pictures

The image link is now available, as is the proxy IP. Now we can download the image. Define a get_down_img(img_url_list) function, pass in the picture link list, then traverse the list, switch the agent every time a picture is downloaded, and download all pictures to the specified folder:

def get_down_img(imgurl_list):
    # 在当前路径下生成存储图片的文件夹
    os.mkdir("小姐姐")
    # 定义图片编号
    n = 0
    for img_url in imgurl_list:
        headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
        # 调用get_ip函数,获取代理IP
        proxies = get_ip()
        # 每次发送请求换代理IP,获取图片,防止被封
        img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
        # 拼接图片存放地址和名字
        img_path = './小姐姐/' + str(n) + '.jpg'
        # 将图片写入指定位置
        with open(img_path, 'wb') as f:
            f.write(img_data)
        # 图片编号递增
        n = n + 1

5.5 Call the main function

Here we can set the page numbers to be crawled:

if __name__ == '__main__':
    # 1. 设置获取的页数
    page_number = 63
    imgurl_list = [] # 用于存储所有的图片链接
    # 2. 循环构建每页的链接
    for i in range(1,page_number+1):
        # 第一页固定,后面页数拼接
        if i ==1:
            url = 'https://pic.netbian.com/4kmeinv/index.html'
        else:
            url = f'https://pic.netbian.com/4kmeinv/index_{
      
      i}.html'
        # 3. 获取图片链接
        get_imgurl_list(url,imgurl_list)
    # 4. 下载图片
    get_down_img(imgurl_list)

5.6 Complete source code

Note: For the proxy URL below, see the 4.2 tutorial to replace it with your own API link:

import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒
import os


def get_ip():
    url = "这里放你自己的API链接"
    while 1:
        try:
            r = requests.get(url, timeout=10)
        except:
            continue

        ip = r.text.strip()
        if '请求过于频繁' in ip:
            print('IP请求频繁')
            time.sleep(1)
            continue
        break
    proxies = {
    
    
        'https': '%s' % ip
    }

    return proxies


def get_imgurl_list(url,imgurl_list):
    """获取图片链接"""
    # 请求头
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
    # 发送请求
    response = requests.get(url=url, headers=headers)
    # 获取网页源码
    html_str = response.text
    # 将html字符串转换为etree对象方便后面使用xpath进行解析
    html_data = etree.HTML(html_str)
    # 利用xpath取到所有的li标签
    li_list = html_data.xpath("//ul[@class='clearfix']/li")
    # 打印一下li标签个数看是否和一页的电影个数对得上
    print(len(li_list))  # 输出20,没有问题
    for li in li_list:
        imgurl = li.xpath(".//a/img/@src")[0]
        # 拼接url
        imgurl = 'https://pic.netbian.com' +imgurl
        print(imgurl)
        # 写入列表
        imgurl_list.append(imgurl)


def get_down_img(imgurl_list):
    # 在当前路径下生成存储图片的文件夹
    os.mkdir("小姐姐")
    # 定义图片编号
    n = 0
    for img_url in imgurl_list:
        headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
        # 调用get_ip函数,获取代理IP
        proxies = get_ip()
        # 每次发送请求换代理IP,获取图片,防止被封
        img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
        # 拼接图片存放地址和名字
        img_path = './小姐姐/' + str(n) + '.jpg'
        # 将图片写入指定位置
        with open(img_path, 'wb') as f:
            f.write(img_data)
        # 图片编号递增
        n = n + 1

if __name__ == '__main__':
    # 1. 设置获取的页数
    page_number = 50
    imgurl_list = [] # 用于存储所有的图片链接
    # 2. 循环构建每页的链接
    for i in range(1,page_number+1):
        # 第一页固定,后面页数拼接
        if i ==1:
            url = 'https://pic.netbian.com/4kmeinv/index.html'
        else:
            url = f'https://pic.netbian.com/4kmeinv/index_{
      
      i}.html'
        # 3. 获取图片链接
        get_imgurl_list(url,imgurl_list)
    # 4. 下载图片
    get_down_img(imgurl_list)

operation result:

The download was successful and there was no error. The quality of the proxy IP is still good! ! !

5.7 What should I do if the proxy IP is not enough?

What should I do if the 1,000 free proxy IPs per day are not enough?? Friends who often write crawler code and have a high demand for proxy IP recommend using the unlimited proxy IP package of Huge IP House. The IP validity period: 30-60 seconds is enough:Click Buy

There are 5 proxy pools by default, with a maximum extraction of 50 at a time and extraction once per second; if 1 is extracted at a time, 50 extractions per second can be achieved. If you feel that 50 proxy IPs at a time are not enough, you can increase the IP pool.

I calculated the default five pools and found that 50 proxy IPs can be extracted in 1 second, which is 86400 seconds a day, that is to say50x86400=4,320,000 proxy IPs can be extracted in a day< /span>, my dear, so I, the blogger, decisively arranged an annual package for myself, not to mention how cool it is:

6. Summary

Proxy IP is inseparable for crawlers. Proxy IP can help crawlers hide the real IP address. Friends who need a proxy IP can try Juliangjia’s proxy IP: Huge IP official website

Guess you like

Origin blog.csdn.net/yuan2019035055/article/details/134116662