Python crawler practice (7) - Use proxy IP to batch download 4K high-definition young lady pictures (complete source code attached)

1. Crawling target

The target of this crawling is another 4K high-definition young lady picture from a certain website:

Insert image description here

2. Realize the effect

Achieve batch downloading of pictures with specified keywords and store them in specified folders:

Insert image description here

3. Preparation work

Python:3.10

Editor:PyCharm

Third-party modules, install by yourself:

pip install requests # 网页数据爬取
pip install lxml # 提取网页数据

4. Obtain free proxy IP

4.1 What are the benefits of using a proxy?

The benefits of crawlers using proxy IPs are as follows:

  • Rotate IP addresses : Use proxy IP to rotate IP addresses, reduce the risk of being banned, and maintain the continuity and stability of crawling.
  • Improve collection speed : Proxy IP can provide multiple IP addresses, allowing the crawler to use multiple threads at the same time, thereby speeding up data collection.
  • Bypassing anti-crawler mechanisms : Many websites adopt various anti-crawler mechanisms, such as IP bans, verification codes, request frequency limits, etc. Using proxy IP can help crawlers bypass these mechanisms and maintain normal data collection.
  • Protect personal privacy : Using proxy IP can help hide the real IP address and protect personal identity and privacy information.

The blogger recently discovered a pretty good proxy IP, the high-anonymity proxy IP of Yipin HTTP. You can get 1G of traffic for free after registration. It’s very good : click to try it for free.

4.2 Get a free agent

1. Open Yipin HTTP official website : click Free Trial

2. All proxy IPs require real-name authentication before they can be used. If you don’t know how, please refer to the real-name tutorial : Real-name Tutorial

Insert image description here

2. Select Whitelist > Add whitelist:

Insert image description here

3. Click API Extraction》Select direct connection extraction:

Insert image description here

4. Select traffic extraction and click to generate API link (we have 1G free traffic here for free use):

Insert image description here

If you still can’t extract the free proxy IP, you can ask the customer service in the lower left corner :

Insert image description here

4.3 Obtain proxy

After obtaining the image link, we need to send a request again to download the image. Since the request volume is generally large, we need to use a proxy IP. We have manually obtained the proxy IP above. Let’s see how Python hangs up the proxy IP to send a request:

1. Get the proxy IP in the API interface through a crawler ( Note: For the proxy URL below, see tutorial 4.2 to change it to your own API link ):

import requests
import time
import random


def get_ip():
    url = "这里放你自己的API链接"
    while 1:
        try:
            r = requests.get(url, timeout=10)
        except:
            continue

        ip = r.text.strip()
        if '请求过于频繁' in ip:
            print('IP请求频繁')
            time.sleep(1)
            continue
        break
    proxies = {
    
    
        'https': '%s' % ip
    }

    return proxies



if __name__ == '__main__':
    proxies = get_ip()
    print(proxies)

5. Actual agency practice

5.1 Import module

import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒
import os # 创建文件

5.2 Set page turning

First, let’s analyze the page turning of the website. There are 10 pages in total:
Insert image description here

First page link:

https://www.moyublog.com/95-2-2-0.html

Second page link:

https://www.moyublog.com/95-2-2-1.html

Third page link:

https://www.moyublog.com/95-2-2-2.html

It can be seen that each page is only 95-2-2-added sequentially starting from the second page 1, so a loop is used to construct all web page links:

if __name__ == '__main__':
    # 页码
    page_number = 10
    # 循环构建每页的链接
    for i in range(0,page_number+1):
        # 页数拼接
        url = f'https://www.moyublog.com/95-2-2-{
      
      i}.html'

5.3 Get image link

You can see that all image URLs are under the ul tag > li tag > a tag > img tag:

Insert image description here

We create a get_imgurl_list(url)function that passes in the web page link to obtain the web page source code, and uses xpath to locate the link to each image:

def get_imgurl_list(url,imgurl_list):
    """获取图片链接"""
    # 请求头
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
    # 发送请求
    response = requests.get(url=url, headers=headers)
    # 获取网页源码
    html_str = response.text
    # 将html字符串转换为etree对象方便后面使用xpath进行解析
    html_data = etree.HTML(html_str)
    # 利用xpath取到所有的li标签
    li_list = html_data.xpath("//ul[@class='clearfix']/li")
    # 打印一下li标签个数看是否和一页的电影个数对得上
    print(len(li_list))  # 输出20,没有问题
    for li in li_list:
        imgurl = li.xpath(".//a/img/@data-original")[0]
        print(imgurl)
        # 写入列表
        imgurl_list.append(imgurl)

operation result:
Insert image description here

Click on a picture link to take a look, OK, no problem:

Insert image description here

5.4 Download pictures

The image link is now available, as is the proxy IP. Now we can download the image. Define a get_down_img(img_url_list)function, pass in the image link list, then traverse the list, switch the agent every time a picture is downloaded, and download all pictures to the specified folder:

def get_down_img(imgurl_list):
    # 在当前路径下生成存储图片的文件夹
    os.mkdir("小姐姐")
    # 定义图片编号
    n = 0
    for img_url in imgurl_list:
        headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
        # 调用get_ip函数,获取代理IP
        proxies = get_ip()
        # 每次发送请求换代理IP,获取图片,防止被封
        img_data = requests.get(url=img_url, headers=headers, proxies=proxies).content
        # 拼接图片存放地址和名字
        img_path = './小姐姐/' + str(n) + '.jpg'
        # 将图片写入指定位置
        with open(img_path, 'wb') as f:
            f.write(img_data)
        # 图片编号递增
        n = n + 1

5.5 Call the main function

Here we can set the page numbers to be crawled:

if __name__ == '__main__':
    page_number = 10 # 爬取页数
    imgurl_list = [] # 存放图片链接
    # 1. 循环构建每页的链接
    for i in range(0,page_number+1):
        # 页数拼接
        url = f'https://www.moyublog.com/95-2-2-{
      
      i}.html'
        print(url)
        # 2. 获取图片链接
        get_imgurl_list(url,imgurl_list)
    # 3. 下载图片
    get_down_img(imgurl_list)

5.6 Complete source code

Note: For the proxy URL below, see tutorial 4.2 to extract the 1G free proxy IP and replace it with your own API link:

import requests  # python基础爬虫库
from lxml import etree  # 可以将网页转换为Elements对象
import time  # 防止爬取过快可以睡眠一秒
import os


def get_ip():
    url = "这里放你自己的API链接"
    while 1:
        try:
            r = requests.get(url, timeout=10)
        except:
            continue

        ip = r.text.strip()
        if '请求过于频繁' in ip:
            print('IP请求频繁')
            time.sleep(1)
            continue
        break
    proxies = {
    
    
        'https': '%s' % ip
    }

    return proxies



def get_imgurl_list(url,imgurl_list):
    """获取图片链接"""
    # 请求头
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
    # 发送请求
    response = requests.get(url=url, headers=headers)
    # 获取网页源码
    html_str = response.text
    # 将html字符串转换为etree对象方便后面使用xpath进行解析
    html_data = etree.HTML(html_str)
    # 利用xpath取到所有的li标签
    li_list = html_data.xpath("//ul[@class='clearfix']/li")
    # 打印一下li标签个数看是否和一页的电影个数对得上
    print(len(li_list))  # 输出20,没有问题
    for li in li_list:
        imgurl = li.xpath(".//a/img/@data-original")[0]
        print(imgurl)
        # 写入列表
        imgurl_list.append(imgurl)


def get_down_img(imgurl_list):
    # 在当前路径下生成存储图片的文件夹
    os.mkdir("小姐姐")
    # 定义图片编号
    n = 0
    for img_url in imgurl_list:
        headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'}
        # 调用get_ip函数,获取代理IP
        proxies = get_ip()
        # 每次发送请求换代理IP,获取图片,防止被封
        img_data = requests.get(url=img_url, proxies=proxies, headers=headers).content #
        # 拼接图片存放地址和名字
        img_path = './小姐姐/' + str(n) + '.jpg'
        # 将图片写入指定位置
        with open(img_path, 'wb') as f:
            f.write(img_data)
        # 图片编号递增
        n = n + 1


if __name__ == '__main__':
    page_number = 10 # 爬取页数
    imgurl_list = [] # 存放图片链接
    # 1. 循环构建每页的链接
    for i in range(0,page_number+1):
        # 页数拼接
        url = f'https://www.moyublog.com/95-2-2-{
      
      i}.html'
        print(url)
        # 2. 获取图片链接
        get_imgurl_list(url,imgurl_list)
    # 3. 下载图片
    get_down_img(imgurl_list)

operation result:

Insert image description here

The download was successful and there was no error. The quality of the proxy IP is still good! ! !

6. Summary

Proxy IP is inseparable for crawlers. Proxy IP can help crawlers hide the real IP address. Friends who need proxy IP can try Yipin HTTP Home: Click for free trial

Book recommendations

Customer retention data analysis and prediction (data science and big data technology)
Insert image description here

For any business that relies on recurring revenue and repeat sales, keeping customers active and continuing to buy is essential. Customer churn (or "churn") is a costly and frustrating event that can be prevented. By using the techniques described in this book, you can identify the early warning signs of customer churn and learn to identify and retain customers before they leave.

Customer Retention Data Analysis and Prediction teaches developers and data scientists proven techniques and methods to stop customer churn before it happens. This book contains many real-world examples of how to transform raw data into measurable behavioral indicators, calculate customer lifetime value, and use demographic data to improve customer churn predictions. By following Zuora's Chief Data Scientist Carl Gold's approach, you'll reap the benefits of high customer retention.

main content:

● Calculate churn indicators

● Predict customer churn through customer behavior

● Use customer segmentation strategies to reduce customer churn

● Apply churn analysis technology to other business areas

● Use artificial intelligence technology for accurate customer churn prediction

JD.com address : https://item.jd.com/13999686.html

Guess you like

Origin blog.csdn.net/yuan2019035055/article/details/135049461