Python crawler practice: crawl certain images based on keywords and download them locally in batches

This article mainly introduces how to use Python crawler to crawl certain images based on keywords and download them locally in batches, and add the use of proxy IP to bypass anti-crawling measures and improve the usability and stability of the program. The article contains code implementation and detailed explanation, suitable for beginners to learn.

Table of contents

Preface

Preparation

Requests library

BeautifulSoup library

Proxy IP

Implementation steps

1. Send a request to get HTML text

2. Parse HTML text to obtain image URL

3. Create folders and download images

4. Add proxy IP

Complete code

Summarize

Preface

With the development of the Internet, we can easily search for various pictures through search engines, such as travel, scenery, etc. But sometimes we need to download these images in batches. It is too troublesome to download them one by one manually, so we need to use a crawler to achieve this.

In actual crawler development, we will face some anti-crawling measures, such as IP restrictions, request frequency restrictions, etc. In order to bypass these restrictions, we can use a proxy IP to hide our real IP address and reduce the risk of being banned.

Therefore, in this article, we will use a Python crawler to crawl certain images based on keywords and download them locally in batches, and add the use of proxy IP to bypass anti-crawling measures.

1. Preparation work

Before we start writing code, we need to understand some necessary knowledge and tools.

Requests library

Requests is a third-party library in Python that provides a simple and intuitive HTTP request API, allowing us to send HTTP/1.1 requests using Python. It uses the basic methods in Python's standard library module urllib, but the Requests library can send HTTP/1.1 requests more conveniently and supports more HTTP request methods, such as PUT, DELETE, HEAD, OPTIONS, etc. The Requests library also provides a more convenient Session class, which can maintain cookies and other information between multiple requests, and makes it easier to use advanced functions such as proxies.

We can install the Requests library using the following command:

pip install requests

BeautifulSoup library

BeautifulSoup is a third-party library in Python that provides a way to extract data from HTML or XML files. It can automatically parse complex HTML text into a tree structure, and provides built-in traversal and search methods, simplifying the process of parsing HTML text. Using the BeautifulSoup library, you can easily extract the tags or attributes specified in the web page and process them accordingly.

We can install the BeautifulSoup library using the following command:

pip install beautifulsoup4

Proxy IP

Proxy IP is a kind of transit server that allows our requests to be sent through the proxy server and hides our real IP address. Using a proxy IP can bypass some anti-crawling measures, prevent the IP from being blocked, and improve the usability of the program.

We can obtain it through some free proxy IP websites on the Internet, including HTTP and HTTPS. However, you need to pay attention to the availability and stability of the proxy IP to avoid unnecessary trouble.

2. Implementation steps

1. Send a request to get HTML text

We need to first send a request to obtain the HTML text. Here we take a certain image search page as an example. First use the get method in the requests library to send a request and save the obtained content in the content variable.

import requests

url = "https://image.baidu.com/search/index?tn=baiduimage&word=美景"
response = requests.get(url)
content = response.content

2. Parse HTML text to obtain image URL

Next, use the BeautifulSoup library to parse the HTML text, obtain all img tags, and extract the image URL from them. Here we only extracted the image URLs in the data-src attribute and saved them in a list.

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'html.parser')
img_tags = soup.find_all('img')

img_urls = []
for tag in img_tags:
    img_url = tag.get('data-src')
    if img_url:
        img_urls.append(img_url)

It should be noted that since the image URL may exist in the data-src attribute instead of the src attribute, we need to determine whether the data-src attribute exists. Some images may not have the data-src attribute, and we need to modify it according to the actual situation.

3. Create folders and download images

After obtaining all the image URLs, we need to download them locally in batches. Here we create a folder named images and save the downloaded pictures in it.

import os

save_dir = "./images"
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

Next, use the get method of the requests library to request each image URL and save the obtained content as a binary file. We name the downloaded files 0.jpg, 1.jpg, 2.jpg... and save them in the images folder in sequence.

for i, img_url in enumerate(img_urls):
    response = requests.get(img_url)

    with open(os.path.join(save_dir, f"{i}.jpg"), "wb") as f:
        f.write(response.content)

It should be noted that if you directly use the image URL as the file name, the file may fail to be saved because it contains some special characters. Therefore, we adopted the method of using numbers to name them sequentially to avoid this problem.

4. Add proxy IP

Before using proxy IP, we need to obtain some available proxy IP addresses. Here we use a list called proxies to save multiple proxy IP addresses and corresponding port numbers in it.

proxies = [
    "http://ip_address1:port",
    "http://ip_address2:port",
    "http://ip_address3:port",
    ...
]

Next, we use the proxies parameter of the requests library to set the proxy IP for requests. We randomly select an available proxy IP to make requests to avoid the risk of being banned.

import random

proxy = {
    "http": "http://" + proxies[random.randint(0, len(proxies) - 1)]
}
response = requests.get(img_url, proxies=proxy)

It should be noted that the format of each proxy IP is http://ip_address:port or https://ip_address:port. Here we have chosen the proxy IP using the http protocol. In addition, if we need to use the proxy IP of the https protocol, we only need to replace http with https.

3. Complete code

The following is the complete code implementation, including the use of proxy IP:

import requests
import os
import random

url = "https://image.baidu.com/search/index?tn=baiduimage&word=美景"

proxies = [
    "http://ip_address1:port",
    "http://ip_address2:port",
    "http://ip_address3:port",
    ...
]

response = requests.get(url)
content = response.content

# 使用BeautifulSoup库解析HTML
from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'html.parser')
img_tags = soup.find_all('img')

img_urls = []
for tag in img_tags:
    img_url = tag.get('data-src')
    if img_url:
        img_urls.append(img_url)

# 创建文件夹，存储下载的图片
save_dir = "./images"
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 使用代理IP进行请求
for i, img_url in enumerate(img_urls):
    proxy = {
        "http": "http://" + proxies[random.randint(0, len(proxies) - 1)]
    }
    response = requests.get(img_url, proxies=proxy)

    with open(os.path.join(save_dir, f"{i}.jpg"), "wb") as f:
        f.write(response.content)

4. Summary

This article introduces how to use a Python crawler to crawl certain images based on keywords and download them locally in batches, and add the use of proxy IP to bypass anti-crawling measures. It should be noted that the availability and stability of the proxy IP have a great impact on the program performance. We need to carefully select and test the proxy IP to improve the usability and stability of the program.