Python crawler - detailed tutorial for novices to use proxy IP

Python proxy IP crawler is a technology that allows crawlers to have more network access. The function of the proxy IP is to provide multiple IP addresses for the crawler, thereby speeding up the crawling of data, and at the same time, it can also avoid the problem of being blocked by the website due to excessive access frequency. This article will introduce how to use Python to crawl and use proxy IP.

1. Obtaining proxy IP

First we need to find an available proxy IP source. Here we take the proxy IP of Mr. Zhan as an example. The proxy IP of Mr. Zhan provides paid proxy and ordinary free proxy IP, which is very convenient to use.

The API interface address of the website’s agent IP: `https://www.zdaye.com/free/inha/1/`

By requesting the above API interface, we can obtain a page of proxy IP information, including IP address and port number. We can obtain the information returned by the API through the get method of the requests library. The sample code is as follows:

import requests

url = 'https://www.zdaye.com/free/inha/1/'
response = requests.get(url)
print(response.text)

After the above code is executed, we can see the obtained proxy IP information. But we need to parse the return value and extract only the useful IP addresses and ports.

import requests
from bs4 import BeautifulSoup

url = 'https://www.zdaye.com/free/inha/1/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

proxies = []
for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    proxy = tds[0].text + ':' + tds[1].text
    proxies.append(proxy)

print(proxies)

In the above code, we use the BeautifulSoup library to parse the returned HTML text, obtain all `<tr>` tags, and then loop through each `<tr>` tag to extract the IP address and port information. and save it to a list.

2. Verification of proxy IP

After obtaining the proxy IP, we need to test to determine whether these proxy IPs are available. Here we test through the get method of the requests library. If 200 is returned, it means that the proxy IP is available. The way we use the proxy IP is by passing the proxies parameter to the requests.get method. The sample code is as follows:

import requests

url = 'http://www.baidu.com'

proxies = {
    'http': 'http://222.74.237.246:808',
    'https': 'https://222.74.237.246:808',
}
try:
    response = requests.get(url, proxies=proxies, timeout=10)
    if response.status_code == 200:
        print('代理IP可用：', proxies)
except:
    print('代理IP不可用：', proxies)

In the above code, we send a request to `http://www.baidu.com` and use a proxy IP for access. If the returned HTTP status code is 200, it means that the proxy IP is available, otherwise it means that it is not available.

If we need to verify each proxy IP, then we need to loop through the above code, for example:

import requests
from bs4 import BeautifulSoup

url = 'https://www.zdaye.com/free/inha/1/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

proxies = []
for tr in soup.find_all('tr')[1:]:
    tds = tr.find_all('td')
    proxy = tds[0].text + ':' + tds[1].text
    proxies.append(proxy)

for proxy in proxies:
    proxies_dict = {
        'http': 'http://' + proxy,
        'https': 'https://' + proxy,
    }
    try:
        response = requests.get(url, proxies=proxies_dict, timeout=10)
        if response.status_code == 200:
            print('代理IP可用：', proxies_dict)
    except:
        print('代理IP不可用：', proxies_dict)

In the loop code above, we first traverse all proxy IPs, and then verify each proxy IP. If the proxy IP is available, print it out, otherwise print out the unavailable information.

3. Testing of proxy IP

After obtaining the available proxy IP, we need to further test it to ensure that it is truly available before crawling. We can use common search engines such as Baidu and 360 Search for testing. Here we take Baidu as an example to test whether the proxy IP is actually available.

import requests

url = 'http://www.baidu.com'

proxies = {
    'http': 'http://222.74.237.246:808',
    'https': 'https://222.74.237.246:808',
}
try:
    response = requests.get(url, proxies=proxies, timeout=10)
    if response.status_code == 200:
        if '百度一下' in response.text:
            print('代理IP可用：', proxies)
        else:
            print('代理IP不可用：', proxies)
    else:
        print('代理IP不可用：', proxies)
except:
    print('代理IP不可用：', proxies)

In the above code, we send a request to Baidu and verify whether the proxy IP is truly available by determining whether the returned HTML page contains the keyword 'Baidu'.

4. Use of proxy IP

After we obtain the available proxy IPs, we can use them to crawl. When using the proxy IP for crawling, we need to pass it as the proxies parameter into the requests.get method. The sample code is as follows:

import requests

url = 'http://www.baidu.com'

proxies = {
    'http': 'http://222.74.201.49:9999',
    'https': 'https://222.74.201.49:9999',
}
response = requests.get(url, proxies=proxies)
print(response.text)

In the above code, we use a proxy IP to access the Baidu website and pass it into the requests.get method as the proxies parameter. If the proxy IP is available, the request will be accessed using the proxy IP.

5. Complete code

The following is a complete code, including the acquisition, verification, testing and use of proxy IP. You can refer to it:

import requests
from bs4 import BeautifulSoup

# 1. 获取代理IP列表
def get_proxy_list():
    # 构造请求头，模拟浏览器请求
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }

    # 请求代理IP网页
    url = "http://www.zdaye.com/"
    response = requests.get(url, headers=headers)

    # 解析网页获取代理IP列表
    soup = BeautifulSoup(response.text, "html.parser")
    proxy_list = []
    table = soup.find("table", {"id": "ip_list"})
    for tr in table.find_all("tr"):
        td_list = tr.find_all("td")
        if len(td_list) > 0:
            ip = td_list[1].text.strip()
            port = td_list[2].text.strip()
            type = td_list[5].text.strip()
            proxy_list.append({
                "ip": ip,
                "port": port,
                "type": type
            })
    return proxy_list

# 2. 验证代理IP可用性
def verify_proxy(proxy):
    # 构造请求头，模拟浏览器请求
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }

    # 请求目标网页并判断响应码
    url = "http://www.baidu.com"
    try:
        response = requests.get(url, headers=headers, proxies=proxy, timeout=5)
        if response.status_code == 200:
            return True
        else:
            return False
    except:
        return False

# 3. 测试代理IP列表可用性
def test_proxy_list(proxy_list):
    valid_proxy_list = []
    for proxy in proxy_list:
        if verify_proxy(proxy):
            valid_proxy_list.append(proxy)
    return valid_proxy_list

# 4. 使用代理IP发送请求
def send_request(url, headers, proxy):
    # 发送请求并返回响应结果
    response = requests.get(url, headers=headers, proxies=proxy)
    return response.text

# 程序入口
if __name__ == "__main__":
    # 获取代理IP列表
    proxy_list = get_proxy_list()

    # 验证代理IP可用性
    valid_proxy_list = test_proxy_list(proxy_list)

    # 输出可用代理IP
    print("有效代理IP列表：")
    for proxy in valid_proxy_list:
        print(proxy)

    # 使用代理IP发送请求
    url = "http://www.baidu.com"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36"
    }
    proxy = {
        "http": "http://" + valid_proxy_list[0]["ip"] + ":" + valid_proxy_list[0]["port"],
        "https": "https://" + valid_proxy_list[0]["ip"] + ":" + valid_proxy_list[0]["port"]
    }
    response = send_request(url, headers, proxy)
    print(response)

In the above code, we first obtain the proxy IP list by crawling the Xisha proxy website. Then, we verify each proxy IP to determine whether it is available, and store the available proxy IPs in a list. Finally, we select an available proxy IP and send the request using that proxy IP.

6. Summary

This article introduces the basic concepts of proxy IP, how to obtain free proxy IP, how to use proxy IP in Python and sample code, as well as the precautions for using proxy IP. Hope it will be helpful to crawler users.