Quickly increase blog reading volume through Python crawler proxy IP

Preface

Python crawler proxy IP allows you to quickly increase your blog reading volume because it allows you to bypass some anti-crawler restrictions. This article will share how to use Python crawler proxy IPs and how to use them to get more blog reads.

1. What is proxy IP

A proxy IP is an IP address used in a network environment to hide the real IP address. In crawlers, proxy IPs are often used to bypass some anti-crawler restrictions, making it harder for crawlers to be identified and banned.

2. Obtain proxy IP

There are several ways to obtain proxy IPs. Some public proxy IP resources on the Internet are often blocked, so we need to purchase some stable proxy IPs ourselves.

Here are several proxy IP services recommended:

  1. Website agent: https://www.zdaye.com
  2. News agent: https://www.xdaili.com

These proxy IP service providers provide API interfaces, and we can obtain the proxy IP through their APIs.

Taking the website proxy as an example, obtain the free proxy IP through GET request:

import requests

def get_proxy():
    try:
        response = requests.get('https://www.zdaye.com/free/')
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

What is returned is a web page HTML. We need to use regular expressions to extract the IP address and port number:

import re

def parse_proxy(html):
    pattern = re.compile('<tr.*?>\s*?<td data-title="IP">(.*?)</td>\s*?<td data-title="PORT">(.*?)</td>.*?</tr>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield item[0] + ':' + item[1]

Here is an explanation of what regular expressions mean:

  • `<tr.*?>`: Matches <tr> tags
  • `\s*?`: Match 0 or more whitespace characters
  • `<td data-title="IP">(.*?)</td>`: Match the IP address between <td data-title="IP"> and </td>
  • `\s*?`: Match 0 or more whitespace characters
  • `<td data-title="PORT">(.*?)</td>`: Match the port number between <td data-title="PORT"> and </td>
  • `.*?</tr>`: Match the content between <tr> and </tr>

3. Use proxy IP

With proxy IPs, we can use them to crawl blogs. Here we take crawling CSDN as an example.

First, we need to randomly select a proxy IP:

import random

proxy_list = ['123.206.189.74:1080', '118.24.61.212:1080', '118.24.61.213:1080']
PROXY = random.choice(proxy_list)
proxies = {'http': 'http://{proxy}'.format(proxy=PROXY), 'https': 'https://{proxy}'.format(proxy=PROXY)}

Python's random library is used here to randomly select a proxy IP. The proxies parameter is a dictionary, the key is the protocol, and the value is the proxy IP.

Then, we need to use the requests library to initiate HTTP requests and set the proxies parameters:

import requests

url = 'https://blog.csdn.net/xxx/article/details/xxx'
response = requests.get(url, proxies=proxies)

This needs to be replaced with the blog address you want to visit. If the proxy IP is unavailable, the requests library will automatically throw a ProxyError exception. We can catch this exception and reselect a proxy IP:

from requests.exceptions import ProxyError

while True:
    try:
        response = requests.get(url, proxies=proxies)
        break
    except ProxyError:
        PROXY = random.choice(proxy_list)
        proxies = {'http': 'http://{proxy}'.format(proxy=PROXY), 'https': 'https://{proxy}'.format(proxy=PROXY)}

A while loop is used here to keep retrying until successful.

4. Complete code

The following is the complete code, including functions such as obtaining proxy IP, randomly selecting proxy IP, accessing blogs, and retrying. You can modify it according to your needs.

import requests
import re
import random
from requests.exceptions import ProxyError

PROXY_LIST = ['123.206.189.74:1080', '118.24.61.212:1080', '118.24.61.213:1080']

def get_proxy():
    try:
        response = requests.get('https://www.zdaye.com/free/')
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def parse_proxy(html):
    pattern = re.compile('<tr.*?>\s*?<td data-title="IP">(.*?)</td>\s*?<td data-title="PORT">(.*?)</td>.*?</tr>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield item[0] + ':' + item[1]

def get_random_proxy():
    PROXY = random.choice(PROXY_LIST)
    proxies = {'http': 'http://{proxy}'.format(proxy=PROXY), 'https': 'https://{proxy}'.format(proxy=PROXY)}
    return proxies

def retry_get(url, retry_times=3):
    while retry_times > 0:
        try:
            proxies = get_random_proxy()
            response = requests.get(url, proxies=proxies)
            if response.status_code == 200:
                return response.text
        except ProxyError:
            pass
        retry_times -= 1
    return None

if __name__ == '__main__':
    url = 'https://blog.csdn.net/xxx/article/details/xxx'
    html = retry_get(url)

5. Precautions

Although using proxy IP can bypass anti-crawler restrictions to a certain extent, excessive use will be recognized by the website as malicious access and the IP will be banned. Therefore, you need to pay attention to the following points when using proxy IP:

  • Choose a stable proxy IP service provider to avoid frequent proxy IP changes.
  • Randomly select proxy IP to avoid using the same IP.
  • Do not overuse proxy IP. It is recommended not to use proxy IP for more than 30% of the visits.

6. Summary

This article introduces how to use Python crawler proxy IP to quickly increase blog reading volume. Functions such as obtaining proxy IP, randomly selecting proxy IP, accessing blogs, and retrying can all be implemented through Python. When using proxy IP, you need to pay attention to stability and usage to avoid being banned from the IP.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/133306456