[Python Crawler Notes] Crawler Agent IP and Access Control

I. Introduction

In the development process of web crawlers, there are many limiting factors that hinder the normal operation of the crawler program, the most important of which is the anti-crawler mechanism. In order to prevent crawlers from requesting a large number of the same website in a short period of time, website administrators will use some methods to limit it. At this time, proxy IP is one of the solutions.

This article mainly introduces how to use proxy IP in crawler programs to deal with the anti-crawler mechanism, and how to perform access control to ensure the normal operation of the program.

2. What is a proxy IP?

The proxy IP is the IP address of the proxy server. In the crawler program, we can use the proxy IP to hide the real IP address to achieve the purpose of accessing the website. Using proxy IP can solve the following problems:

  1. Break through access restrictions: Some websites restrict access to certain areas. Using proxy IP can break through these restrictions.
  2. Bypassing the anti-crawler mechanism: Some websites will judge whether it is a crawler behavior based on the frequency of access to the same IP. Using a proxy IP can hide the real IP address, thereby preventing being banned or detected.
  3. Improve access speed: Using a proxy IP can cause requests to be cached by the proxy server, thereby improving access speed.

3. How to obtain proxy IP

There are many free or paid proxy IP providers, we can get proxy IP on these websites, here is one recommended:

Website agent IP : https://www.zdaye.com

After obtaining the proxy IP, we need to perform validity detection, screening and storage to ensure the availability of the proxy IP.

The following is a Python code example that can detect the validity of proxy IPs and store available proxy IPs:

import requests
import time


def check_proxy(proxy):
    """
    检测代理IP的有效性
    :param proxy: 代理IP
    :return: True or False
    """
    proxies = {
        'http': proxy,
        'https': proxy,
    }
    try:
        response = requests.get('https://www.baidu.com/', proxies=proxies, timeout=5)
        if response.status_code == 200:
            return True
        else:
            return False
    except:
        return False


def save_proxy(ip, port, protocol='http'):
    """
    存储可用代理IP
    :param ip: IP地址
    :param port: 端口号
    :param protocol: 协议类型
    :return: None
    """
    with open('proxies.txt', 'a+', encoding='utf-8') as f:
        f.write('{}://{}:{}\n'.format(protocol, ip, port))


def main():
    for page in range(1, 11):  # 获取前10页的代理IP
        url = 'https://www.zdaye.com/nn/{}'.format(page)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/89.0.4389.82 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            html = response.text
            proxy_list = html.split('\n')
            for proxy in proxy_list:
                if proxy:
                    ip = proxy.split(':')[0]
                    port = proxy.split(':')[1]
                    if check_proxy(proxy):
                        save_proxy(ip, port)


if __name__ == '__main__':
    main()
    print('Done!')

The above code uses the requests library to request the proxy IP website, checks the validity after obtaining the proxy IP, and stores the available proxy IP in a local file.

4. How to apply proxy IP

To use the proxy IP in the crawler program, you can use the proxies parameter provided by the requests library. The sample code is as follows:

import requests


def get_page(url, proxy):
    """
    使用代理IP请求网页
    :param url: 网页url
    :param proxy: 代理IP
    :return: 网页内容
    """
    proxies = {
        'http': proxy,
        'https': proxy,
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/89.0.4389.82 Safari/537.36'
    }
    try:
        response = requests.get(url, headers=headers, proxies=proxies, timeout=10)
        if response.status_code == 200:
            return response.text
        else:
            return None
    except:
        return None


def main():
    url = 'https://www.baidu.com/'
    proxy = 'http://121.69.46.218:9000'
    page = get_page(url, proxy)
    print(page)


if __name__ == '__main__':
    main()

The above code uses the proxies parameter provided by the requests library to pass the proxy IP into the access request, thus realizing the function of using the proxy IP to request web pages.

5. How to perform access control

When using proxy IP for access, we need to perform access control to ensure the normal operation of the program. Specifically, we can control access in the following ways:

  1. Control the request frequency: By setting the time interval, the number of requests, etc., control the crawler's access speed to avoid putting excessive pressure on the website.
  2. Rotate proxy IPs: Distribute access pressure by storing multiple available proxy IPs and using them in rotation.
  3. Randomly use proxy IP: Randomly select one from the available proxy IP pool and use it to increase the difficulty of anti-crawling.

Here is an example of Python code that implements access control and rotates proxy IPs:

import requests
import time


def get_proxy():
    """
    从代理IP池中取出一个代理IP
    :return: 代理IP
    """
    proxy_list = []
    with open('proxies.txt', 'r', encoding='utf-8') as f:
        for line in f:
            proxy = line.strip()
            proxy_list.append(proxy)
    return proxy_list[0]


def check_proxy(proxy):
    """
    检测代理IP的有效性
    :param proxy: 代理IP
    :return: True or False
    """
    proxies = {
        'http': proxy,
        'https': proxy,
    }
    try:
        response = requests.get('https://www.baidu.com/', proxies=proxies, timeout=5)
        if response.status_code == 200:
            return True
        else:
            return False
    except:
        return False


def save_proxy(ip, port, protocol='http'):
    """
    存储可用代理IP
    :param ip: IP地址
    :param port: 端口号
    :param protocol: 协议类型
    :return: None
    """
    with open('proxies.txt', 'a+', encoding='utf-8') as f:
        f.write('{}://{}:{}\n'.format(protocol, ip, port))


def rotate_proxy():
    """
    从代理IP池中轮流取出一个代理IP
    :return: 代理IP
    """
    proxy_list = []
    with open('proxies.txt', 'r', encoding='utf-8') as f:
        for line in f:
            proxy = line.strip()
            proxy_list.append(proxy)
    while True:
        for proxy in proxy_list:
            yield proxy


def main():
    proxy_generator = rotate_proxy()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/89.0.4389.82 Safari/537.36'
    }
    for i in range(10):  # 控制访问次数
        proxy = next(proxy_generator)
        while not check_proxy(proxy):  # 检测代理IP是否可用
            proxy = next(proxy_generator)
        try:
            url = 'https://www.baidu.com/'
            response = requests.get(url, headers=headers, proxies={'http': proxy, 'https': proxy}, timeout=10)
            if response.status_code == 200:
                print(response.text)
        except:
            pass
        time.sleep(1)  # 控制请求间隔


if __name__ == '__main__':
    main()
    print('Done!')

The above code uses generators and yield statements to implement the function of taking out available proxy IPs in turn, and adds time interval control to ensure that the crawler program does not request too frequently. At the same time, the code also implements the validity detection of the proxy IP to ensure that all proxy IPs used are available.

6. Summary

This article mainly introduces how to use proxy IP in crawler programs to respond to the anti-crawler mechanism, and how to perform access control to ensure the normal operation of the program. Implementing proxy IP usage and access control requires understanding the principles of web crawlers and anti-crawler mechanisms. At the same time, you should abide by the website's access rules to ensure that it does not place an excessive burden on the website.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/132689944