Python crawler-IP hiding technology and proxy crawling

Preface

When developing and running crawler programs, you often encounter the anti-crawler mechanism of the target website. The most common one is IP blocking. In this case, you need to use IP hiding technology and proxy crawling.

 1. IP hiding technology

IP hiding technology is to disguise the IP address so that the IP address requested by the crawler is not recognized as a crawler by the target website. Through IP hiding technology, you can effectively bypass the target website's restrictions on specific IP addresses.

1. Random User-Agent

User-Agent refers to the string information sent to the server when the client program requests it. It usually includes information such as the current client's software version, operating system, language environment, and service provider. When developing a crawler, if the User-Agent used is different from the browser, it will be easily recognized as a crawler by the server and restricted.

Therefore, by randomly generating the User-Agent string, the client can be effectively disguised so that the server thinks it is a real user accessing. The following is a sample code for randomly generating User-Agent:

import random

def get_user_agent():
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 OPR/39.0.2256.48"
    ]
    return random.choice(user_agents)
2. Set Header header information

When making a crawler request, you need to set the Header information of the Request request, especially the Referer and Cookie information. When setting the Header header information, you also need to pay attention to disguising it as a real user request.

import requests

url = "http://www.example.com"

headers = {
    "User-Agent": get_user_agent(),
    "Referer": "http://www.example.com/",
    "Cookie": "xxx"
}

response = requests.get(url, headers=headers)
3. Use dynamic IP proxy

Dynamic IP proxy can help us hide the real IP address and request the target website through the proxy server, so that the server cannot identify the real IP address of the crawler program.

Using a proxy requires preparing a proxy pool, that is, multiple available proxy IP addresses. It can be purchased through a proxy IP provider or obtained for free.

import requests

def get_proxy():
    return {
        "http": "http://username:password@proxy_address:port",
        "https": "https://username:password@proxy_address:port"
    }

url = "http://www.example.com"

response = requests.get(url, proxies=get_proxy())

2. Agent crawling

When performing proxy crawling, you need to pay attention to the following issues:

  1. The proxy IP address needs to be available, otherwise it will affect the efficiency of the crawler program.
  2. The number of proxy IP addresses needs to be sufficient, otherwise it will be blocked by the server due to frequent switching.
  3. The quality of the proxy IP address needs to be excellent, because low-quality proxy IP addresses are prone to connection timeouts or network errors.
1. Use proxy pool

A proxy pool refers to a collection of multiple available proxy IP addresses. Through the proxy pool, available proxy IP addresses can be automatically maintained, thus avoiding the manual addition and deletion of proxy IP addresses. The implementation of the proxy pool can refer to the following sample code:

import random
import requests
import time

class ProxyPool:
    def __init__(self):
        self.pool = []
        self.index = 0

    def get_proxy(self):
        if len(self.pool) == 0:
            return None
        proxy = self.pool[self.index]
        self.index += 1
        if self.index == len(self.pool):
            self.index = 0
        return proxy

    def add_proxy(self, proxy):
        if proxy not in self.pool:
            self.pool.append(proxy)

    def remove_proxy(self, proxy):
        if proxy in self.pool:
            self.pool.remove(proxy)

    def check_proxy(self, proxy):
        try:
            response = requests.get("http://www.example.com", proxies=proxy, timeout=10)
            if response.status_code == 200:
                return True
            return False
        except:
            return False

    def update_pool(self):
        new_pool = []
        for proxy in self.pool:
            if self.check_proxy(proxy):
                new_pool.append(proxy)
        self.pool = new_pool

pool = ProxyPool()

# 添加代理IP地址
pool.add_proxy({"http": "http://username:password@proxy_address:port", "https": "http://username:password@proxy_address:port"})

# 更新代理池
while True:
    pool.update_pool()
    time.sleep(60)
 2. Randomly switch agents

When performing proxy crawling, you need to randomly switch the proxy IP address to avoid being blocked by the server due to frequent connections to the same IP address. Random proxy switching can be achieved through the following sample code:

import requests

def get_random_proxy():
    return {"http": "http://username:password@proxy_address:port", "https": "http://username:password@proxy_address:port"}

url = "http://www.example.com"

for i in range(10):
    proxy = get_random_proxy()
    response = requests.get(url, proxies=proxy)
3. Use a quality proxy

When performing proxy crawling, if a low-quality proxy IP address is used, connection timeouts or network errors may easily occur, thus affecting the operating efficiency of the crawler program. Therefore, it is very important to choose a high-quality proxy IP address.

You can choose a high-quality proxy IP address by using the services provided by a proxy IP provider. At the same time, you can also regularly test the availability of proxy IP addresses to eliminate invalid proxy IP addresses in a timely manner. Here is a sample code that tests the availability of a proxy IP address:

import requests

def check_proxy(proxy):
    try:
        response = requests.get("http://www.example.com", proxies=proxy, timeout=10)
        if response.status_code == 200:
            return True
        return False
    except:
        return False

proxy = {"http": "http://username:password@proxy_address:port", "https": "http://username:password@proxy_address:port"}

if check_proxy(proxy):
    print("代理IP地址可用")
else:
    print("代理IP地址不可用")

3. Summary

When developing Python crawlers, you often encounter the anti-crawler mechanism of the target website. The most common one is IP blocking. To bypass this restriction, IP hiding techniques and proxy crawling can be used. IP hiding techniques include methods such as random User-Agent, setting Header header information and using dynamic IP proxies. However, proxy crawling requires attention to the availability, quantity and quality of proxy IP addresses. You can use proxy pools, randomly switch proxies and select high-quality proxies. realized in other ways.

Guess you like

Origin blog.csdn.net/wq10_12/article/details/132832491