Python web crawler notes (9) ProxyHandler processor (proxy settings)

Use a proxy IP, which is the second best trick for crawlers/anti-crawlers, and usually the best.

Many websites will detect the number of visits of a certain IP in a certain period of time (through traffic statistics, system logs, etc.), if the number of visits is not like normal people, it will prohibit the visit of this IP.

So we can set up some proxy servers and change a proxy every once in a while. Even if the IP is banned, we can still continue to crawl with another IP.

ProxyHandler is used to set the proxy server in urllib.request. The following code shows how to use a custom opener to use the proxy:

#urllib2_proxy1.py

import urllib.request

# 构建了两个代理Handler,一个有代理IP,一个没有代理IP
httpproxy_handler = urllib.request.ProxyHandler({"http" : "124.88.67.81:80"})
nullproxy_handler = urllib.request.ProxyHandler({})

proxySwitch = True #定义一个代理开关

# 通过 urllib.request.build_opener()方法使用这些代理Handler对象,创建自定义opener对象
# 根据代理开关是否打开,使用不同的代理模式
if proxySwitch:  
    opener = urllib.request.build_opener(httpproxy_handler)
else:
    opener = urllib.request.build_opener(nullproxy_handler)

request = urllib.request.Request("http://www.baidu.com/")

# 1. 如果这么写,只有使用opener.open()方法发送请求才使用自定义的代理,而urlopen()则不使用自定义代理。
response = opener.open(request)

# 2. 如果这么写,就是将opener应用到全局,之后所有的,不管是opener.open()还是urlopen() 发送请求,都将使用自定义代理。
# urllib.request.install_opener(opener)
# response = urlopen(request)

print response.read()

There is basically no cost to obtain free open proxies. We can collect these free proxies on some proxy websites. If they are available after testing, we can collect them and use them on crawlers.

Examples of free short-term agency websites:

If there are enough proxy IPs, you can randomly select a proxy to access the website just like getting User-Agent randomly.

import urllib.request
import random

proxy_list = [
    {"http" : "124.88.67.81:80"},
    {"http" : "124.88.67.81:80"},
    {"http" : "124.88.67.81:80"},
    {"http" : "124.88.67.81:80"},
    {"http" : "124.88.67.81:80"}
]

# 随机选择一个代理
proxy = random.choice(proxy_list)
# 使用选择的代理构建代理处理器对象
httpproxy_handler = urllib.request.ProxyHandler(proxy)

opener = urllib.request.build_opener(httpproxy_handler)

request = urllib.request.Request("http://www.baidu.com/")
response = opener.open(request)
print response.read()

However, these free and open proxies are generally used by many people, and the proxies have shortcomings such as short lifespan, slow speed, low anonymity, and unstable HTTP/HTTPS support (free products are not good).

Therefore, professional crawler engineers or crawler companies will use high-quality private proxies. These proxies usually need to be purchased from specialized proxy suppliers, and then authorized to be used through username/password (reluctant children can’t catch wolves).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324866819&siteId=291194637