[Python3 crawler] 12_Use of proxy IP

When we are crawling pages, if we use a URL to crawl a website for a long time, it will be restricted by crawling. At this time, we refer to the proxy IP, which changes at any time, so it will not be restricted.

The following is the address that provides free proxy IP in China: http://www.xicidaili.com/

After we open this page, we can see the proxy IP and address as shown below

image

The marked part in the picture above is a proxy IP and its port number

Then let's start using the proxy IP to crawl the content

First of all, we need to customize the opener, why should we customize the opener? That's because the basic urlopen method doesn't support proxies, so it needs to support this feature:

  • Use the associated Handler handler to create a specific handler object
  • Then use these processor objects through the urllib.request.build_opener method to create custom opener objects
  • Define the custom opener object as a global opener (meaning that if urlopen is used later, this opener will be used)

The specific implementation code is as follows:

import urllib.request

'''
 define function
'''
def proxy_use(url,tm_ip):
    proxy = urllib.request.ProxyHandler({"http":tm_ip})
    opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    #定义全局opener
    urllib.request.install_opener(opener)
    #获取网页内容
    content = urllib.request.urlopen(url).read().decode("UTF-8","ignore")
    return content

ip = "14.118.254.1:6666"
url = "http://www.baidu.com"
content = proxy_use(url,ip)
print(len(content))

The results are as follows:

image

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324605551&siteId=291194637