Implementation of Python crawler using proxy IP

When using crawlers, if the target website requires high access speed or number of visits, your IP will be easily blocked, which means that you will not be able to perform further work for a period of time. At this time, the proxy IP can bring us great convenience. No matter how the website is blocked, as long as a new proxy IP can be found, the next step of research can be continued.

At present, many websites provide some free proxy IPs for us to use, of course, paid ones will be more useful. In addition to showing how to use the proxy IP, this article also happens to experience the proxy IP pool built in the previous article. If you don’t know it, you can click here: Build a proxy IP pool with Python (1) - Get IP. As long as you access the interface provided by the proxy pool, you can get the proxy IP. Next, let's see how to use it!

insert image description here

API test URL: http://www.jshk.com.cn/, visit this site to get some relevant information requested, where the origin field is the IP of the client, according to it to determine whether the proxy is set successfully, that is, whether the masquerading is successful IP

get ip

The proxy pool uses Flask to provide an interface for acquisition: http://localhost:5555/random,
as long as you visit this interface and return the content, you can get the IP

Screaming

Let's take a look at Urllib's proxy setting method:

from urllib.error import URLError
import urllib.request
from urllib.request import ProxyHandler, build_opener
 
# 获取IP
ip_response = urllib.request.urlopen("http://localhost:5555/random")
ip = ip_response.read().decode('utf-8')
 
proxy_handler = ProxyHandler({
    
    
  'http': 'http://' + ip,
  'https': 'https://' + ip
})
opener = build_opener(proxy_handler)
try:
  response = opener.open('http://httpbin.org/get')
  print(response.read().decode('utf-8'))
except URLError as e:
  print(e.reason)

operation result:

{
    
    
 "args": {
    
    },
 "headers": {
    
    
  "Accept-Encoding": "identity",
  "Host": "httpbin.org",
  "User-Agent": "Python-urllib/3.7"
 },
 "origin": "108.61.201.231, 108.61.201.231",
 "url": "https://httpbin.org/get"
} 

Urllib uses ProxyHandler to set the proxy. The parameter is a dictionary type, the key name is the protocol type, and the key value is the proxy. The protocol needs to be added in front of the proxy, that is, http or https. When the requested link is the http protocol, it will call the http proxy. When the requested link is the https protocol, it will call the https proxy, so the effective proxy here is: http://108.61.201.231 and https://108.61.201.231 After the
ProxyHandler object is created, use the build_opener() method Pass in this object to create an Opener, which means that the Opener has already set up a proxy, and you can use this proxy to access the link by directly calling its open() method

Requests

The proxy setting of Requests only needs to pass in the proxies parameter:

import requests
 
# 获取IP
ip_response = requests.get("http://localhost:5555/random")
ip = ip_response.text
 
proxies = {
    
    
  'http': 'http://' + ip,
  'https': 'https://' + ip,
}
try:
  response = requests.get('http://httpbin.org/get', proxies=proxies)
  print(response.text)
except requests.exceptions.ConnectionError as e:
  print('Error', e.args)

operation result:

{
    
    
 "args": {
    
    },
 "headers": {
    
    
  "Accept": "*/*",
  "Accept-Encoding": "gzip, deflate",
  "Host": "httpbin.org",
  "User-Agent": "python-requests/2.21.0"
 },
 "origin": "47.90.28.54, 47.90.28.54",
 "url": "https://httpbin.org/get"
}

Requests only needs to construct a proxy dictionary and then set the proxy through the proxies parameter, which is relatively simple.

Selenium

import requests
from selenium import webdriver
import time
 
# 借助requests库获取IP
ip_response = requests.get("http://localhost:5555/random")
ip = ip_response.text
 
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://' + ip)
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get('http://httpbin.org/get')
time.sleep(5)

The above is the whole content of this article. I hope it will be helpful to your study, and I hope you can communicate more.

Guess you like

Origin blog.csdn.net/weixin_44617651/article/details/131533514