Why use a proxy?
I don’t know if you have encountered such a situation when writing a crawler. During the test, the crawler can work normally, but after running for a period of time, you will find that an error is reported or no data is returned, and the webpage may prompt “frequent IP access”. This shows that the website has anti-crawling measures for IP (the number of requests and speed of IP within a certain period of time). If it exceeds a certain threshold, it will directly deny service, which is often referred to as "blocking IP".
In this case, the proxy IP comes in. The proxy is actually a proxy server. Its working principle is actually very simple. When we request a website normally, we directly send the request to the web server, and the web server transmits the response data to us. Using a proxy IP is like building a "bridge" between the local machine and the web server. At this time, the local machine first sends a request to the proxy server, and then the proxy server sends it to the web server. The same is true for the return of response data, which requires a proxy server to transfer. In this way, it is not easy for the Web server to identify the local IP.
However, there are also pros and cons of proxy IPs. There are three types of proxy IPs: high-secrecy proxy, general-secret proxy, and transparent proxy.
Next, let's take a look at the popular open source project ProxyPool on Github, how to use it!
Introduction to ProxyPool
A crawler proxy IP pool, which regularly collects free proxies published on the Internet and verifies the storage, regularly detects the availability of proxies, and provides API and CLI usage methods. You can also expand the proxy source to increase the quality and quantity of proxy pool IPs.
download code
via git clone
download code
git clone [email protected]:jhao104/proxy_pool.git
Download the corresponding zip file
Install dependent libraries
pip install -r requirements.txt
复制代码
Modify the configuration file
Open the code file and setting.py
modify the project configuration according to your needs.
# 配置API服务
HOST = "0.0.0.0" # IP
PORT = 5000 # 监听端口
# 配置数据库
DB_CONN = 'redis://:[email protected]:8888/0'
# 无密码
DB_CONN = 'redis://:@127.0.0.1:8888/0'
# proxy table name 表名(自己建的)
TABLE_NAME = 'use_proxy'
# 配置 ProxyFetcher
PROXY_FETCHER = [
"freeProxy01", # 这里是启用的代理抓取方法名,所有fetch方法位于fetcher/proxyFetcher.py
"freeProxy02",
# ....
]
run the project
Start the redis service: redis-server.exe (the executable file is in the redis
installation path).
1. Start the scheduler
Start the scheduler, open it under the Proxy_pool
project path and cmd
enter:
python proxyPool.py schedule
Read the agents in the database.
import redis
r = redis.StrictRedis(host="127.0.0.1", port=6379, db=0)
result = r.hgetall('use_proxy')
result.keys()
2. Start the webApi service
python proxyPool.py server
After starting the web service, the api interface service will be enabled by default:
api | method | Description | params |
---|---|---|---|
/ | GET | API introduction | None |
/get | GET | Get a random agent | Optional parameter: ?type=https to filter proxies that support https |
/pop | GET | Get and remove a proxy | Optional parameter: ?type=https to filter proxies that support https |
/all | GET | get all proxies | Optional parameter: ?type=https to filter proxies that support https |
/count | GET | Check the number of agents | None |
/delete | GET | remove proxy | ?proxy=host:ip |
use in crawler
import requests
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
# your spider code
def getHtml():
# ....
retry_count = 5
proxy = get_proxy().get("proxy")
while retry_count > 0:
try:
html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
# 使用代理访问
return html
except Exception:
retry_count -= 1
# 删除代理池中代理
delete_proxy(proxy)
return None
After all, the proxies in the proxy pool are free proxies for crawling. The IP quality is really hard to say, but it is enough for daily development and use.
Today's content is shared here, and the editor finally prepared a python spree for everyone [Jiajun Yang: 419693945] to help everyone learn better!